What are the main rules for UTF-8 encoding validation?

UTF-8 encoding has specific bit patterns: 1-byte starts with '0', 2-byte starts with '110' followed by '10', 3-byte starts with '1110' followed by '10 10', and 4-byte starts with '11110' followed by '10 10 10'.

How do I process continuation bytes in the UTF-8 validation problem?

Continuations bytes must always start with the bit pattern '10'. If any byte in the sequence doesn't follow this pattern, the encoding is invalid.

How does the number of leading ones in the first byte help in UTF-8 validation?

The number of leading ones in the first byte indicates how many continuation bytes are expected. For example, 3 leading ones mean 3 continuation bytes are required.

What is the space complexity for the UTF-8 validation problem?

The space complexity is O(1) because the solution can be implemented with a fixed amount of extra space, regardless of the input size.

Can this problem be solved using simple loops?

Yes, the problem can be solved efficiently using loops and bit manipulation to check each byte and its continuation bytes.

#393

Medium

auto_awesome数组·结合·位运算·操作

LeetCode 题解工作台

UTF-8 编码验证

给定一个表示数据的整数数组 data ，返回它是否为有效的 UTF-8 编码。 UTF-8 中的一个字符可能的长度为 1 到 4 字节，遵循以下的规则：对于 1 字节的字符，字节的第一位设为 0 ，后面 7 位为这个符号的 unicode 码。对于 n 字节的字符 (n > 1)，第一个字…

数组位运算

题目描述

给定一个表示数据的整数数组 data ，返回它是否为有效的 UTF-8 编码。

UTF-8 中的一个字符可能的长度为 1 到 4 字节，遵循以下的规则：

对于 1 字节 的字符，字节的第一位设为 0 ，后面 7 位为这个符号的 unicode 码。
对于 n 字节 的字符 (n > 1)，第一个字节的前 n 位都设为1，第 n+1 位设为 0 ，后面字节的前两位一律设为 10 。剩下的没有提及的二进制位，全部为这个符号的 unicode 码。

这是 UTF-8 编码的工作方式：

      Number of Bytes  |        UTF-8 octet sequence
                       |              (binary)
   --------------------+---------------------------------------------
            1          | 0xxxxxxx
            2          | 110xxxxx 10xxxxxx
            3          | 1110xxxx 10xxxxxx 10xxxxxx
            4          | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

x 表示二进制形式的一位，可以是 0 或 1。

注意：输入是整数数组。只有每个整数的 最低 8 个有效位 用来存储数据。这意味着每个整数只表示 1 字节的数据。

示例 1：

输入：data = [197,130,1]
输出：true
解释：数据表示字节序列:11000101 10000010 00000001。
这是有效的 utf-8 编码，为一个 2 字节字符，跟着一个 1 字节字符。

示例 2：

输入：data = [235,140,4]
输出：false
解释：数据表示 8 位的序列: 11101011 10001100 00000100.
前 3 位都是 1 ，第 4 位为 0 表示它是一个 3 字节字符。
下一个字节是开头为 10 的延续字节，这是正确的。
但第二个延续字节不以 10 开头，所以是不符合规则的。

提示:

1 <= data.length <= 2 * 10⁴
0 <= data[i] <= 255

lightbulb

解题思路

方法一：一次遍历

我们用一个变量 $cnt$ 记录当前需要填充的以 $10$ 开头的字节的个数，初始时 $cnt = 0$ 。

遍历数组中的每个整数，对于每个整数 $v$ ：

如果 $cnt > 0$ ，则判断 $v$ 是否以 $10$ 开头，如果不是，则返回 false，否则 $cnt$ 减一。
如果 $v$ 的最高位为 $0$ ，则 $cnt = 0$ 。
如果 $v$ 的最高两位为 $110$ ，则 $cnt = 1$ 。
如果 $v$ 的最高三位为 $1110$ ，则 $cnt = 2$ 。
如果 $v$ 的最高四位为 $11110$ ，则 $cnt = 3$ 。
否则，返回 false。

最后，如果 $cnt = 0$ ，则返回 true，否则返回 false。

时间复杂度 $O(n)$ ，其中 $n$ 为数组 data 的长度。空间复杂度 $O(1)$ 。

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

class Solution:
    def validUtf8(self, data: List[int]) -> bool:
        cnt = 0
        for v in data:
            if cnt > 0:
                if v >> 6 != 0b10:
                    return False
                cnt -= 1
            elif v >> 7 == 0:
                cnt = 0
            elif v >> 5 == 0b110:
                cnt = 1
            elif v >> 4 == 0b1110:
                cnt = 2
            elif v >> 3 == 0b11110:
                cnt = 3
            else:
                return False
        return cnt == 0

speed

复杂度分析

指标	值
时间	Depends on the final approach
空间	O(1)

psychology

面试官常问的追问

外企场景

question_mark
Look for clear identification of how many continuation bytes follow each byte.
question_mark
Assess understanding of bit manipulation and byte operations.
question_mark
Ensure the solution handles edge cases like incomplete byte sequences.

warning

常见陷阱

外企场景

error
Failing to properly identify the number of expected continuation bytes based on the first byte's leading bits.
error
Misunderstanding the rules of valid continuation bytes, particularly the '10' pattern.
error
Incorrectly processing arrays with invalid sequences, such as bytes that don’t follow the proper encoding pattern.

swap_horiz

进阶变体

外企场景

arrow_right_alt
Modify the problem to validate UTF-16 or UTF-32 encoding.
arrow_right_alt
Introduce a larger input size to assess the solution's efficiency under heavy constraints.
arrow_right_alt
Ask for an optimization to minimize space usage further.

help

常见问题

外企场景

继续练习

#421 数组中两个数的最大异或值 #318 最大单词长度乘积 #473 火柴拼正方形