【笔记】字符编码

Unicode平面

Plane 0, 称作基本平面(Basic Plane), 剩余的称作扩展平面(Supplementary Plane)

Unicode code points	Range	Encoding Binary value (码点)
U+000000-U+00007f	0xxxxxxx	0xxxxxxx
U+000080-U+0007ff	110yyyxx 10xxxxxx	00000yyy xxxxxxxx
U+000800-U+00ffff	1110yyyy 10yyyyxx 10xxxxxx	yyyyyyyy xxxxxxxx
U+010000-U+10ffff	11110zzz 10zzyyyy 10yyyyxx 10xxxxxx	000zzzzz yyyyyyyy xxxxxxxx

Python中的转义序列

转义序列	意义	示例	码点范围
\ooo(三个八进制数)	ooo码点的字符	'\141' (a)	U+0 - U+01FF
\xhh	hh码点的字符	'\x61' (a)	U+0 - U+FF
\uxxxx	xxxx码点的字符	'\u4e2d\u6587' (中文)	U+0 - U+FFFF
\Uxxxxxxxx	xxxxxxxx码点的字符	'\U0001f466' (👦)	U+0 - U+FFFFFFFF

注: \u和\U仅在字符串字面值中可用. o为八进制digit, x和h均为十六进制digit.
注: 对于\ooo, 和Standard C一样, up to three octal digits are accepted.
对于\xhh和\ooo, In a bytes literal, 十六进制和八进制转义序列表示 the byte with the given value. In a string literal, these escapes denote a Unicode character with the given value.

'中'.encode('utf-8')
# b'\xe4\xb8\xad'
# 11100100 10111000 10101101‬

Java中的转义序列

转义序列	意义	示例
\uxxxx	xxxx码点的字符	"\u4e2d\u6587" (中文)
\uxxxx	xxxx码点的字符	"\ud83d\udc66" (👦)

注意: Java中没有\Uxxxxxxxx这种形式的转义序列

How many characters can be mapped with Unicode?

The count of all the possible valid combinations in Unicode.
1,111,998: 17 planes × 65,536 characters per plane - 2048 surrogates - 66 noncharacters
Note that UTF-8 and UTF-32 could theoretically encode much more than 17 planes, but the range is restricted based on the limitations of the UTF-16 encoding.
137,994 code points are actually assigned in Unicode 12.1.
utf 8 - How many characters can be mapped with Unicode? - Stack Overflow
编码方式之ASCII、ANSI、Unicode概述 - 蓝海人 - 博客园

十六进制、八进制和二进制的关系

一个十六进制digit对应4个bit, 即半个字节. 一个八进制digit对应3个bit.

UTF-16

在基本平面内, 从U+D800到U+DFFF的码点不对应任何字符.
UTF-16就利用保留下来的0xD800到0xDFFF的码点来映射扩展平面的字符.
U+D800 - U+DBFF, U+DC00 - U+DFFF, 分别有1024个不同的码点.
利用这两段码点结对编码扩展平面的字符, 扩展平面可编码字符数: 1024*1024=1048576.
Unicode 编码及 UTF-32, UTF-16 和 UTF-8 - 知乎

UTF-8

Modified UTF-8

In Modified UTF-8 (MUTF-8), the null character (U+0000) uses the two-byte overlong encoding 11000000 10000000 (hexadecimal C0 80), instead of 00000000 (hexadecimal 00).
BMP的字符, 除了U+0000, 其他的字符Modified UTF-8和UTF-8编码方式一致.
non-BMP的字符, Modified UTF-8用6个字节编码, 而UTF-8用4个字节编码.
UTF8 --> CESU-8 (non-BMP用6个字节编码) --> Modified UTF-8(U+0000用两个字节编码; non-BMP同CESU-8)
Chapter 4. The class File Format
cesu8 - Rust
CESU-8 - Wikipedia

码点和代码单元

码点(Code Point): 是Unicode代码空间中的一个值, 取值从0x0到0x10FFFF, 代表一个字符.
代码单元(Code Unit): 是具体编码形式中的最小单位. 比如UTF-16中一个代码单元为16 bits, UTF-8中一个代码单元为8 bits. 一个码点可能由一个或多个代码单元表示. 在U+10000之前的码点可以由一个UTF-16代码单元表示, U+10000及之后的码点由两个UTF-16代码单元表示.
代码点(Code Point)和代码单元(Code Unit) - zhangzl419 - 博客园
 Emoji与Unicode · Zablog

import java.nio.charset.StandardCharsets;

public class Demo {
    public static void main(String[] args) {
        String str = "你";
        printLen(str);
        str = "👦";
        printLen(str);
        str = "你👦";
        printLen(str);

        //只得到了半个👦
        System.out.println(str.charAt(1));
        System.out.println(str.substring(str.indexOf("👦"), str.toCharArray().length));
    }

    public static void printLen(String str) {
        System.out.println(str);
        System.out.println("string length = " + str.length());
        System.out.println("string bytes length (UTF-8) = " + str.getBytes(StandardCharsets.UTF_8).length);
        System.out.println("string bytes length (UTF-16BE) = " + str.getBytes(StandardCharsets.UTF_16BE).length);
        System.out.println("string char length = " + str.toCharArray().length);
        System.out.println("string code points count = " + str.codePoints().count());
        System.out.println();
    }
}
/*
你
string length = 1
string bytes length (UTF-8) = 3
string bytes length (UTF-16BE) = 2
string char length = 1
string code points count = 1

👦
string length = 2
string bytes length (UTF-8) = 4
string bytes length (UTF-16BE) = 4
string char length = 2
string code points count = 1

你👦
string length = 3
string bytes length (UTF-8) = 7
string bytes length (UTF-16BE) = 6
string char length = 3
string code points count = 2

?
👦

代码片段

//遍历码点
String str = "👦abc";
str.codePoints().forEach(c -> new String(Character.toChars(c)));

//取得第一个码点的字符
String firstSymbol = new String(Character.toChars(str.codePointAt(0)));
System.out.println(firstSymbol);

获取字符的码点及其编码(UTF,GB18030)

str = '👦'
print('Code Point', format(ord(str), 'x'))

encodingList = ["UTF-8", "UTF-16-LE",
               "UTF-16-BE", "UTF-32-LE", "UTF-32-BE", "GB18030"]

for encoding in encodingList:
    byteHexStr = ''
    for byte in bytearray(str, encoding):
        byteHexStr += format(byte, 'x').zfill(2)
    print(encoding, byteHexStr)

# 输出
"""
Code Point 1f466
UTF-8 f09f91a6
UTF-16-LE 3dd866dc
UTF-16-BE d83ddc66
UTF-32-LE 66f40100
UTF-32-BE 0001f466
GB18030 9439d336
"""

import java.nio.ByteBuffer;
import java.nio.IntBuffer;
import java.util.Arrays;
import java.util.LinkedHashMap;
import java.util.Map;

class Demo {
  private static final char[] HEX_ARRAY = "0123456789abcdef".toCharArray();

  public static String bytesToHex(byte[] bytes) {
    char[] hexChars = new char[bytes.length * 2];
    for (int j = 0; j < bytes.length; j++) {
      int v = bytes[j] & 0xFF;
      hexChars[j * 2] = HEX_ARRAY[v >>> 4];
      hexChars[j * 2 + 1] = HEX_ARRAY[v & 0x0F];
    }
    return new String(hexChars);
  }

  public static void main(String args[]) throws Exception {
    String str = "👦";
    int[] codePointArr = str.codePoints().toArray();
    System.out.println("Code Point " + Integer.toHexString(codePointArr[0]));

    String[] encodingArr = { "UTF-8", "UTF-16LE", "UTF-16BE", "UTF-32LE", "UTF-32BE", "GB18030" };
    Map<String, String> encodingMap = new LinkedHashMap<>();

    for (String encoding : encodingArr) {
      byte[] arr = str.getBytes(encoding);
      encodingMap.put(encoding, bytesToHex(arr));
    }

    encodingMap.forEach((k, v) -> System.out.println(k + " " + v));
  }
}

字节序和BOM

计算机硬件有两种储存数据的方式: 大端字节序(big endian)和小端字节序(little endian).
例如, 数值0x2211使用两个字节储存: 高位字节是0x22, 低位字节是0x11.
大端字节序: 高位字节在前, 低位字节在后, 即0x2211.
小端字节序: 低位字节在前, 高位字节在后, 即0x1122.
BOM=Byte Order Mark, 即"字节顺序标识".

UTF-16, JVM中缺省是大端法，Windows平台下缺省为小端法.
理解字节序 - 阮一峰的网络日志
 字符集与编码（七）——BOM – 肖国栋的i自留地

MySQL排序规则

排序规则	Unicode collation algorithm
utf8mb4_unicode_ci	基于UCA 4.0.0
utf8mb4_unicode_520_ci	基于UCA 5.2.0
utf8mb4_0900_ai_ci	基于UCA 9.0.0, since MySQL 8.0

The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which defines a customizable method to compare two strings.