b5b2ea2c3b
The unicode standard allows for certain (visually) identical characters to be represented in different ways. For example the character ä may be represented as a single combined codepoint "Latin Small Letter A with Diaeresis" (U+00E4) or by the combination of "Latin Small Letter A" (U+0061) followed by "Combining Diaeresis" (U+0308). When encoded with UTF-8, these are represented as respectively the two bytes 0xC3 0xA4, and the three bytes 0x61 0xCC 0x88. A user linking to notes with these characters in their titles would expect these two variants to link to the same file, given they are visually identical and have the exact same semantic meaning. The unicode standard defines a method to deconstruct and normalize these forms, so that a byte comparison on the normalized forms of these variants ends up comparing the same thing. This is called Unicode Normalization, defined in Unicode® Standard Annex #15 (http://www.unicode.org/reports/tr15/). The W3C Working Group has written an excellent explanation of the problems regarding string matching, and how unicode normalization helps with this process: https://www.w3.org/TR/charmod-norm/#unicodeNormalization With this change, obsidian-export will perform unicode normalization (specifically the C (or NFC) normalization form) on all note titles while looking up link references, ensuring visually identical links are treated as being similar, even if they were encoded as different variants. A special thanks to Hans Raaf (@oderwat) for reporting and helping track down this issue. --- Closes #126