Nick Groenen b5b2ea2c3b
New: apply unicode normalization while resolving notes
The unicode standard allows for certain (visually) identical characters to
be represented in different ways.

For example the character ä may be represented as a single combined
codepoint "Latin Small Letter A with Diaeresis" (U+00E4) or by the
combination of "Latin Small Letter A" (U+0061) followed by "Combining
Diaeresis" (U+0308).

When encoded with UTF-8, these are represented as respectively the two
bytes 0xC3 0xA4, and the three bytes 0x61 0xCC 0x88.

A user linking to notes with these characters in their titles would
expect these two variants to link to the same file, given they are
visually identical and have the exact same semantic meaning.

The unicode standard defines a method to deconstruct and normalize these
forms, so that a byte comparison on the normalized forms of these
variants ends up comparing the same thing. This is called Unicode
Normalization, defined in Unicode® Standard Annex #15
(http://www.unicode.org/reports/tr15/).

The W3C Working Group has written an excellent explanation of the
problems regarding string matching, and how unicode normalization helps
with this process: https://www.w3.org/TR/charmod-norm/#unicodeNormalization

With this change, obsidian-export will perform unicode normalization
(specifically the C (or NFC) normalization form) on all note titles
while looking up link references, ensuring visually identical links are
treated as being similar, even if they were encoded as different
variants.

A special thanks to Hans Raaf (@oderwat) for reporting and helping track
down this issue.

---

Closes #126
2022-11-19 16:58:48 +01:00
..
2022-11-05 14:38:02 +01:00
2022-11-05 14:18:53 +01:00
2022-11-05 14:38:02 +01:00