Ghost/ghost/link-replacer/lib/link-replacer.js

95 lines
3.2 KiB
JavaScript
Raw Normal View History

class LinkReplacer {
/**
* Replaces the links in the provided HTML
* @param {string} html
* @param {(url: URL, originalPath: string): Promise<URL|string|false>} replaceLink
* @param {object} options
* @param {string} [options.base] If you want to replace relative links, this will replace them to an absolute link and call the replaceLink method too
* @returns {Promise<string>}
*/
async replace(html, replaceLink, options = {}) {
🐛 Fixed unexpected conversion of single-quoted attributes in HTML cards (#19727) closes ENG-627 We were using `cheerio` to parse+modify+serialize our rendered HTML to modify links for member attribution. Cheerio's serializer has a [long-standing issue](https://github.com/cheeriojs/cheerio/issues/720) (that we've [had to deal with before](https://github.com/TryGhost/SDK/issues/124)) where it replaces single-quote attributes with double-quote attributes. That was resulting in broken rendering when content used single-quotes such as in HTML cards that have JSON data inside a `data-` attribute or otherwise used single-quotes to avoid escaping double-quotes in an attribute value. - swapped the implementation that uses `cheerio` for one that uses `html5parser` to tokenize the html string, from there we can loop over the tokens and replace the href attribute values in the original string without touching any other part of the content. Avoids a full parse+serialize process which is both more costly and can result unexpected content changes due to serializer opinions. - fixes the quote change bug - uses tokenization directly to avoid cost of building a full AST - updated Content API Posts snapshot - one of our fixtures has a missing closing tag which we're no longer "fixing" with a full parse+serialize step in the link replacer (keeps modified src closer to original and better matches behaviour elsewhere in the app / without member-attribution applied) - the link replacer no longer converts `attr=""` to `attr` (these are equivalent in the HTML spec so no change in behaviour other than preserving the original source html) - added a benchmark test file comparing the two implementations because the link replacer runs on render so it's used in a hot path - new implementation has a 3x performance improvement - the separate files with the old/new implementations have been cleaned up but I've left the benchmark test file in place for future reference Benchmark results comparing implementations: ``` ❯ node test/benchmark.js LinkReplacer ├─ cheerio: 5.03K /s ±2.20% ├─ html5parser: 16.5K /s ±0.43% Completed benchmark in 0.9976526670455933s ┌─────────────┬─────────┬────────────┬─────────┬───────┐ │ (index) │ percent │ iterations │ current │ max │ ├─────────────┼─────────┼────────────┼─────────┼───────┤ │ cheerio │ '' │ '5.03K/s' │ 5037 │ 5037 │ │ html5parser │ '' │ '16.5K/s' │ 16534 │ 16534 │ └─────────────┴─────────┴────────────┴─────────┴───────┘ ```
2024-03-06 12:11:49 +03:00
const {tokenize} = require('html5parser');
const entities = require('entities');
🐛 Fixed unexpected conversion of single-quoted attributes in HTML cards (#19727) closes ENG-627 We were using `cheerio` to parse+modify+serialize our rendered HTML to modify links for member attribution. Cheerio's serializer has a [long-standing issue](https://github.com/cheeriojs/cheerio/issues/720) (that we've [had to deal with before](https://github.com/TryGhost/SDK/issues/124)) where it replaces single-quote attributes with double-quote attributes. That was resulting in broken rendering when content used single-quotes such as in HTML cards that have JSON data inside a `data-` attribute or otherwise used single-quotes to avoid escaping double-quotes in an attribute value. - swapped the implementation that uses `cheerio` for one that uses `html5parser` to tokenize the html string, from there we can loop over the tokens and replace the href attribute values in the original string without touching any other part of the content. Avoids a full parse+serialize process which is both more costly and can result unexpected content changes due to serializer opinions. - fixes the quote change bug - uses tokenization directly to avoid cost of building a full AST - updated Content API Posts snapshot - one of our fixtures has a missing closing tag which we're no longer "fixing" with a full parse+serialize step in the link replacer (keeps modified src closer to original and better matches behaviour elsewhere in the app / without member-attribution applied) - the link replacer no longer converts `attr=""` to `attr` (these are equivalent in the HTML spec so no change in behaviour other than preserving the original source html) - added a benchmark test file comparing the two implementations because the link replacer runs on render so it's used in a hot path - new implementation has a 3x performance improvement - the separate files with the old/new implementations have been cleaned up but I've left the benchmark test file in place for future reference Benchmark results comparing implementations: ``` ❯ node test/benchmark.js LinkReplacer ├─ cheerio: 5.03K /s ±2.20% ├─ html5parser: 16.5K /s ±0.43% Completed benchmark in 0.9976526670455933s ┌─────────────┬─────────┬────────────┬─────────┬───────┐ │ (index) │ percent │ iterations │ current │ max │ ├─────────────┼─────────┼────────────┼─────────┼───────┤ │ cheerio │ '' │ '5.03K/s' │ 5037 │ 5037 │ │ html5parser │ '' │ '16.5K/s' │ 16534 │ 16534 │ └─────────────┴─────────┴────────────┴─────────┴───────┘ ```
2024-03-06 12:11:49 +03:00
try {
🐛 Fixed unexpected conversion of single-quoted attributes in HTML cards (#19727) closes ENG-627 We were using `cheerio` to parse+modify+serialize our rendered HTML to modify links for member attribution. Cheerio's serializer has a [long-standing issue](https://github.com/cheeriojs/cheerio/issues/720) (that we've [had to deal with before](https://github.com/TryGhost/SDK/issues/124)) where it replaces single-quote attributes with double-quote attributes. That was resulting in broken rendering when content used single-quotes such as in HTML cards that have JSON data inside a `data-` attribute or otherwise used single-quotes to avoid escaping double-quotes in an attribute value. - swapped the implementation that uses `cheerio` for one that uses `html5parser` to tokenize the html string, from there we can loop over the tokens and replace the href attribute values in the original string without touching any other part of the content. Avoids a full parse+serialize process which is both more costly and can result unexpected content changes due to serializer opinions. - fixes the quote change bug - uses tokenization directly to avoid cost of building a full AST - updated Content API Posts snapshot - one of our fixtures has a missing closing tag which we're no longer "fixing" with a full parse+serialize step in the link replacer (keeps modified src closer to original and better matches behaviour elsewhere in the app / without member-attribution applied) - the link replacer no longer converts `attr=""` to `attr` (these are equivalent in the HTML spec so no change in behaviour other than preserving the original source html) - added a benchmark test file comparing the two implementations because the link replacer runs on render so it's used in a hot path - new implementation has a 3x performance improvement - the separate files with the old/new implementations have been cleaned up but I've left the benchmark test file in place for future reference Benchmark results comparing implementations: ``` ❯ node test/benchmark.js LinkReplacer ├─ cheerio: 5.03K /s ±2.20% ├─ html5parser: 16.5K /s ±0.43% Completed benchmark in 0.9976526670455933s ┌─────────────┬─────────┬────────────┬─────────┬───────┐ │ (index) │ percent │ iterations │ current │ max │ ├─────────────┼─────────┼────────────┼─────────┼───────┤ │ cheerio │ '' │ '5.03K/s' │ 5037 │ 5037 │ │ html5parser │ '' │ '16.5K/s' │ 16534 │ 16534 │ └─────────────┴─────────┴────────────┴─────────┴───────┘ ```
2024-03-06 12:11:49 +03:00
const tokens = tokenize(html); // IToken[]
const replacements = [];
let inAnchor = false;
let inHref = false;
// interface IToken {
// start: number;
// end: number;
// value: string;
// type: TokenKind;
// }
// const enum TokenKind {
// 0 Literal,
// 1 OpenTag, // trim leading '<'
// 2 OpenTagEnd, // trim tailing '>', only could be '/' or ''
// 3 CloseTag, // trim leading '</' and tailing '>'
// 4 Whitespace, // the whitespace between attributes
// 5 AttrValueEq,
// 6 AttrValueNq,
// 7 AttrValueSq,
// 8 AttrValueDq,
// }
for (const token of tokens) {
if (token.type === 1 && token.value === 'a') {
inAnchor = true;
}
if (inAnchor) {
if (token.type === 2) {
inAnchor = false;
inHref = false;
}
if (token.type === 6 && token.value === 'href') {
inHref = true;
}
🐛 Fixed unexpected conversion of single-quoted attributes in HTML cards (#19727) closes ENG-627 We were using `cheerio` to parse+modify+serialize our rendered HTML to modify links for member attribution. Cheerio's serializer has a [long-standing issue](https://github.com/cheeriojs/cheerio/issues/720) (that we've [had to deal with before](https://github.com/TryGhost/SDK/issues/124)) where it replaces single-quote attributes with double-quote attributes. That was resulting in broken rendering when content used single-quotes such as in HTML cards that have JSON data inside a `data-` attribute or otherwise used single-quotes to avoid escaping double-quotes in an attribute value. - swapped the implementation that uses `cheerio` for one that uses `html5parser` to tokenize the html string, from there we can loop over the tokens and replace the href attribute values in the original string without touching any other part of the content. Avoids a full parse+serialize process which is both more costly and can result unexpected content changes due to serializer opinions. - fixes the quote change bug - uses tokenization directly to avoid cost of building a full AST - updated Content API Posts snapshot - one of our fixtures has a missing closing tag which we're no longer "fixing" with a full parse+serialize step in the link replacer (keeps modified src closer to original and better matches behaviour elsewhere in the app / without member-attribution applied) - the link replacer no longer converts `attr=""` to `attr` (these are equivalent in the HTML spec so no change in behaviour other than preserving the original source html) - added a benchmark test file comparing the two implementations because the link replacer runs on render so it's used in a hot path - new implementation has a 3x performance improvement - the separate files with the old/new implementations have been cleaned up but I've left the benchmark test file in place for future reference Benchmark results comparing implementations: ``` ❯ node test/benchmark.js LinkReplacer ├─ cheerio: 5.03K /s ±2.20% ├─ html5parser: 16.5K /s ±0.43% Completed benchmark in 0.9976526670455933s ┌─────────────┬─────────┬────────────┬─────────┬───────┐ │ (index) │ percent │ iterations │ current │ max │ ├─────────────┼─────────┼────────────┼─────────┼───────┤ │ cheerio │ '' │ '5.03K/s' │ 5037 │ 5037 │ │ html5parser │ '' │ '16.5K/s' │ 16534 │ 16534 │ └─────────────┴─────────┴────────────┴─────────┴───────┘ ```
2024-03-06 12:11:49 +03:00
if (inHref && token.type === 8) {
const path = entities.decode(token.value.substring(1, token.value.length - 1));
let url;
try {
url = new URL(path, options.base);
} catch (e) {
// Ignore invalid URLs
}
if (url) {
url = await replaceLink(url, path);
const str = url.toString();
replacements.push({url: str, start: token.start + 1, end: token.end - 1});
}
inHref = false;
}
}
}
🐛 Fixed unexpected conversion of single-quoted attributes in HTML cards (#19727) closes ENG-627 We were using `cheerio` to parse+modify+serialize our rendered HTML to modify links for member attribution. Cheerio's serializer has a [long-standing issue](https://github.com/cheeriojs/cheerio/issues/720) (that we've [had to deal with before](https://github.com/TryGhost/SDK/issues/124)) where it replaces single-quote attributes with double-quote attributes. That was resulting in broken rendering when content used single-quotes such as in HTML cards that have JSON data inside a `data-` attribute or otherwise used single-quotes to avoid escaping double-quotes in an attribute value. - swapped the implementation that uses `cheerio` for one that uses `html5parser` to tokenize the html string, from there we can loop over the tokens and replace the href attribute values in the original string without touching any other part of the content. Avoids a full parse+serialize process which is both more costly and can result unexpected content changes due to serializer opinions. - fixes the quote change bug - uses tokenization directly to avoid cost of building a full AST - updated Content API Posts snapshot - one of our fixtures has a missing closing tag which we're no longer "fixing" with a full parse+serialize step in the link replacer (keeps modified src closer to original and better matches behaviour elsewhere in the app / without member-attribution applied) - the link replacer no longer converts `attr=""` to `attr` (these are equivalent in the HTML spec so no change in behaviour other than preserving the original source html) - added a benchmark test file comparing the two implementations because the link replacer runs on render so it's used in a hot path - new implementation has a 3x performance improvement - the separate files with the old/new implementations have been cleaned up but I've left the benchmark test file in place for future reference Benchmark results comparing implementations: ``` ❯ node test/benchmark.js LinkReplacer ├─ cheerio: 5.03K /s ±2.20% ├─ html5parser: 16.5K /s ±0.43% Completed benchmark in 0.9976526670455933s ┌─────────────┬─────────┬────────────┬─────────┬───────┐ │ (index) │ percent │ iterations │ current │ max │ ├─────────────┼─────────┼────────────┼─────────┼───────┤ │ cheerio │ '' │ '5.03K/s' │ 5037 │ 5037 │ │ html5parser │ '' │ '16.5K/s' │ 16534 │ 16534 │ └─────────────┴─────────┴────────────┴─────────┴───────┘ ```
2024-03-06 12:11:49 +03:00
let offsetAdjustment = 0;
replacements.forEach(({url, start, end}) => {
const originalLength = end - start;
const replacementLength = url.length;
html = html.slice(0, start + offsetAdjustment) + url + html.slice(end + offsetAdjustment);
offsetAdjustment += replacementLength - originalLength;
});
return html;
} catch (e) {
🐛 Fixed unexpected conversion of single-quoted attributes in HTML cards (#19727) closes ENG-627 We were using `cheerio` to parse+modify+serialize our rendered HTML to modify links for member attribution. Cheerio's serializer has a [long-standing issue](https://github.com/cheeriojs/cheerio/issues/720) (that we've [had to deal with before](https://github.com/TryGhost/SDK/issues/124)) where it replaces single-quote attributes with double-quote attributes. That was resulting in broken rendering when content used single-quotes such as in HTML cards that have JSON data inside a `data-` attribute or otherwise used single-quotes to avoid escaping double-quotes in an attribute value. - swapped the implementation that uses `cheerio` for one that uses `html5parser` to tokenize the html string, from there we can loop over the tokens and replace the href attribute values in the original string without touching any other part of the content. Avoids a full parse+serialize process which is both more costly and can result unexpected content changes due to serializer opinions. - fixes the quote change bug - uses tokenization directly to avoid cost of building a full AST - updated Content API Posts snapshot - one of our fixtures has a missing closing tag which we're no longer "fixing" with a full parse+serialize step in the link replacer (keeps modified src closer to original and better matches behaviour elsewhere in the app / without member-attribution applied) - the link replacer no longer converts `attr=""` to `attr` (these are equivalent in the HTML spec so no change in behaviour other than preserving the original source html) - added a benchmark test file comparing the two implementations because the link replacer runs on render so it's used in a hot path - new implementation has a 3x performance improvement - the separate files with the old/new implementations have been cleaned up but I've left the benchmark test file in place for future reference Benchmark results comparing implementations: ``` ❯ node test/benchmark.js LinkReplacer ├─ cheerio: 5.03K /s ±2.20% ├─ html5parser: 16.5K /s ±0.43% Completed benchmark in 0.9976526670455933s ┌─────────────┬─────────┬────────────┬─────────┬───────┐ │ (index) │ percent │ iterations │ current │ max │ ├─────────────┼─────────┼────────────┼─────────┼───────┤ │ cheerio │ '' │ '5.03K/s' │ 5037 │ 5037 │ │ html5parser │ '' │ '16.5K/s' │ 16534 │ 16534 │ └─────────────┴─────────┴────────────┴─────────┴───────┘ ```
2024-03-06 12:11:49 +03:00
// do nothing in case of error,
// we don't want to break the content for the sake of member attribution
return html;
}
}
}
module.exports = new LinkReplacer();