mirror of
https://github.com/squidfunk/mkdocs-material.git
synced 2024-06-14 11:52:32 +03:00
Updated documentation
This commit is contained in:
parent
439708c8a9
commit
29b364656f
@ -1,3 +1,11 @@
|
||||
mkdocs-material-7.2.6+insiders-3.0.0 (2021-09-13)
|
||||
|
||||
* Rewrite of MkDocs' search plugin
|
||||
* Added support for rich search previews
|
||||
* Added support for tokenizer with lookahead
|
||||
* Improved search indexing performance (twice as fast)
|
||||
* Improved search highlighting
|
||||
|
||||
mkdocs-material-7.2.6+insiders-2.13.3 (2021-09-01)
|
||||
|
||||
* Added support for disabling social card generation
|
||||
|
@ -7,7 +7,9 @@ search:
|
||||
|
||||
# Search: better, faster, smaller
|
||||
|
||||
__This is the story of how we managed to completely rebuild client-side search, delivering a significantly better user experience, while making it faster and smaller at the same time.__
|
||||
__This is the story of how we managed to completely rebuild client-side search,
|
||||
delivering a significantly better user experience, while making it faster and
|
||||
smaller at the same time.__
|
||||
|
||||
<aside class="mdx-author" markdown="1">
|
||||
![@squidfunk][1]
|
||||
@ -15,7 +17,7 @@ __This is the story of how we managed to completely rebuild client-side search,
|
||||
<span>__Martin Donath__ · @squidfunk</span>
|
||||
<span>
|
||||
:octicons-calendar-24: September 13, 2021 ·
|
||||
:octicons-clock-24: 20 min read
|
||||
:octicons-clock-24: 15 min read
|
||||
</span>
|
||||
</aside>
|
||||
|
||||
@ -23,9 +25,19 @@ __This is the story of how we managed to completely rebuild client-side search,
|
||||
|
||||
---
|
||||
|
||||
The [search][2] of Material for MkDocs is by far one of its best and most-loved assets: [multilingual][3], [offline-capable][4] and most importantly: _all client-side_. It provides a solution to empower the users of your documentation to find what they're searching for instantly without the headache of managing additional servers. However, even though several iterations have been made, there's still some room for improvement, which is why we rebuilt the search plugin and integration from the ground up. This article shines some light on the internals of the new search, why it's much more powerful than the previous version and what's about to come.
|
||||
The [search][2] of Material for MkDocs is by far one of its best and most-loved
|
||||
assets: [multilingual][3], [offline-capable][4] and most importantly: _all
|
||||
client-side_. It provides a solution to empower the users of your documentation
|
||||
to find what they're searching for instantly without the headache of managing
|
||||
additional servers. However, even though several iterations have been made,
|
||||
there's still some room for improvement, which is why we rebuilt the search
|
||||
plugin and integration from the ground up. This article shines some light on the
|
||||
internals of the new search, why it's much more powerful than the previous
|
||||
version and what's about to come.
|
||||
|
||||
_The next section discusses the architecture and issues of the current search implementation. If you immediately want to learn what's new, skip to the [section just after that][5]._
|
||||
_The next section discusses the architecture and issues of the current search
|
||||
implementation. If you immediately want to learn what's new, skip to the
|
||||
[section just after that][5]._
|
||||
|
||||
[2]: ../../setup/setting-up-site-search.md
|
||||
[3]: ../../setup/setting-up-site-search.md#lang
|
||||
@ -34,7 +46,11 @@ _The next section discusses the architecture and issues of the current search im
|
||||
|
||||
## Architecture
|
||||
|
||||
Material for MkDocs uses [lunr][6] together with [lunr-languages][7] to implement its client-side search capabilities. When a documentation page is loaded and JavaScript is available, the search index as generated by the [built-in search plugin][8] during the build process is requested from the server:
|
||||
Material for MkDocs uses [lunr][6] together with [lunr-languages][7] to
|
||||
implement its client-side search capabilities. When a documentation page is
|
||||
loaded and JavaScript is available, the search index as generated by the
|
||||
[built-in search plugin][8] during the build process is requested from the
|
||||
server:
|
||||
|
||||
``` ts
|
||||
const index$ = document.forms.namedItem("search")
|
||||
@ -50,7 +66,9 @@ const index$ = document.forms.namedItem("search")
|
||||
|
||||
### Search index
|
||||
|
||||
The search index includes a stripped-down version of all pages. Let's take a look at an example, to understand precisely what the search index contains from the original Markdown file:
|
||||
The search index includes a stripped-down version of all pages. Let's take a
|
||||
look at an example, to understand precisely what the search index contains from
|
||||
the original Markdown file:
|
||||
|
||||
??? example "Expand to inspect example"
|
||||
|
||||
@ -61,8 +79,8 @@ The search index includes a stripped-down version of all pages. Let's take a loo
|
||||
|
||||
## Text
|
||||
|
||||
It's very easy to make some words **bold** and other words *italic* with
|
||||
Markdown. You can even add [links](#), or even `code`:
|
||||
It's very easy to make some words **bold** and other words *italic*
|
||||
with Markdown. You can even add [links](#), or even `code`:
|
||||
|
||||
```
|
||||
if (isAwesome) {
|
||||
@ -124,35 +142,70 @@ The search index includes a stripped-down version of all pages. Let's take a loo
|
||||
|
||||
If we inspect the search index, we immediately see several problems:
|
||||
|
||||
1. __All content is included twice__: the search index includes one entry with the entire contents of the page, and one entry for each section of the page, i.e. each block preceded by a headline or subheadline. This significantly contributes to the size of the search index.
|
||||
1. __All content is included twice__: the search index includes one entry
|
||||
with the entire contents of the page, and one entry for each section of
|
||||
the page, i.e. each block preceded by a headline or subheadline. This
|
||||
significantly contributes to the size of the search index.
|
||||
|
||||
2. __All structure is lost__: when the search index is built, all structural information like HTML tags and attributes are stripped from the content. While this approach works well for paragraphs and inline formatting, it might be problematic for lists and code blocks. An excerpt:
|
||||
2. __All structure is lost__: when the search index is built, all structural
|
||||
information like HTML tags and attributes are stripped from the content.
|
||||
While this approach works well for paragraphs and inline formatting, it
|
||||
might be problematic for lists and code blocks. An excerpt:
|
||||
|
||||
```
|
||||
… links , or even code : if (isAwesome) { … } Lists Sometimes you want …
|
||||
```
|
||||
|
||||
- __Context__: for an untrained eye, the result can look like gibberish, as it's not immediately apparent what classifies as text and what as code. Furthermore, it's not clear that `Lists` is a headline as it's merged with the code block before and the paragraph after it.
|
||||
- __Context__: for an untrained eye, the result can look like gibberish, as
|
||||
it's not immediately apparent what classifies as text and what as code.
|
||||
Furthermore, it's not clear that `Lists` is a headline as it's merged
|
||||
with the code block before and the paragraph after it.
|
||||
|
||||
- __Punctuation__: inline elements like links, that are immediately followed by punctuation are separated by whitespace (see `,` and `:` in the excerpt). This is because all extracted text is joined with a whitespace character during the construction of the search index.
|
||||
- __Punctuation__: inline elements like links, that are immediately followed
|
||||
by punctuation are separated by whitespace (see `,` and `:` in the
|
||||
excerpt). This is because all extracted text is joined with a whitespace
|
||||
character during the construction of the search index.
|
||||
|
||||
It's not difficult to see that it can be quite challenging to implement a good search experience for theme authors, which is why Material for MkDocs (up to now) did some [monkey patching][9] to be able to render slighltly more meaningful search previews.
|
||||
It's not difficult to see that it can be quite challenging to implement a good
|
||||
search experience for theme authors, which is why Material for MkDocs (up to
|
||||
now) did some [monkey patching][9] to be able to render slighltly more
|
||||
meaningful search previews.
|
||||
|
||||
### Search worker
|
||||
|
||||
The actual search functionality is implemented as part of a web worker[^1], which creates and manages the [lunr][6] search index. When search is initialized, the following steps are taken:
|
||||
The actual search functionality is implemented as part of a web worker[^1],
|
||||
which creates and manages the [lunr][6] search index. When search is
|
||||
initialized, the following steps are taken:
|
||||
|
||||
[^1]: Prior to [version 5.0][10], search was carried out in the main thread which locked up the browser, rendering it unusable. This problem was first reported in #904 and, after some back and forth, fixed and released in version 5.0.
|
||||
[^1]:
|
||||
Prior to [version 5.0][10], search was carried out in the main thread which
|
||||
locked up the browser, rendering it unusable. This problem was first
|
||||
reported in #904 and, after some back and forth, fixed and released in
|
||||
version 5.0.
|
||||
|
||||
1. __Linking sections with pages__: The search index is parsed and each section is linked to its parent page. The parent page itself is _not indexed_, as it would lead to duplicate results, so only the sections remain. Linking is necessary, as search results are grouped by page.
|
||||
1. __Linking sections with pages__: The search index is parsed and each section
|
||||
is linked to its parent page. The parent page itself is _not indexed_, as it
|
||||
would lead to duplicate results, so only the sections remain. Linking is
|
||||
necessary, as search results are grouped by page.
|
||||
|
||||
2. __Tokenization__: The `title` and `text` values of each section are split into tokens by using the [separator][11] as configured in `mkdocs.yml`. Tokenization itself is carried out by [lunr's default tokenizer][12], which doesn't allow for lookahead or separators spanning multiple characters.
|
||||
2. __Tokenization__: The `title` and `text` values of each section are split
|
||||
into tokens by using the [separator][11] as configured in `mkdocs.yml`.
|
||||
Tokenization itself is carried out by [lunr's default tokenizer][12], which
|
||||
doesn't allow for lookahead or separators spanning multiple characters.
|
||||
|
||||
> Why is this important and a big deal? We will see later how much more we can achieve with a tokenizer that is capable of separating strings with lookahead.
|
||||
> Why is this important and a big deal? We will see later how much more we
|
||||
> can achieve with a tokenizer that is capable of separating strings with
|
||||
> lookahead.
|
||||
|
||||
3. __Indexing__: As a final step, each section is indexed. When querying the index, if a search query includes one of the tokens as returned by step 2., the section is considered to be part of the search result and passed to the main thread.
|
||||
3. __Indexing__: As a final step, each section is indexed. When querying the
|
||||
index, if a search query includes one of the tokens as returned by step 2.,
|
||||
the section is considered to be part of the search result and passed to the
|
||||
main thread.
|
||||
|
||||
Now, that's basically how the search worker operates. Sure, there's a little more magic involved, e.g. search results are [post-processed][13] and [rescored][14] to account for some shortcomings of [lunr][6], but in general this is how data gets into and out of the index.
|
||||
Now, that's basically how the search worker operates. Sure, there's a little
|
||||
more magic involved, e.g. search results are [post-processed][13] and
|
||||
[rescored][14] to account for some shortcomings of [lunr][6], but in general
|
||||
this is how data gets into and out of the index.
|
||||
|
||||
[9]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/document/index.ts#L68-L71
|
||||
[10]: https://squidfunk.github.io/mkdocs-material/upgrading/#upgrading-from-4x-to-5x
|
||||
@ -163,9 +216,15 @@ Now, that's basically how the search worker operates. Sure, there's a little mor
|
||||
|
||||
### Search previews
|
||||
|
||||
Users should be able to quickly scan an evaluate the relevance of a search result in the given context, which is why a concise summary with highlighted occurrences of the search terms found is an essential part of a great search experience.
|
||||
Users should be able to quickly scan and evaluate the relevance of a search
|
||||
result in the given context, which is why a concise summary with highlighted
|
||||
occurrences of the search terms found is an essential part of a great search
|
||||
experience.
|
||||
|
||||
This is where the current search preview generation falls short, as some of the search previews appear to not include any occurrence of any of the search terms. This was due to the fact that search previews were [truncated after a maximum of 320 characters][15], as can be seen here:
|
||||
This is where the current search preview generation falls short, as some of the
|
||||
search previews appear to not include any occurrence of any of the search
|
||||
terms. This was due to the fact that search previews were [truncated after a
|
||||
maximum of 320 characters][15], as can be seen here:
|
||||
|
||||
<figure markdown="1">
|
||||
|
||||
@ -173,40 +232,76 @@ This is where the current search preview generation falls short, as some of the
|
||||
|
||||
<figcaption markdown="1">
|
||||
|
||||
The first two results look like they're not relevant, as they don't seem to include the query string the user just searched for. Yet, they are.
|
||||
The first two results look like they're not relevant, as they don't seem to
|
||||
include the query string the user just searched for. Yet, they are.
|
||||
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
A better solution to this problem has been on the roadmap for a very, very long time, but in order to solve this once and for all, several factors need to be carefully considered:
|
||||
A better solution to this problem has been on the roadmap for a very, very long
|
||||
time, but in order to solve this once and for all, several factors need to be
|
||||
carefully considered:
|
||||
|
||||
1. __Word boundaries__: some themes[^2] for static site generators generate search previews by expanding the text left and right next to an occurrence, stopping at a whitespace character when enough words have been consumed. A preview might look like this:
|
||||
1. __Word boundaries__: some themes[^2] for static site generators generate
|
||||
search previews by expanding the text left and right next to an occurrence,
|
||||
stopping at a whitespace character when enough words have been consumed. A
|
||||
preview might look like this:
|
||||
|
||||
```
|
||||
… channels, e.g. or which can be configured via mkdocs.yml …
|
||||
```
|
||||
|
||||
While this may work for languages that use whitespace as a separator between words, it breaks down for languages like Japanese or Chinese[^3], as they have non-whitespace word boundaries and use dedicated segmenters to split strings into tokens.
|
||||
While this may work for languages that use whitespace as a separator
|
||||
between words, it breaks down for languages like Japanese or Chinese[^3],
|
||||
as they have non-whitespace word boundaries and use dedicated segmenters to
|
||||
split strings into tokens.
|
||||
|
||||
[^2]: At the time of writing, [Just the Docs][17] and [Docusaurus][18] use this method for generating search previews. Note that the latter also integrates with Algolia, which is a fully managed server-based solution.
|
||||
[^2]:
|
||||
At the time of writing, [Just the Docs][17] and [Docusaurus][18] use this
|
||||
method for generating search previews. Note that the latter also integrates
|
||||
with Algolia, which is a fully managed server-based solution.
|
||||
|
||||
[^3]: China and Japan are both within the top 5 countries of origin of users of Material for MkDocs.
|
||||
[^3]:
|
||||
China and Japan are both within the top 5 countries of origin of users of
|
||||
Material for MkDocs.
|
||||
|
||||
[15]: https://github.com/squidfunk/mkdocs-material/blob/master/src/assets/javascripts/templates/search/index.tsx#L90
|
||||
[16]: search-better-faster-smaller/search-preview.png
|
||||
[17]: https://pmarsceill.github.io/just-the-docs/
|
||||
[18]: https://github.com/lelouch77/docusaurus-lunr-search
|
||||
|
||||
2. __Context awareness__: Although whitespace doesn't work for all languages, one could argue that it could be a good-enough solution. Unfortunately, this is not necessarily true for code blocks, as the removal of whitespace might change meaning in some languages.
|
||||
2. __Context awareness__: Although whitespace doesn't work for all languages,
|
||||
one could argue that it could be a good-enough solution. Unfortunately, this
|
||||
is not necessarily true for code blocks, as the removal of whitespace might
|
||||
change meaning in some languages.
|
||||
|
||||
3. __Structure__: Preserving structural information is not a must, but apparently beneficial to build more meaningful search previews which allow for a quick evaluation of relevance. If a word occurrence is part of a code block, it should be rendered as a code block.
|
||||
3. __Structure__: Preserving structural information is not a must, but
|
||||
apparently beneficial to build more meaningful search previews which allow
|
||||
for a quick evaluation of relevance. If a word occurrence is part of a code
|
||||
block, it should be rendered as a code block.
|
||||
|
||||
## What's new?
|
||||
|
||||
After we built a solid understanding of the problem space and before we dive into the internals of the new search implementation to see which of the problems it already solves, a quick overview of what the new search implementation brings:
|
||||
After we built a solid understanding of the problem space and before we dive
|
||||
into the internals of our new search implementation to see which of the
|
||||
problems it already solves, a quick overview of what features and improvements
|
||||
it brings:
|
||||
|
||||
- __Better__: support for [rich search previews][19], preserving the structural information of code blocks, inline code and lists, so they are rendered as-is, as well as [lookahead tokenization][20], [more accurate highlighting][21], and improved stability of typeahead. Also, a [slightly better UX][22].
|
||||
- __Faster__ and __smaller__: significant decrease in search index size of up to 48% due to improved extraction and construction techniques, resulting in a search experience that is up to 95% faster, which is particularly helpful for large documentation projects.
|
||||
- __Better__: support for [rich search previews][19], preserving the structural
|
||||
information of code blocks, inline code and lists, so they are rendered
|
||||
as-is, as well as [lookahead tokenization][20],
|
||||
[more accurate highlighting][21], and improved stability of typeahead. Also,
|
||||
a [slightly better UX][22].
|
||||
- __Faster__ and __smaller__: significant decrease in search index size of up
|
||||
to 48% due to improved extraction and construction techniques, resulting in a
|
||||
search experience that is up to 95% faster, which is particularly helpful for
|
||||
large documentation projects.
|
||||
|
||||
_Note that our new search implementation is currently 'Insiders only', which
|
||||
means that it is reserved for sponsors, because it's those sponsors that make
|
||||
features like this possible._
|
||||
|
||||
[:octicons-heart-fill-24:{ .mdx-heart } I want to become a sponsor](../../insiders/index.md){ .md-button .md-button--primary }
|
||||
|
||||
[19]: #rich-search-previews
|
||||
[20]: #tokenizer-lookahead
|
||||
@ -215,7 +310,10 @@ After we built a solid understanding of the problem space and before we dive int
|
||||
|
||||
### Rich search previews
|
||||
|
||||
As we rebuilt the search plugin from scratch, we reworked the construction of the search index to preserve the structural information of code blocks, inline code, as well as unordered and ordered lists. Using the example from the [search index][23] section, here's how it looks:
|
||||
As we rebuilt the search plugin from scratch, we reworked the construction of
|
||||
the search index to preserve the structural information of code blocks, inline
|
||||
code, as well as unordered and ordered lists. Using the example from the
|
||||
[search index][23] section, here's how it looks:
|
||||
|
||||
=== "Now"
|
||||
|
||||
@ -225,8 +323,9 @@ As we rebuilt the search plugin from scratch, we reworked the construction of th
|
||||
|
||||
![Search preview before][25]
|
||||
|
||||
Now, __code blocks are first-class citizens of search previews__, and even inline code formatting is preserved. Let's take a look at the new structure of the search index to understand why:
|
||||
{ #example }
|
||||
Now, __code blocks are first-class citizens of search previews__, and even
|
||||
inline code formatting is preserved. Let's take a look at the new structure of
|
||||
the search index to understand why:
|
||||
|
||||
??? example "Expand to inspect search index"
|
||||
|
||||
@ -287,9 +386,15 @@ Now, __code blocks are first-class citizens of search previews__, and even inlin
|
||||
|
||||
If we inspect the search index again, we can see how the situation improved:
|
||||
|
||||
1. __Content is included only once__: the search index does not include the content of the page twice, as only the sections of a page are part of the search index. This leads to a significant reduction in size, fewer bytes to transfer and a smaller search index.
|
||||
1. __Content is included only once__: the search index does not include the
|
||||
content of the page twice, as only the sections of a page are part of the
|
||||
search index. This leads to a significant reduction in size, fewer bytes to
|
||||
transfer and a smaller search index.
|
||||
|
||||
2. __Some structure is preserved__: each section of the search index includes a small subset of HTML to provide the necessary structure to allow for more sophisticated search previews. Revisiting our example from before, let's look at an excerpt:
|
||||
2. __Some structure is preserved__: each section of the search index includes a
|
||||
small subset of HTML to provide the necessary structure to allow for more
|
||||
sophisticated search previews. Revisiting our example from before, let's
|
||||
look at an excerpt:
|
||||
|
||||
=== "Now"
|
||||
|
||||
@ -303,7 +408,9 @@ If we inspect the search index again, we can see how the situation improved:
|
||||
… links , or even code : if (isAwesome) { … }
|
||||
```
|
||||
|
||||
The punctuation issue is gone, as no additional whitespace is inserted, and the preserved markup yields additional context to make scanning search results more effective.
|
||||
The punctuation issue is gone, as no additional whitespace is inserted, and
|
||||
the preserved markup yields additional context to make scanning search
|
||||
results more effective.
|
||||
|
||||
On to the next step in the process: __tokenization__.
|
||||
|
||||
@ -313,85 +420,143 @@ On to the next step in the process: __tokenization__.
|
||||
|
||||
### Tokenizer lookahead
|
||||
|
||||
The [default tokenizer][12] of [lunr][6] uses a regular expression to split a given string, by matching each character against the [separator][11] as defined in `mkdocs.yml`. This doesn't allow for more complex separators based on lookahead or multiple characters.
|
||||
The [default tokenizer][12] of [lunr][6] uses a regular expression to split a
|
||||
given string, by matching each character against the [separator][11] as defined
|
||||
in `mkdocs.yml`. This doesn't allow for more complex separators based on
|
||||
lookahead or multiple characters.
|
||||
|
||||
Fortunately, __the new search implementation provides an advanced tokenizer__ that doesn't have these shortcomings and supports more complex regular expressions. As a result, Material for MkDocs just changed it's own separator configuration to the following value:
|
||||
Fortunately, __our new search implementation provides an advanced tokenizer__
|
||||
that doesn't have these shortcomings and supports more complex regular
|
||||
expressions. As a result, Material for MkDocs just changed it's own separator
|
||||
configuration to the following value:
|
||||
|
||||
```
|
||||
[\s\-,:!=\[\]()"/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;
|
||||
```
|
||||
|
||||
While the first part up to the first `|` contains a list of single control characters at which the string should be split, the following three sections explain the remainder of the regular expression.[^4]
|
||||
While the first part up to the first `|` contains a list of single control
|
||||
characters at which the string should be split, the following three sections
|
||||
explain the remainder of the regular expression.[^4]
|
||||
|
||||
[^4]: As a fun fact: the [separator default value][26] of the search plugin being `[\s\-]+` always has been kind of irritating, as it suggests that multiple characters can be considered being a separator. However, the `+` is completely irrelevant, as regular expression groups involving multiple characters were never supported by [lunr's default tokenizer][12].
|
||||
[^4]:
|
||||
As a fun fact: the [separator default value][26] of the search plugin being
|
||||
`[\s\-]+` always has been kind of irritating, as it suggests that multiple
|
||||
characters can be considered being a separator. However, the `+` is
|
||||
completely irrelevant, as regular expression groups involving multiple
|
||||
characters were never supported by [lunr's default tokenizer][12].
|
||||
|
||||
[26]: https://www.mkdocs.org/user-guide/configuration/#separator
|
||||
|
||||
#### Case changes
|
||||
|
||||
Many programming languages use `PascalCase` or `camelCase` naming conventions. When a user searches for the term `case`, it's quite natural to expect for `PascalCase` and `camelCase` to show up. By adding the following match group to the separator, this can now be achieved with ease:
|
||||
Many programming languages use `PascalCase` or `camelCase` naming conventions.
|
||||
When a user searches for the term `case`, it's quite natural to expect for
|
||||
`PascalCase` and `camelCase` to show up. By adding the following match group to
|
||||
the separator, this can now be achieved with ease:
|
||||
|
||||
```
|
||||
(?!\b)(?=[A-Z][a-z])
|
||||
```
|
||||
|
||||
This regular expression is a combination of a negative lookahead (`\b`, i.e. not a word boundary) and a positive lookahead (`[A-Z][a-z]`, i.e. an uppercase character followed by a lowercase character), and has the following behavior:
|
||||
This regular expression is a combination of a negative lookahead (`\b`, i.e.
|
||||
not a word boundary) and a positive lookahead (`[A-Z][a-z]`, i.e. an uppercase
|
||||
character followed by a lowercase character), and has the following behavior:
|
||||
|
||||
- `PascalCase` :octicons-arrow-right-24: `Pascal`, `Case`
|
||||
- `camelCase` :octicons-arrow-right-24: `camel`, `Case`
|
||||
- `UPPERCASE` :octicons-arrow-right-24: `UPPERCASE`
|
||||
|
||||
Searching for [:octicons-search-24: searchHighlight][27] now brings up the section discussing the `search.highlight` feature flag, which also demonstrates that this even works for search queries now![^5]
|
||||
Searching for [:octicons-search-24: searchHighlight][27] now brings up the
|
||||
section discussing the `search.highlight` feature flag, which also demonstrates
|
||||
that this even works for search queries now![^5]
|
||||
|
||||
[^5]: Previously, the search query was not correctly tokenized due to the way [lunr][6] treats wildcards, as it disables the pipeline for search terms that contain wildcards. In order to provide a good typeahead experience, Material for MkDocs adds wildcards to the end of each search term not explicitly preceded with `+` or `-`, effectively disabling tokenization.
|
||||
[^5]:
|
||||
Previously, the search query was not correctly tokenized due to the way
|
||||
[lunr][6] treats wildcards, as it disables the pipeline for search terms
|
||||
that contain wildcards. In order to provide a good typeahead experience,
|
||||
Material for MkDocs adds wildcards to the end of each search term not
|
||||
explicitly preceded with `+` or `-`, effectively disabling tokenization.
|
||||
|
||||
[27]: ?q=searchHighlight
|
||||
|
||||
#### Version numbers
|
||||
|
||||
Indexing version numbers is another problem that can be solved with a small lookahead. Usually, `.` should be considered a separator to split words like `search.highlight`. However, splitting version numbers at `.` will make them undiscoverable. Thus, the following expression:
|
||||
Indexing version numbers is another problem that can be solved with a small
|
||||
lookahead. Usually, `.` should be considered a separator to split words like
|
||||
`search.highlight`. However, splitting version numbers at `.` will make them
|
||||
undiscoverable. Thus, the following expression:
|
||||
|
||||
```
|
||||
\.(?!\d)
|
||||
```
|
||||
|
||||
This regular expression matches a `.`, but not immediately followed by a digit `\d`, which leaves version numbers discoverable. Searching for [:octicons-search-24: 7.2.6][28] brings up the [7.2.6][29] release notes.
|
||||
This regular expression matches a `.`, but not immediately followed by a digit
|
||||
`\d`, which leaves version numbers discoverable. Searching for
|
||||
[:octicons-search-24: 7.2.6][28] brings up the [7.2.6][29] release notes.
|
||||
|
||||
[28]: ?q=7.2.6
|
||||
[29]: ../../changelog.md#726-_-september-1-2021
|
||||
|
||||
#### HTML/XML tags
|
||||
|
||||
If your documentation includes HTML/XML code examples, you may want to allow users to find specific tag names. Unfortunately, the `<` and `>` control characters are encoded in code blocks as `<` and `>`. Now, adding the following expression to the separator allows for just that:
|
||||
If your documentation includes HTML/XML code examples, you may want to allow
|
||||
users to find specific tag names. Unfortunately, the `<` and `>` control
|
||||
characters are encoded in code blocks as `<` and `>`. Now, adding the
|
||||
following expression to the separator allows for just that:
|
||||
|
||||
```
|
||||
&[lg]t;
|
||||
```
|
||||
|
||||
Searching for [:octicons-search-24: custom search worker script][30] brings up the section on [custom search][31] and matches the `script` tag among the other search terms discovered.
|
||||
Searching for [:octicons-search-24: custom search worker script][30] brings up
|
||||
the section on [custom search][31] and matches the `script` tag among the other
|
||||
search terms discovered.
|
||||
|
||||
---
|
||||
|
||||
_We've only just begun to scratch the surface of the new possibilities tokenizer lookahead brings. If you found other useful expressions, you're invited to share them in the comment section._
|
||||
_We've only just begun to scratch the surface of the new possibilities
|
||||
tokenizer lookahead brings. If you found other useful expressions, you're
|
||||
invited to share them in the comment section._
|
||||
|
||||
[30]: ?q=custom+search+worker+script
|
||||
[31]: ../../setup/setting-up-site-search.md#custom-search
|
||||
|
||||
### Accurate highlighting
|
||||
|
||||
Highlighting is the last step in the process of search and involves the highlighting of all search term occurrences in a given search result. For a long time, highlighting was implemented through dynamically generated [regular expressions][32].[^6]
|
||||
Highlighting is the last step in the process of search and involves the
|
||||
highlighting of all search term occurrences in a given search result. For a
|
||||
long time, highlighting was implemented through dynamically generated
|
||||
[regular expressions][32].[^6]
|
||||
|
||||
This approach has some problems with non-whitespace languages like Japanese or Chinese[^3], since it only works if the highlighted term is at a word boundary. However, Asian languages are tokenized using a [dedicated segmenter][33], which cannot be modelled with regular expressions.
|
||||
This approach has some problems with non-whitespace languages like Japanese or
|
||||
Chinese[^3], since it only works if the highlighted term is at a word boundary.
|
||||
However, Asian languages are tokenized using a [dedicated segmenter][33], which
|
||||
cannot be modelled with regular expressions.
|
||||
|
||||
[^6]: Using the separator as defined in `mkdocs.yml`, a regular expression was constructed that was trying to mimick the tokenizer. As an example, the search query `search highlight` was transformed into the rather cumbersome regular expression `(^|<separator>)(search|highlight)`, which only matches at word boundaries.
|
||||
[^6]:
|
||||
Using the separator as defined in `mkdocs.yml`, a regular expression was
|
||||
constructed that was trying to mimick the tokenizer. As an example, the
|
||||
search query `search highlight` was transformed into the rather cumbersome
|
||||
regular expression `(^|<separator>)(search|highlight)`, which only matches
|
||||
at word boundaries.
|
||||
|
||||
Now, as a direct result of the [new tokenization approach][34], __the new search implementation uses token positions for highlighting__, making it exactly as powerful as tokenization:
|
||||
Now, as a direct result of the [new tokenization approach][34], __our new
|
||||
search implementation uses token positions for highlighting__, making it
|
||||
exactly as powerful as tokenization:
|
||||
|
||||
1. __Word boundaries__: as the new highlighter uses token positions, word boundaries are equal to token boundaries. This means that more complex cases of tokenization (e.g. [case changes][35], [version numbers][36], [HTML/XML tags][37]), are now all highlighted accurately.
|
||||
1. __Word boundaries__: as the new highlighter uses token positions, word
|
||||
boundaries are equal to token boundaries. This means that more complex cases
|
||||
of tokenization (e.g. [case changes][35], [version numbers][36], [HTML/XML
|
||||
tags][37]), are now all highlighted accurately.
|
||||
|
||||
2. __Context awareness__: as the new search index preserves some of the structural information of the original document, the content of a section is now divided into separate content blocks – paragraphs, code blocks and lists.
|
||||
1. __Context awareness__: as the new search index preserves some of the
|
||||
structural information of the original document, the content of a section is
|
||||
now divided into separate content blocks – paragraphs, code blocks and lists.
|
||||
|
||||
Now, only the content blocks that actually contain occurrences of one of the search terms are considered for inclusion into the search preview. If a term only occurs in a code block, it's the code block that gets rendered, see for example the results of [:octicons-search-24: twitter][38].
|
||||
Now, only the content blocks that actually contain occurrences of one of
|
||||
the search terms are considered for inclusion into the search preview. If a
|
||||
term only occurs in a code block, it's the code block that gets rendered,
|
||||
see for example the results of [:octicons-search-24: twitter][38].
|
||||
|
||||
[32]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/highlighter/index.ts#L61-91
|
||||
[33]: http://chasen.org/~taku/software/TinySegmenter/
|
||||
@ -403,7 +568,10 @@ Now, as a direct result of the [new tokenization approach][34], __the new search
|
||||
|
||||
### Benchmarks
|
||||
|
||||
We conducted two benchmarks – one with the documentation of Material for MkDocs itself, and one with a rather large corpus of Markdown files with more than 800k words:
|
||||
We conducted two benchmarks – one with the documentation of Material for MkDocs
|
||||
itself, and one with a very massive corpus of Markdown files with more than
|
||||
800,000 words – a size most documentation projects will likely never
|
||||
reach:
|
||||
|
||||
<figure markdown="1">
|
||||
|
||||
@ -418,29 +586,57 @@ We conducted two benchmarks – one with the documentation of Material for MkDoc
|
||||
| Index size (`gzip`) | 2.3 MB | __1.2 MB__ | __–48%__ |
|
||||
| Indexing time^‡^ | 2,700 ms | __1,390 ms__ | __–48%__ |
|
||||
|
||||
<figcaption>
|
||||
<p>Benchmark results</p>
|
||||
<small>‡ Smallest value of ten distinct runs</small>
|
||||
</figcaption>
|
||||
|
||||
</figure>
|
||||
|
||||
[^7]: We agnostically use [KJV Markdown][39] as a tool for testing to learn how Material for MkDocs behaves on large corpora, as it's a rather large set of Markdown files with over 800k words.
|
||||
[^7]:
|
||||
We agnostically use [KJV Markdown][39] as a tool for testing to learn how
|
||||
Material for MkDocs behaves on large corpora, as it's a very large set of
|
||||
Markdown files with over 800k words.
|
||||
|
||||
The results show that indexing time, which is the time that it takes to set up the search when the page is loaded, has dropped by up to 48%, which means __the new search is up to 95% faster__. This is a significant improvement, particularly relevant for large documentation projects.
|
||||
The results show that indexing time, which is the time that it takes to set up
|
||||
the search when the page is loaded, has dropped by up to 48%, which means __the
|
||||
new search is up to 95% faster__. This is a significant improvement,
|
||||
particularly relevant for large documentation projects.
|
||||
|
||||
While 1,3s still may sound like a long time, using the new client-side search together with [instant loading][40] only initializes the search on the first page load. When navigating, the search index is preserved across pages, so the cost does only have to be paid once.
|
||||
While 1,3s still may sound like a long time, using the new client-side search
|
||||
together with [instant loading][40] only initializes the search on the first
|
||||
page load. When navigating, the search index is preserved across pages, so the
|
||||
cost does only have to be paid once.
|
||||
|
||||
[39]: https://github.com/arleym/kjv-markdown
|
||||
[40]: ../../setup/setting-up-navigation.md#instant-loading
|
||||
|
||||
### User interface
|
||||
|
||||
Additionally, some small improvements have been made, most prominently the __# more results on this page__ button, which now sticks to the top of the search result list when open. This enables the user to jump out of the list more quickly.
|
||||
Additionally, some small improvements have been made, most prominently the __#
|
||||
more results on this page__ button, which now sticks to the top of the search
|
||||
result list when open. This enables the user to jump out of the list more
|
||||
quickly.
|
||||
|
||||
## What's next?
|
||||
|
||||
The new search implementation is a big improvement to Material for MkDocs. It solves some long-standing issues which needed to be tackled for years. Yet, it's only the start of a search experience that is going to get better and better. Next up:
|
||||
Our new search implementation is a big improvement to Material for MkDocs. It
|
||||
solves some long-standing issues which needed to be tackled for years. Yet,
|
||||
it's only the start of a search experience that is going to get better and
|
||||
better. Next up:
|
||||
|
||||
- __Context aware search summarization__: currently, the first two matching content blocks are rendered as a search preview. With the new tokenization technique, we laid the groundwork for more sophisticated shortening and summarization methods, which we're tackling next.
|
||||
- __Context aware search summarization__: currently, the first two matching
|
||||
content blocks are rendered as a search preview. With the new tokenization
|
||||
technique, we laid the groundwork for more sophisticated shortening and
|
||||
summarization methods, which we're tackling next.
|
||||
|
||||
- __User interface improvements__: as we now gained full control over the search plugin, we can now add meaningful metadata to provide more context and a better experience. We'll explore some of those paths in the future.
|
||||
- __User interface improvements__: as we now gained full control over the
|
||||
search plugin, we can now add meaningful metadata to provide more context and
|
||||
a better experience. We'll explore some of those paths in the future.
|
||||
|
||||
If you've made it this far, thank you for your time and interest in Material for MkDocs! This is the first blog article which I decided to write after a short [Twitter survey][41] made me to. Feel free to leave a comment when you have something to say.
|
||||
If you've made it this far, thank you for your time and interest in Material
|
||||
for MkDocs! This is the first blog article which I decided to write after a
|
||||
short [Twitter survey][41] made me to. Feel free to leave a comment when you
|
||||
have something to say.
|
||||
|
||||
[41]: https://twitter.com/squidfunk/status/1434477478823743488
|
||||
|
@ -8,9 +8,19 @@ search:
|
||||
|
||||
<h2>Search: better, faster, smaller</h2>
|
||||
|
||||
__This is the story of how we managed to completely rebuild client-side search, delivering a significantly better user experience, while making it faster and smaller at the same time.__
|
||||
__This is the story of how we managed to completely rebuild client-side search,
|
||||
delivering a significantly better user experience, while making it faster and
|
||||
smaller at the same time.__
|
||||
|
||||
The search of Material for MkDocs is genuinely one of its best and most-loved assets: fast, multi-lingual, offline-capable and most importantly: _all client-side_. It provides a solution to empower the users of your documentation to find what they're searching for instantly without the headache of managing additional servers. However, even though several iterations have been made, there's still some room for improvement, which is why we rebuilt the search plugin and integration from the ground up. This article shines some light on the internals of the new search, why it's much more powerful than the previous version and what's about to come.
|
||||
The search of Material for MkDocs is by far one of its best and most-loved
|
||||
assets: multilingual, offline-capable and most importantly: _all client-side_.
|
||||
It provides a solution to empower the users of your documentation to find what
|
||||
they're searching for instantly without the headache of managing additional
|
||||
servers. However, even though several iterations have been made, there's still
|
||||
some room for improvement, which is why we rebuilt the search plugin and
|
||||
integration from the ground up. This article shines some light on the internals
|
||||
of the new search, why it's much more powerful than the previous version and
|
||||
what's about to come.
|
||||
|
||||
[Continue reading :octicons-arrow-right-24:][1]{ .md-button }
|
||||
|
||||
|
@ -6,6 +6,14 @@ template: overrides/main.html
|
||||
|
||||
## Material for MkDocs Insiders
|
||||
|
||||
### 3.0.0 <small>_ September 13, 2021</small>
|
||||
|
||||
- Rewrite of MkDocs' search plugin
|
||||
- Added support for rich search previews
|
||||
- Added support for tokenizer with lookahead
|
||||
- Improved search indexing performance (twice as fast)
|
||||
- Improved search highlighting
|
||||
|
||||
### 2.13.3 <small>_ September 1, 2021</small>
|
||||
|
||||
- Added support for disabling social card generation
|
||||
|
@ -137,9 +137,13 @@ The following features are currently exclusively available to sponsors:
|
||||
|
||||
<div class="mdx-columns" markdown="1">
|
||||
|
||||
- [x] [Social cards :material-new-box:][34]
|
||||
- [x] [Cookie consent :material-new-box:][33]
|
||||
- [x] [Linking content tabs :material-new-box:][32]
|
||||
- [x] [Brand new search plugin :material-new-box:][35]
|
||||
- [x] [Rich search previews :material-new-box:][36]
|
||||
- [x] [Tokenizer with lookahead :material-new-box:][37]
|
||||
- [x] [Advanced search highlighting :material-new-box:][38]
|
||||
- [x] [Social cards][34]
|
||||
- [x] [Cookie consent][33]
|
||||
- [x] [Linking content tabs][32]
|
||||
- [x] [Boosting pages in search][30]
|
||||
- [x] [Tags (with search integration)][29]
|
||||
- [x] [Stay on page when switching versions][28]
|
||||
@ -211,8 +215,8 @@ the public for general availability.
|
||||
#### $ 7,000 – Royal Gold
|
||||
|
||||
- [x] [Cookie consent][33]
|
||||
- [ ] Improved search result summaries
|
||||
- [ ] List of last searches
|
||||
- [ ] Exclude pages from search
|
||||
- [ ] Link cards
|
||||
|
||||
[33]: ../setup/setting-up-site-analytics.md#cookie-consent
|
||||
|
||||
@ -224,15 +228,17 @@ the public for general availability.
|
||||
|
||||
[34]: ../setup/setting-up-social-cards.md
|
||||
|
||||
#### Future
|
||||
#### $ 10,000 – Carolina Reaper
|
||||
|
||||
- [ ] [Material for MkDocs Live Edit][35]
|
||||
- [ ] New layouts and styles
|
||||
- [ ] Code block palette toggle
|
||||
- [ ] Native lightbox integration
|
||||
- [ ] Table of contents auto-collapse
|
||||
- [x] [Brand new search plugin][35]
|
||||
- [x] [Rich search previews][36]
|
||||
- [x] [Tokenizer with lookahead][37]
|
||||
- [x] [Advanced search highlighting][38]
|
||||
|
||||
[35]: https://twitter.com/squidfunk/status/1338252230265360391
|
||||
[35]: ../blog/2021/search-better-faster-smaller.md
|
||||
[36]: ../blog/2021/search-better-faster-smaller.md#rich-search-previews
|
||||
[37]: ../blog/2021/search-better-faster-smaller.md#tokenizer-lookahead
|
||||
[38]: ../blog/2021/search-better-faster-smaller.md#accurate-highlighting
|
||||
|
||||
### Goals completed
|
||||
|
||||
@ -297,17 +303,17 @@ implemented behind feature flags; all configuration changes are
|
||||
backward-compatible. This means that your users will be able to build the
|
||||
documentation locally with Material for MkDocs and when they push their changes,
|
||||
it can be built with Insiders (e.g. as part of GitHub Actions). Thus, it's
|
||||
recommended to [install Insiders][36] only in CI, as you don't want to expose
|
||||
recommended to [install Insiders][39] only in CI, as you don't want to expose
|
||||
your `GH_TOKEN` to users.
|
||||
|
||||
[36]: ../publishing-your-site.md#github-pages
|
||||
[39]: ../publishing-your-site.md#github-pages
|
||||
|
||||
### Payment
|
||||
|
||||
_We don't want to pay for sponsorship every month. Are there any other options?_
|
||||
|
||||
Yes. You can sponsor on a yearly basis by [switching your GitHub account to a
|
||||
yearly billing cycle][37]. If for some reason you cannot do that, you could
|
||||
yearly billing cycle][40]. If for some reason you cannot do that, you could
|
||||
also create a dedicated GitHub account with a yearly billing cycle, which you
|
||||
only use for sponsoring (some sponsors already do that).
|
||||
|
||||
@ -315,7 +321,7 @@ If you have any problems or further questions, don't hesitate to contact me at
|
||||
sponsors@squidfunk.com. Note that one-time payments are not eligible for
|
||||
Insiders, but of course, very appreciated.
|
||||
|
||||
[37]: https://docs.github.com/en/github/setting-up-and-managing-billing-and-payments-on-github/changing-the-duration-of-your-billing-cycle
|
||||
[40]: https://docs.github.com/en/github/setting-up-and-managing-billing-and-payments-on-github/changing-the-duration-of-your-billing-cycle
|
||||
|
||||
### Terms
|
||||
|
||||
@ -324,7 +330,7 @@ commercial project. Can we use Insiders under the same terms and conditions?_
|
||||
|
||||
Yes. Whether you're an individual or a company, you may use _Material for MkDocs
|
||||
Insiders_ precisely under the same terms as Material for MkDocs, which are given
|
||||
by the [MIT license][38]. However, we kindly ask you to respect the following
|
||||
by the [MIT license][41]. However, we kindly ask you to respect the following
|
||||
guidelines:
|
||||
|
||||
- Please __don't distribute the source code__ of Insiders. You may freely use
|
||||
@ -335,7 +341,7 @@ guidelines:
|
||||
- If you cancel your subscription, you're removed as a collaborator and will
|
||||
miss out on future updates of Insiders. However, you may __use the latest
|
||||
version__ that's available to you __as long as you like__. Just remember that
|
||||
[GitHub deletes private forks][39].
|
||||
[GitHub deletes private forks][42].
|
||||
|
||||
[38]: ../license.md
|
||||
[39]: https://docs.github.com/en/github/setting-up-and-managing-your-github-user-account/removing-a-collaborator-from-a-personal-repository
|
||||
[41]: ../license.md
|
||||
[42]: https://docs.github.com/en/github/setting-up-and-managing-your-github-user-account/removing-a-collaborator-from-a-personal-repository
|
||||
|
@ -17,8 +17,13 @@ with some effort, search can be made available [offline][1].
|
||||
|
||||
### Built-in search
|
||||
|
||||
!!! danger "[Search: better, faster, smaller](../blog/2021/search-better-faster-smaller.md)"
|
||||
|
||||
We rebuilt the search plugin and integration from the ground up, introducing [rich search previews](../blog/2021/search-better-faster-smaller.md#rich-search-previews), much better [tokenizer support](../blog/2021/search-better-faster-smaller.md#tokenizer-lookahead), [more accurate highlighting](../blog/2021/search-better-faster-smaller.md#accurate-highlighting) and much more. Read the [blog article](../blog/2021/search-better-faster-smaller.md) to learn more about our new search implementation. Start using it immediately by [becoming a sponsor][20]!
|
||||
|
||||
[:octicons-file-code-24: Source][2] ·
|
||||
[:octicons-cpu-24: Plugin][3]
|
||||
[:octicons-cpu-24: Plugin][3] ·
|
||||
[:octicons-heart-fill-24:{ .mdx-heart } Better in Insiders][20]{ .mdx-insiders }
|
||||
|
||||
The [built-in search plugin][3] integrates seamlessly with Material for MkDocs,
|
||||
adding multilingual client-side search with [lunr][4] and [lunr-languages][5].
|
||||
|
Loading…
Reference in New Issue
Block a user