mirror of
https://github.com/squidfunk/mkdocs-material.git
synced 2024-06-14 11:52:32 +03:00
648 lines
26 KiB
Markdown
648 lines
26 KiB
Markdown
---
|
||
template: overrides/blog.html
|
||
description: >
|
||
How we rebuilt client-side search, delivering a better user experience while
|
||
making it faster and smaller at the same time
|
||
search:
|
||
exclude: true
|
||
hide:
|
||
- feedback
|
||
---
|
||
|
||
# Search: better, faster, smaller
|
||
|
||
__This is the story of how we managed to completely rebuild client-side search,
|
||
delivering a significantly better user experience while making it faster and
|
||
smaller at the same time.__
|
||
|
||
<aside class="mdx-author" markdown>
|
||
![@squidfunk][@squidfunk avatar]
|
||
|
||
<span>__Martin Donath__ · @squidfunk</span>
|
||
<span>
|
||
:octicons-calendar-24: September 13, 2021 ·
|
||
:octicons-clock-24: 15 min read ·
|
||
[:octicons-tag-24: 7.2.6+insiders-3.0.0][insiders-3.0.0]
|
||
</span>
|
||
</aside>
|
||
|
||
[@squidfunk avatar]: https://avatars.githubusercontent.com/u/932156
|
||
[insiders-3.0.0]: ../../insiders/changelog.md#3.0.0
|
||
|
||
---
|
||
|
||
The [search] of Material for MkDocs is by far one of its best and most-loved
|
||
assets: [multilingual], [offline-capable], and most importantly: _all
|
||
client-side_. It provides a solution to empower the users of your documentation
|
||
to find what they're searching for instantly without the headache of managing
|
||
additional servers. However, even though several iterations have been made,
|
||
there's still some room for improvement, which is why we rebuilt the search
|
||
plugin and integration from the ground up. This article shines some light on the
|
||
internals of the new search, why it's much more powerful than the previous
|
||
version, and what's about to come.
|
||
|
||
_The next section discusses the architecture and issues of the current search
|
||
implementation. If you immediately want to learn what's new, skip to the
|
||
[section just after that][what's new]._
|
||
|
||
[search]: ../../setup/setting-up-site-search.md
|
||
[multilingual]: ../../setup/setting-up-site-search.md#lang
|
||
[offline-capable]: ../../setup/building-for-offline-usage.md
|
||
[what's new]: #whats-new
|
||
|
||
## Architecture
|
||
|
||
Material for MkDocs uses [lunr] together with [lunr-languages] to implement
|
||
its client-side search capabilities. When a documentation page is loaded and
|
||
JavaScript is available, the search index as generated by the
|
||
[built-in search plugin] during the build process is requested from the
|
||
server:
|
||
|
||
``` ts
|
||
const index$ = document.forms.namedItem("search")
|
||
? __search?.index || requestJSON<SearchIndex>(
|
||
new URL("search/search_index.json", config.base)
|
||
)
|
||
: NEVER
|
||
```
|
||
|
||
[lunr]: https://lunrjs.com
|
||
[lunr-languages]: https://github.com/MihaiValentin/lunr-languages
|
||
[built-in search plugin]: ../../setup/setting-up-site-search.md#built-in-search-plugin
|
||
|
||
### Search index
|
||
|
||
The search index includes a stripped-down version of all pages. Let's take a
|
||
look at an example to understand precisely what the search index contains from
|
||
the original Markdown file:
|
||
|
||
??? example "Expand to inspect example"
|
||
|
||
=== ":octicons-file-code-16: docs/page.md"
|
||
|
||
```` markdown
|
||
# Example
|
||
|
||
## Text
|
||
|
||
It's very easy to make some words **bold** and other words *italic*
|
||
with Markdown. You can even add [links](#), or even `code`:
|
||
|
||
```
|
||
if (isAwesome) {
|
||
return true
|
||
}
|
||
```
|
||
|
||
## Lists
|
||
|
||
Sometimes you want numbered lists:
|
||
|
||
1. One
|
||
2. Two
|
||
3. Three
|
||
|
||
Sometimes you want bullet points:
|
||
|
||
* Start a line with a star
|
||
* Profit!
|
||
````
|
||
|
||
=== ":octicons-codescan-16: search_index.json"
|
||
|
||
``` json
|
||
{
|
||
"config": {
|
||
"indexing": "full",
|
||
"lang": [
|
||
"en"
|
||
],
|
||
"min_search_length": 3,
|
||
"prebuild_index": false,
|
||
"separator": "[\\s\\-]+"
|
||
},
|
||
"docs": [
|
||
{
|
||
"location": "page/",
|
||
"title": "Example",
|
||
"text": "Example Text It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true } Lists Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
|
||
},
|
||
{
|
||
"location": "page/#example",
|
||
"title": "Example",
|
||
"text": ""
|
||
},
|
||
{
|
||
"location": "page/#text",
|
||
"title": "Text",
|
||
"text": "It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true }"
|
||
},
|
||
{
|
||
"location": "page/#lists",
|
||
"title": "Lists",
|
||
"text": "Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
If we inspect the search index, we immediately see several problems:
|
||
|
||
1. __All content is included twice__: the search index contains one entry
|
||
with the entire contents of the page, and one entry for each section of
|
||
the page, i.e., each block preceded by a headline or subheadline. This
|
||
significantly contributes to the size of the search index.
|
||
|
||
2. __All structure is lost__: when the search index is built, all structural
|
||
information like HTML tags and attributes are stripped from the content.
|
||
While this approach works well for paragraphs and inline formatting, it
|
||
might be problematic for lists and code blocks. An excerpt:
|
||
|
||
```
|
||
… links , or even code : if (isAwesome) { … } Lists Sometimes you want …
|
||
```
|
||
|
||
- __Context__: for an untrained eye, the result can look like gibberish, as
|
||
it's not immediately apparent what classifies as text and what as code.
|
||
Furthermore, it's not clear that `Lists` is a headline as it's merged
|
||
with the code block before and the paragraph after it.
|
||
|
||
- __Punctuation__: inline elements like links that are immediately followed
|
||
by punctuation are separated by whitespace (see `,` and `:` in the
|
||
excerpt). This is because all extracted text is joined with a whitespace
|
||
character during the construction of the search index.
|
||
|
||
It's not difficult to see that it can be quite challenging to implement a good
|
||
search experience for theme authors, which is why Material for MkDocs (up to
|
||
now) did some [monkey patching] to be able to render slightly more
|
||
meaningful search previews.
|
||
|
||
[monkey patching]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/document/index.ts#L68-L71
|
||
|
||
### Search worker
|
||
|
||
The actual search functionality is implemented as part of a web worker[^1],
|
||
which creates and manages the [lunr] search index. When search is initialized,
|
||
the following steps are taken:
|
||
|
||
[^1]:
|
||
Prior to :octicons-tag-24: 5.0.0, search was carried out in the main thread
|
||
which locked up the browser, rendering it unusable. This problem was first
|
||
reported in #904 and, after some back and forth, fixed and released in
|
||
:octicons-tag-24: 5.0.0.
|
||
|
||
1. __Linking sections with pages__: The search index is parsed, and each
|
||
section is linked to its parent page. The parent page itself is _not
|
||
indexed_, as it would lead to duplicate results, so only the sections
|
||
remain. Linking is necessary, as search results are grouped by page.
|
||
|
||
2. __Tokenization__: The `title` and `text` values of each section are split
|
||
into tokens by using the [`separator`][separator] as configured in
|
||
`mkdocs.yml`. Tokenization itself is carried out by
|
||
[lunr's default tokenizer][default tokenizer], which doesn't allow for
|
||
lookahead or separators spanning multiple characters.
|
||
|
||
> Why is this important and a big deal? We will see later how much more we
|
||
> can achieve with a tokenizer that is capable of separating strings with
|
||
> lookahead.
|
||
|
||
1. __Indexing__: As a final step, each section is indexed. When querying the
|
||
index, if a search query includes one of the tokens as returned by step 2.,
|
||
the section is considered to be part of the search result and passed to the
|
||
main thread.
|
||
|
||
Now, that's basically how the search worker operates. Sure, there's a little
|
||
more magic involved, e.g., search results are [post-processed] and [rescored] to
|
||
account for some shortcomings of [lunr], but in general, this is how data gets
|
||
into and out of the index.
|
||
|
||
[separator]: ../../setup/setting-up-site-search.md#search-separator
|
||
[default tokenizer]: https://github.com/olivernn/lunr.js/blob/aa5a878f62a6bba1e8e5b95714899e17e8150b38/lunr.js#L413-L456
|
||
[post-processed]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/_/index.ts#L249-L272
|
||
[rescored]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/_/index.ts#L274-L275
|
||
|
||
### Search previews
|
||
|
||
Users should be able to quickly scan and evaluate the relevance of a search
|
||
result in the given context, which is why a concise summary with highlighted
|
||
occurrences of the search terms found is an essential part of a great search
|
||
experience.
|
||
|
||
This is where the current search preview generation falls short, as some of the
|
||
search previews appear not to include any occurrence of any of the search
|
||
terms. This was due to the fact that search previews were [truncated after a
|
||
maximum of 320 characters][truncated], as can be seen here:
|
||
|
||
<figure markdown>
|
||
|
||
![search preview]
|
||
|
||
<figcaption markdown>
|
||
|
||
The first two results look like they're not relevant, as they don't seem to
|
||
include the query string the user just searched for. Yet, they are.
|
||
|
||
</figcaption>
|
||
</figure>
|
||
|
||
A better solution to this problem has been on the roadmap for a very, very long
|
||
time, but in order to solve this once and for all, several factors need to be
|
||
carefully considered:
|
||
|
||
1. __Word boundaries__: some themes[^2] for static site generators generate
|
||
search previews by expanding the text left and right next to an occurrence,
|
||
stopping at a whitespace character when enough words have been consumed. A
|
||
preview might look like this:
|
||
|
||
```
|
||
… channels, e.g., or which can be configured via mkdocs.yml …
|
||
```
|
||
|
||
While this may work for languages that use whitespace as a separator
|
||
between words, it breaks down for languages like Japanese or Chinese[^3],
|
||
as they have non-whitespace word boundaries and use dedicated segmenters to
|
||
split strings into tokens.
|
||
|
||
[^2]:
|
||
At the time of writing, [Just the Docs] and [Docusaurus] use this method
|
||
for generating search previews. Note that the latter also integrates with
|
||
Algolia, which is a fully managed server-based solution.
|
||
|
||
[^3]:
|
||
China and Japan are both within the top 5 countries of origin of users of
|
||
Material for MkDocs.
|
||
|
||
[truncated]: https://github.com/squidfunk/mkdocs-material/blob/master/src/assets/javascripts/templates/search/index.tsx#L90
|
||
[search preview]: search-better-faster-smaller/search-preview.png
|
||
[Just the Docs]: https://pmarsceill.github.io/just-the-docs/
|
||
[Docusaurus]: https://github.com/lelouch77/docusaurus-lunr-search
|
||
|
||
1. __Context-awareness__: Although whitespace doesn't work for all languages,
|
||
one could argue that it could be a good enough solution. Unfortunately, this
|
||
is not necessarily true for code blocks, as the removal of whitespace might
|
||
change meaning in some languages.
|
||
|
||
3. __Structure__: Preserving structural information is not a must, but
|
||
apparently beneficial to build more meaningful search previews which allow
|
||
for a quick evaluation of relevance. If a word occurrence is part of a code
|
||
block, it should be rendered as a code block.
|
||
|
||
## What's new?
|
||
|
||
After we built a solid understanding of the problem space and before we dive
|
||
into the internals of our new search implementation to see which of the
|
||
problems it already solves, a quick overview of what features and improvements
|
||
it brings:
|
||
|
||
- __Better__: support for [rich search previews], preserving the structural
|
||
information of code blocks, inline code, and lists, so they are rendered
|
||
as-is, as well as [lookahead tokenization], [more accurate highlighting], and
|
||
improved stability of typeahead. Also, a [slightly better UX].
|
||
- __Faster__ and __smaller__: significant decrease in search index size of up
|
||
to 48% due to improved extraction and construction techniques, resulting in a
|
||
search experience that is up to 95% faster, which is particularly helpful for
|
||
large documentation projects.
|
||
|
||
[rich search previews]: #rich-search-previews
|
||
[lookahead tokenization]: #tokenizer-lookahead
|
||
[more accurate highlighting]: #accurate-highlighting
|
||
[slightly better UX]: #user-interface
|
||
|
||
### Rich search previews
|
||
|
||
As we rebuilt the search plugin from scratch, we reworked the construction of
|
||
the search index to preserve the structural information of code blocks, inline
|
||
code, as well as unordered and ordered lists. Using the example from the
|
||
[search index] section, here's how it looks:
|
||
|
||
=== "Now"
|
||
|
||
![search preview now]
|
||
|
||
=== "Before"
|
||
|
||
![search preview before]
|
||
|
||
Now, __code blocks are first-class citizens of search previews__, and even
|
||
inline code formatting is preserved. Let's take a look at the new structure of
|
||
the search index to understand why:
|
||
|
||
??? example "Expand to inspect search index"
|
||
|
||
=== "Now"
|
||
|
||
``` json
|
||
{
|
||
...
|
||
"docs": [
|
||
{
|
||
"location": "page/",
|
||
"title": "Example",
|
||
"text": ""
|
||
},
|
||
{
|
||
"location": "page/#text",
|
||
"title": "Text",
|
||
"text": "<p>It's very easy to make some words bold and other words italic with Markdown. You can even add links, or even <code>code</code>:</p> <pre><code>if (isAwesome){\n return true\n}\n</code></pre>"
|
||
},
|
||
{
|
||
"location": "page/#lists",
|
||
"title": "Lists",
|
||
"text": "<p>Sometimes you want numbered lists:</p> <ol> <li>One</li> <li>Two</li> <li>Three</li> </ol> <p>Sometimes you want bullet points:</p> <ul> <li>Start a line with a star</li> <li>Profit!</li> </ul>"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
=== "Before"
|
||
|
||
``` json
|
||
{
|
||
...
|
||
"docs": [
|
||
{
|
||
"location": "page/",
|
||
"title": "Example",
|
||
"text": "Example Text It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true } Lists Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
|
||
},
|
||
{
|
||
"location": "page/#example",
|
||
"title": "Example",
|
||
"text": ""
|
||
},
|
||
{
|
||
"location": "page/#text",
|
||
"title": "Text",
|
||
"text": "It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true }"
|
||
},
|
||
{
|
||
"location": "page/#lists",
|
||
"title": "Lists",
|
||
"text": "Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
If we inspect the search index again, we can see how the situation improved:
|
||
|
||
1. __Content is included only once__: the search index does not include the
|
||
content of the page twice, as only the sections of a page are part of the
|
||
search index. This leads to a significant reduction in size, fewer bytes to
|
||
transfer, and a smaller search index.
|
||
|
||
2. __Some structure is preserved__: each section of the search index includes
|
||
a small subset of HTML to provide the necessary structure to allow for more
|
||
sophisticated search previews. Revisiting our example from before, let's
|
||
look at an excerpt:
|
||
|
||
=== "Now"
|
||
|
||
``` html
|
||
… links, or even <code>code</code>:</p> <pre><code>if (isAwesome){ … }\n</code></pre>
|
||
```
|
||
|
||
=== "Before"
|
||
|
||
```
|
||
… links , or even code : if (isAwesome) { … }
|
||
```
|
||
|
||
The punctuation issue is gone, as no additional whitespace is inserted, and
|
||
the preserved markup yields additional context to make scanning search
|
||
results more effective.
|
||
|
||
On to the next step in the process: __tokenization__.
|
||
|
||
[search index]: #search-index
|
||
[search preview now]: search-better-faster-smaller/search-preview-now.png
|
||
[search preview before]: search-better-faster-smaller/search-preview-before.png
|
||
|
||
### Tokenizer lookahead
|
||
|
||
The [default tokenizer] of [lunr] uses a regular expression to split a given
|
||
string by matching each character against the [`separator`][separator] as
|
||
defined in `mkdocs.yml`. This doesn't allow for more complex separators based
|
||
on lookahead or multiple characters.
|
||
|
||
Fortunately, __our new search implementation provides an advanced tokenizer__
|
||
that doesn't have these shortcomings and supports more complex regular
|
||
expressions. As a result, Material for MkDocs just changed its own separator
|
||
configuration to the following value:
|
||
|
||
```
|
||
[\s\-,:!=\[\]()"/]+|(?!\b)(?=[A-Z][a-z])|\.(?!\d)|&[lg]t;
|
||
```
|
||
|
||
While the first part up to the first `|` contains a list of single control
|
||
characters at which the string should be split, the following three sections
|
||
explain the remainder of the regular expression.[^4]
|
||
|
||
[^4]:
|
||
As a fun fact: the [`separator`][separator] [default value] of the search
|
||
plugin being `[\s\-]+` always has been kind of irritating, as it suggests
|
||
that multiple characters can be considered being a separator. However, the
|
||
`+` is completely irrelevant, as regular expression groups involving
|
||
multiple characters were never supported by
|
||
[lunr's default tokenizer][default tokenizer].
|
||
|
||
[default value]: https://www.mkdocs.org/user-guide/configuration/#separator
|
||
|
||
#### Case changes
|
||
|
||
Many programming languages use `PascalCase` or `camelCase` naming conventions.
|
||
When a user searches for the term `case`, it's quite natural to expect for
|
||
`PascalCase` and `camelCase` to show up. By adding the following match group to
|
||
the separator, this can now be achieved with ease:
|
||
|
||
```
|
||
(?!\b)(?=[A-Z][a-z])
|
||
```
|
||
|
||
This regular expression is a combination of a negative lookahead (`\b`, i.e.,
|
||
not a word boundary) and a positive lookahead (`[A-Z][a-z]`, i.e., an uppercase
|
||
character followed by a lowercase character), and has the following behavior:
|
||
|
||
- `PascalCase` :octicons-arrow-right-24: `Pascal`, `Case`
|
||
- `camelCase` :octicons-arrow-right-24: `camel`, `Case`
|
||
- `UPPERCASE` :octicons-arrow-right-24: `UPPERCASE`
|
||
|
||
Searching for [:octicons-search-24: searchHighlight][q=searchHighlight]
|
||
now brings up the section discussing the `search.highlight` feature flag, which
|
||
also demonstrates that this now even works properly for search queries.[^5]
|
||
|
||
[^5]:
|
||
Previously, the search query was not correctly tokenized due to the way
|
||
[lunr] treats wildcards, as it disables the pipeline for search terms that
|
||
contain wildcards. In order to provide a good typeahead experience,
|
||
Material for MkDocs adds wildcards to the end of each search term not
|
||
explicitly preceded with `+` or `-`, effectively disabling tokenization.
|
||
|
||
[q=searchHighlight]: ?q=searchHighlight
|
||
|
||
#### Version numbers
|
||
|
||
Indexing version numbers is another problem that can be solved with a small
|
||
lookahead. Usually, `.` should be considered a separator to split words like
|
||
`search.highlight`. However, splitting version numbers at `.` will make them
|
||
undiscoverable. Thus, the following expression:
|
||
|
||
```
|
||
\.(?!\d)
|
||
```
|
||
|
||
This regular expression matches a `.` only if not immediately followed by a
|
||
digit `\d`, which leaves version numbers discoverable. Searching for
|
||
[:octicons-search-24: 7.2.6][q=7.2.6] brings up the [7.2.6] release notes.
|
||
|
||
[q=7.2.6]: ?q=7.2.6
|
||
[7.2.6]: ../../changelog/index.md#726-_-september-1-2021
|
||
|
||
#### HTML/XML tags
|
||
|
||
If your documentation includes HTML/XML code examples, you may want to allow
|
||
users to find specific tag names. Unfortunately, the `<` and `>` control
|
||
characters are encoded in code blocks as `<` and `>`. Now, adding the
|
||
following expression to the separator allows for just that:
|
||
|
||
```
|
||
&[lg]t;
|
||
```
|
||
|
||
Searching for [:octicons-search-24: custom search worker script][q=script]
|
||
brings up the section on [custom search] and matches the `script` tag among the
|
||
other search terms discovered.
|
||
|
||
---
|
||
|
||
_We've only just begun to scratch the surface of the new possibilities
|
||
tokenizer lookahead brings. If you found other useful expressions, you're
|
||
invited to share them in the comment section._
|
||
|
||
[q=script]: ?q=custom+search+worker+script
|
||
[custom search]: ../../setup/setting-up-site-search.md#custom-search
|
||
|
||
### Accurate highlighting
|
||
|
||
Highlighting is the last step in the process of search and involves the
|
||
highlighting of all search term occurrences in a given search result. For a
|
||
long time, highlighting was implemented through dynamically generated
|
||
[regular expressions].[^6]
|
||
|
||
This approach has some problems with non-whitespace languages like Japanese or
|
||
Chinese[^3] since it only works if the highlighted term is at a word boundary.
|
||
However, Asian languages are tokenized using a [dedicated segmenter], which
|
||
cannot be modeled with regular expressions.
|
||
|
||
[^6]:
|
||
Using the separator as defined in `mkdocs.yml`, a regular expression was
|
||
constructed that was trying to mimic the tokenizer. As an example, the
|
||
search query `search highlight` was transformed into the rather cumbersome
|
||
regular expression `(^|<separator>)(search|highlight)`, which only matches
|
||
at word boundaries.
|
||
|
||
Now, as a direct result of the [new tokenization approach], __our new search
|
||
implementation uses token positions for highlighting__, making it exactly as
|
||
powerful as tokenization:
|
||
|
||
1. __Word boundaries__: as the new highlighter uses token positions, word
|
||
boundaries are equal to token boundaries. This means that more complex cases
|
||
of tokenization (e.g., [case changes], [version numbers], [HTML/XML tags]),
|
||
are now all highlighted accurately.
|
||
|
||
2. __Context-awareness__: as the new search index preserves some of the
|
||
structural information of the original document, the content of a section
|
||
is now divided into separate content blocks – paragraphs, code blocks, and
|
||
lists.
|
||
|
||
Now, only the content blocks that actually contain occurrences of one of
|
||
the search terms are considered for inclusion into the search preview. If a
|
||
term only occurs in a code block, it's the code block that gets rendered,
|
||
see, for example, the results of
|
||
[:octicons-search-24: twitter][q=twitter].
|
||
|
||
[regular expressions]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/highlighter/index.ts#L61-L91
|
||
[dedicated segmenter]: http://chasen.org/~taku/software/TinySegmenter/
|
||
[new tokenization approach]: #tokenizer-lookahead
|
||
[case changes]: #case-changes
|
||
[version numbers]: #version-numbers
|
||
[HTML/XML tags]: #htmlxml-tags
|
||
[q=twitter]: ?q=twitter
|
||
|
||
### Benchmarks
|
||
|
||
We conducted two benchmarks – one with the documentation of Material for MkDocs
|
||
itself, and one with a very massive corpus of Markdown files with more than
|
||
800,000 words – a size most documentation projects will likely never
|
||
reach:
|
||
|
||
<figure markdown>
|
||
|
||
| | Before | Now | Relative |
|
||
| ----------------------- | -------: | -------------: | -----------: |
|
||
| __Material for MkDocs__ | | | |
|
||
| Index size | 573 kB | __335 kB__ | __–42%__ |
|
||
| Index size (`gzip`) | 105 kB | __78 kB__ | __–27%__ |
|
||
| Indexing time[^7] | 265 ms | __177 ms__ | __–34%__ |
|
||
| __KJV Markdown[^8]__ | | | |
|
||
| Index size | 8.2 MB | __4.4 MB__ | __–47%__ |
|
||
| Index size (`gzip`) | 2.3 MB | __1.2 MB__ | __–48%__ |
|
||
| Indexing time | 2,700 ms | __1,390 ms__ | __–48%__ |
|
||
|
||
<figcaption>
|
||
<p>Benchmark results</p>
|
||
</figcaption>
|
||
|
||
</figure>
|
||
|
||
[^7]:
|
||
Smallest value of ten distinct runs.
|
||
|
||
[^8]:
|
||
We agnostically use [KJV Markdown] as a tool for testing to learn how
|
||
Material for MkDocs behaves on large corpora, as it's a very large set of
|
||
Markdown files with over 800k words.
|
||
|
||
The results show that indexing time, which is the time that it takes to set up
|
||
the search when the page is loaded, has dropped by up to 48%, which means __the
|
||
new search is up to 95% faster__. This is a significant improvement,
|
||
particularly relevant for large documentation projects.
|
||
|
||
While 1,3s still may sound like a long time, using the new client-side search
|
||
together with [instant loading] only creates the search index on the initial
|
||
page load. When navigating, the search index is preserved across pages, so the
|
||
cost does only have to be paid once.
|
||
|
||
[KJV Markdown]: https://github.com/arleym/kjv-markdown
|
||
[instant loading]: ../../setup/setting-up-navigation.md#instant-loading
|
||
|
||
### User interface
|
||
|
||
Additionally, some small improvements have been made, most prominently the
|
||
__more results on this page__ button, which now sticks to the top of the search
|
||
result list when open. This enables the user to jump out of the list more
|
||
quickly.
|
||
|
||
## What's next?
|
||
|
||
Our new search implementation is a big improvement to Material for MkDocs. It
|
||
solves some long-standing issues which needed to be tackled for years. Yet,
|
||
it's only the start of a search experience that is going to get better and
|
||
better. Next up:
|
||
|
||
- __Context-aware search summarization__: currently, the first two matching
|
||
content blocks are rendered as a search preview. With the new tokenization
|
||
technique, we laid the groundwork for more sophisticated shortening and
|
||
summarization methods, which we're tackling next.
|
||
|
||
- __User interface improvements__: as we now gained full control over the
|
||
search plugin, we can now add meaningful metadata to provide more context and
|
||
a better experience. We'll explore some of those paths in the future.
|
||
|
||
If you've made it this far, thank you for your time and interest in Material
|
||
for MkDocs! This is the first blog article that I decided to write after a
|
||
short [Twitter survey] made me to. You're invited to leave a comment
|
||
to share your experiences with the new search implementation.
|
||
|
||
[Twitter survey]: https://twitter.com/squidfunk/status/1434477478823743488
|