Set up blog and started article about new search

This commit is contained in:
squidfunk 2021-09-12 16:41:19 +02:00
parent cc724f77d4
commit a5e3052db6
8 changed files with 267 additions and 12 deletions

View File

@ -0,0 +1,235 @@
---
template: overrides/main.html
search:
boost: 0.5
---
# Search: better, faster, smaller
__This is the story of how we managed to completely rebuild client-side search, delivering a significantly better user experience, while making it faster and smaller at the same time.__
<aside class="mdx-author" markdown="1">
![@squidfunk][1]
<span>__Martin Donath__ · @squidfunk</span>
<span>
:octicons-calendar-24: September 12, 2021 ·
:octicons-clock-24: 15 min read
</span>
</aside>
[1]: https://avatars.githubusercontent.com/u/932156
---
The [search][2] of Material for MkDocs is genuinely one of its best and most-loved assets: fast, [multi-lingual][3], [offline-capable][4] and most importantly: _all client-side_. It provides a solution to empower the users of your documentation to find what they're searching for instantly without the headache of managing additional servers. However, even though several iterations have been made, there's still some room for improvement, which is why we rebuilt the search plugin and integration from the ground up. This article shines some light on the internals of the new search, why it's much more powerful than the previous version and what's about to come.
_The next section explains the architecture and issues of the current search implementation. If you immediately want to learn what's new, skip to the [next section][5]._
[2]: ../../setup/setting-up-site-search.md
[3]: ../../setup/setting-up-site-search.md#lang
[4]: ../../setup/setting-up-site-search.md#offline-search
[5]: #whats-new
## Architecture
Material for MkDocs uses [lunr][6] together with [lunr-languages][7] to implement its client-side search capabilities. When a documentation page is loaded and JavaScript is available, the search index as generated by the [built-in search plugin][8] during the build process is requested from the server:
``` ts
const index$ = document.forms.namedItem("search")
? __search?.index || requestJSON<SearchIndex>(
new URL("search/search_index.json", config.base)
)
: NEVER
```
[6]: https://lunrjs.com
[7]: https://github.com/MihaiValentin/lunr-languages
[8]: ../../setup/setting-up-site-search.md#built-in-search
### Search index
The search index includes a stripped-down version of all pages. Let's take a look at an example, to understand precisely what the search index contains from the original Markdown file:
??? example "Expand to see full example"
=== "`docs/page.md`"
```` markdown
# Example
## Text
It's very easy to make some words **bold** and other words *italic* with
Markdown. You can even add [links](#), or even `code`:
```
if (isAwesome) {
return true
}
```
## Lists
Sometimes you want numbered lists:
1. One
2. Two
3. Three
Sometimes you want bullet points:
* Start a line with a star
* Profit!
````
=== "`search_index.json`"
``` json
{
"config": {
"indexing": "full",
"lang": [
"en"
],
"min_search_length": 3,
"prebuild_index": false,
"separator": "[\\s\\-]+"
},
"docs": [
{
"location": "page/",
"text": "Example Text It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true } Lists Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!",
"title": "Example"
},
{
"location": "page/#example",
"text": "",
"title": "Example"
},
{
"location": "page/#text",
"text": "It's very easy to make some words bold and other words italic with Markdown. You can even add links , or even code : if (isAwesome) { return true }",
"title": "Text"
},
{
"location": "page/#lists",
"text": "Sometimes you want numbered lists: One Two Three Sometimes you want bullet points: Start a line with a star Profit!",
"title": "Lists"
}
]
}
```
If we inspect the search index, we immediately see several problems:
1. __All content is included twice__: the search index includes one entry with the entire contents of the page, and one entry for each section of the page, i.e. each block preceded by a headline or subheadline. This significantly increases the size of the search index.
2. __All structure is lost__: when the search index is built, all structural information like HTML tags and attributes are stripped from the content. While this approach works well for paragraphs and inline formatting, it might be problematic for lists and code blocks. An excerpt:
```
[...] links , or even code : if (isAwesome) { ... } Lists Sometimes [...]
```
- __Context__: for an untrained eye, the result can look like gibberish, as it's not immediately apparent what classifies as text and what as code. Furthermore, it's not clear that `Lists` is a headline as it's merged with the code block before and the paragraph after it.
- __Punctuation__: inline elements like links, that are immediately followed by punctuation are separated by whitespace (see `,` and `:` in the excerpt). This is because all extracted text is joined with a whitespace character during the construction of the search index.
It's not difficult to see that it can be quite challenging to implement a good search experience for theme authors, which is why Material for MkDocs (up to now) does some [monkey patching][9] to be able to render more meaningful search previews.
### Search worker
The actual search functionality is implemented as part of a web worker[^1], which creates and manages the [lunr][6] search index. When search is initialized, the following steps are taken:
[^1]: Prior to [version 5.0][10], search was carried out in the main thread which locked up the browser, rendering it unusable. This problem was first reported in #904 and, after some back and forth, fixed and released in version 5.0.
1. __Linking sections with pages__: The search index is parsed and each section is linked to its parent page. The parent page itself is _not indexed_, as it would lead to duplicate results, so only the sections remain. Linking is necessary, as search results need to be grouped by page.
2. __Tokenization__: The `title` and `text` values of each section are split into tokens by using the [separator][11] as configured in `mkdocs.yml`. Tokenization itself is carried out by [lunr's default tokenizer][12], which doesn't allow for lookahead or separators spanning multiple characters.
> Why is this important and a big deal? We will see later how much more we can achieve with a tokenizer that is capable of separating strings with lookahead.
3. __Indexing__: As a final step, each section is indexed. When querying the index, if a search query includes one of the tokens as returned by step 2., the section is considered to be part of the search result and passed to the main thread.
Now, that's basically how the search worker operates. Sure, there's a little more magic involved, e.g. search results are [post-processed][13] and [rescored][14] to account for some shortcomings of [lunr][6], but in general this is how search results get into and out of the index.
[9]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/document/index.ts#L68-L71
[10]: https://squidfunk.github.io/mkdocs-material/upgrading/#upgrading-from-4x-to-5x
[11]: ../../setup/setting-up-site-search.md#separator
[12]: https://github.com/olivernn/lunr.js/blob/aa5a878f62a6bba1e8e5b95714899e17e8150b38/lunr.js#L413-L456
[13]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/_/index.ts#L249-L272
[14]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/_/index.ts#L274-L275
### Search previews
Users should be able to quickly scan an evaluate the relevance of a search result in the given context, which is why a concise summary with highlighted occurrences of the words found is an essential part of a great search experience.
This is where the current search preview generation falls short, as some of the search previews appear to not include any occurrence of any of the search terms. This was due to the fact that search previews were [truncated after 320 characters][15], as can be seen here:
<figure markdown="1">
[![Search previews][16]][16]
<figcaption markdown="1">
The first two results look like they're not relevant, as they don't seem to include the query string the user just searched for. Yet, they are.
</figcaption>
</figure>
A better solution to this problem has been on the roadmap for a very, very long time, but in order to solve this once and for all, several factors need to be carefully considered:
1. __Word boundaries__: some themes[^2] for static site generators generate search previews by expanding the text left and right next to an occurrence, stopping at a whitespace character when enough words have been consumed. A preview might look like this:
```
... channels, e.g. or which can be configured via mkdocs.yml.
```
While this may work for languages that use whitespace as a separator between words, it breaks down for languages like Japanese or Chinese[^2], as they have non-whitespace word boundaries and use dedicated segmenters to split strings into tokens.
[^2]: At the time of writing, [Just the Docs][17] and [Docusaurus][18] use this method for generating search previews. Note that the latter by default uses Algolia, which is a fully managed server-based solution.
[^3]: In fact, China and Japan are both within the top 5 countries of origin of users of Material for MkDocs.
[15]: https://github.com/squidfunk/mkdocs-material/blob/master/src/assets/javascripts/templates/search/index.tsx#L90
[16]: search-better-faster-smaller/search-preview.png
[17]: https://pmarsceill.github.io/just-the-docs/
[18]: https://github.com/lelouch77/docusaurus-lunr-search
2. __Context awareness__: Although whitespace doesn't work for all languages, one could argue that it could be a good-enough solution. Unfortunately, this is not necessarily true for code blocks, as the removal of whitespace might change meaning in some languages.
3. __Structure__: Preserving structural information is not a must, but apparently beneficial to build more meaningful search previews which allow for a quick evaluation of relevance. If the user expects a match within a code block, it should be rendered as a code block.
## What's new?
After we built a solid understanding of the problem space and before we dive into the internals of the new search implementation to see which of the problems it already solves, a quick overview:
- __Better__: support for [rich search results][19], preserving the structural information of code blocks, inline code and lists, so they are rendered as-is, as well as more [accurate highlighting][20] and improved stability of typeahead. Additionall, some small [interface improvements][21].
- __Faster__: Up to 40% faster indexing and querying
- __Smaller__: Up to 50% savings in index size
[19]: #rich-search-results
[20]: #accurate-highlighting
[21]: #interface-improvements
### Rich search results
- HTML awareness
- tokenization now
- faster indexing
- smaller index size
- faster search results
## Highlighting
x the problem with highlighting
x how highlighting was implemented
- how its implemented now
- division into blocks
- lookahead tokenization
- add "jump to improvements"
- ux improvements
- scrolling more butotn
##

Binary file not shown.

After

Width:  |  Height:  |  Size: 55 KiB

View File

@ -0,0 +1,15 @@
---
template: overrides/main.html
---
# Blog
<h2>Search: better, faster, smaller</h2>
__This is the story of how we managed to completely rebuild client-side search, delivering a significantly better user experience, while making it faster and smaller at the same time.__
The search of Material for MkDocs is genuinely one of its best and most-loved assets: fast, multi-lingual, offline-capable and most importantly: _all client-side_. It provides a solution to empower the users of your documentation to find what they're searching for instantly without the headache of managing additional servers. However, even though several iterations have been made, there's still some room for improvement, which is why we rebuilt the search plugin and integration from the ground up. This article shines some light on the internals of the new search, why it's much more powerful than the previous version and what's about to come.
[Continue reading :octicons-arrow-right-24:][1]{ .md-button }
[1]: 2021/search-better-faster-smaller.md

View File

@ -76,7 +76,7 @@ and embed them into any other file.
_Example_:
=== "docs/page.md"
=== "`docs/page.md`"
```` markdown
The HTML specification is maintained by the W3C.
@ -84,7 +84,7 @@ _Example_:
--8<-- "includes/abbreviations.md"
````
=== "includes/abbreviations.md"
=== "`includes/abbreviations.md`"
```` markdown
*[HTML]: Hyper Text Markup Language

View File

@ -53,7 +53,7 @@ configuring syntax highlighting of code blocks:
respective stylesheet and JavaScript from a [CDN][9] serving
Highlight.js in `mkdocs.yml`:
=== "docs/javascripts/config.js"
=== "`docs/javascripts/config.js`"
``` js
document$.subscribe(() => {
@ -61,7 +61,7 @@ configuring syntax highlighting of code blocks:
})
```
=== "mkdocs.yml"
=== "`mkdocs.yml`"
``` yaml
extra_javascript:

View File

@ -120,7 +120,7 @@ If you want to make data tables sortable, you can add [tablesort][5], which is
natively integrated with Material for MkDocs and will also work with [instant
loading][6] via [additional JavaScript][2]:
=== "docs/javascripts/tables.js"
=== "`docs/javascripts/tables.js`"
``` js
document$.subscribe(function() {
@ -131,7 +131,7 @@ loading][6] via [additional JavaScript][2]:
})
```
=== "mkdocs.yml"
=== "`mkdocs.yml`"
``` yaml
extra_javascript:

View File

@ -70,13 +70,13 @@ storage and management.
_Example_:
=== "docs/page.md"
=== "`docs/page.md`"
```` markdown
The unit price is {{ unit.price }}
````
=== "mkdocs.yml"
=== "`mkdocs.yml`"
``` yaml
extra:
@ -109,13 +109,13 @@ In your Markdown file, include snippets with Jinja's [`include`][4] function:
_Example_:
=== "snippets/definitions.md"
=== "`snippets/definitions.md`"
``` markdown
The unit price is {{ page.meta.unit.price }}
```
=== "docs/page-1.md"
=== "`docs/page-1.md`"
``` markdown
---
@ -126,7 +126,7 @@ _Example_:
{% include "definitions.md" %}
```
=== "docs/page-2.md"
=== "`docs/page-2.md`"
``` markdown
---

View File

@ -54,7 +54,7 @@ theme:
- content.tabs.link
# - header.autohide
# - navigation.expand
# - navigation.indexes
- navigation.indexes
# - navigation.instant
- navigation.sections
- navigation.tabs
@ -211,6 +211,11 @@ nav:
- MathJax: reference/mathjax.md
- Meta tags: reference/meta-tags.md
- Variables: reference/variables.md
- Blog:
- blog/index.md
- 2021:
- blog/2021/search-better-faster-smaller.md
- Insiders:
- Sponsorship: insiders/index.md
- Getting started: