Adventures in Searching with Postgres: Part 2

Matt Wean ·

In part one, we looked at querying 100,000 International Statistical Classification of Diseases (ICD) code records for 3-7 character codes. In part two of three, we’ll cover full-text search of the record descriptions using Postgres and Arel.

What’s a tsvector?

The main component of full-text search in Postgres is the tsvector. This is basically a list of tokens and their positions in a text. For example:

The quick brown fox jumped over the lazy dog

becomes

'brown':3 'dog':9 'fox':4 'jumped':5 'lazy':8 'over':6 'quick':2 'the':1,7

To go beyond simple string matching, Postgres comes with a set of functions, parsers, and dictionaries which combine to allow much more powerful searching. The functions to_tsvector and to_tsquery tokenize a string, remove common “stop-words”, and stem each word. This converts our example sentence to:

You can see that stop-words “the” and “over” were removed, and the words “jumped” and “lazy” were converted to their stems.

We can use the match operator @@ to query our string:

Since Arel doesn’t provide built-in methods for full-text queries, we have to create our own custom operators and functions:

For the query vector, each word in the query string must be combined using a boolean operator: (quick & brown) | dog. In our case, I simply joined all words with &.

The resulting query is something like this:

If you need to combine text from multiple columns, you can use the concatenation operator ||:

With a working query, we can test against the real data. Unfortunately, benchmarks with 1-5 search terms gave some troubling results:

Search Terms Avg. Query Time
1 1,500 ms
2 1,700 ms
3 1,800 ms
4 1,900 ms
5 2,000 ms

Adding an Index

We obviously want to add an index to the query, but it turns out there are two options. GiST, or Generalized Search Tree, works by hashing each word in a text to a single bit. This leads to hash collisions and causes some false matches which need to be checked before returning a result. GIN (Generalized Inverted Index) instead works by mapping a word to all the records it is found in. This means that a GIN index does not suffer from hash collisions, which leads to better performance in most cases.

GIN is about 3x faster, but that comes at the price of slower updates, 3x more time to build, and 2-3x more space. The takeaway: use GIN for relatively static data, use GiST if you’re worried about write performance.

Since our data will be mostly static, it made sense to use a GIN index:

But this didn’t work. Instead, I received the error “functions in index expression must be marked IMMUTABLE”. After some research, I found the problem was the to_tsvector function takes an optional configuration parameter, which needs to be specified for the index to work. A configuration describes how a string is tokenized, which stop-words are removed, and what stemming is performed. In most cases, you’ll want to use “english”. It’s also important to remember that all other to_tsvector and to_tsquery calls need to include the “english” configuration in order to use the index.

The new index had a tremendous impact:

Search Terms Avg. Query Time
1 8 ms
2 0.44 ms
3 0.47 ms
4 0.6 ms
5 0.78 ms

To understand why the query time is so much lower with more than one search term, we can look at some EXPLAIN queries:

What these queries are showing is that Postgres will first do an index scan to find the records that contain the given terms and then will select those records. In the case of two terms, the Postgres planner estimates the index search will return 2 rows (rows=2), whereas the single-term query is expected to return 275. Since the heap scan to select all matching rows is so much slower than the index scan, having fewer matching records makes the query faster.

Sorting the Results

Postgres has two sorting functions for full-text queries: ts_rank and ts_rank_cd. ts_rank is simply based on the number of times a word occurs in a record, where ts_rank_cd uses the “cover density”, which is how close the search terms are to each other. We implemented the order clause using another custom function:

The benchmarks for each type of sort were nearly identical, so we chose ts_rank_cd, since it made most sense for our application. The sorted queries took about twice as long for a single search term, but were almost the same for more terms:

Search Terms Avg. Query Time
1 16 ms
2 0.57 ms
3 0.54 ms
4 0.67 ms
5 0.85 ms

Result

At this point, we have a fast full-text search system for the ICD descriptions, but it isn’t integrated with the other system for searching the codes.

In part three, we’ll pull everything together with search objects and talk about some improvements like partial matches, spell-checking, and synonyms.