In part one, we looked at querying 100,000 International Statistical Classification of Diseases (ICD) code records for 3-7 character codes. In part two of three, we’ll cover full-text search of the record descriptions using Postgres and Arel.
What’s a tsvector?
The main component of full-text search in Postgres is the
tsvector. This is basically a list of tokens and their positions in a text. For example:
The quick brown fox jumped over the lazy dog
'brown':3 'dog':9 'fox':4 'jumped':5 'lazy':8 'over':6 'quick':2 'the':1,7
To go beyond simple string matching, Postgres comes with a set of functions, parsers, and dictionaries which combine to allow much more powerful searching. The functions
to_tsquery tokenize a string, remove common “stop-words”, and stem each word. This converts our example sentence to:
You can see that stop-words “the” and “over” were removed, and the words “jumped” and “lazy” were converted to their stems.
We can use the match operator
@@ to query our string:
Since Arel doesn’t provide built-in methods for full-text queries, we have to create our own custom operators and functions:
For the query vector, each word in the query string must be combined using a boolean operator:
(quick & brown) | dog. In our case, I simply joined all words with
The resulting query is something like this:
If you need to combine text from multiple columns, you can use the concatenation operator
With a working query, we can test against the real data. Unfortunately, benchmarks with 1-5 search terms gave some troubling results:
|Avg. Query Time
Adding an Index
We obviously want to add an index to the query, but it turns out there are two options. GiST, or Generalized Search Tree, works by hashing each word in a text to a single bit. This leads to hash collisions and causes some false matches which need to be checked before returning a result. GIN (Generalized Inverted Index) instead works by mapping a word to all the records it is found in. This means that a GIN index does not suffer from hash collisions, which leads to better performance in most cases.
GIN is about 3x faster, but that comes at the price of slower updates, 3x more time to build, and 2-3x more space. The takeaway: use GIN for relatively static data, use GiST if you’re worried about write performance.
Since our data will be mostly static, it made sense to use a GIN index:
But this didn’t work. Instead, I received the error “functions in index expression must be marked IMMUTABLE”. After some research, I found the problem was the
to_tsvector function takes an optional
configuration parameter, which needs to be specified for the index to work. A configuration describes how a string is tokenized, which stop-words are removed, and what stemming is performed. In most cases, you’ll want to use “english”. It’s also important to remember that all other
to_tsquery calls need to include the “english” configuration in order to use the index.
The new index had a tremendous impact:
|Avg. Query Time
To understand why the query time is so much lower with more than one search term, we can look at some
What these queries are showing is that Postgres will first do an index scan to find the records that contain the given terms and then will select those records. In the case of two terms, the Postgres planner estimates the index search will return 2 rows (
rows=2), whereas the single-term query is expected to return 275. Since the heap scan to select all matching rows is so much slower than the index scan, having fewer matching records makes the query faster.
Sorting the Results
Postgres has two sorting functions for full-text queries:
ts_rank is simply based on the number of times a word occurs in a record, where
ts_rank_cd uses the “cover density”, which is how close the search terms are to each other. We implemented the order clause using another custom function:
The benchmarks for each type of sort were nearly identical, so we chose
ts_rank_cd, since it made most sense for our application. The sorted queries took about twice as long for a single search term, but were almost the same for more terms:
|Avg. Query Time
At this point, we have a fast full-text search system for the ICD descriptions, but it isn’t integrated with the other system for searching the codes.
In part three, we’ll pull everything together with search objects and talk about some improvements like partial matches, spell-checking, and synonyms.