For a recent project, we built a live-search for over 60,000 records which used both simple pattern matching and full-text search. The records we needed to search were diagnoses that are used to assign patients to a therapy. I had never done full-text search or anything real-time with that many records, so I ended up doing a lot of experimentation. These posts will cover my experience, and I hope they’ll be of value to anyone implementing their own Postgres search.
The records we were searching were International Statistical Classification of Diseases (ICD) codes. Each record consists of a 3-7 character code, a short description averaging 11 words, and a few other data and association fields (for this post, I generated 100,000 records matching the real ICD format). We needed to be able to search by code and description, and users would be changing their search query quickly to find the right record, so it needed to be responsive. In part one, I’ll cover the code search where a user enters one or more codes (which may be partial).
Given the need for partial matches, I started off with the built-in Arel #matches
method:
This generates the query:
I ran benchmarks on this using queries of different lengths and got an average of 39.5ms, which is not ideal.
You can see from the query that we’re using the Postgres ILIKE
operator, which is case-insensitive. All the codes were capitalized in the database, so I wondered if using a plain LIKE
query would be any faster. Since I couldn’t find a built in Arel method for case-sensitive match, I had to use a custom operator:
As I suspected, this did speed up the queries, giving an average of 10.7ms. This was much better, but I hadn’t even added an index!
I added a normal index to the codes
column, and then… no change. I was still getting average query times of about 11ms. So why was there no change? I ran an EXPLAIN
query (ICDCode.where(···).explain
) and saw this:
We can see that it’s running a sequential scan of the table rather than using the index. The PostgreSQL manual suggests that the standard B-tree index should work with pattern matching operators as long as the pattern is anchored at the beginning of the string, which mine was. After some searching, I found that the Postgres indices are affected by the locale settings. By default, Postgres indices use the generic “C” locale, whereas my machine was using the more modern en_US.UTF-8
. Since the locales didn’t match, I couldn’t use the index.
Fortunately, you can specify an operator class for the index, which ignores the locale rules and just compares strings character by character.
After adding the operator class to the index, another EXPLAIN
query showed that it was being used:
We can see that the condition is looking for strings >=
A123 and <
A124 rather than scanning all values, and our benchmark reflects the change: average query times were down to 1.3ms!
When looking at EXPLAIN
results, keep in mind the Postgres query planner may use different approaches depending on the specifics of the query. For example, using a shorter query string resulted in a different plan:
You can read more about the bitmap heap scan here.
The next issue was handling queries containing multiple codes. A common technique for combining an arbitrary number of Arel conditions is using #reduce
with the #and
and #or
methods:
This combines all the LIKE
conditions and can be used in our #where
call:
This generates the query:
I ran some benchmarks using different numbers of search terms:
Term Count | Avg. Query Time |
---|---|
1 | 1.4ms |
2 | 2.6ms |
3 | 3.7ms |
4 | 4.8ms |
5 | 5.9ms |
Looking good!
These query times were plenty fast, but since we were only using a few of the columns from the table, I wondered if only selecting those would improve performance. My query then changed to:
and resulted in a 19-25% boost:
Term Count | SELECT * | SELECT id, code, description | Improvement |
---|---|---|---|
1 | 1.4ms | 1.1ms | 19% |
2 | 2.6ms | 2.0ms | 24% |
3 | 3.7ms | 2.8ms | 24% |
4 | 4.8ms | 3.6ms | 25% |
5 | 5.9ms | 4.4ms | 25% |
After these improvements, we had query responses under 10ms and a fluid user experience. In part two, I’ll cover full-text search of the ICD code descriptions and many more benchmarks!