SQL supports wildcard search on strings using
LIKE operator which accepts
_ wildcards. The problem with
LIKE is it's not very fast if you have a lot of rows and the query is non-sargable. And in some cases you need to provide fuzzy search capabilities where the results don't have to exactly match the query.
PostgreSQL has the
pg_trgm extension that solves both problems:
- It has
gistindexes for speeding up
LIKEand other string operators
- It has
%operator for string similarity search using trigrams.
Let's assume we have this table:
CREATE TABLE persons ( id int4 NOT NULL GENERATED ALWAYS AS IDENTITY, forenames varchar(100) NOT NULL, surname varchar(100) NOT NULL, forenames_normalized varchar(100) NOT NULL, surname_normalized varchar(100) NOT NULL, CONSTRAINT persons_pk PRIMARY KEY (id) );
Note: Normalized columns are lowercase versions of the normal columns and special characters are removed. You can also remove character accents. This is to make the search experience better for the user as they don't have to type in the exact case and punctuations.
If we run a
LIKE query on it:
select * from persons p where surname_normalized like '%tche%' and forenames_normalized like '%nde%'
On my laptop it takes PostgreSQL about a second to return the results:
Gather (cost=1000.00..142174.75 rows=10 width=30) (actual time=9.719..639.460 rows=75 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on persons p (cost=0.00..141173.75 rows=4 width=30) (actual time=3.425..605.240 rows=25 loops=3) Filter: (((surname_normalized)::text ~~ '%tche%'::text) AND ((forenames_normalized)::text ~~ '%nde%'::text)) Rows Removed by Filter: 3333308 Planning Time: 0.097 ms Execution Time: 639.494 ms
It seems like all of rows rows are scanned in the table. To speed things up, first we need to enable the
pg_trgm extension on the database:
create extension if not exists pg_trgm;
Then we can use the
gin index on the normalized columns:
create index if not exists idx_gin_persons_on_names on persons using gin (forenames_normalized gin_trgm_ops, surname_normalized gin_trgm_ops)
gin index and
gin_trgm_ops operator are part of
gin index took about a minute on my laptop for 10M rows.
Now let's see if the results have improved:
Bitmap Heap Scan on persons p (cost=54.20..3692.46 rows=995 width=30) (actual time=4.011..4.066 rows=75 loops=1) Recheck Cond: (((forenames_normalized)::text ~~ '%nde%'::text) AND ((surname_normalized)::text ~~ '%tche%'::text)) Heap Blocks: exact=75 -> Bitmap Index Scan on idx_gin_persons_on_names (cost=0.00..53.95 rows=995 width=0) (actual time=3.999..3.999 rows=75 loops=1) Index Cond: (((forenames_normalized)::text ~~ '%nde%'::text) AND ((surname_normalized)::text ~~ '%tche%'::text)) Planning Time: 0.092 ms Execution Time: 4.120 ms
639.494 ms for execution time, now it only takes
4.1 ms! That's because instead of sequentially scanning all of the rows in the document, it scanned the
Great, now let's take a look at how to do fuzzy search:
Let's say we are trying to find someone with forename(s) of
anderson and surname of
select id, forenames, surname, ((similarity('mitchel', surname_normalized) + similarity('andersen', forenames_normalized)) / 2) as score from persons p order by score desc limit 10
This query takes about 58 seconds to complete. The
similarity function is expensive, so we have to try not to use it as much as possible. For that, we can use the similarity operator (
%) to filter out the rows that are below a certain threshold. By default the threshold is 30% similarity (
0.3) but you can change that using
set_limit. Now let's use it:
select id, forenames, surname, ((similarity('mitchel', surname_normalized) + similarity('andersen', forenames_normalized)) / 2) as score from persons p where forenames_normalized % 'andersen' and surname_normalized % 'mitchel' order by score desc limit 10
Now it takes about
100ms on my laptop. A huge improvement over 58 seconds :)
pg_trgm uses tri-grams for indexing. It means that each string is broken into all possible 3 letter components. For example
mitchel's trigrams are:
michelle's trigrams are:
lle. They share 2 trigrams so the similarity of
michelle is 30%.
This approach is not useful for words that are less than 3 letters. As you can't form any trigrams. So this query:
select * from persons p where surname_normalized like '%he%' and forenames_normalized like '%de%'
Takes the same amount of time on both the indexed table and the non-indexed table because PostgreSQL does sequential scan for both of them:
Gather (cost=1000.00..147095.90 rows=49229 width=30) (actual time=1.169..655.329 rows=21216 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Seq Scan on persons p (cost=0.00..141173.00 rows=20512 width=30) (actual time=0.397..583.521 rows=7072 loops=3) Filter: (((surname_normalized)::text ~~ '%he%'::text) AND ((forenames_normalized)::text ~~ '%de%'::text)) Rows Removed by Filter: 3326261 Planning Time: 0.105 ms Execution Time: 655.974 ms
There can be cases where the index makes things slower. So please test it for your own use case and weight the trade-offs. Also keep in mind that inserts and updates take longer with the index.
I wrote some very simple benchmarks using BenchmarkDotNet and here is the results:
// * Summary * BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.928 (2004/?/20H1) Intel Core i7-8550U CPU 1.80GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores .NET Core SDK=5.0.201 [Host] : .NET Core 5.0.4 (CoreCLR 5.0.421.11614, CoreFX 5.0.421.11614), X64 RyuJIT DefaultJob : .NET Core 5.0.4 (CoreCLR 5.0.421.11614, CoreFX 5.0.421.11614), X64 RyuJIT | Method | Mean | Error | StdDev | Median | |---------------------:|-------------:|-----------:|-----------:|-----------:| | LikeOnGinIndex | 5.398 ms | 0.7167 ms | 2.113 ms | 4.170 ms | | Like | 1,035.140 ms | 55.0098 ms | 158.716 ms | 991.495 ms | | SimilarityOnGinIndex | 137.339 ms | 14.7610 ms | 43.523 ms | 114.342 ms |
Note: Please download the database dump and code on GitHub.comments powered by Disqus