SQL supports wildcard search on strings using LIKE
operator which accepts %
and _
wildcards. The problem with LIKE
is it's not very fast if you have a lot of rows and the query is non-sargable. And in some cases you need to provide fuzzy search capabilities where the results don't have to exactly match the query.
PostgreSQL has the pg_trgm
extension that solves both problems:
- It has
gin
andgist
indexes for speeding upLIKE
and other string operators - It has
similarity
function and%
operator for string similarity search using trigrams.
Let's assume we have this table:
CREATE TABLE persons (
id int4 NOT NULL GENERATED ALWAYS AS IDENTITY,
forenames varchar(100) NOT NULL,
surname varchar(100) NOT NULL,
forenames_normalized varchar(100) NOT NULL,
surname_normalized varchar(100) NOT NULL,
CONSTRAINT persons_pk PRIMARY KEY (id)
);
Note: Normalized columns are lowercase versions of the normal columns and special characters are removed. You can also remove character accents. This is to make the search experience better for the user as they don't have to type in the exact case and punctuations.
I inserted 10M rows of fake data generated by Bogus into the table. You can download the dump here.
If we run a LIKE
query on it:
select * from persons p
where surname_normalized like '%tche%' and forenames_normalized like '%nde%'
On my laptop it takes PostgreSQL about a second to return the results:
Gather (cost=1000.00..142174.75 rows=10 width=30) (actual time=9.719..639.460 rows=75 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on persons p (cost=0.00..141173.75 rows=4 width=30) (actual time=3.425..605.240 rows=25 loops=3)
Filter: (((surname_normalized)::text ~~ '%tche%'::text) AND ((forenames_normalized)::text ~~ '%nde%'::text))
Rows Removed by Filter: 3333308
Planning Time: 0.097 ms
Execution Time: 639.494 ms
It seems like all of rows rows are scanned in the table. To speed things up, first we need to enable the pg_trgm
extension on the database:
create extension if not exists pg_trgm;
Then we can use the gin
index on the normalized columns:
create index if not exists idx_gin_persons_on_names on persons using gin (forenames_normalized gin_trgm_ops, surname_normalized gin_trgm_ops)
Note: gin
index and gin_trgm_ops
operator are part of pg_trgm
.
Adding the gin
index took about a minute on my laptop for 10M rows.
Now let's see if the results have improved:
Bitmap Heap Scan on persons p (cost=54.20..3692.46 rows=995 width=30) (actual time=4.011..4.066 rows=75 loops=1)
Recheck Cond: (((forenames_normalized)::text ~~ '%nde%'::text) AND ((surname_normalized)::text ~~ '%tche%'::text))
Heap Blocks: exact=75
-> Bitmap Index Scan on idx_gin_persons_on_names (cost=0.00..53.95 rows=995 width=0) (actual time=3.999..3.999 rows=75 loops=1)
Index Cond: (((forenames_normalized)::text ~~ '%nde%'::text) AND ((surname_normalized)::text ~~ '%tche%'::text))
Planning Time: 0.092 ms
Execution Time: 4.120 ms
Instead of 639.494 ms
for execution time, now it only takes 4.1 ms
! That's because instead of sequentially scanning all of the rows in the document, it scanned the gin
index.
Great, now let's take a look at how to do fuzzy search:
Let's say we are trying to find someone with forename(s) of anderson
and surname of mitchell
:
select id, forenames, surname, ((similarity('mitchel', surname_normalized) + similarity('andersen', forenames_normalized)) / 2) as score from persons p
order by score desc
limit 10
This query takes about 58 seconds to complete. The similarity
function is expensive, so we have to try not to use it as much as possible. For that, we can use the similarity operator (%
) to filter out the rows that are below a certain threshold. By default the threshold is 30% similarity (0.3
) but you can change that using set_limit
. Now let's use it:
select id, forenames, surname, ((similarity('mitchel', surname_normalized) + similarity('andersen', forenames_normalized)) / 2) as score from persons p
where forenames_normalized % 'andersen' and surname_normalized % 'mitchel'
order by score desc
limit 10
Now it takes about 100ms
on my laptop. A huge improvement over 58 seconds :)
Edge Cases
pg_trgm
uses tri-grams for indexing. It means that each string is broken into all possible 3 letter components. For example mitchel
's trigrams are: mit
,itc
,tch
,che
,hel
and michelle
's trigrams are: mic
,ich
,che
,hel
,ell
,lle
. They share 2 trigrams so the similarity of mitchel
with michelle
is 30%.
This approach is not useful for words that are less than 3 letters. As you can't form any trigrams. So this query:
select * from persons p
where surname_normalized like '%he%' and forenames_normalized like '%de%'
Takes the same amount of time on both the indexed table and the non-indexed table because PostgreSQL does sequential scan for both of them:
Gather (cost=1000.00..147095.90 rows=49229 width=30) (actual time=1.169..655.329 rows=21216 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on persons p (cost=0.00..141173.00 rows=20512 width=30) (actual time=0.397..583.521 rows=7072 loops=3)
Filter: (((surname_normalized)::text ~~ '%he%'::text) AND ((forenames_normalized)::text ~~ '%de%'::text))
Rows Removed by Filter: 3326261
Planning Time: 0.105 ms
Execution Time: 655.974 ms
There can be cases where the index makes things slower. So please test it for your own use case and weight the trade-offs. Also keep in mind that inserts and updates take longer with the index.
Benchmarks
I wrote some very simple benchmarks using BenchmarkDotNet and here is the results:
// * Summary *
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.928 (2004/?/20H1)
Intel Core i7-8550U CPU 1.80GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.201
[Host] : .NET Core 5.0.4 (CoreCLR 5.0.421.11614, CoreFX 5.0.421.11614), X64 RyuJIT
DefaultJob : .NET Core 5.0.4 (CoreCLR 5.0.421.11614, CoreFX 5.0.421.11614), X64 RyuJIT
| Method | Mean | Error | StdDev | Median |
|---------------------:|-------------:|-----------:|-----------:|-----------:|
| LikeOnGinIndex | 5.398 ms | 0.7167 ms | 2.113 ms | 4.170 ms |
| Like | 1,035.140 ms | 55.0098 ms | 158.716 ms | 991.495 ms |
| SimilarityOnGinIndex | 137.339 ms | 14.7610 ms | 43.523 ms | 114.342 ms |
Note: Please download the database dump and code on GitHub.
comments powered by Disqus