Muhammad Azeez - String similarity search and fast LIKE operator using pg

SQL supports wildcard search on strings using LIKE operator which accepts % and _ wildcards. The problem with LIKE is it's not very fast if you have a lot of rows and the query is non-sargable. And in some cases you need to provide fuzzy search capabilities where the results don't have to exactly match the query.

PostgreSQL has the pg_trgm extension that solves both problems:

It has gin and gist indexes for speeding up LIKE and other string operators
It has similarity function and % operator for string similarity search using trigrams.

Let's assume we have this table:

CREATE TABLE persons (
	id int4 NOT NULL GENERATED ALWAYS AS IDENTITY,
	forenames varchar(100) NOT NULL,
	surname varchar(100) NOT NULL,
	forenames_normalized varchar(100) NOT NULL,
	surname_normalized varchar(100) NOT NULL,
	CONSTRAINT persons_pk PRIMARY KEY (id)
);

Note: Normalized columns are lowercase versions of the normal columns and special characters are removed. You can also remove character accents. This is to make the search experience better for the user as they don't have to type in the exact case and punctuations.

I inserted 10M rows of fake data generated by Bogus into the table. You can download the dump here.

If we run a LIKE query on it:

select * from persons p 
where surname_normalized like '%tche%' and forenames_normalized like '%nde%'

On my laptop it takes PostgreSQL about a second to return the results:

Gather  (cost=1000.00..142174.75 rows=10 width=30) (actual time=9.719..639.460 rows=75 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Parallel Seq Scan on persons p  (cost=0.00..141173.75 rows=4 width=30) (actual time=3.425..605.240 rows=25 loops=3)
        Filter: (((surname_normalized)::text ~~ '%tche%'::text) AND ((forenames_normalized)::text ~~ '%nde%'::text))
        Rows Removed by Filter: 3333308
Planning Time: 0.097 ms
Execution Time: 639.494 ms

It seems like all of rows rows are scanned in the table. To speed things up, first we need to enable the pg_trgm extension on the database:

create extension if not exists pg_trgm;

Then we can use the gin index on the normalized columns:

create index if not exists idx_gin_persons_on_names on persons using gin (forenames_normalized gin_trgm_ops, surname_normalized gin_trgm_ops)

Note: gin index and gin_trgm_ops operator are part of pg_trgm.

Adding the gin index took about a minute on my laptop for 10M rows.

Now let's see if the results have improved:

Bitmap Heap Scan on persons p  (cost=54.20..3692.46 rows=995 width=30) (actual time=4.011..4.066 rows=75 loops=1)
  Recheck Cond: (((forenames_normalized)::text ~~ '%nde%'::text) AND ((surname_normalized)::text ~~ '%tche%'::text))
  Heap Blocks: exact=75
  ->  Bitmap Index Scan on idx_gin_persons_on_names  (cost=0.00..53.95 rows=995 width=0) (actual time=3.999..3.999 rows=75 loops=1)
        Index Cond: (((forenames_normalized)::text ~~ '%nde%'::text) AND ((surname_normalized)::text ~~ '%tche%'::text))
Planning Time: 0.092 ms
Execution Time: 4.120 ms

Instead of 639.494 ms for execution time, now it only takes 4.1 ms! That's because instead of sequentially scanning all of the rows in the document, it scanned the gin index.

Great, now let's take a look at how to do fuzzy search:

Let's say we are trying to find someone with forename(s) of anderson and surname of mitchell:

select id, forenames, surname, ((similarity('mitchel', surname_normalized) + similarity('andersen', forenames_normalized)) / 2) as score from persons p
order by score desc
limit 10

This query takes about 58 seconds to complete. The similarity function is expensive, so we have to try not to use it as much as possible. For that, we can use the similarity operator (%) to filter out the rows that are below a certain threshold. By default the threshold is 30% similarity (0.3) but you can change that using set_limit. Now let's use it:

select id, forenames, surname, ((similarity('mitchel', surname_normalized) + similarity('andersen', forenames_normalized)) / 2) as score from persons p
where forenames_normalized % 'andersen' and surname_normalized % 'mitchel'
order by score desc
limit 10

Now it takes about 100ms on my laptop. A huge improvement over 58 seconds :)

Edge Cases

pg_trgm uses tri-grams for indexing. It means that each string is broken into all possible 3 letter components. For example mitchel's trigrams are: mit,itc,tch,che,hel and michelle's trigrams are: mic,ich,che,hel,ell,lle. They share 2 trigrams so the similarity of mitchel with michelle is 30%.

This approach is not useful for words that are less than 3 letters. As you can't form any trigrams. So this query:

select * from persons p 
where surname_normalized like '%he%' and forenames_normalized like '%de%'

Takes the same amount of time on both the indexed table and the non-indexed table because PostgreSQL does sequential scan for both of them:

Gather  (cost=1000.00..147095.90 rows=49229 width=30) (actual time=1.169..655.329 rows=21216 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Parallel Seq Scan on persons p  (cost=0.00..141173.00 rows=20512 width=30) (actual time=0.397..583.521 rows=7072 loops=3)
        Filter: (((surname_normalized)::text ~~ '%he%'::text) AND ((forenames_normalized)::text ~~ '%de%'::text))
        Rows Removed by Filter: 3326261
Planning Time: 0.105 ms
Execution Time: 655.974 ms

There can be cases where the index makes things slower. So please test it for your own use case and weight the trade-offs. Also keep in mind that inserts and updates take longer with the index.

Benchmarks

I wrote some very simple benchmarks using BenchmarkDotNet and here is the results:

// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.928 (2004/?/20H1)
Intel Core i7-8550U CPU 1.80GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.201
  [Host]     : .NET Core 5.0.4 (CoreCLR 5.0.421.11614, CoreFX 5.0.421.11614), X64 RyuJIT
  DefaultJob : .NET Core 5.0.4 (CoreCLR 5.0.421.11614, CoreFX 5.0.421.11614), X64 RyuJIT


|            Method    |         Mean |      Error |     StdDev |     Median |
|---------------------:|-------------:|-----------:|-----------:|-----------:|
|    LikeOnGinIndex    |     5.398 ms |  0.7167 ms |   2.113 ms |   4.170 ms |
|              Like    | 1,035.140 ms | 55.0098 ms | 158.716 ms | 991.495 ms |
| SimilarityOnGinIndex |   137.339 ms | 14.7610 ms |  43.523 ms | 114.342 ms |

Note: Please download the database dump and code on GitHub.