unbrokendub
unbrokendub

Reputation: 23

Any way to find and delete almost similar records with SQL?

I have a table in Postgres DB, that has a lot of almost identical rows. For example:

1. 00Zicky_-_San_Pedro_Danilo_Vigorito_Remix
2. 00Zicky_-_San_Pedro__Danilo_Vigorito_Remix__
3. 0101_-_Try_To_Say__Strictlyjaz_Unit_Future_Rmx__
4. 0101_-_Try_To_Say__Strictlyjaz_Unit_Future_Rmx_
5. 01_-_Digital_Excitation_-_Brothers_Gonna_Work_it_Out__Piano_Mix__
6. 01_-_Digital_Excitation_-_Brothers_Gonna_Work_it_Out__Piano_Mix__

I think about to writing a little golang script to remove duplicates, but maybe SQL can do it?

Table definition:

\d+ songs
                                                                       Table "public.songs"
    Column     |            Type             | Collation | Nullable |                Default                 | Storage  | Compression | Stats target | Description
---------------+-----------------------------+-----------+----------+----------------------------------------+----------+-------------+--------------+-------------
 song_id       | integer                     |           | not null | nextval('songs_song_id_seq'::regclass) | plain    |             |              |
 song_name     | character varying(250)      |           | not null |                                        | extended |             |              |
 fingerprinted | smallint                    |           |          | 0                                      | plain    |             |              |
 file_sha1     | bytea                       |           |          |                                        | extended |             |              |
 total_hashes  | integer                     |           | not null | 0                                      | plain    |             |              |
 date_created  | timestamp without time zone |           | not null | now()                                  | plain    |             |              |
 date_modified | timestamp without time zone |           | not null | now()                                  | plain    |             |              |
Indexes:
    "pk_songs_song_id" PRIMARY KEY, btree (song_id)
Referenced by:
    TABLE "fingerprints" CONSTRAINT "fk_fingerprints_song_id" FOREIGN KEY (song_id) REFERENCES songs(song_id) ON DELETE CASCADE
Access method: heap

Tried several methods to find duplicates, but that methods search only for exact similarity.

Upvotes: 0

Views: 78

Answers (1)

Belayer
Belayer

Reputation: 14934

There is no operator which is essentially A almost = B. (Well there is full text search, but that seems to be a little excessive here.) If the only difference is the number of - and _ then just get rid of them and compare the the resulting difference. If they are equal, then one is a duplicate. You can use the replace() function to remove them. So something like: (see demo)

delete  
  from songs s2 
 where exists ( select null 
                  from songs s1 
                 where s1.song_id < s2.song_id
                   and replace(replace(s1.name, '_',''),'-','') = 
                       replace(replace(s2.name, '_',''),'-','')
              ); 

If your table is large this will not be fast, but a functional index may help:

create index song_name_idx on songs
        (replace(replace(name, '_',''),'-',''));

Upvotes: 1

Related Questions