Roman
Roman

Reputation: 883

SQLite3 How to combine the IN operator with the LIKE operator to get partial text matching to a subquery?

I am trying to do detect duplicates in a database. I have came up with a MWE of what I am trying to do. Not every id column has a value but the id nevertheless may be in the filename. I am trying to find all rows which have an id which appears in a filename which is in a different row.

enter image description here

This query is kind of what I am looking for, but the problem is that doing exact matches

SELECT * FROM items WHERE id IN (
    SELECT filename FROM items
);

The IN operator is shorthand for:

name IN ("Bob Walters", "Alice Reed") ==> name == "Bob Walters" OR name = "Alice Reed"

However I want an operator that does this:

_________ ==> "%" || name || "%" LIKE "Bob Walters" OR "%" || name || "%" LIKE "Alice Reed"

I am looking for what I can put in the blank space to achieve this operation. My table has just over 10,000 rows (so the solution doesn't necessarily have to be optimized for millions of rows).

Upvotes: 1

Views: 1309

Answers (3)

Roman
Roman

Reputation: 883

Things I've tried

  1. Using the WHERE EXISTS operator in conjunction with the LIKE operator (~50secs on a 10,000 item database)

    SELECT * FROM items AS i1 WHERE EXISTS (
        SELECT * FROM items AS i2 WHERE i1.id != i2.id AND i2.filename LIKE '%' || i1.id || '%'
    );
    
  2. Using the WHERE EXISTS operator in conjunction with the instr function (~50secs on a 10,000 item database)

    SELECT * FROM items AS i1 WHERE EXISTS (
        SELECT * FROM items AS i2 WHERE i1.id != i2.id AND instr(i2.filename, i1.id) != 0
    );
    
  3. Using the WHERE EXISTS operator in conjunction with the LIKE operator along with looking only at rows with a null id (~30secs on a 10,000 item database)

  4. Using the WHERE EXISTS operator in conjunction with the instr function along with looking only at rows with a null id (~30secs on a 10,000 item database)
  5. In Python, get a list of ids, then for each ID call the database looking for matches (~17secs on a 10,000 item database)
  6. In Python, get a list of ids and filenames from the database, then do the search in Python (~10secs on a 10,000 item database) <-- solution I used

Other Possible solutions

  1. Using the WHERE EXISTS operator in conjunction with the REGEXP operator
  2. Enabling the FTS4 full text search extension and creating a virtual table, so using the WHERE EXISTS operator in conjunction with the MATCH operator (or another FTS4 operator)
  3. Using SQLAlchemy in Python
  4. Restructuring the database (for example pulling out ID look alikes from filenames into a separate id column, so we can do exact column matches instead of wildcard searches)

Code for my Solution

statement_id = 'SELECT * FROM items WHERE id IS NOT NULL and id != ""'
cursor.execute(statement_id)
ids = cursor.fetchall()

statement_title = 'SELECT * FROM items WHERE title IS NOT NULL AND title != ""'
cursor.execute(statement_title)
titles = cursor.fetchall()

matches = []
for id in ids:
    for title in titles:
        if id['id'] in title['title']:
            matches.append([id, title])

Conclusion

In conclusion, the solution is ugly but I think given the circumstances (it's a personal project, I won't be doing this data deduplication operation all that often, etc). I think the full text extensions or restructuring the database might be nicer solutions.

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521854

If you are using SQLite 3+, you may try using the REGEXP operator:

SELECT *
FROM items i1
WHERE EXISTS (SELECT 1 FROM items i2
              WHERE i2.filename REGEXP '\b' || i1.id || '\b' AND
              i1.id <> i2.id);

If your version of SQLite does not support REGEXP, then you can use LIKE in its place:

SELECT *
FROM items i1
WHERE EXISTS (SELECT 1 FROM items i2
              WHERE i2.filename LIKE '%' || i1.id || '%' AND
              i1.id <> i2.id);

I highlighted can above, because the problem with LIKE and wildcards is that it would not only match exact matches, but also substrings, e.g. if id=34983 appeared as a substring of another id in a filename in a different record, there would be a false positive.

Upvotes: 1

Shawn
Shawn

Reputation: 52499

You want EXISTS, not IN. Try:

SELECT *
FROM items AS i
WHERE EXISTS (SELECT *
              FROM items AS i2
              WHERE i.filename LIKE '%' || i2.id || '%' AND i.filename <> i2.filename)

Upvotes: 2

Related Questions