Reputation: 883
I am trying to do detect duplicates in a database. I have came up with a MWE of what I am trying to do. Not every id column has a value but the id nevertheless may be in the filename. I am trying to find all rows which have an id which appears in a filename which is in a different row.
This query is kind of what I am looking for, but the problem is that doing exact matches
SELECT * FROM items WHERE id IN (
SELECT filename FROM items
);
The IN operator is shorthand for:
name IN ("Bob Walters", "Alice Reed")
==> name == "Bob Walters" OR name = "Alice Reed"
However I want an operator that does this:
_________
==> "%" || name || "%" LIKE "Bob Walters" OR "%" || name || "%" LIKE "Alice Reed"
I am looking for what I can put in the blank space to achieve this operation. My table has just over 10,000 rows (so the solution doesn't necessarily have to be optimized for millions of rows).
Upvotes: 1
Views: 1309
Reputation: 883
Using the WHERE EXISTS
operator in conjunction with the LIKE
operator (~50secs on a 10,000 item database)
SELECT * FROM items AS i1 WHERE EXISTS (
SELECT * FROM items AS i2 WHERE i1.id != i2.id AND i2.filename LIKE '%' || i1.id || '%'
);
Using the WHERE EXISTS
operator in conjunction with the instr
function (~50secs on a 10,000 item database)
SELECT * FROM items AS i1 WHERE EXISTS (
SELECT * FROM items AS i2 WHERE i1.id != i2.id AND instr(i2.filename, i1.id) != 0
);
Using the WHERE EXISTS
operator in conjunction with the LIKE
operator along with looking only at rows with a null id (~30secs on a 10,000 item database)
WHERE EXISTS
operator in conjunction with the instr
function along with looking only at rows with a null id (~30secs on a 10,000 item database)<--
solution I usedWHERE EXISTS
operator in conjunction with the REGEXP
operatorWHERE EXISTS
operator in conjunction with the MATCH
operator (or another FTS4 operator)statement_id = 'SELECT * FROM items WHERE id IS NOT NULL and id != ""'
cursor.execute(statement_id)
ids = cursor.fetchall()
statement_title = 'SELECT * FROM items WHERE title IS NOT NULL AND title != ""'
cursor.execute(statement_title)
titles = cursor.fetchall()
matches = []
for id in ids:
for title in titles:
if id['id'] in title['title']:
matches.append([id, title])
In conclusion, the solution is ugly but I think given the circumstances (it's a personal project, I won't be doing this data deduplication operation all that often, etc). I think the full text extensions or restructuring the database might be nicer solutions.
Upvotes: 0
Reputation: 521854
If you are using SQLite 3+, you may try using the REGEXP
operator:
SELECT *
FROM items i1
WHERE EXISTS (SELECT 1 FROM items i2
WHERE i2.filename REGEXP '\b' || i1.id || '\b' AND
i1.id <> i2.id);
If your version of SQLite does not support REGEXP
, then you can use LIKE
in its place:
SELECT *
FROM items i1
WHERE EXISTS (SELECT 1 FROM items i2
WHERE i2.filename LIKE '%' || i1.id || '%' AND
i1.id <> i2.id);
I highlighted can above, because the problem with LIKE
and wildcards is that it would not only match exact matches, but also substrings, e.g. if id=34983
appeared as a substring of another id in a filename in a different record, there would be a false positive.
Upvotes: 1
Reputation: 52499
You want EXISTS, not IN. Try:
SELECT *
FROM items AS i
WHERE EXISTS (SELECT *
FROM items AS i2
WHERE i.filename LIKE '%' || i2.id || '%' AND i.filename <> i2.filename)
Upvotes: 2