Reputation: 41523
Hay, I created a spider to crawl through a PDF document and log every word in the document into a table in a MySQL database.
Obviously words like 'the', 'and', 'or' etc appear in a book many, many times.
I'm just wondering what's the quickest method to remove dupe values from a table?
Upvotes: 0
Views: 323
Reputation: 2093
delete from words where idcolumn not in
(select min(idcolumn)
from words T2
where T2.plain = WordsTable.plain)
This works if you added (idcolumn, plain)
for every word you found.
If you do not have an id column (pk) then you can use Anax's solution.
In addition to not inserting duplicates (codeburger comment), you can just set a unique index on your plain column.
Upvotes: 0
Reputation: 398
If you can rerun the script to populate the database, you could add a unique key on the "word" field and instead of INSERT INTO do a REPLACE INTO. This will delete the previous instance of the record before adding a duplicate field. This may not be the most efficient way to do it, but it's rather simple. See here for more details:
http://dev.mysql.com/doc/refman/5.0/en/replace.html
Upvotes: 1
Reputation: 9382
Create a table without indexing the words and put in all the words from the book using mass inserts (you could also use LOAD DATA). When you're done with insertions, add a new Index on the word
field
Then create a second table using:
CREATE TABLE newTable SELECT DISTINCT word FROM oldTable
Upvotes: 3
Reputation: 12323
Instead of removing duplicates, you could make sure that no duplicates ever make it into the table.
Presuming your table has only 2 fields, id and word:
INSERT INTO table SELECT null, 'word' FROM table WHERE NOT EXISTS (SELECT * FROM table WHERE word = 'word') LIMIT 1;
This will insert the word into the table only if it's not already in there
Upvotes: 1
Reputation: 6840
select distinct on word field, and then delete all rows that have a different id? I'm not a master in subqueries so no example atm :)
Upvotes: 0