Reputation: 4907
I have a MySQL database table with a couple thousand rows. The table is setup like so:
id | text
The id
column is an auto-incrementing integer, and the text
column is a 200-character varchar.
Say I have the following rows:
3 | I think I'll have duck tonight
4 | Maybe the chicken will be alright
5 | I have a pet duck now, awesome!
6 | I love duck
Then the list I'm wanting to generate might be something like:
Plus, I'll probably want to maintain a list of substrings to ignore from the list, like 'I', 'will' and 'have. It's important to note that I do not know what people will post.
I do not have a list of words that I want to monitor, I just want to find the most common substrings. I'll then filter out any erroneous substrings that are not interesting from the list manually by editing the query.
Can anyone suggest the best way to do this? Thanks everyone!
Upvotes: 0
Views: 563
Reputation: 1133
MySQL already does this for you.
First make sure your table is a MyISAM table
Define a FULLTEXT index on your column
On a shell command line navigate to the folder where your MySQL data is stored, then type:
myisam_ftdump -c yourtablename 1 >wordfreq.dump
You can then process wordfreq.dump to eliminate the unwanted column and sort by frequency decending.
You could do all the above with a single command line and some sed/awk wizardry no doubt. And you could incorporate it into your program without needing a dump file.
More info on myisam_ftdump here: http://dev.mysql.com/doc/refman/5.0/en/myisam-ftdump.html
Oh... one more thing, the stopwords for MySQL are precompiled into the engine. And words with 3 or less characters are not indexed. The full list is here:
http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html
If this list isn't adequate for your needs, or you need words with less than 3 characters to count, the only way is to recompile MySQL with different rules for FULLTEXT. I don't recommend that!
Upvotes: 2
Reputation: 23141
Extract to flat file and then use your favorite quick language, perl, python, ruby, etc to process the flat file.
If you don't have one these languages as part of your skillset, this is a perfect little task to start using one, and it won't take you long.
Some database tasks are just so much easier to do OUTSIDE of the database.
Upvotes: 0