Reputation: 752
I have a large (+/- 300,000 lines) dataset of text fragments that contain some noisy elements. With noisy I mean words of slang, type errors, etc… I wish to filter out these noisy elements to have a more clean dataset.
I read some papers that propose to filter these out by keeping track of the occurrence of each word. By setting a treshold (eg. less than 20) we can assume these words are noise and thus can safely be removed from the corpus.
Maybe there are some libraries or algorithms that do this in a fast and efficient way. Ofcourse I tried it myself first but this is EXTREMELY slow!
So to summarize, I am looking for an algorithm that can filter out words, in a fast and efficient way, that occur less than a particular treshold. Maybe I add a small example:
This is just an example of whaat I wish to acccomplish.
The words 'whaat' and 'acccomplish' are misspelled and thus likely to occur less often (If we assume to live in a perfect world and typos are rare …). I wish to end up with
This is just an example of I wish to.
Thanks!
PS: If possible, I'd like to have an algorithm in Java (or pseudo-code so I can write it myself)
Upvotes: 0
Views: 418
Reputation: 178491
I think you are over complicating it with the approach suggested in comments.
You can do it with 2 passes on the data:
Map<String,Integer>
that counts number of occurancesmap.get(word) > THRESHOLD
As a side note, if any - I think a fixed threshold approach is not the best choice, I personally would filter words that occure less than MEAN-3*STD
where MEAN
is the average number of words, and STD
is the standard deviation. (3 standard deviations mean you are catching words that are approximately out of the expected normal distribution with probability of ~99%). You can 'play' with the constant factor and find what best suits your needs.
Upvotes: 4