Reputation: 5938
I have a large data set of urls and I need a way to parse words from the urls eg:
realestatesales.com -> {"real","estate","sales"}
I would prefer to do it in python. This seems like it should be possible with some kind of english language dictionary. There might be some ambiguous cases, but I feel like there should be a solution out there somewhere.
Upvotes: 3
Views: 1838
Reputation: 3604
Ternary Search Trees when filled with a word-dictionary can find the most-complex set of matched terms (words) rather efficiently. This is the solution I've previously used.
You can get a C/Python implementation of a tst here: http://github.com/nlehuen/pytst
Example:
import tst
tree = tst.TST()
#note that tst.ListAction() assigns each matched term to a list
words = tree.scan("MultipleWordString", tst.ListAction())
Other Resources:
The open-source search engine called "Solr" uses what it calls a "Word-Boundary-Filter" to deal with this problem you might want to have a look at it.
Upvotes: 4
Reputation: 16039
This is a problem is word segmentation, and an efficient dynamic programming solution exists. This page discusses how you could implement it. I have also answered this question on SO before, but I can't find a link to the answer. Please feel free to edit my post if you do.
Upvotes: 2
Reputation: 133
This might be of use to you: http://www.clips.ua.ac.be/pattern
It's a set of modules which, depending on your system, might already be installed. It does all kinds of interesting stuff, and even if it doesn't do exactly what you need it might get you started on the right path.
Upvotes: 2