user1893354
user1893354

Reputation: 5938

Python parse words from URL string

I have a large data set of urls and I need a way to parse words from the urls eg:

realestatesales.com -> {"real","estate","sales"}

I would prefer to do it in python. This seems like it should be possible with some kind of english language dictionary. There might be some ambiguous cases, but I feel like there should be a solution out there somewhere.

Upvotes: 3

Views: 1838

Answers (3)

Ben DeMott
Ben DeMott

Reputation: 3604

Ternary Search Trees when filled with a word-dictionary can find the most-complex set of matched terms (words) rather efficiently. This is the solution I've previously used.
You can get a C/Python implementation of a tst here: http://github.com/nlehuen/pytst

Example:

import tst
tree = tst.TST()
#note that tst.ListAction() assigns each matched term to a list
words = tree.scan("MultipleWordString", tst.ListAction())

Other Resources:

The open-source search engine called "Solr" uses what it calls a "Word-Boundary-Filter" to deal with this problem you might want to have a look at it.

Upvotes: 4

mbatchkarov
mbatchkarov

Reputation: 16039

This is a problem is word segmentation, and an efficient dynamic programming solution exists. This page discusses how you could implement it. I have also answered this question on SO before, but I can't find a link to the answer. Please feel free to edit my post if you do.

Upvotes: 2

Rune
Rune

Reputation: 133

This might be of use to you: http://www.clips.ua.ac.be/pattern

It's a set of modules which, depending on your system, might already be installed. It does all kinds of interesting stuff, and even if it doesn't do exactly what you need it might get you started on the right path.

Upvotes: 2

Related Questions