Reputation: 53
How can I get various words from a string(URL) in python? From a URL like:
http://www.sample.com/level1/level2/index.html?id=1234
I want to get words like:
http, www, sample, com, level1, level2, index, html, id, 1234
Any solutions using python.
Thanks.
Upvotes: 2
Views: 2307
Reputation: 7091
This is how you may do it for all URL
import re
def getWordsFromURL(url):
return re.compile(r'[\:/?=\-&]+',re.UNICODE).split(url)
Now you may use it as
url = "http://www.sample.com/level1/level2/index.html?id=1234"
words = getWordsFromURL(url)
Upvotes: 5
Reputation: 140168
just regex-split according to the biggest sequence of non-alphanums:
import re
l = re.split(r"\W+","http://www.sample.com/level1/level2/index.html?id=1234")
print(l)
yields:
['http', 'www', 'sample', 'com', 'level1', 'level2', 'index', 'html', 'id', '1234']
This is simple but as someone noted, doesn't work when there are _
, -
, ... in URL names. So the less fun solution would be to list all possible tokens that can separate path parts:
l = re.split(r"[/:\.?=&]+","http://stackoverflow.com/questions/41935748/splitting-a-string-url-into-words-using-python")
(I admit that I may have forgotten some separation symbols)
Upvotes: 2