Reputation: 3417
The following code splits a string into a list of words but does not include numbers:
txt="there_once was,a-monkey.called phillip?09.txt"
sep=re.compile(r"[\s\.,-_\?]+")
sep.split(txt)
['there', 'once', 'was', 'a', 'monkey', 'called', 'phillip', 'txt']
This code gives me words and numbers but still includes "_" as a valid character:
re.findall(r"\w+|\d+",txt)
['there_once', 'was', 'a', 'monkey', 'called', 'phillip', '09', 'txt']
What do I need to alter in either piece of code to end up with the desired result of:
['there', 'once', 'was', 'a', 'monkey', 'called', 'phillip', '09', 'txt']
Upvotes: 1
Views: 2450
Reputation: 77400
For the example case,
sep = re.compile(r"[^a-zA-Z0-9]+")
sea.split(txt)
should work. To separate numbers from words, try
re.findall(r"[a-zA-Z]+|\d+", txt)
Upvotes: 2
Reputation: 131570
Here's a quick way that should do it:
re.findall(r"[a-zA-Z0-9]+",txt)
Here's another:
re.split(r"[\s\.,\-_\?]+",txt)
(you just needed to escape the hyphen because it has a special meaning in a character class)
Upvotes: 2