Reputation: 3377
I have a string:
feature.append(freq_and_feature(text, freq))
I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.
These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):
import re
with open("helper.txt", "r") as helper:
for row in helper:
print re.split('\' .,()_', row)
However, I get the following, which is not what I want.
[' feature.append(freq_pain_feature(text, freq))\n']
Upvotes: 4
Views: 10669
Reputation: 11032
I think you are trying to split on the basis of non-word
characters. It should be
re.split(r'[^A-Za-z0-9]+', s)
[^A-Za-z0-9]
can be translated to --> [\W_]
Python Code
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])
This will also work, indeed
p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))
Upvotes: 1
Reputation: 1797
str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str
This will result:
['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']
Upvotes: 1
Reputation: 626929
It seems you want to split a string with non-word or underscore characters. Use
import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']
See the IDEONE demo
The [\W_]+
regex matches 1+ characters that are not word (\W
= [^a-zA-Z0-9_]
) or underscores.
You can get rid of the if x
if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s)
.
Upvotes: 4
Reputation: 25327
re.split('\' .,()_', row)
This looks for the string ' .,()_
to split on. You probably meant
re.split('[\' .,()_]', row)
re.split
takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b
and it will match either a
or b
. If you wrote ab
, it would only match a
followed by b
. Luckily, so we don't have to write '| |.|,|(|...
, there's a nice form where you can use []
s to state that everything inside should be treated as "match one of these".
Upvotes: 4