Jobs
Jobs

Reputation: 3377

Python split with multiple delimiters not working

I have a string:

feature.append(freq_and_feature(text, freq))

I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.

These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):

import re
with open("helper.txt", "r") as helper:
    for row in helper:

       print re.split('\' .,()_', row)

However, I get the following, which is not what I want.

['    feature.append(freq_pain_feature(text, freq))\n']

Upvotes: 4

Views: 10669

Answers (4)

rock321987
rock321987

Reputation: 11032

I think you are trying to split on the basis of non-word characters. It should be

re.split(r'[^A-Za-z0-9]+', s)

[^A-Za-z0-9] can be translated to --> [\W_]

Python Code

s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])

This will also work, indeed

p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))

Ideone Demo

Upvotes: 1

cjahangir
cjahangir

Reputation: 1797

You can try this

str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str

This will result:

['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626929

It seems you want to split a string with non-word or underscore characters. Use

import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']

See the IDEONE demo

The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.

You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).

Upvotes: 4

Justin
Justin

Reputation: 25327

re.split('\' .,()_', row)

This looks for the string ' .,()_ to split on. You probably meant

re.split('[\' .,()_]', row)

re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".

Upvotes: 4

Related Questions