Splitting in Python with everything BUT particular set of cases

Question

I am not very good with regex and it continues to confuse me every time it comes up so instead of writing a possibly incorrect regex string, I want to split a string a different way.

Let's say I have a string "hello, my name is Joseph! Haha, hello!" and I want to split it whenever I encounter a non-alphanumeric character. So then, in this case, I would obtain:

"hello" "my" "name" "is" "Joseph" "Haha" "hello"

Is there a way to do this without a regex string? As in: split whenever character != alphanumeric?

(Yes, I do realize it is probably not a smart thing to do to not correct my regex deficiency!)

Wiktor Stribiżew · Accepted Answer

Personally, I think it is appropriate to use simple and straightforward regexes for such simple tasks.

Compare an itertools and re solutions:

import itertools, re
s = "hello, my name is Joseph! Haha, hello!"
print(["".join(x) for _, x in itertools.groupby(s, key=str.isalnum)][0::2])
print(re.findall(r"\w+", s))

See an online Python demo here.

As for me, I'd vote for the regex here. The \w+ matches one or more word characters (letters, digits, underscores) and the re.findall returns all the non-overlapping occurrences.

The itertools groupby groups the substring chunks according to the key which is set to alphanumeric (str.alnum) and all the even tokens (the non-word chunks in this concrete case) are removed from the final result with [0::2]. If a string starts with a non-word char, this won't work, a regex solution is safer and easier.

Splitting in Python with everything BUT particular set of cases

Answers (2)

Related Questions