user6952520
user6952520

Reputation: 23

Splitting in Python with everything BUT particular set of cases

I am not very good with regex and it continues to confuse me every time it comes up so instead of writing a possibly incorrect regex string, I want to split a string a different way.

Let's say I have a string "hello, my name is Joseph! Haha, hello!" and I want to split it whenever I encounter a non-alphanumeric character. So then, in this case, I would obtain:

"hello" "my" "name" "is" "Joseph" "Haha" "hello"

Is there a way to do this without a regex string? As in: split whenever character != alphanumeric?

(Yes, I do realize it is probably not a smart thing to do to not correct my regex deficiency!)

Upvotes: 1

Views: 60

Answers (2)

I'm always glad when someone tries to avoid using regex ;) But here it's probably the best tool for the job.

You can write your own parser, but that is much more verbose:

s = "hello, my name is Joseph! Haha, hello!"
words = []
lasti = 0
lastp = False
for i,p in enumerate (not c.isalpha() for c in s):
    if p != lastp:
        if p: words.append (s[lasti:i])
        lasti, lastp = i,p

print (', '.join (words))

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626903

Personally, I think it is appropriate to use simple and straightforward regexes for such simple tasks.

Compare an itertools and re solutions:

import itertools, re
s = "hello, my name is Joseph! Haha, hello!"
print(["".join(x) for _, x in itertools.groupby(s, key=str.isalnum)][0::2])
print(re.findall(r"\w+", s))

See an online Python demo here.

As for me, I'd vote for the regex here. The \w+ matches one or more word characters (letters, digits, underscores) and the re.findall returns all the non-overlapping occurrences.

The itertools groupby groups the substring chunks according to the key which is set to alphanumeric (str.alnum) and all the even tokens (the non-word chunks in this concrete case) are removed from the final result with [0::2]. If a string starts with a non-word char, this won't work, a regex solution is safer and easier.

Upvotes: 1

Related Questions