Mohamed Taher Alrefaie
Mohamed Taher Alrefaie

Reputation: 16233

How to split string by space and treat special characters as a separate word in Python?

Assume I have a string,

"I want that one, it is great."

I want to split up this string to be

["I", "want", "that", "one", ",", "it", "is", "great", "."]

Keeping special characters such as ",.:;" and possibly other ones to be treated as a separate word.

Is there any easy way to do this with Python 2.7?

Update

For an example such as "I don't.", it should be ["I", "don", "'", "t", "."]. It would ideally work with non-English punctuations such as ؛ and others.

Upvotes: 1

Views: 3714

Answers (4)

unutbu
unutbu

Reputation: 879321

In [70]: re.findall(r"[^,.:;' ]+|[,.:;']", "I want that one, it is great.")
Out[70]: ['I', 'want', 'that', 'one', ',', 'it', 'is', 'great', '.']

In [76]: re.findall(r"[^,.:;' ]+|[,.:;']", "I don't.")
Out[76]: ['I', 'don', "'", 't', '.']

The regex [^,.:;' ]+|[,.:;'] matches (1-or-more characters other than ,, ., :, ;, ' or a literal space), or (the literal characters ,, ., :, ;, or ').


Or, using the regex module, you could easily expand this to include all punctuation and symbols by using the [:punct:] character class:

In [77]: import regex

In Python2:

In [4]: regex.findall(ur"[^[:punct:] ]+|[[:punct:]]", u"""A \N{ARABIC SEMICOLON} B""")
Out[4]: [u'A', u'\u061b', u'B']

In [6]: regex.findall(ur"[^[:punct:] ]+|[[:punct:]]", u"""He said, "I don't!" """)
Out[6]: [u'He', u'said', u',', u'"', u'I', u'don', u"'", u't', u'!', u'"']

In Python3:

In [105]: regex.findall(r"[^[:punct:] ]+|[[:punct:]]", """A \N{ARABIC SEMICOLON} B""")
Out[105]: ['A', '؛', 'B']

In [83]: regex.findall(r"[^[:punct:] ]+|[[:punct:]]", """He said, "I don't!" """)
Out[83]: ['He', 'said', ',', '"', 'I', 'don', "'", 't', '!', '"']

Note that it is important that you pass a unicode as the second argument to regex.findall if you wish [:punct:] to match unicode punctuation or symbols.

In Python2:

import regex
print(regex.findall(r"[^[:punct:] ]+|[[:punct:]]", 'help؛'))
print(regex.findall(ur"[^[:punct:] ]+|[[:punct:]]", u'help؛'))

prints

['help\xd8\x9b']
[u'help', u'\u061b']

Upvotes: 1

Dan
Dan

Reputation: 4663

You can use Regex and a simple list comprehension to do this. The regex will pull out words and separate punctuation, and the list comprehension will remove the blank spaces.

import re
s = "I want that one, it is great. Don't do it."
new_s = [c.strip() for c in re.split('(\W+)', s) if c.strip() != '']
print new_s

The output of new_s will be:

['I', 'want', 'that', 'one', ',', 'it', 'is', 'great', '.', 'Don', "'", 't', 'do', 'it', '.']

Upvotes: 1

Liam McElhaney
Liam McElhaney

Reputation: 3

I don't know of any functions that can do this but you could use a for loop.

Something like this: word = "" wordLength = 0 for i in range(0, len(stringName)): if stringName[i] != " ": for x in range((i-wordLength), i): word += stringName[i] wordLength = 0 list.append(word) word = "" else: worldLength = wordlength + 1 Hope this works...sorry if it is not the best way

Upvotes: 0

Greg Sadetsky
Greg Sadetsky

Reputation: 5082

See here for a similar question. The answer there applies to you as well:

import re
print re.split('(\W)', "I want that one, it is great.")
print re.split('(\W)', "I don't.")

You can remove the spaces and empty strings returned by re.split using a filter:

s = "I want that one, it is great."
print filter(lambda _: _ not in [' ', ''], re.split('(\W)', s))

Upvotes: 1

Related Questions