Reputation: 16233
Assume I have a string,
"I want that one, it is great."
I want to split up this string to be
["I", "want", "that", "one", ",", "it", "is", "great", "."]
Keeping special characters such as ",.:;"
and possibly other ones to be treated as a separate word.
Is there any easy way to do this with Python 2.7?
For an example such as "I don't."
, it should be ["I", "don", "'", "t", "."]
. It would ideally work with non-English punctuations such as ؛
and others.
Upvotes: 1
Views: 3714
Reputation: 879321
In [70]: re.findall(r"[^,.:;' ]+|[,.:;']", "I want that one, it is great.")
Out[70]: ['I', 'want', 'that', 'one', ',', 'it', 'is', 'great', '.']
In [76]: re.findall(r"[^,.:;' ]+|[,.:;']", "I don't.")
Out[76]: ['I', 'don', "'", 't', '.']
The regex [^,.:;' ]+|[,.:;']
matches (1-or-more characters other than ,
, .
, :
, ;
, '
or a literal space), or (the literal characters ,
, .
, :
, ;
, or '
).
Or, using the regex module, you could easily expand this to include all punctuation and symbols by using the [:punct:]
character class:
In [77]: import regex
In Python2:
In [4]: regex.findall(ur"[^[:punct:] ]+|[[:punct:]]", u"""A \N{ARABIC SEMICOLON} B""")
Out[4]: [u'A', u'\u061b', u'B']
In [6]: regex.findall(ur"[^[:punct:] ]+|[[:punct:]]", u"""He said, "I don't!" """)
Out[6]: [u'He', u'said', u',', u'"', u'I', u'don', u"'", u't', u'!', u'"']
In Python3:
In [105]: regex.findall(r"[^[:punct:] ]+|[[:punct:]]", """A \N{ARABIC SEMICOLON} B""")
Out[105]: ['A', '؛', 'B']
In [83]: regex.findall(r"[^[:punct:] ]+|[[:punct:]]", """He said, "I don't!" """)
Out[83]: ['He', 'said', ',', '"', 'I', 'don', "'", 't', '!', '"']
Note that it is important that you pass a unicode
as the second argument to regex.findall
if you wish [:punct:]
to match unicode punctuation or symbols.
In Python2:
import regex
print(regex.findall(r"[^[:punct:] ]+|[[:punct:]]", 'help؛'))
print(regex.findall(ur"[^[:punct:] ]+|[[:punct:]]", u'help؛'))
prints
['help\xd8\x9b']
[u'help', u'\u061b']
Upvotes: 1
Reputation: 4663
You can use Regex and a simple list comprehension to do this. The regex will pull out words and separate punctuation, and the list comprehension will remove the blank spaces.
import re
s = "I want that one, it is great. Don't do it."
new_s = [c.strip() for c in re.split('(\W+)', s) if c.strip() != '']
print new_s
The output of new_s
will be:
['I', 'want', 'that', 'one', ',', 'it', 'is', 'great', '.', 'Don', "'", 't', 'do', 'it', '.']
Upvotes: 1
Reputation: 3
I don't know of any functions that can do this but you could use a for loop.
Something like this: word = "" wordLength = 0 for i in range(0, len(stringName)): if stringName[i] != " ": for x in range((i-wordLength), i): word += stringName[i] wordLength = 0 list.append(word) word = "" else: worldLength = wordlength + 1 Hope this works...sorry if it is not the best way
Upvotes: 0
Reputation: 5082
See here for a similar question. The answer there applies to you as well:
import re
print re.split('(\W)', "I want that one, it is great.")
print re.split('(\W)', "I don't.")
You can remove the spaces and empty strings returned by re.split
using a filter:
s = "I want that one, it is great."
print filter(lambda _: _ not in [' ', ''], re.split('(\W)', s))
Upvotes: 1