Reputation: 4695
How can I write a program in python that can split more than one word or character?
For example I have these sentences: Hi, This is a test. Are you surprised?
In this example i need my program to split these sentences by ',','!','?' and '.'. I know split in str
library and NLTK
but I need to know is there any internal pythonic way like split?
Upvotes: 1
Views: 204
Reputation: 17
def get_words(s):
l = []
w = ''
for c in s:
if c in '-!?,. ':
if w != '':
l.append(w)
w = ''
else:
w = w + c
if w != '':
l.append(w)
return l
>>> s = "Hi, This is a test. Are you surprised?"
>>> print get_words(s)
['Hi', 'This', 'is', 'a', 'test', 'Are', 'you', 'surprised']
If you change '-!?,. ' into '-!?,.'
The output will be:
['Hi', ' This is a test', ' Are you surprised']
Upvotes: 0
Reputation: 4695
I think I found a tricky way for my question. I don't need to use any modules for that. I can use replace
method of str library and replace words like !
or ?
with .
. Then I can use split
method for my text to split word by .
.
Upvotes: 1
Reputation: 14370
You are looking for the tokenize
function of NLTK package. NLTK
stands for Natural Language Tool Kit
Or try re.split
from re
module.
From re doc.
>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']
>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
['0', '3', '9']
Upvotes: 1
Reputation: 12220
Use re.split:
string = 'Hi, This is a test. Are you surprised?'
words = re.split('[,!?.]', string)
print(words)
[u'Hi', u' This is a test', u' Are you surprised', u'']
Upvotes: 3