Reputation: 3733
Input:
Some Text here: Java, PHP, JS, HTML 5, CSS, Web, C#, SQL, databases, AJAX, etc.
Code:
import re
input_words = list(re.split('\s+', input()))
print(input_words)
Works perfect and returns me:
['Some', 'Text', 'here:', 'Java,', 'PHP,', 'JS,', 'HTML', '5,', 'CSS,', 'Web,', 'C#,', 'SQL,', 'databases,', 'AJAX,', 'etc.']
But when add some other separators, like this:
import re
input_words = list(re.split('\s+ , ; : . ! ( ) " \' \ / [ ] ', input()))
print(input_words)
It doesn't split by spaces anymore, where am I wrong?
Expected outpus would be:
['Some', 'Text', 'here', 'Java', 'PHP', 'JS', 'HTML', '5', 'CSS', 'Web', 'C#', 'SQL', 'databases', 'AJAX', 'etc']
Upvotes: 2
Views: 139
Reputation: 237
write the expression inside brackets as shown below. Hope it helps
import re
input_words = list(re.split('[\s+,:.!()]', input()))
Upvotes: 1
Reputation: 1875
Word tokenization using nltk module
#!/usr/bin/python3
import nltk
sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
words = nltk.tokenize.word_tokenize(sentence)
print(words)
output:
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', '...', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
Upvotes: 0
Reputation: 521457
You should be splitting on a regex alternation containing all those symbols:
input_words = re.split('[\s,;:.!()"\'\\\[\]]', input())
print(input_words)
This is a literal answer to your question. The actual solution you might want to use would be to split on the symbols with optional whitespace on either end, e.g
input = "A B ; C.D ! E[F] G"
input_words = re.split('\s*[,;:.!()"\'\\\[\]]?\s*', input)
print(input_words)
Prints:
['A', 'B', 'C', 'D', 'E', 'F', 'G']
Upvotes: 6