Reputation: 1471
I am trying to split the sentences in words.
words = content.lower().split()
this gives me the list of words like
'evening,', 'and', 'there', 'was', 'morning--the', 'first', 'day.'
and with this code:
def clean_up_list(word_list):
clean_word_list = []
for word in word_list:
symbols = "~!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"
for i in range(0, len(symbols)):
word = word.replace(symbols[i], "")
if len(word) > 0:
clean_word_list.append(word)
I get something like:
'evening', 'and', 'there', 'was', 'morningthe', 'first', 'day'
if you see the word "morningthe" in the list, it used to have "--" in between words. Now, is there any way I can split them in two words like "morning","the"
??
Upvotes: 4
Views: 7025
Reputation: 3
You could also do this:
import re
def word_list(text):
return list(filter(None, re.split('\W+', text)))
print(word_list("Here we go round the mulberry-bush! And even---this and!!!this."))
Returns:
['Here', 'we', 'go', 'round', 'the', 'mulberry', 'bush', 'And', 'even', 'this', 'and', 'this']
Upvotes: 0
Reputation: 82934
Trying to do this with regexes will send you crazy e.g.
>>> re.findall(r'\w+', "Don't read O'Rourke's books!")
['Don', 't', 'read', 'O', 'Rourke', 's', 'books']
Definitely look at the nltk
package.
Upvotes: 1
Reputation: 48077
Alternatively, you may also use itertools.groupby
along with str.alpha()
to extract alphabets-only words from the string as:
>>> from itertools import groupby
>>> sentence = 'evening, and there was morning--the first day.'
>>> [''.join(j) for i, j in groupby(sentence, str.isalpha) if i]
['evening', 'and', 'there', 'was', 'morning', 'the', 'first', 'day']
PS: Regex based solution is much cleaner. I have mentioned this as an possible alternative to achieve this.
Specific to OP: If all you want is to also split on --
in the resultant list, then you may firstly replace hyphens '-'
with space ' '
before performing split. Hence, your code should be:
words = content.lower().replace('-', ' ').split()
where words
will hold the value you desire.
Upvotes: 3
Reputation: 611
Besides the solutions given already, you could also improve your clean_up_list
function to do a better work.
def clean_up_list(word_list):
clean_word_list = []
# Move the list out of loop so that it doesn't
# have to be initiated every time.
symbols = "~!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"
for word in word_list:
current_word = ''
for index in range(len(word)):
if word[index] in symbols:
if current_word:
clean_word_list.append(current_word)
current_word = ''
else:
current_word += word[index]
if current_word:
# Append possible last current_word
clean_word_list.append(current_word)
return clean_word_list
Actually, you could apply the block in for word in word_list:
to the whole sentence to get the same result.
Upvotes: 0
Reputation: 423
I would suggest a regex-based solution:
import re
def to_words(text):
return re.findall(r'\w+', text)
This looks for all words - groups of alphabetic characters, ignoring symbols, seperators and whitespace.
>>> to_words("The morning-the evening")
['The', 'morning', 'the', 'evening']
Note that if you're looping over the words, using re.finditer
which returns a generator object is probably better, as you don't have store the whole list of words at once.
Upvotes: 4