Reputation:
Hi I have a question about splitting strings into tokens.
Here is an example string:
string
= "As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests."
and I'm trying to split string
correctly into its tokens.
Here is my function count_words
def count_words(text):
"""Count how many times each unique word occurs in text."""
counts = dict() # dictionary of { <word>: <count> } pairs to return
#counts["I"] = 1
print(text)
# TODO: Convert to lowercase
lowerText = text.lower()
# TODO: Split text into tokens (words), leaving out punctuation
# (Hint: Use regex to split on non-alphanumeric characters)
split = re.split("[\s.,!?:;'\"-]+",lowerText)
print(split)
# TODO: Aggregate word counts using a dictionary
and the result of split
here
['as', 'i', 'was', 'waiting', 'a', 'man', 'came', 'out', 'of', 'a', 'side', 'room', 'and', 'at', 'a', 'glance', 'i', 'was', 'sure', 'he', 'must', 'be', 'long', 'john', 'his', 'left', 'leg', 'was', 'cut', 'off', 'close', 'by', 'the', 'hip', 'and', 'under', 'the', 'left', 'shoulder', 'he', 'carried', 'a', 'crutch', 'which', 'he', 'managed', 'with', 'wonderful', 'dexterity', 'hopping', 'about', 'upon', 'it', 'like', 'a', 'bird', 'he', 'was', 'very', 'tall', 'and', 'strong', 'with', 'a', 'face', 'as', 'big', 'as', 'a', 'ham—plain', 'and', 'pale', 'but', 'intelligent', 'and', 'smiling', 'indeed', 'he', 'seemed', 'in', 'the', 'most', 'cheerful', 'spirits', 'whistling', 'as', 'he', 'moved', 'about', 'among', 'the', 'tables', 'with', 'a', 'merry', 'word', 'or', 'a', 'slap', 'on', 'the', 'shoulder', 'for', 'the', 'more', 'favoured', 'of', 'his', 'guests', '']
as you see there is the empty string ''
in the last index of the split
list.
Please help me understand this empty string in the list and to correctly split this example string
.
Upvotes: 5
Views: 6945
Reputation: 11
import string
def count_words(text):
counts = dict()
text = text.translate(text.maketrans('', '', string.punctuation))
text = text.lower()
words = text.split()
print(words)
for word in words:
if word not in counts:
counts[word] = 1
else:
counts[word] += 1
return counts
It works.
Upvotes: 0
Reputation: 1970
You have an empty string due to a point is also matching to split at the string
ending and anything is downstream. You can, however, filter out empty strings with filter
function and thus complete your function:
import re
import collections
def count_words(text):
"""Count how many times each unique word occurs in text."""
lowerText = text.lower()
split = re.split("[ .,!?:;'\"\-]+",lowerText)
## filer out empty strings and count
## words:
return collections.Counter( filter(None, split) )
count_words(text=string)
# Counter({'a': 9, 'he': 6, 'the': 6, 'and': 5, 'as': 4, 'was': 4, 'with': 3, 'his': 2, 'about': 2, 'i': 2, 'of': 2, 'shoulder': 2, 'left': 2, 'dexterity': 1, 'seemed': 1, 'managed': 1, 'among': 1, 'indeed': 1, 'favoured': 1, 'moved': 1, 'it': 1, 'slap': 1, 'cheerful': 1, 'at': 1, 'in': 1, 'close': 1, 'glance': 1, 'face': 1, 'pale': 1, 'smiling': 1, 'out': 1, 'tables': 1, 'cut': 1, 'ham': 1, 'for': 1, 'long': 1, 'intelligent': 1, 'waiting': 1, 'wonderful': 1, 'which': 1, 'under': 1, 'must': 1, 'bird': 1, 'guests': 1, 'more': 1, 'hip': 1, 'be': 1, 'sure': 1, 'leg': 1, 'very': 1, 'big': 1, 'spirits': 1, 'upon': 1, 'but': 1, 'like': 1, 'most': 1, 'carried': 1, 'whistling': 1, 'merry': 1, 'tall': 1, 'word': 1, 'strong': 1, 'by': 1, 'on': 1, 'john': 1, 'off': 1, 'room': 1, 'hopping': 1, 'or': 1, 'crutch': 1, 'man': 1, 'plain': 1, 'side': 1, 'came': 1})
Upvotes: 1
Reputation: 6036
Python's wiki explains this behavior:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string
Even though yours is not actually a capturing group, the effect is the same. Note that it could be at the end as well as at the start (for instance if your string started with a whitespace).
The 2 solution already proposed (more or less) by others are these:
findall
As other users pointed out you can use findall
and try to inverse the logic of the pattern. With yours, you can easily negate your character class: [^\s\.,!?:;'\"-]+
.
But it depends on you regex pattern because it is not always that easy.
Instead of checking if each token is != ''
, you can just look at the first or at the last one of the tokens, since you are eagerly taking all the characters on the set you need to split on.
split = re.split("[\s\.,!?:;'\"-]+",lowerText)
if split[0] == '':
split = split[1:]
if split[-1] == '':
split = split[:-1]
Upvotes: 1
Reputation: 4302
That happened because the end of string is .
and it is in the split pattern
so , when match .
the next match will start with an empty and that why you see ''
.
I suggest this solution using re.findall
instead to work an opposite way like this :
def count_words(text):
"""Count how many times each unique word occurs in text."""
counts = dict() # dictionary of { <word>: <count> } pairs to return
#counts["I"] = 1
print(text)
# TODO: Convert to lowercase
lowerText = text.lower()
# TODO: Split text into tokens (words), leaving out punctuation
# (Hint: Use regex to split on non-alphanumeric characters)
split = re.findall(r"[a-z\-]+", lowerText)
print(split)
# TODO: Aggregate word counts using a dictionary
Upvotes: 2
Reputation: 452
You could use a list comprehension to iterate over the list items produced by re.split
and only keep them if they are not empty strings:
def count_words(text):
"""Count how many times each unique word occurs in text."""
counts = dict() # dictionary of { <word>: <count> } pairs to return
#counts["I"] = 1
print(text)
# TODO: Convert to lowercase
lowerText = text.lower()
# TODO: Split text into tokens (words), leaving out punctuation
# (Hint: Use regex to split on non-alphanumeric characters)
split = re.split("[\s.,!?:;'\"-]+",lowerText)
split = [x for x in split if x != ''] # <- list comprehension
print(split)
You should also consider returning the data from the function, and printing it from the caller rather than printing it from within the function. That will provide you with flexibility in future.
Upvotes: 4