Reputation:

Python split text into tokens using regex

Hi I have a question about splitting strings into tokens.

Here is an example string:

string = "As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests."

and I'm trying to split string correctly into its tokens.

Here is my function count_words

def count_words(text):
    """Count how many times each unique word occurs in text."""
    counts = dict()  # dictionary of { <word>: <count> } pairs to return
    #counts["I"] = 1
    print(text)
    # TODO: Convert to lowercase
    lowerText = text.lower()
    # TODO: Split text into tokens (words), leaving out punctuation
    # (Hint: Use regex to split on non-alphanumeric characters)
    split = re.split("[\s.,!?:;'\"-]+",lowerText)
    print(split)
    # TODO: Aggregate word counts using a dictionary

and the result of split here

['as', 'i', 'was', 'waiting', 'a', 'man', 'came', 'out', 'of', 'a', 'side', 'room', 'and', 'at', 'a', 'glance', 'i', 'was', 'sure', 'he', 'must', 'be', 'long', 'john', 'his', 'left', 'leg', 'was', 'cut', 'off', 'close', 'by', 'the', 'hip', 'and', 'under', 'the', 'left', 'shoulder', 'he', 'carried', 'a', 'crutch', 'which', 'he', 'managed', 'with', 'wonderful', 'dexterity', 'hopping', 'about', 'upon', 'it', 'like', 'a', 'bird', 'he', 'was', 'very', 'tall', 'and', 'strong', 'with', 'a', 'face', 'as', 'big', 'as', 'a', 'ham—plain', 'and', 'pale', 'but', 'intelligent', 'and', 'smiling', 'indeed', 'he', 'seemed', 'in', 'the', 'most', 'cheerful', 'spirits', 'whistling', 'as', 'he', 'moved', 'about', 'among', 'the', 'tables', 'with', 'a', 'merry', 'word', 'or', 'a', 'slap', 'on', 'the', 'shoulder', 'for', 'the', 'more', 'favoured', 'of', 'his', 'guests', '']

as you see there is the empty string '' in the last index of the split list.

Please help me understand this empty string in the list and to correctly split this example string.

Upvotes: 5

Answers (5)

Lima

Reputation: 11

import string

def count_words(text):

    counts = dict() 

    text = text.translate(text.maketrans('', '', string.punctuation))
    text = text.lower()

    words = text.split()
    print(words)

    for word in words:
        if word not in counts:
            counts[word] = 1
        else:
            counts[word] += 1

    return counts

It works.

Upvotes: 0

Ulises Rosas-Puchuri

Reputation: 1970

You have an empty string due to a point is also matching to split at the string ending and anything is downstream. You can, however, filter out empty strings with filter function and thus complete your function:

import re
import collections


def count_words(text):
    """Count how many times each unique word occurs in text."""

    lowerText = text.lower()

    split = re.split("[ .,!?:;'\"\-]+",lowerText)
    ## filer out empty strings and count
    ## words:

   return collections.Counter( filter(None, split) )


count_words(text=string)
# Counter({'a': 9, 'he': 6, 'the': 6, 'and': 5, 'as': 4, 'was': 4, 'with': 3, 'his': 2, 'about': 2, 'i': 2, 'of': 2, 'shoulder': 2, 'left': 2, 'dexterity': 1, 'seemed': 1, 'managed': 1, 'among': 1, 'indeed': 1, 'favoured': 1, 'moved': 1, 'it': 1, 'slap': 1, 'cheerful': 1, 'at': 1, 'in': 1, 'close': 1, 'glance': 1, 'face': 1, 'pale': 1, 'smiling': 1, 'out': 1, 'tables': 1, 'cut': 1, 'ham': 1, 'for': 1, 'long': 1, 'intelligent': 1, 'waiting': 1, 'wonderful': 1, 'which': 1, 'under': 1, 'must': 1, 'bird': 1, 'guests': 1, 'more': 1, 'hip': 1, 'be': 1, 'sure': 1, 'leg': 1, 'very': 1, 'big': 1, 'spirits': 1, 'upon': 1, 'but': 1, 'like': 1, 'most': 1, 'carried': 1, 'whistling': 1, 'merry': 1, 'tall': 1, 'word': 1, 'strong': 1, 'by': 1, 'on': 1, 'john': 1, 'off': 1, 'room': 1, 'hopping': 1, 'or': 1, 'crutch': 1, 'man': 1, 'plain': 1, 'side': 1, 'came': 1})

Upvotes: 1

Marco Luzzara

Reputation: 6036

Python's wiki explains this behavior:

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string

Even though yours is not actually a capturing group, the effect is the same. Note that it could be at the end as well as at the start (for instance if your string started with a whitespace).

The 2 solution already proposed (more or less) by others are these:

Solution 1: `findall`

As other users pointed out you can use findall and try to inverse the logic of the pattern. With yours, you can easily negate your character class: [^\s\.,!?:;'\"-]+.

But it depends on you regex pattern because it is not always that easy.

Solution 2: check on the starting and ending token

Instead of checking if each token is != '', you can just look at the first or at the last one of the tokens, since you are eagerly taking all the characters on the set you need to split on.

split = re.split("[\s\.,!?:;'\"-]+",lowerText)

if split[0] == '':
    split = split[1:]

if split[-1] == '':
    split = split[:-1]

Upvotes: 1

Mohammed Elhag

Reputation: 4302

That happened because the end of string is . and it is in the split pattern so , when match . the next match will start with an empty and that why you see ''.

I suggest this solution using re.findall instead to work an opposite way like this :

def count_words(text):
    """Count how many times each unique word occurs in text."""
    counts = dict()  # dictionary of { <word>: <count> } pairs to return
    #counts["I"] = 1
    print(text)
    # TODO: Convert to lowercase
    lowerText = text.lower()
    # TODO: Split text into tokens (words), leaving out punctuation
    # (Hint: Use regex to split on non-alphanumeric characters)
    split = re.findall(r"[a-z\-]+", lowerText)
    print(split)
    # TODO: Aggregate word counts using a dictionary

Upvotes: 2

DWD

Reputation: 452

You could use a list comprehension to iterate over the list items produced by re.split and only keep them if they are not empty strings:

def count_words(text):
    """Count how many times each unique word occurs in text."""
    counts = dict()  # dictionary of { <word>: <count> } pairs to return
    #counts["I"] = 1
    print(text)
    # TODO: Convert to lowercase
    lowerText = text.lower()
    # TODO: Split text into tokens (words), leaving out punctuation 
    # (Hint: Use regex to split on non-alphanumeric characters) 

    split = re.split("[\s.,!?:;'\"-]+",lowerText)
    split = [x for x in split if x != '']  # <- list comprehension
    print(split)

You should also consider returning the data from the function, and printing it from the caller rather than printing it from within the function. That will provide you with flexibility in future.

Upvotes: 4

Python split text into tokens using regex

Answers (5)

Solution 1: findall

Solution 2: check on the starting and ending token

Related Questions

Solution 1: `findall`