UndisclosedCurtain
UndisclosedCurtain

Reputation: 173

Removing punctuation from my nested and tokenized list

I am trying to remove the punctuation from my nested and tokenized list. I have tried several different approaches to this, but to no avail. My most recent attempt looks like this:

def tokenizeNestedList(listToTokenize):
    flat_list = [item.lower() for sublist in paragraphs_no_guten for item in sublist]
    tokenList = []
    for sentence in flat_list:
        sentence.translate(str.maketrans(",",string.punctuation))
        tokenList.append(nltk.word_tokenize(sentence))
    return tokenList

As you can see I'm trying to remove the punctuation as i tokenize the list, the list is being traversed anywho whilst calling my function. However, when trying this approach I get the error

ValueError: the first two maketrans arguments must have equal length

Which I sort of understand why happens. Running my code without trying to remove punctuation and printing the first 10 elements gives me (so you have an idea of what I'm working on) this:

[[], ['title', ':', 'an', 'inquiry', 'into', 'the', 'nature', 'and', 'causes', 'of', 'the', 'wealth', 'of', 'nations'], ['author', ':', 'adam', 'smith'], ['posting', 'date', ':', 'february', '28', ',', '2009', '[', 'ebook', '#', '3300', ']'], ['release', 'date', ':', 'april', ',', '2002'], ['[', 'last', 'updated', ':', 'june', '5', ',', '2011', ']'], ['language', ':', 'english'], [], [], ['produced', 'by', 'colin', 'muir']]

Any and all advice appreciated.

Upvotes: 1

Views: 1317

Answers (2)

Konstantin Grigorov
Konstantin Grigorov

Reputation: 1642

For this to work as it is you need to run Python 3.x . Also, b contains the example nested list which you have provided

import string
# Remove empty lists
b = [x for x in b if x]
# Make flat list
b = [x for sbl in b for x in sbl]
# Define translation
translator = str.maketrans('', '', string.punctuation)
# Apply translation
b = [x.translate(translator) for x in b]
# Remove empty strings
b = list(filter(None, b))

A reference why it didn't work before: Python 2 maketrans() function doesn't work with Unicode: "the arguments are different lengths" when they actually are

Upvotes: 1

Dani Mesejo
Dani Mesejo

Reputation: 61930

Assuming each punctuation is a separate token, you could so something like this:

import string

sentences = [[], ['title', ':', 'an', 'inquiry', 'into', 'the', 'nature', 'and', 'causes', 'of', 'the', 'wealth', 'of',
             'nations'], ['author', ':', 'adam', 'smith'],
             ['posting', 'date', ':', 'february', '28', ',', '2009', '[', 'ebook', '#', '3300', ']'],
             ['release', 'date', ':', 'april', ',', '2002'], ['[', 'last', 'updated', ':', 'june', '5', ',', '2011', ']'],
             ['language', ':', 'english'], [], [], ['produced', 'by', 'colin', 'muir']]


result = [list(filter(lambda x: x not in string.punctuation, sentence)) for sentence in sentences]

print(result)

Output

[[], ['title', 'an', 'inquiry', 'into', 'the', 'nature', 'and', 'causes', 'of', 'the', 'wealth', 'of', 'nations'], ['author', 'adam', 'smith'], ['posting', 'date', 'february', '28', '2009', 'ebook', '3300'], ['release', 'date', 'april', '2002'], ['last', 'updated', 'june', '5', '2011'], ['language', 'english'], [], [], ['produced', 'by', 'colin', 'muir']]

The idea is to use filter, to remove those tokens that are punctuation, as filter returns an iterator use list to convert it back to a list. You could also use the equivalent list comprehension:

result = [[token for token in sentence if token not in string.punctuation] for sentence in sentences]

Upvotes: 2

Related Questions