Removing punctuation from my nested and tokenized list

Question

I am trying to remove the punctuation from my nested and tokenized list. I have tried several different approaches to this, but to no avail. My most recent attempt looks like this:

def tokenizeNestedList(listToTokenize):
    flat_list = [item.lower() for sublist in paragraphs_no_guten for item in sublist]
    tokenList = []
    for sentence in flat_list:
        sentence.translate(str.maketrans(",",string.punctuation))
        tokenList.append(nltk.word_tokenize(sentence))
    return tokenList

As you can see I'm trying to remove the punctuation as i tokenize the list, the list is being traversed anywho whilst calling my function. However, when trying this approach I get the error

ValueError: the first two maketrans arguments must have equal length

Which I sort of understand why happens. Running my code without trying to remove punctuation and printing the first 10 elements gives me (so you have an idea of what I'm working on) this:

[[], ['title', ':', 'an', 'inquiry', 'into', 'the', 'nature', 'and', 'causes', 'of', 'the', 'wealth', 'of', 'nations'], ['author', ':', 'adam', 'smith'], ['posting', 'date', ':', 'february', '28', ',', '2009', '[', 'ebook', '#', '3300', ']'], ['release', 'date', ':', 'april', ',', '2002'], ['[', 'last', 'updated', ':', 'june', '5', ',', '2011', ']'], ['language', ':', 'english'], [], [], ['produced', 'by', 'colin', 'muir']]

Any and all advice appreciated.

Konstantin Grigorov · Accepted Answer

For this to work as it is you need to run Python 3.x . Also, b contains the example nested list which you have provided

import string
# Remove empty lists
b = [x for x in b if x]
# Make flat list
b = [x for sbl in b for x in sbl]
# Define translation
translator = str.maketrans('', '', string.punctuation)
# Apply translation
b = [x.translate(translator) for x in b]
# Remove empty strings
b = list(filter(None, b))

A reference why it didn't work before: Python 2 maketrans() function doesn't work with Unicode: "the arguments are different lengths" when they actually are

Removing punctuation from my nested and tokenized list

Answers (2)

Related Questions