Reputation: 173
I am trying to remove the punctuation from my nested and tokenized list. I have tried several different approaches to this, but to no avail. My most recent attempt looks like this:
def tokenizeNestedList(listToTokenize):
flat_list = [item.lower() for sublist in paragraphs_no_guten for item in sublist]
tokenList = []
for sentence in flat_list:
sentence.translate(str.maketrans(",",string.punctuation))
tokenList.append(nltk.word_tokenize(sentence))
return tokenList
As you can see I'm trying to remove the punctuation as i tokenize the list, the list is being traversed anywho whilst calling my function. However, when trying this approach I get the error
ValueError: the first two maketrans arguments must have equal length
Which I sort of understand why happens. Running my code without trying to remove punctuation and printing the first 10 elements gives me (so you have an idea of what I'm working on) this:
[[], ['title', ':', 'an', 'inquiry', 'into', 'the', 'nature', 'and', 'causes', 'of', 'the', 'wealth', 'of', 'nations'], ['author', ':', 'adam', 'smith'], ['posting', 'date', ':', 'february', '28', ',', '2009', '[', 'ebook', '#', '3300', ']'], ['release', 'date', ':', 'april', ',', '2002'], ['[', 'last', 'updated', ':', 'june', '5', ',', '2011', ']'], ['language', ':', 'english'], [], [], ['produced', 'by', 'colin', 'muir']]
Any and all advice appreciated.
Upvotes: 1
Views: 1317
Reputation: 1642
For this to work as it is you need to run Python 3.x . Also, b contains the example nested list which you have provided
import string
# Remove empty lists
b = [x for x in b if x]
# Make flat list
b = [x for sbl in b for x in sbl]
# Define translation
translator = str.maketrans('', '', string.punctuation)
# Apply translation
b = [x.translate(translator) for x in b]
# Remove empty strings
b = list(filter(None, b))
A reference why it didn't work before: Python 2 maketrans() function doesn't work with Unicode: "the arguments are different lengths" when they actually are
Upvotes: 1
Reputation: 61930
Assuming each punctuation is a separate token, you could so something like this:
import string
sentences = [[], ['title', ':', 'an', 'inquiry', 'into', 'the', 'nature', 'and', 'causes', 'of', 'the', 'wealth', 'of',
'nations'], ['author', ':', 'adam', 'smith'],
['posting', 'date', ':', 'february', '28', ',', '2009', '[', 'ebook', '#', '3300', ']'],
['release', 'date', ':', 'april', ',', '2002'], ['[', 'last', 'updated', ':', 'june', '5', ',', '2011', ']'],
['language', ':', 'english'], [], [], ['produced', 'by', 'colin', 'muir']]
result = [list(filter(lambda x: x not in string.punctuation, sentence)) for sentence in sentences]
print(result)
Output
[[], ['title', 'an', 'inquiry', 'into', 'the', 'nature', 'and', 'causes', 'of', 'the', 'wealth', 'of', 'nations'], ['author', 'adam', 'smith'], ['posting', 'date', 'february', '28', '2009', 'ebook', '3300'], ['release', 'date', 'april', '2002'], ['last', 'updated', 'june', '5', '2011'], ['language', 'english'], [], [], ['produced', 'by', 'colin', 'muir']]
The idea is to use filter, to remove those tokens that are punctuation, as filter returns an iterator use list to convert it back to a list. You could also use the equivalent list comprehension:
result = [[token for token in sentence if token not in string.punctuation] for sentence in sentences]
Upvotes: 2