Reputation: 11
i'm new using python and i dont know how to lemmatize an array. what i do is
from nltk.stem import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
data = 'data/new 1.txt'
file_tagged = open(data)
verses_tagged = file_tagged.readlines()
num_lines = sum(1 for line in open(data))
i = 0
dataPair = []
tokenP1 = []
tokenP2 = []
def tokenPhrasebase(verse):
return verse.split('}')
for i in range(0, num_lines):
dataPair.append(verses_tagged[i].split('\t'))
tokenP1.append(tokenPhrasebase(dataPair[i][0]))
tokenP2.append(tokenPhrasebase(dataPair[i][1]))
for j in range(len(tokenP1[i])):
tokenP1[i][j] = tagRemoval(tokenP1[i][j])
for j in range(len(tokenP2[i])):
tokenP2[i][j] = tagRemoval(tokenP2[i][j])
for y in range(0, num_lines):
tokenP1[y] = lmtzr.lemmatize(tokenP1[y])
tokenP2[y] = lmtzr.lemmatize(tokenP2[y])
what i want to do is lemmatize all string inside array without changing the array formation. but i get an error like this TypeError: unhashable type: 'list'
anyone can help?
Upvotes: 1
Views: 1678
Reputation: 3577
Probably you are trying to use an array of arrays instead of an array of tokens in your lemmatizer. You can see a working example below:
import pandas as pd
import nltk as nl
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
# reading the data in lower case
dbFilepandas = pd.read_csv('yourfilename.csv').apply(lambda x: x.astype(str).str.lower())
# declaring the lemmatizer to use
lemmatizer = WordNetLemmatizer()
# Getting the data as an array of strings
train = []
# I am only using the first 4 columns of the file
for sentences in dbFilepandas[dbFilepandas.columns[0:4]].values:
str1 = ''.join(sentences)
train.append(str1)
# removing punctuation
tokenizer = RegexpTokenizer(r'\w+')
# get the tokens
tokens_to_lematize = [tokenizer.tokenize(sentences) for sentences in train]
filtered_tokens_Array = []
for item in tokens_to_lematize:
words = [lemmatizer.lemmatize(word) for word in item]
#removing stopwords
filtered_words = [word for word in words if word not in nl.corpus.stopwords.words('english')]
filtered_tokens_Array.append(filtered_words)
Now, if you print the filtered_tokens_Array you have the tokens "lemmatized"
print("\nfiltered:\n",filtered_tokens_Array)
[example output]:
filtered: [['men', 'shirt', 'running', 'color', 'yellow', 'size', 'springfield'], ['men', 'shirt', 'passed', 'color', 'red', 'size', 'springfield'], ['woman', 'shirt', 'color', 'green', 'size', 'springfield'], ['men', 'shirt', 'color', 'red', 'size', 'l', 'springfield']]
Upvotes: 1