Reza Rifaldi
Reza Rifaldi

Reputation: 11

lemmatization inside array using nltk python

i'm new using python and i dont know how to lemmatize an array. what i do is

from nltk.stem import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
data = 'data/new 1.txt'
file_tagged = open(data)
verses_tagged = file_tagged.readlines()
num_lines = sum(1 for line in open(data))
i = 0
dataPair = []
tokenP1 = []
tokenP2 = []

def tokenPhrasebase(verse):
    return verse.split('}')

for i in range(0, num_lines):
    dataPair.append(verses_tagged[i].split('\t'))

    tokenP1.append(tokenPhrasebase(dataPair[i][0]))
    tokenP2.append(tokenPhrasebase(dataPair[i][1]))

    for j in range(len(tokenP1[i])):
        tokenP1[i][j] = tagRemoval(tokenP1[i][j])
    for j in range(len(tokenP2[i])):
        tokenP2[i][j] = tagRemoval(tokenP2[i][j])

    for y in range(0, num_lines):
        tokenP1[y] = lmtzr.lemmatize(tokenP1[y])
        tokenP2[y] = lmtzr.lemmatize(tokenP2[y])

what i want to do is lemmatize all string inside array without changing the array formation. but i get an error like this TypeError: unhashable type: 'list' anyone can help?

Upvotes: 1

Views: 1678

Answers (1)

Joel Carneiro
Joel Carneiro

Reputation: 3577

Probably you are trying to use an array of arrays instead of an array of tokens in your lemmatizer. You can see a working example below:

import pandas as pd
import nltk as nl
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

# reading the data in lower case
    dbFilepandas = pd.read_csv('yourfilename.csv').apply(lambda x: x.astype(str).str.lower())

# declaring the lemmatizer to use
lemmatizer = WordNetLemmatizer()

# Getting the data as an array of strings
train = []
# I am only using the first 4 columns of the file
for sentences in dbFilepandas[dbFilepandas.columns[0:4]].values:
    str1 = ''.join(sentences)
    train.append(str1)
# removing punctuation
tokenizer = RegexpTokenizer(r'\w+')
# get the tokens
tokens_to_lematize = [tokenizer.tokenize(sentences) for sentences in train]

filtered_tokens_Array = []
for item in tokens_to_lematize:
    words = [lemmatizer.lemmatize(word) for word in item]
    #removing stopwords
    filtered_words = [word for word in words if word not in nl.corpus.stopwords.words('english')]
    filtered_tokens_Array.append(filtered_words)

Now, if you print the filtered_tokens_Array you have the tokens "lemmatized"

 print("\nfiltered:\n",filtered_tokens_Array)

[example output]:

filtered: [['men', 'shirt', 'running', 'color', 'yellow', 'size', 'springfield'], ['men', 'shirt', 'passed', 'color', 'red', 'size', 'springfield'], ['woman', 'shirt', 'color', 'green', 'size', 'springfield'], ['men', 'shirt', 'color', 'red', 'size', 'l', 'springfield']]

Upvotes: 1

Related Questions