random student
random student

Reputation: 775

ValueError: Length of values does not match length of index in nested loop

I'm trying to remove the stopwords in each row of my column. The columns contains rows and the rows since i already word_tokenized it with nltk then now it's a list which contains tuples. I'm trying to remove the stopwords with this nested list comprehension but it says ValueError: Length of values does not match length of index in nested loop. How to fix this?

import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

data = pd.read_csv(r"D:/python projects/read_files/spam.csv",
                    encoding = "latin-1")

data = data[['v1','v2']]

data = data.rename(columns = {'v1': 'label', 'v2': 'text'})

stopwords = set(stopwords.words('english'))

data['text'] = data['text'].str.lower()
data['new'] = [word_tokenize(row) for row in data['text']]
data['new'] = [word for new in data['new'] for word in new if word not in stopwords]

My text data

data['text'].head(5)
Out[92]: 
0    go until jurong point, crazy.. available only ...
1                        ok lar... joking wif u oni...
2    free entry in 2 a wkly comp to win fa cup fina...
3    u dun say so early hor... u c already then say...
4    nah i don't think he goes to usf, he lives aro...
Name: text, dtype: object

After i word_tokenized it with nltk

data['new'].head(5)
Out[89]: 
0    [go, until, jurong, point, ,, crazy.., availab...
1             [ok, lar, ..., joking, wif, u, oni, ...]
2    [free, entry, in, 2, a, wkly, comp, to, win, f...
3    [u, dun, say, so, early, hor, ..., u, c, alrea...
4    [nah, i, do, n't, think, he, goes, to, usf, ,,...
Name: new, dtype: object

The Traceback

runfile('D:/python projects/NLP_nltk_first.py', wdir='D:/python projects')
Traceback (most recent call last):

  File "D:\python projects\NLP_nltk_first.py", line 36, in <module>
    data['new'] = [new for new in data['new'] for word in new if word not in stopwords]

  File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3487, in __setitem__
    self._set_item(key, value)

  File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3564, in _set_item
    value = self._sanitize_column(key, value)

  File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3749, in _sanitize_column
    value = sanitize_index(value, self.index, copy=False)

  File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 612, in sanitize_index
    raise ValueError("Length of values does not match length of index")

ValueError: Length of values does not match length of index

Upvotes: 2

Views: 7903

Answers (1)

shadowtalker
shadowtalker

Reputation: 13823

Read the error message carefully:

ValueError: Length of values does not match length of index

The "values" in this case is the stuff on the right of the =:

values = [word for new in data['new'] for word in new if word not in stopwords]

The "index" in this case is the row index of the DataFrame:

index = data.index

The index here always has the same number of rows as the DataFrame itself.

The problem is that values is too long for the index -- i.e. they are too long for the DataFrame. If you inspect your code this should be immediately obvious. If you still don't see the problem, try this:

data['text_tokenized'] = [word_tokenize(row) for row in data['text']]

values = [word for new in data['text_tokenized'] for word in new if word not in stopwords]

print('N rows:', data.shape[0])
print('N new values:', len(values))

As for how to fix the problem -- it depends entirely on what you're trying to achieve. One option is to "explode" the data (also note the use of .map instead of a list comprehension):

data['text_tokenized'] = data['text'].map(word_tokenize)

# Flatten the token lists without a nested list comprehension
tokens_flat = data['text_tokenized'].explode()

# Join your labels w/ your flattened tokens, if desired
data_flat = data[['label']].join(tokens_flat)

# Add a 2nd index level to track token appearance order,
# might make your life easier 
data_flat['token_id'] = data.groupby(level=0).cumcount()
data_flat = data_flat.set_index('token_id', append=True)

As an unrelated tip, you can make your CSV processing more efficient by only loading the columns you need, as follows:

data = pd.read_csv(r"D:/python projects/read_files/spam.csv",
                    encoding="latin-1",
                    usecols=["v1", "v2"])

Upvotes: 3

Related Questions