MLNLPEnhusiast
MLNLPEnhusiast

Reputation: 153

Cumulative Unique words in huge dataframe

How do I get cumulative unique words from a dataframe column which has more than 500 words per. Dataframe has ~300,000 rows

I read the csv file in a dataframe with column A having text data. I have tried creating couple of columns (B & C) by looping through column A and taking unique words from column A as set and appending Column B with unique words and Column C with count

Subsequently, I take unique words by taking Column A and column B(union) from previous row(set)

This works for small number of rows. But once number of rows exceeds 10,000 performance degrades and kernal eventually dies

Is there any better way of doing this for huge dataframe ?

Tried creating seperate dataframe with just unique words and count, but still have issue

Sample code:

for index, row in DF.iterrows():
      if index = 0:
          result = set(row['Column A'].lower().split()
          DF.at[index, 'Column B'] = result
      else:
          result = set(row['Column A'].lower().split()
          DF.at[index, 'Cloumn B'] = result.union(DF.loc[index -1, 
                                                'Column B'])
DF['Column C'] = DF['Column B'].apply(len)

Upvotes: 0

Views: 674

Answers (2)

Lante Dellarovere
Lante Dellarovere

Reputation: 1858

Exploit uniqueness of dictionary keys to cumulate words.

I create a dict cumulative_words where I want to store unique words row by row, by updating it with dictionaries having keys made of unique words from a given row sentence.

Code:

cumulative_words = {}

def cumulate(x):
    cumulative_words.update(dict.fromkeys(set(x.lower().split())))
    return list(cumulative_words.keys())

df["Column B"] = df["Column A"].apply(cumulate)
df["Column C"] = df["Column B"].apply(len)

Update:

Given that you said this code still has memory issues at ~ 200k rows, I'll try something very simple to understand a bit more:

  1. Just update the dictionary of cumulatives

Create the dictionary with unique words before dataframe operations

cumulative_words = {}

for x in df["Column A"].values:
    cumulative_words.update(dict.fromkeys(set(x.lower().split())))

If this still breaks I think we have to change method

  1. Append words to a list

This is the critical point I think, because it creates a list of lists of, roughly, billions of words

cumulative_words = {}
cumulative_column = []

for x in df["Column A"].values:
    cumulative_words.update(dict.fromkeys(set(x.lower().split())))
    cumulative_column.append(cumulative_words.keys())
  1. Assign created list to Column B and count
df["Column B"] = cumulative_column
df["Column C"] = df["Column B"].apply(len)

Maybe there are just too many words to store and a dataframe can't be created, or I don't know how to do. Let me know

Upvotes: 0

Brice
Brice

Reputation: 372

You could use CountVectorizer and make the cumulative sum afterwards.

Learn more about CountVectorizer : https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html And pandas' cumulative sum : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cumsum.html

Upvotes: 0

Related Questions