Cumulative Unique words in huge dataframe

Question

How do I get cumulative unique words from a dataframe column which has more than 500 words per. Dataframe has ~300,000 rows

I read the csv file in a dataframe with column A having text data. I have tried creating couple of columns (B & C) by looping through column A and taking unique words from column A as set and appending Column B with unique words and Column C with count

Subsequently, I take unique words by taking Column A and column B(union) from previous row(set)

This works for small number of rows. But once number of rows exceeds 10,000 performance degrades and kernal eventually dies

Is there any better way of doing this for huge dataframe ?

Tried creating seperate dataframe with just unique words and count, but still have issue

Sample code:

for index, row in DF.iterrows():
      if index = 0:
          result = set(row['Column A'].lower().split()
          DF.at[index, 'Column B'] = result
      else:
          result = set(row['Column A'].lower().split()
          DF.at[index, 'Cloumn B'] = result.union(DF.loc[index -1, 
                                                'Column B'])
DF['Column C'] = DF['Column B'].apply(len)

Lante Dellarovere · Accepted Answer

Exploit uniqueness of dictionary keys to cumulate words.

I create a dict cumulative_words where I want to store unique words row by row, by updating it with dictionaries having keys made of unique words from a given row sentence.

Code:

cumulative_words = {}

def cumulate(x):
    cumulative_words.update(dict.fromkeys(set(x.lower().split())))
    return list(cumulative_words.keys())

df["Column B"] = df["Column A"].apply(cumulate)
df["Column C"] = df["Column B"].apply(len)

Update:

Given that you said this code still has memory issues at ~ 200k rows, I'll try something very simple to understand a bit more:

Just update the dictionary of cumulatives

Create the dictionary with unique words before dataframe operations

cumulative_words = {}

for x in df["Column A"].values:
    cumulative_words.update(dict.fromkeys(set(x.lower().split())))

If this still breaks I think we have to change method

Append words to a list

This is the critical point I think, because it creates a list of lists of, roughly, billions of words

cumulative_words = {}
cumulative_column = []

for x in df["Column A"].values:
    cumulative_words.update(dict.fromkeys(set(x.lower().split())))
    cumulative_column.append(cumulative_words.keys())

Assign created list to Column B and count

df["Column B"] = cumulative_column
df["Column C"] = df["Column B"].apply(len)

Maybe there are just too many words to store and a dataframe can't be created, or I don't know how to do. Let me know

Cumulative Unique words in huge dataframe

Answers (2)

Update:

Related Questions