Reputation: 153
How do I get cumulative unique words from a dataframe column which has more than 500 words per. Dataframe has ~300,000 rows
I read the csv file in a dataframe with column A having text data. I have tried creating couple of columns (B & C) by looping through column A and taking unique words from column A as set and appending Column B with unique words and Column C with count
Subsequently, I take unique words by taking Column A and column B(union) from previous row(set)
This works for small number of rows. But once number of rows exceeds 10,000 performance degrades and kernal eventually dies
Is there any better way of doing this for huge dataframe ?
Tried creating seperate dataframe with just unique words and count, but still have issue
Sample code:
for index, row in DF.iterrows():
if index = 0:
result = set(row['Column A'].lower().split()
DF.at[index, 'Column B'] = result
else:
result = set(row['Column A'].lower().split()
DF.at[index, 'Cloumn B'] = result.union(DF.loc[index -1,
'Column B'])
DF['Column C'] = DF['Column B'].apply(len)
Upvotes: 0
Views: 674
Reputation: 1858
Exploit uniqueness of dictionary keys to cumulate words.
I create a dict cumulative_words
where I want to store unique words row by row, by updating it with dictionaries having keys made of unique words from a given row sentence.
Code:
cumulative_words = {}
def cumulate(x):
cumulative_words.update(dict.fromkeys(set(x.lower().split())))
return list(cumulative_words.keys())
df["Column B"] = df["Column A"].apply(cumulate)
df["Column C"] = df["Column B"].apply(len)
Given that you said this code still has memory issues at ~ 200k rows, I'll try something very simple to understand a bit more:
Create the dictionary with unique words before dataframe operations
cumulative_words = {}
for x in df["Column A"].values:
cumulative_words.update(dict.fromkeys(set(x.lower().split())))
If this still breaks I think we have to change method
This is the critical point I think, because it creates a list of lists of, roughly, billions of words
cumulative_words = {}
cumulative_column = []
for x in df["Column A"].values:
cumulative_words.update(dict.fromkeys(set(x.lower().split())))
cumulative_column.append(cumulative_words.keys())
df["Column B"] = cumulative_column
df["Column C"] = df["Column B"].apply(len)
Maybe there are just too many words to store and a dataframe can't be created, or I don't know how to do. Let me know
Upvotes: 0
Reputation: 372
You could use CountVectorizer and make the cumulative sum afterwards.
Learn more about CountVectorizer : https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html And pandas' cumulative sum : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.cumsum.html
Upvotes: 0