Reputation: 165
With Python and Pandas, I'm seeking to write a script that takes the data from the text
column, evaluates that text with the textstat module, and then write the results back into the csv under the word_count
column.
Here is the structure of the csv:
user_id text text_number word_count
0 10 test text A text_0 NaN
1 11 NaN NaN NaN
2 12 NaN NaN NaN
3 13 NaN NaN NaN
4 14 NaN NaN NaN
5 15 test text B text_1 NaN
Here is my code attempt to loop the text
column into textstat:
df = pd.read_csv("texts.csv").fillna('')
text_data = df["text"]
length1 = len(text_data)
for x in range(length1):
(text_data[x])
#this is the textstat word count operation
word_count = textstat.lexicon_count(text_data, removepunct=True)
output_df = pd.DataFrame({"word_count":[word_count]})
output_df.to_csv('texts.csv', mode="a", header=False, index=False)
However, I recieve this error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Any suggestions on how to proceed? All assistance appreciated.
Upvotes: 0
Views: 2660
Reputation: 35646
The more pandas
approach would be to use fillna
+ apply
. Then write the Series
directly out to_csv
:
(
df["text"].fillna('') # Replace NaN with empty String
.apply(textstat.lexicon_count,
removepunct=True) # Call lexicon_count on each value
.rename('word_count') # Rename Series
.to_csv('texts.csv', mode="a", index=False) # Write to csv
)
texts.csv:
word_count
1
0
0
0
0
1
To add a column to the existing DataFrame/csv instead of appending to the end of it can also do:
df['word_count'] = (
df["text"].fillna('') # Replace NaN with empty String
.apply(textstat.lexicon_count,
removepunct=True) # Call lexicon_count on each value
)
df.to_csv('texts.csv', index=False) # Write to csv
texts.csv:
user_id,text,text_number,word_count
text,A,text_0,1
,,,0
,,,0
,,,0
,,,0
text,B,text_1,1
To fix the current implementation, also use fillna
and conditionally write the header only on the first iteration:
text_data = df["text"].fillna('')
for i, x in enumerate(text_data):
# this is the textstat word count operation
word_count = textstat.lexicon_count(x, removepunct=True)
output_df = pd.DataFrame({"word_count": [word_count]})
output_df.to_csv('texts.csv', mode="a", header=(i == 0), index=False)
texts.csv:
word_count
1
0
0
0
0
1
DataFrame and imports:
import pandas as pd
import textstat
from numpy import nan
df = pd.DataFrame({
'user_id': ['text', nan, nan, nan, nan, 'text'],
'text': ['A', nan, nan, nan, nan, 'B'],
'text_number': ['text_0', nan, nan, nan, nan, 'text_1'],
'word_count': [nan, nan, nan, nan, nan, nan]
})
Upvotes: 3