Reputation: 41
I'm trying to use nltk and pandas to find the top 100 words from another csv and list them on a new CSV. I am able to plot the words but when I print to CSV I get
word | count
52 | 7 <- This is current CSV output
Not sure where I am going wrong, looking for some guidance.
My code is
words= []
with open('SECParse2.csv', encoding = 'utf-8') as csvfile:
reader = csv.reader(csvfile)
next(reader)
freq_all = nltk.FreqDist()
for row in reader:
note = row[1]
tokens = [t for t in note.split()]
freq = nltk.FreqDist(tokens)
fd_t100 = freq.most_common(100)
freq_all.update(tokens)
freq_all.plot(100, cumulative=False)
df3 = pd.DataFrame(freq_all,columns=['word','count'], index=[1])
df3.to_csv("./SECParse3.csv", sep=',',index=False)
I'm guessing that it's my df3 line but for the life of me I can't get it to display the correct distribution in CSV
Have also tried
df3 = pd.DataFrame(fd_t100,columns=['word','count'])
Some sample content from CSV2-
filename text
AAL_0000004515_10Q_20200331 generally industry may affected
AAL_0000004515_10Q_20200331 material decrease demand international air travel
AAPL_0000320193_10Q_2020032 february following initial outbreak virus china
AAP_0001158449_10Q_20200418 restructuring cost cost primarily relating early
Upvotes: 1
Views: 827
Reputation: 9711
Here you go. The code is quite compressed, so feel free to expand if you like.
First, ensure the source file is actually a CSV file (i.e. comma separated). I copied/pasted the sample text from the question into a text file and added commas (as shown below).
Breaking the code down line by line:
DataFrame
text
column and flatten into a string of words, and tokeniseimport pandas as pd
from nltk import FreqDist, word_tokenize
df = pd.read_csv('./SECParse3.csv')
words = word_tokenize(' '.join([line for line in df['text'].to_numpy()]))
common = FreqDist(words).most_common(100)
pd.DataFrame(common, columns=['word', 'count']).to_csv('words_out.csv', index=False
filename,text
AAL_0000004515_10Q_20200331,generally industry may affected
AAL_0000004515_10Q_20200331,material decrease demand international air travel
AAPL_0000320193_10Q_2020032,february following initial outbreak virus china
AAP_0001158449_10Q_20200418,restructuring cost cost primarily relating early
word,count
cost,2
generally,1
industry,1
may,1
affected,1
material,1
decrease,1
...
Upvotes: 1