Reputation: 899
I'm trying to read a big size csv file using pandas that will not fit in the memory and create word frequency from it, my code works when the whole file fits inside the memory but when defining the chunk size it does not check the previous chunk to know if the word is there just increase the frequency of it if the word is not there just append it to the end of the file. it does each chunk separately regardless of what is inside the previous chunk, the code I'm trying is
dic = pd.DataFrame()
for chunk in pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=1):
dic_tmp = (chunk['sentences'].str.split(expand=True).stack().value_counts().rename_axis('word').reset_index(name='freq'))
dic_tmp.append(dic)
dic.to_csv('nenene.csv', index=False, header=None)
for testing purpose I put the chunksize one for a small csv file which looks like :
the output Im getting is :
while what I'm trying to get is something like this:
I'm I doing something wrong in the code? any advice, please ?
Upvotes: 0
Views: 1553
Reputation: 4215
You can simply groupby your created df:
Input:
word freq
0 fly 3
1 Alex 1
2 name 1
0 Alex 1
1 fly 1
df.groupby('word').sum()
Output:
freq
word
Alex 2
fly 4
name 1
Full example:
dic = pd.DataFrame()
for chunk in pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=1):
dic_tmp = (chunk['sentences'].str.split(expand=True).stack().value_counts().rename_axis('word').reset_index(name='freq'))
dic = dic.append(dic_tmp)
dic = dic.groupby('word').sum().reset_index().sort_values('freq',ascending=False)
dic.to_csv('nenene.csv', index=False, header=None)
Upvotes: 1
Reputation: 1765
You are resetting the frequencies in each chunk.
Instead, you can use Counter for this. Create a counter object at the beginning. In each chunk update your counter via its update
method. At the end, write down the output of counter.most_common()
to a file as you wish.
Update: An example of this is:
import pandas as pd
from collections import Counter
c = Counter([]) # initiate counter with an empty list so we can update it later
chunks = pd.read_csv("/home/emre/Desktop/hdd/foo.csv", chunksize=1)
for chunk in chunks:
for i, row in chunk.iterrows():
c.update(row['sentences'].split(' '))
print(c.most_common())
The output is:
[('fly', 4),
('alex', 2),
('ibrahim', 2),
('hi', 1),
('my', 1),
('name', 1),
('is', 1),
('i', 1),
('am', 1),
('how', 1),
('are', 1),
('you', 1),
('doing', 1)]
Now you can iterate over this most commons and save them to a file:
with open('most_commons.txt', 'w+') as f:
for word_freq in c.most_common():
f.write(word_freq[0] + ' ' + str(word_freq[1]) + '\n')
The file:
fly 4
alex 2
ibrahim 2
hi 1
my 1
name 1
is 1
i 1
am 1
how 1
are 1
you 1
doing 1
And in this way you don't have to make chunksize=1
. Make it like chunksize=1000
so it won't have to read the file from disk too many times.
Also writing to file part might be written more elegantly; it's just for demonstration.
Upvotes: 1
Reputation: 94
what was your dessision to use pandas?
maybe you can read the file line by line and update the collections counter every time.
import collections
freq = collections.Counter()
with open(filename) as f:
for line in f:
freq.update(line.split())
after this python block you will have the frequenze in varibale freq
Upvotes: 1
Reputation: 969
I think you made a mistake at dic_tmp.append(dic)
, what you need is dic = dic.append(dic_tmp)
. Also, You are getting the indices set by pandas in your output befor the words, you can do index=False
parameter in your to_csv() function.
Upvotes: 1
Reputation: 699
Here is what you wanna be doing:
chunks = pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=chunksize)
d = pd.concat(chunks)
d2 = d['sentences'].str.split(expand=True).stack().value_counts().rename_axis('word').reset_index(name='freq')
avoiding unwanted loops will speed up your code as well when you read in large files
Upvotes: 1