Reputation: 899

using chunksize in pandas to read large size csv files that wont fit into memory

I'm trying to read a big size csv file using pandas that will not fit in the memory and create word frequency from it, my code works when the whole file fits inside the memory but when defining the chunk size it does not check the previous chunk to know if the word is there just increase the frequency of it if the word is not there just append it to the end of the file. it does each chunk separately regardless of what is inside the previous chunk, the code I'm trying is

dic = pd.DataFrame()
for chunk in pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=1):
    dic_tmp = (chunk['sentences'].str.split(expand=True).stack().value_counts().rename_axis('word').reset_index(name='freq'))
    dic_tmp.append(dic)
dic.to_csv('nenene.csv', index=False, header=None)

for testing purpose I put the chunksize one for a small csv file which looks like :

the output Im getting is :

while what I'm trying to get is something like this:

I'm I doing something wrong in the code? any advice, please ?

Upvotes: 0

Answers (5)

luigigi

Reputation: 4215

You can simply groupby your created df:

Input:

   word  freq
0   fly     3
1  Alex     1
2  name     1
0  Alex     1
1   fly     1

df.groupby('word').sum()

Output:

       freq
word      
Alex     2
fly      4
name     1

Full example:

dic = pd.DataFrame()
for chunk in pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=1):
    dic_tmp = (chunk['sentences'].str.split(expand=True).stack().value_counts().rename_axis('word').reset_index(name='freq'))
    dic = dic.append(dic_tmp)
dic = dic.groupby('word').sum().reset_index().sort_values('freq',ascending=False)
dic.to_csv('nenene.csv', index=False, header=None)

Upvotes: 1

emremrah

Reputation: 1775

You are resetting the frequencies in each chunk.

Instead, you can use Counter for this. Create a counter object at the beginning. In each chunk update your counter via its update method. At the end, write down the output of counter.most_common() to a file as you wish.

Update: An example of this is:

import pandas as pd
from collections import Counter

c = Counter([]) # initiate counter with an empty list so we can update it later

chunks = pd.read_csv("/home/emre/Desktop/hdd/foo.csv", chunksize=1)

for chunk in chunks:
    for i, row in chunk.iterrows():
        c.update(row['sentences'].split(' '))

print(c.most_common())

The output is:

[('fly', 4),
 ('alex', 2),
 ('ibrahim', 2),
 ('hi', 1),
 ('my', 1),
 ('name', 1),
 ('is', 1),
 ('i', 1),
 ('am', 1),
 ('how', 1),
 ('are', 1),
 ('you', 1),
 ('doing', 1)]

Now you can iterate over this most commons and save them to a file:

with open('most_commons.txt', 'w+') as f:
    for word_freq in c.most_common():
        f.write(word_freq[0] + ' ' + str(word_freq[1]) + '\n')

The file:

fly 4
alex 2
ibrahim 2
hi 1
my 1
name 1
is 1
i 1
am 1
how 1
are 1
you 1
doing 1

And in this way you don't have to make chunksize=1. Make it like chunksize=1000 so it won't have to read the file from disk too many times.

Also writing to file part might be written more elegantly; it's just for demonstration.

Upvotes: 1

Exciter

Reputation: 94

what was your dessision to use pandas?

maybe you can read the file line by line and update the collections counter every time.

import collections

freq = collections.Counter()
with open(filename) as f:
  for line in f:
    freq.update(line.split())

after this python block you will have the frequenze in varibale freq

Upvotes: 1

SajanGohil

Reputation: 969

I think you made a mistake at dic_tmp.append(dic), what you need is dic = dic.append(dic_tmp). Also, You are getting the indices set by pandas in your output befor the words, you can do index=False parameter in your to_csv() function.

Upvotes: 1

synaptikon

Reputation: 699

Here is what you wanna be doing:

chunks = pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=chunksize)
d = pd.concat(chunks)
d2 = d['sentences'].str.split(expand=True).stack().value_counts().rename_axis('word').reset_index(name='freq')

avoiding unwanted loops will speed up your code as well when you read in large files

Upvotes: 1

using chunksize in pandas to read large size csv files that wont fit into memory

Answers (5)

Related Questions