Aakarsh
Aakarsh

Reputation: 51

Pandas pd.read_csv isn't working for csv file greater than 900MB

My server is having 8GB of RAM and I am using pandas read_csv function for reading a csv file to a dataframe but it is executing as "Killed" for csv size greater than 900MB.

Please anyone help me handling this situation. I am attaching my meminfo for getting advises on how to clear memory on the server Memory info image

Upvotes: 2

Views: 3835

Answers (3)

Alexander Martins
Alexander Martins

Reputation: 383

In my case, it was a memory-related issue. Setting the nrows parameter in pd.read_csv. It's not a solution but I was able to debbug this way.

Upvotes: 0

Neill Herbst
Neill Herbst

Reputation: 2122

pandas can return an iterator for large files.

import pandas as pd

foo = pd.read_csv('bar.csv', iterator=True, chunksize=1000)

This will return an iterator. You can then apply operations to the data in chunks using a for loop. It therefore does not read the whole file into memory at once. The chunk size is the number of rows per chunk.

It will be something like this:

for chunk in foo:
    # do something with chunk

EDIT: To the best of my knowledge you will have to apply functions like unique in chunks as well.

import numpy as np
unique_foo = []
for i in df:
    unique_foo.append(i['foo'].unique())

unique_foo = np.unique(unique_eff)

Upvotes: 4

Laurent
Laurent

Reputation: 2023

(You should be a little more specific about what code you're typing, and what kind of error you're receiving.)

If pandas is not working with a file too large, you should revert to the more basic csv package. And you can still import in a DataFrame if you feel more comfortable that way.

Something like:

with open("file.csv", 'rb') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
df = pd.DataFrame(list(reader))

Upvotes: 0

Related Questions