Ranyk
Ranyk

Reputation: 267

Pandas: read_csv reading large csv file with no NaNs

I have a large dataset in .csv file format, with around 60 GB of data containing more than 60% of the data is missing in some columns and rows, Since Its not possible to read such a huge file directly into jupyter notebook, I want to read only specific columns and only non-null rows into jupyter notebook using pandas.read_csv. How can this be done?

Thanks in advance!!

Upvotes: 1

Views: 419

Answers (2)

Naga kiran
Naga kiran

Reputation: 4607

You can read the CSV file chunk by chunk and retain the rows which you want to have

iter_csv = pd.read_csv('sample.csv',, usecols = ['col1','col2'] iterator=True, chunksize=10000,error_bad_lines=False)
data = pd.concat ([chunk.dropna(how='all') for chunk in iter_csv] )

Upvotes: 2

Albin
Albin

Reputation: 912

Check following suggestion in a previous post.

The pandas documentation suggest you can read a csv file selecting only the columns which you want to read.

import pandas as pd

df = pd.read_csv('some_data.csv', usecols = ['col1','col2'], low_memory = True)

Upvotes: 2

Related Questions