Reputation: 267
I have a large dataset in .csv
file format, with around 60 GB of data containing more than 60% of the data is missing in some columns and rows, Since Its not possible to read such a huge file directly into jupyter notebook
, I want to read only specific columns and only non-null rows into jupyter notebook using pandas.read_csv
.
How can this be done?
Thanks in advance!!
Upvotes: 1
Views: 419
Reputation: 4607
You can read the CSV file chunk by chunk and retain the rows which you want to have
iter_csv = pd.read_csv('sample.csv',, usecols = ['col1','col2'] iterator=True, chunksize=10000,error_bad_lines=False)
data = pd.concat ([chunk.dropna(how='all') for chunk in iter_csv] )
Upvotes: 2
Reputation: 912
Check following suggestion in a previous post.
The pandas documentation suggest you can read a csv file selecting only the columns which you want to read.
import pandas as pd
df = pd.read_csv('some_data.csv', usecols = ['col1','col2'], low_memory = True)
Upvotes: 2