Lupos
Lupos

Reputation: 906

How to get the Index of a DataFrame when using the chunksize argument?

I have a very big .csv file which I cant load fully into my RAM. That's why I need to load my dataset witch the chunksize argument like this:

import pandas as pd
csv = pd.read_csv("challenger_match_V2.csv", chunksize=100, iterator=True)

But how do I access the dataset with the index.
Without using the chunksize argument I can just do dataframe[idx:idx].
How can I do that with chunksize

I tried doing:

for chunk in csv:
    print(chunk[idx])

which didn't work I got a KeyError with the index I tried to access the dataframe.

Example:

for chunk in csv:
    print(chunk[5])

Which gave the error:

   2646                 return self._engine.get_loc(key)
   2647             except KeyError:
-> 2648                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2649         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2650         if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 5

Upvotes: 0

Views: 1218

Answers (2)

Lupos
Lupos

Reputation: 906

I ended up throwing away some data from my dataframe to reduce the amount of memory needed.

Upvotes: 0

HurtadoLazaro
HurtadoLazaro

Reputation: 11

Each pandas chunk return value is an iterable object of type TextFileReader instead of a DataFrame so you can't index it like any regular DataFrame. Instead, you need iterate over csv = pd.read_csv("challenger_match_V2.csv", chunksize=100, iterator=True) and concat each chunk to get a DataFrame. You can also append each chunk to a list then concat the whole list.

Example:

import pandas as pd

csv = pd.read_csv("challenger_match_V2.csv", chunksize=100, iterator=True)

for chunk in csv:
    df = pd.concat(chunk)
    print(df)

OR

import pandas as pd

csv = pd.read_csv("challenger_match_V2.csv", chunksize=100, iterator=True)
chunk_list = []

for chunk in csv:
    chunk_list.append(chunk)

df = pd.concat(chunk_list)
print(df)

You can also print each chunk by just doing print(chunk) will iterating.

Upvotes: 1

Related Questions