Reputation: 906
I have a very big .csv
file which I cant load fully into my RAM. That's why I need to load my dataset witch the chunksize
argument like this:
import pandas as pd
csv = pd.read_csv("challenger_match_V2.csv", chunksize=100, iterator=True)
But how do I access the dataset with the index.
Without using the chunksize argument I can just do dataframe[idx:idx]
.
How can I do that with chunksize
I tried doing:
for chunk in csv:
print(chunk[idx])
which didn't work I got a KeyError with the index I tried to access the dataframe.
Example:
for chunk in csv:
print(chunk[5])
Which gave the error:
2646 return self._engine.get_loc(key)
2647 except KeyError:
-> 2648 return self._engine.get_loc(self._maybe_cast_indexer(key))
2649 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2650 if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 5
Upvotes: 0
Views: 1218
Reputation: 906
I ended up throwing away some data from my dataframe to reduce the amount of memory needed.
Upvotes: 0
Reputation: 11
Each pandas chunk return value is an iterable object of type TextFileReader instead of a DataFrame so you can't index it like any regular DataFrame. Instead, you need iterate over csv = pd.read_csv("challenger_match_V2.csv", chunksize=100, iterator=True)
and concat each chunk to get a DataFrame. You can also append each chunk to a list then concat the whole list.
Example:
import pandas as pd
csv = pd.read_csv("challenger_match_V2.csv", chunksize=100, iterator=True)
for chunk in csv:
df = pd.concat(chunk)
print(df)
OR
import pandas as pd
csv = pd.read_csv("challenger_match_V2.csv", chunksize=100, iterator=True)
chunk_list = []
for chunk in csv:
chunk_list.append(chunk)
df = pd.concat(chunk_list)
print(df)
You can also print each chunk by just doing print(chunk)
will iterating.
Upvotes: 1