Reputation: 772
I am using huge dataset with 5 columns and more that 90 million rows. The code works fine with part of the data, but when it comes to the whole I get Memory Error. I read about generators, but it appears very complex for me. Can I get explanation based on this code?
df = pd.read_csv('D:.../test.csv', names=["id_easy","ordinal", "timestamp", "latitude", "longitude"])
df = df[:-1]
df.loc[:,'timestamp'] = pd.to_datetime(df.loc[:,'timestamp'])
pd.set_option('float_format', '{:f}'.format)
df['epoch'] = df.loc[:, 'timestamp'].astype('int64')//1e9
df['day_of_week'] = pd.to_datetime(df['epoch'], unit="s").dt.weekday_name
del df['timestamp']
for day in ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']:
day_df = df.loc[df['day_of_week'] == day]
day_df.to_csv(f'{day}.csv', index=False,)
Error appears on the last for loop
operation
d4ace40905729245a5a0bc3fb748d2b3 1 2016-06-01T08:18:46.000Z 22.9484 56.7728
d4ace40905729245a5a0bc3fb748d2b3 2 2016-06-01T08:28:05.000Z 22.9503 56.7748
I did this:
chunk_list = []
for chunk in df_chunk:
chunk_list.append(chunk)
df_concat = pd.concat(chunk_list)
I have no idea how to proceed now? How to apply the rest of the code?
Upvotes: 3
Views: 1249
Reputation: 185
You can use something like pypy (which is an alternative implementation of Python that is not compatible with many packages but faster and has better memory management). It did not use to support pandas (so you would need to iterate through each row, but pypy is very fast at doing this), but I believe if you use a version since this release that now it can be used with pandas.
Upvotes: -2
Reputation: 4487
My advice is to switch to Dask or Spark.
If you want to continue using pandas, try the following tips to read a CSV file, with pandas.read_csv:
chunksize
parameter: that allows you to read a piece of files at a time. For
example in your case you could use chunksize equal to a million, you
would get 90 chunks and you could operate on each chunk
individually.dtype
parameter: with this parameter you can specify the data type of each column simply by passing a dictionary like this: {‘a’: np.float32, ‘b’: np.int32, ‘c’: ‘Int32’}
Try this code:
df_chunks = pd.read_csv('test.csv', chunksize=1000000, iterator=True,
parse_dates=['timestamp'], error_bad_lines=False,
dtype={"ordinal":'int32', "latitude": 'float32', "longitude":'float32'})
for chunk in df_chunks:
# chunk = chunk.apply(...) # process the single chunk
for day in ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']:
day_df = chunk.loc[chunk['day_of_week'] == day]
day_df.to_csv(f'{day}.csv', mode='a', index=0, header=False)
This way you work on one chunk of data at a time and never work with all the data together. The mode='a'
tells pandas to append.
Note1:
You do not need pandas.concat
here. The only thing iterator and chunksize=1000000
does is to give you a reader object that iterates 1000000-row DataFrames instead of reading the whole thing. Using concat you lose all the advantages of using iterators and loading the whole file into memory exactly like using csv laws without specifying chunksize.
Note2: If the 'MemoryError'
error persists, try smaller chunksize.
Upvotes: 2
Reputation: 92854
Complex improvements:
chunksize
to read_csv
call (specifying a number of rows to read at one iteration)df = df[:-1]
is not applicable in iterator approach and assuming that the last line is in bad format 99695386 [space] NaN NaN NaN NaN
- we can handle it and skip by specifying option error_bad_lines=False
df.loc[:,'timestamp'] = pd.to_datetime(df.loc[:,'timestamp'])
can also be eliminated by using parse_dates=['timestamp']
as an option for pd.read_csv
callmode='a'
(append to a file)On practice:
n_rows = 10 * 6 # adjust empirically
reader = pd.read_csv('test.csv', names=["id_easy","ordinal", "timestamp", "latitude", "longitude"],
parse_dates=['timestamp'], chunksize=n_rows, error_bad_lines=False)
day_names = ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')
for df in reader:
if not df.empty:
df['epoch'] = df.loc[:, 'timestamp'].astype('int64') // 1e9
df['day_of_week'] = pd.to_datetime(df['epoch'], unit="s").dt.weekday_name
del df['timestamp']
for day in day_names:
day_df = df.loc[df['day_of_week'] == day]
if not day_df.empty:
day_df.to_csv(f'{day}.csv', index=False, header=False, mode='a')
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking
Upvotes: 1