Mamed
Mamed

Reputation: 772

How to reduce the memory usage and speed up the code

I am using huge dataset with 5 columns and more that 90 million rows. The code works fine with part of the data, but when it comes to the whole I get Memory Error. I read about generators, but it appears very complex for me. Can I get explanation based on this code?

df = pd.read_csv('D:.../test.csv', names=["id_easy","ordinal", "timestamp", "latitude", "longitude"])

df = df[:-1]
df.loc[:,'timestamp'] = pd.to_datetime(df.loc[:,'timestamp'])
pd.set_option('float_format', '{:f}'.format)
df['epoch'] = df.loc[:, 'timestamp'].astype('int64')//1e9
df['day_of_week'] = pd.to_datetime(df['epoch'], unit="s").dt.weekday_name
del df['timestamp']

for day in ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']:
    day_df = df.loc[df['day_of_week'] == day]
    day_df.to_csv(f'{day}.csv', index=False,)

Error appears on the last for loop operation

Sample data:

d4ace40905729245a5a0bc3fb748d2b3    1   2016-06-01T08:18:46.000Z    22.9484 56.7728
d4ace40905729245a5a0bc3fb748d2b3    2   2016-06-01T08:28:05.000Z    22.9503 56.7748

UPDATED

I did this:

chunk_list = []  

for chunk in df_chunk:  
    chunk_list.append(chunk)
df_concat = pd.concat(chunk_list)

I have no idea how to proceed now? How to apply the rest of the code?

Upvotes: 3

Views: 1249

Answers (3)

Maddo
Maddo

Reputation: 185

You can use something like pypy (which is an alternative implementation of Python that is not compatible with many packages but faster and has better memory management). It did not use to support pandas (so you would need to iterate through each row, but pypy is very fast at doing this), but I believe if you use a version since this release that now it can be used with pandas.

Upvotes: -2

Massifox
Massifox

Reputation: 4487

My advice is to switch to Dask or Spark.

If you want to continue using pandas, try the following tips to read a CSV file, with pandas.read_csv:

  1. chunksize parameter: that allows you to read a piece of files at a time. For example in your case you could use chunksize equal to a million, you would get 90 chunks and you could operate on each chunk individually.
  2. dtype parameter: with this parameter you can specify the data type of each column simply by passing a dictionary like this: {‘a’: np.float32, ‘b’: np.int32, ‘c’: ‘Int32’}
    Pandas could use 64-bit data types, while 32bit might be enough for you. With this trick you could save 50% of the space.

Yuor case study

Try this code:

df_chunks = pd.read_csv('test.csv', chunksize=1000000, iterator=True, 
                         parse_dates=['timestamp'], error_bad_lines=False,
                         dtype={"ordinal":'int32', "latitude": 'float32', "longitude":'float32'})
for chunk in df_chunks:
    # chunk = chunk.apply(...) # process the single chunk 
    for day in ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']:
        day_df = chunk.loc[chunk['day_of_week'] == day]
        day_df.to_csv(f'{day}.csv', mode='a', index=0, header=False)

This way you work on one chunk of data at a time and never work with all the data together. The mode='a' tells pandas to append.

Note1: You do not need pandas.concat here. The only thing iterator and chunksize=1000000 does is to give you a reader object that iterates 1000000-row DataFrames instead of reading the whole thing. Using concat you lose all the advantages of using iterators and loading the whole file into memory exactly like using csv laws without specifying chunksize.

Note2: If the 'MemoryError' error persists, try smaller chunksize.

Upvotes: 2

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

Complex improvements:

  • to iterate through a (potentially very large) file lazily rather than reading the entire file into memory - specify a chunksize to read_csv call (specifying a number of rows to read at one iteration)
  • the statement df = df[:-1] is not applicable in iterator approach and assuming that the last line is in bad format 99695386 [space] NaN NaN NaN NaN - we can handle it and skip by specifying option error_bad_lines=False
  • the statement df.loc[:,'timestamp'] = pd.to_datetime(df.loc[:,'timestamp']) can also be eliminated by using parse_dates=['timestamp'] as an option for pd.read_csv call
  • we'll append to an existing target csv file applying mode='a' (append to a file)

On practice:

n_rows = 10 * 6  # adjust empirically
reader = pd.read_csv('test.csv', names=["id_easy","ordinal", "timestamp", "latitude", "longitude"], 
                     parse_dates=['timestamp'], chunksize=n_rows, error_bad_lines=False)                               
day_names = ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')

for df in reader: 
    if not df.empty: 
        df['epoch'] = df.loc[:, 'timestamp'].astype('int64') // 1e9 
        df['day_of_week'] = pd.to_datetime(df['epoch'], unit="s").dt.weekday_name 
        del df['timestamp']
        for day in day_names: 
            day_df = df.loc[df['day_of_week'] == day] 
            if not day_df.empty:
                day_df.to_csv(f'{day}.csv', index=False, header=False, mode='a') 

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking

Upvotes: 1

Related Questions