Is it possible to use pandas to filter hundreds of millions of rows of data

Question

Recently I've been dealing with a large dataset which contains almost 100M rows.
The file when fully loaded into memory is over 15GB. I have no problem with loading all the data into memory since I have a server with 96GB mem.
Here's the output of info():


Int64Index: 97915924 entries, 0 to 117814626
Data columns (total 20 columns):
 #   Column             Non-Null Count     Dtype  
---  -----------------  -----------------  -----  
 0   50p_width          97915924 non-null  float64
 1   80p_width          97915924 non-null  float64
 2   area               97915924 non-null  float64
 3   area_fraction_top  97915924 non-null  float64
 4   center_time        97915924 non-null  float64
 5   event_number       97915924 non-null  int64
 6   event_start_time   97915924 non-null  int64
 7   goodness_of_fit    8205122 non-null   float64
 8   left               97915924 non-null  int64
 9   max_PMT            97915924 non-null  int64
 10  max_PMT_area       97915924 non-null  float64
 11  max_hit_PMT        97915924 non-null  int64
 12  max_hit_area       97915924 non-null  float64
 13  n_PMTS             97915924 non-null  int64
 14  n_hits             97915924 non-null  int64
 15  right              97915924 non-null  int64
 16  run_number         97915924 non-null  int64
 17  type               97915924 non-null  int64
 18  x                  8205122 non-null   float64
 19  y                  8205122 non-null   float64
dtypes: float64(10), int64(10)
memory usage: 15.3 GB

data['exact_time'] = data['center_time'] + data['event_start_time']

My goal here is not to directly analyze on the data but rather filter it so I can do some further research. The type can only be 0, 1, and 2. data['exact_time'] is a unix timestamp in ns. I'd like to find out all the type 0/type 1 events that happened 1 ms (1e6 ns) before the type 2 events and find the out the one with max area correspond to all the type 2.\

I came up with 2 possible approaches, but both requires to iterate through each row.
Method 1: This only filters all the events that's within the time range.

s2_data = data[data['type'] == 2]
s2_time_list = s2_data['exact_time'].tolist()
s2_time_list
for t in tqdm(s2_time_list):
    new_data = new_data.append(data.loc[(data['exact_time'] >= t-1e6) & (data['exact_time'] <= t)])

Method 2: This will iterate through the whole dataframe once and get t0/t1 event that has max area along with corresponding t2.

#eventlist = pd.DataFrame().reindex(columns=data.columns)
s2_time = 0
for i, row in tqdm(data.iterrows()):
    if row['type'] == 2:
        s2_time = row['exact_time']
        eventlist = eventlist.loc[eventlist['exact_time'] >= s2_time - 1e6]
        ind = eventlist["area"].idxmax()
        max_row = eventlist.iloc[ind,:]
        new_data = new_data.append(max_row)
        new_data = new_data.append(row)
    else:
        eventlist = eventlist.append(row)

I've already created index on exact_time and use modin library to process the data parallelly if possible, but both seems to be super slow and take forever to finish. I don't think apply() or pd/np vectorization would work because this requires datas from multiple rows so I wonder if there's any better way I can do this.

Is it possible to use pandas to filter hundreds of millions of rows of data

Answers (1)

Related Questions