Reputation: 11
Recently I've been dealing with a large dataset which contains almost 100M rows.
The file when fully loaded into memory is over 15GB. I have no problem with loading all the data into memory since I have a server with 96GB mem.
Here's the output of info():
<class 'modin.pandas.dataframe.DataFrame'>
Int64Index: 97915924 entries, 0 to 117814626
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ----------------- ----------------- -----
0 50p_width 97915924 non-null float64
1 80p_width 97915924 non-null float64
2 area 97915924 non-null float64
3 area_fraction_top 97915924 non-null float64
4 center_time 97915924 non-null float64
5 event_number 97915924 non-null int64
6 event_start_time 97915924 non-null int64
7 goodness_of_fit 8205122 non-null float64
8 left 97915924 non-null int64
9 max_PMT 97915924 non-null int64
10 max_PMT_area 97915924 non-null float64
11 max_hit_PMT 97915924 non-null int64
12 max_hit_area 97915924 non-null float64
13 n_PMTS 97915924 non-null int64
14 n_hits 97915924 non-null int64
15 right 97915924 non-null int64
16 run_number 97915924 non-null int64
17 type 97915924 non-null int64
18 x 8205122 non-null float64
19 y 8205122 non-null float64
dtypes: float64(10), int64(10)
memory usage: 15.3 GB
data['exact_time'] = data['center_time'] + data['event_start_time']
My goal here is not to directly analyze on the data but rather filter it so I can do some further research. The type can only be 0, 1, and 2. data['exact_time']
is a unix timestamp in ns. I'd like to find out all the type 0/type 1 events that happened 1 ms (1e6 ns) before the type 2 events and find the out the one with max area correspond to all the type 2.\
I came up with 2 possible approaches, but both requires to iterate through each row.
Method 1: This only filters all the events that's within the time range.
s2_data = data[data['type'] == 2]
s2_time_list = s2_data['exact_time'].tolist()
s2_time_list
for t in tqdm(s2_time_list):
new_data = new_data.append(data.loc[(data['exact_time'] >= t-1e6) & (data['exact_time'] <= t)])
Method 2: This will iterate through the whole dataframe once and get t0/t1 event that has max area along with corresponding t2.
#eventlist = pd.DataFrame().reindex(columns=data.columns)
s2_time = 0
for i, row in tqdm(data.iterrows()):
if row['type'] == 2:
s2_time = row['exact_time']
eventlist = eventlist.loc[eventlist['exact_time'] >= s2_time - 1e6]
ind = eventlist["area"].idxmax()
max_row = eventlist.iloc[ind,:]
new_data = new_data.append(max_row)
new_data = new_data.append(row)
else:
eventlist = eventlist.append(row)
I've already created index on exact_time and use modin library to process the data parallelly if possible, but both seems to be super slow and take forever to finish. I don't think apply() or pd/np vectorization would work because this requires datas from multiple rows so I wonder if there's any better way I can do this.
Upvotes: 1
Views: 930
Reputation: 5648
Rework your method 1 to get a list of dataframes, then concatenate. From what I see you are appending a dataframe on each iteration.
I put this example together to illustrate. Created a dataframe with 500,000 rows on a PC with 12 GB ram (for comparison). Reduced the time from 46 seconds to 25 seconds (so at a minimum you can get half your time there).
tt = 1605067706567342
n=500000
exact_time = []
for i in range(n):
tt = tt + 99999
# print(t)
exact_time.append(tt)
type = [0,0,0,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,0,1,1,1,1]*10000
data = pd.DataFrame({'exact_time':exact_time,'type':type}, columns=['exact_time', 'type'])
print(data)
s2_data = data[data['type'] == 2]
s2_time_list = s2_data['exact_time'].tolist()
len(s2_time_list)
####Your method 1 - # 500,000 rows; 20,000 type 2 --- **46.63247585296631 seconds**
starttime = time.time()
new_data = pd.DataFrame()
for t in s2_time_list:
# print(t)
# print(data.loc[(data['exact_time'] >= t-1e6) & (data['exact_time'] <= t)])
new_data = new_data.append(data.loc[(data['exact_time'] >= t-1e6) & (data['exact_time'] <= t)])
print(time.time()-starttime)
#### Reworked to create list of dfs then concatenate
### # 500,000 rows; 20,000 type 2 --- **25.308656930923462 seconds**
starttime = time.time()
new_data_list = []
for t in s2_time_list:
# print(t)
# print(data.loc[(data['exact_time'] >= t-1e6) & (data['exact_time'] <= t)])
new_data_list.append(data.loc[(data['exact_time'] >= t-1e6) & (data['exact_time'] <= t)])
new_df = pd.concat(new_data_list, axis=0)
print(time.time()-starttime)
Upvotes: 1