Reputation: 145
I have the following df:
df = pd.DataFrame({'ID1':[1,2,3,4,5,6],'ID2':[2,6,6,2,1,2],'AREA':[1,1,1,1,1,1]})
...
ID1 ID2 AREA
0 1 2 1
1 2 6 1
2 3 6 1
3 4 2 1
4 5 1 1
5 6 2 1
I accumulate the AREA column as so:
for id_ in df.ID1:
id1_filter = df.ID1 == id_
id2_filter = (df.ID1 == id_) | (df.ID2 == id_)
df.loc[id1_filter, 'AREA'] = df.loc[id2_filter].AREA.sum()
print(df)
...
ID1 ID2 AREA
0 1 2 2
1 2 6 5
2 3 6 1
3 4 2 1
4 5 1 1
5 6 2 7
For each id_
in ID1
, the AREA
is summed where ID1
== id_
or ID2 == id_
,
and it is always run when df
is sorted on ID1
.
The real dataframe I'm working on though is 150,000 records, each row belonging to a unique ID1.
Running the above on this dataframe takes 2.5 hours. Since this operation will take place repeatedly
for the foreseeable future, I decided to store the indices of the True values in id1_filter
and id2_filter
in a DB with the following schema.
Table ID1:
ID_,INDEX_
1 , 0
2 , 1
etc, ect
Table ID2:
ID_,INDEX_
1 , 0
1 , 4
2 , 0
2 , 1
2 , 3
2 , 5
etc, etc
The next time I run the accumulation on the AREA
column (now filled with different AREA
values)
I read in the sql tables and the convert them to dicts. I then use these dicts
to grab the records I need during the summing loop.
id1_dict = pd.read_sql('select * from ID1',db_engine).groupby('ID_').INDEX_.unique().to_dict()
id2_dict = pd.read_sql('select * from ID2',db_engine).groupby('ID_').INDEX_.unique().to_dict()
# print indices for id1_filter and id2_fillter for id 1
print(id1_dict[1])
print(id2_dict[1])
...
[0]
[0, 4]
for id_ in df.ID1:
df.loc[id1_dict[id_], 'AREA'] = df.loc[id2_dict[id_]].AREA.sum()
When run this way it only takes 6 minutes!
My question: Is there a better/standard way to handle this scenario, i.e storing dataframe selections for later use? Side note, I have set an index on the SQL table's ID columns and tried getting the indices by querying the table for each id, and it works well, but still takes a little longer than the above (9 mins).
Upvotes: 1
Views: 533
Reputation: 29635
One way to do it is like this:
df = df.set_index('ID1')
for row in df.join(df.groupby('ID2')['AREA'].apply(lambda x: x.index.tolist()),rsuffix='_').dropna().itertuples():
df.loc[row[0],'AREA'] += df.loc[row[3],'AREA'].sum()
df = df.reset_index()
and you get the result expected
ID1 ID2 AREA
0 1 2 2
1 2 6 5
2 3 6 1
3 4 2 1
4 5 1 1
5 6 2 7
Now on a bigger df
like:
df = pd.DataFrame( {'ID1':range(1,1501),'ID2': np.random.randint(1,1501,(1500,)),'AREA':[1]*1500},
columns = ['ID1','ID2','AREA'])
The method presented here turns in about 0.76 s on my computer while your first is running in 6.5 s.
Ultimately, you could create a df_list
such as:
df_list = (df.set_index('ID1')
.join(df.set_index('ID1').groupby('ID2')['AREA']
.apply(lambda x: x.index.tolist()),rsuffix='_ID2')
.dropna().drop(['AREA','ID2'],1))
to keep somewhere the information that linked ID1 and ID2: here you can see the id is equal to 2 in the column ID2, where the value of ID1 = 1, 4 and 6
AREA_ID2
ID1
1 [5]
2 [1, 4, 6]
6 [2, 3]
and then you can run to not re-create the df_list
, with a small difference in the code:
df = df.set_index('ID1')
for row in df_list.itertuples():
df.loc[row[0],'AREA'] += df.loc[row[1],'AREA'].sum()
df = df.reset_index()
Hope it's faster
Upvotes: 1