Reputation: 21
I have one dataframe that contains hundreds of columns as clients ID and one row with the total nr of tickets per client ID, it looks like this: (df1 is a result of several transformation of the original csv file)
df1
+-----+----+-----+
| 30 | 5 | 100 |
+-----+----+-----+
| 122 | 40 | 13 |
+-----+----+-----+
And another dataframe that has 2 columns, one account_id and client_id, looks like this:
df2
+------------+-----------+
| account_id | client_id |
+------------+-----------+
| 4char | 4 |
+------------+-----------+
| 3char | 5 |
+------------+-----------+
| 2char | 30 |
+------------+-----------+
| 16char | 9 |
+------------+-----------+
| 17char | 100 |
+------------+-----------+
I want to have a single file that contains 3 columns account_id, client_id and total_tickets, that will look like this:
df
+------------+-----------+---------------+
| account_id | client_id | total_tickets |
+------------+-----------+---------------+
| 4char | 4 | null
+------------+-----------+---------------+
| 3char | 5 | 40
+------------+-----------+---------------+
| 2char | 30 | 122
+------------+-----------+---------------+
| 16char | 9 | null
+------------+-----------+---------------+
| 17char | 100 | 13
+------------+-----------+---------------+
Until now I have reached at this point: I have created a function that iterrows() on both dataframes, check with isin() function if the client_id of df2 is found in the columns of df1, next I add a new column total_tickets with assign() function on df2
f1 = df1, f2 = df2
def populating_df(f1, f2):
for org_nr in f2.iterrows():
for col in f1.iterrows():
matched_org_nr = f2.client_id.isin(f1.columns)
if matched_org_nr.any() == True:
sum_of_tickets_per_col = matched_org_nr
# create a new column in f2 file with the values of total_tickets for each org number matched
f2 = f2.loc[:].assign(Total_Tickets=sum_of_tickets_per_col)
return f2
and I get as a result this table:
+------------+-----------+---------------+
| account_id | client_id | total_tickets |
+------------+-----------+---------------+
| 4char | 4 |False
+------------+-----------+---------------+
| 3char | 5 | True
+------------+-----------+---------------+
| 2char | 30 | True
+------------+-----------+---------------+
| 16char | 9 | False
+------------+-----------+---------------+
| 17char | 100 | True
+------------+-----------+---------------+
I would be happy if anyone has any suggestion on how to solve this problem
Upvotes: 1
Views: 147
Reputation: 148890
merge
is the key, but you have to first transpose the initial data frame, and do some cosmetic changes such as reset its index and provide relevant column names:
The transformation could be:
df1.rename({0: 'total_tickets'}).T.rename_axis('client_id').reset_index()
giving:
client_index total_tickets
0 30 122
1 5 40
2 100 13
Once this is done, the merge is trivial:
result = df2.merge(df1.rename({0: 'total_tickets'}).T.rename_axis('client_id').reset_index(),
on='client_id', how='left')
giving as expected:
account_id client_id total_tickets
0 4char 4 NaN
1 3char 5 40.0
2 2char 30 122.0
3 16char 9 NaN
4 17char 100 13.0
Upvotes: 1
Reputation: 23099
First we need to melt df1 so you have an observation for each row
you can then merge using outer to the get the keys from both columns
df_melt = pd.melt(df1,var_name='client_id',value_name='total_tickets')
df3 = pd.merge(df_melt ,df2,on=['client_id'],how='outer')
#make sure dtypes are the same.
#df_melt ['client_id'] = df_melt ['client_id'].astype(int)
df3 = df3[["account_id", "client_id", "total_tickets"]].sort_values(
"account_id", ascending=False
)
print(df3)
account_id client_id total_tickets
3 4char 4 NaN
1 3char 5 40.0
0 2char 30 122.0
2 17char 100 13.0
4 16char 9 NaN
Upvotes: 1
Reputation: 743
You can use pd.merge
.
df = pd.merge(df1, df2, on="client_id", how='outer')
Upvotes: 0