Reputation: 181
I have what appears to be a simple problem, that I was not able to find a solution to. Namely, I have a table, where first column contains the list of all available applications, while other columns represent users and the list of applications they have:
I'm trying to convert the table into pandas DataFrame and align matching values on the first column. The desired output should look like this:
import pandas as pd
df = pd.read_excel('U:/Desktop/appdata.xlsx')
df.head(10)
Out[21]:
Applications User 1 User 2 User 3 User 4 User 5
0 App1 App1 App2 App1 App1 App2
1 App2 App3 App3 App2 App3 App3
2 App3 App10 App4 App7 App4 App4
3 App4 NaN App5 App8 App5 App5
4 App5 NaN NaN App10 App6 App6
5 App6 NaN NaN NaN NaN App7
6 App7 NaN NaN NaN NaN App8
7 App8 NaN NaN NaN NaN App9
8 App9 NaN NaN NaN NaN NaN
9 App10 NaN NaN NaN NaN NaN
df[df.apply(lambda x: x['Applications'] == x, axis=1)]
Out[22]:
Applications User 1 User 2 User 3 User 4 User 5
0 App1 App1 NaN App1 App1 NaN
1 App2 NaN NaN App2 NaN NaN
2 App3 NaN NaN NaN NaN NaN
3 App4 NaN NaN NaN NaN NaN
4 App5 NaN NaN NaN NaN NaN
5 App6 NaN NaN NaN NaN NaN
6 App7 NaN NaN NaN NaN NaN
7 App8 NaN NaN NaN NaN NaN
8 App9 NaN NaN NaN NaN NaN
9 App10 NaN NaN NaN NaN NaN
Any help is appreciated. Cheers!
Upvotes: 6
Views: 1795
Reputation: 38510
Here is an approach with some numpy tools. Here, apply
loops through the columns of interest, np.isin
performs a search over your first column (dat.Applications) and returns True if the respective element is contained in the current column. This boolean array is then converted to the respective string in dat.Applications or to NAN if there is no match via np.where
. The results are then assigned back to the original DataFrame.
import numpy as np
dat.iloc[:, 1:] = \
dat.iloc[:, 1:].apply(lambda x : np.where(np.isin(dat.Applications, x),
dat.Applications, np.NAN))
Note that it would work to use pd.np.isin
for instance rather than directly importing numpy, but this seems a bit cleaner to me.
Upvotes: 2
Reputation: 2757
A key observation is that in the output, a value in the column of the application also represents the value for that row (index col == value col)
(df.iloc[:,1:]
.melt()
.dropna()
.pivot(index='value',columns='variable',values='value')
.reindex([f'App{x}' for x in range(1,11)]))
Upvotes: 0