Reputation: 1142
This is as subset of a data frame:
drug_id A B C type
lexapro.13 1 SSRI
lexapro.13 1 1 SSRI
lexapro.13 1 SSRI
lexapro.13 1 SSRI
effexor.223 1 SNRI
effexor.223 1 SNRI
cymbalta.18 1 SNRI
cymbalta.18 1 SNRI
As you see, the drug id repeated, but the values for A,B,and C are different. First I need to group data by drug_id and then for each group if A in any rows of that group (for example lexapro.13) has value "1", then A in that group has value "1', otherwise will get 0. IF B in any rows of that group has value "1", then B in that group will receive "1", otherwise will get 0, and the same for "C". The output should be like this:
drug_id A B C type
lexapro.13 1 1 1 SSRI
effexor.223 0 1 1 SNRI
cymbalta.18 1 0 1 SNRI
I think at first I need to group the data by drug_id column using set_index, and then search for the value 1 in column A in that group, value 1 fro column B in that group and the same for C. But I do not know how to do it. Any suggestion ?
Upvotes: 1
Views: 316
Reputation: 862761
You can use groupby
and aggregate max
, then replace NaN
s by fillna
, cast to int
s by astype
and last if need column from index
add reset_index
:
df = df.groupby('drug_id', sort=False).max().fillna(0).astype(int).reset_index()
print (df)
drug_id A B C
0 lexapro.13 1 1 1
1 effexor.223 0 1 1
2 cymbalta.18 1 0 1
Another solution with any
check if at least one value is not zero
or NaN
per group and per column:
df = df.groupby('drug_id', sort=False).any().fillna(0).astype(int).reset_index()
print (df)
drug_id A B C
0 lexapro.13 1 1 1
1 effexor.223 0 1 1
2 cymbalta.18 1 0 1
If need check only 1
values in all colums without drug_id
is possible get all columns names with difference
and then compare with 1
by eq
:
cols = df.columns.difference(['drug_id'])
df[cols] = df[cols].eq(1).astype(int)
df = df.groupby('drug_id', sort=False).max().reset_index()
#or
#df = df.groupby('drug_id', sort=False).any().reset_index()
EDIT:
If there is another text
column, need agg
for aggregate each column, else column is omited.
d = {'A': [3.0, 1.0, np.nan, np.nan, np.nan, np.nan, np.nan, 1.0],
'type': ['SSRI1', 'SSRI2', 'SSRI3', 'SSRI4', 'SNRI5', 'SNRI6', 'SNRI7', 'SNRI8'],
'drug_id': ['lexapro.13', 'lexapro.13', 'lexapro.13',
'lexapro.13', 'effexor.223', 'effexor.223', 'cymbalta.18', 'cymbalta.18'],
'B': [np.nan, np.nan, 1.0, 1.0, np.nan, 5.0, 4.0, 1.0],
'C': [np.nan, 1.0, np.nan, np.nan, 1.0, np.nan, 2.0, np.nan]}
df = pd.DataFrame(d, columns=['drug_id', 'A', 'B', 'C', 'type'])
print (df)
drug_id A B C type
0 lexapro.13 3.0 NaN NaN SSRI1
1 lexapro.13 1.0 NaN 1.0 SSRI2
2 lexapro.13 NaN 1.0 NaN SSRI3
3 lexapro.13 NaN 1.0 NaN SSRI4
4 effexor.223 NaN NaN 1.0 SNRI5
5 effexor.223 NaN 5.0 NaN SNRI6
6 cymbalta.18 NaN 4.0 2.0 SNRI7
7 cymbalta.18 1.0 1.0 NaN SNRI8
Check values 1
:
cols = df.columns.difference(['drug_id', 'type'])
df[cols] = df[cols].eq(1).astype(int)
print (df)
drug_id A B C type
0 lexapro.13 0 0 0 SSRI1
1 lexapro.13 1 0 1 SSRI2
2 lexapro.13 0 1 0 SSRI3
3 lexapro.13 0 1 0 SSRI4
4 effexor.223 0 0 1 SNRI5
5 effexor.223 0 0 0 SNRI6
6 cymbalta.18 0 0 0 SNRI7
7 cymbalta.18 1 1 0 SNRI8
Dynamically prepare dictionary - for column type
need another function.
Use first
for first value per group or join
for all values to string
with all values:
d = {x:'max' for x in cols}
d['type'] = 'first'
print (d)
{'A': 'max', 'type': 'first', 'B': 'max', 'C': 'max'}
df1 = df.groupby('drug_id', sort=False).agg(d).reset_index().reindex_axis(df.columns, axis=1)
print (df1)
drug_id A B C type
0 lexapro.13 1 1 1 SSRI1
1 effexor.223 0 0 1 SNRI5
2 cymbalta.18 1 1 0 SNRI7
d = {x:'max' for x in cols}
d['type'] = ', '.join
print (d)
{'A': 'max', 'type': <built-in method join of str object at 0x000000000B447340>,
'B': 'max', 'C': 'max'}
df2 = df.groupby('drug_id', sort=False).agg(d).reset_index().reindex_axis(df.columns, axis=1)
print (df2)
drug_id A B C type
0 lexapro.13 1 1 1 SSRI1, SSRI2, SSRI3, SSRI4
1 effexor.223 0 0 1 SNRI5, SNRI6
2 cymbalta.18 1 1 0 SNRI7, SNRI8
Upvotes: 3