Alec
Alec

Reputation: 9536

How to reorganize pandas dataframe such that all duplicate values in one column are condensed into one row containing all the info in another column?

Sorry for the unclear title, but I'm having a hard time putting what I want to do into words.

I have a CSV dataset that's not very clear in how it presents information: enter image description here

Basically, each image may be associated with one or more subtypes. I managed to manipulate the dataset in such a way that it's much easier to read:

enter image description here

Now, I'd like to condense each row into one ID, and True/False for each of the subtypes, like so:

ID        Epidural Intraparenchymal Intraventricular Subarachnoid Subdural Any
aec8e68b3 False    False            False            True        False    True

Hopefully I've made myself clear. Can anybody help in this endeavor?

edit: dictionary at the request of @Ch3steR

{'ID': {21: 'aec8e68b3',
  23: 'aec8e68b3',
  57: 'ff7125125',
  59: 'ff7125125',
  115: '9b7d000a2',
  119: '9b7d000a2',
  172: '85f1fa3bd',
  173: '85f1fa3bd',
  181: '4598fa77a',
  182: '4598fa77a',
  184: '4598fa77a',
  185: '4598fa77a',
  208: '4c45dcf36',
  209: '4c45dcf36',
  313: 'a837bb1fc',
  315: 'a837bb1fc',
  317: 'a837bb1fc',
  358: '8927bc6ee',
  359: '8927bc6ee',
  369: 'ffc54be74',
  371: 'ffc54be74',
  439: '945b3f1e4',
  442: '945b3f1e4',
  443: '945b3f1e4',
  483: '35b9fc2cc'},
 'Category': {21: 'subarachnoid',
  23: 'any',
  57: 'subarachnoid',
  59: 'any',
  115: 'intraparenchymal',
  119: 'any',
  172: 'subdural',
  173: 'any',
  181: 'intraparenchymal',
  182: 'intraventricular',
  184: 'subdural',
  185: 'any',
  208: 'subdural',
  209: 'any',
  313: 'intraparenchymal',
  315: 'subarachnoid',
  317: 'any',
  358: 'subdural',
  359: 'any',
  369: 'subarachnoid',
  371: 'any',
  439: 'intraparenchymal',
  442: 'subdural',
  443: 'any',
  483: 'subarachnoid'}}

Upvotes: 0

Views: 232

Answers (2)

Ch3steR
Ch3steR

Reputation: 20669

You can use pd.crosstab here.

df
            ID          Category
21   aec8e68b3      subarachnoid
23   aec8e68b3               any
57   ff7125125      subarachnoid
59   ff7125125               any
115  9b7d000a2  intraparenchymal
119  9b7d000a2               any
172  85f1fa3bd          subdural
173  85f1fa3bd               any
181  4598fa77a  intraparenchymal
182  4598fa77a  intraventricular
184  4598fa77a          subdural
185  4598fa77a               any
208  4c45dcf36          subdural
209  4c45dcf36               any

pd.crosstab(df['ID'], df['Category']).astype(bool)

Category    any  intraparenchymal  intraventricular  subarachnoid  subdural
ID
4598fa77a  True              True              True         False      True
4c45dcf36  True             False             False         False      True
85f1fa3bd  True             False             False         False      True
9b7d000a2  True              True             False         False     False
aec8e68b3  True             False             False          True     False
ff7125125  True             False             False          True     False
  • To remove axis names use _.rename_axis(index=None, columns=None)

OR

Using pd.pivot_table

df.pivot_table(index='ID', columns='Category', fill_value=0, aggfunc='size')

df.assign(val = True).pivot_table(index='ID', columns= 'Category', 
                                  values = 'val',fill_value=False)

Another idea of using df.unstack

df.assign(val = True).set_index(['ID', 'Category']).val.unstack(fill_value= False)

Upvotes: 1

Yogesh
Yogesh

Reputation: 1432

You can use reset index to get a desired format.

pd.crosstab(df['ID'], df['Category']).astype(bool).reset_index()

Upvotes: 0

Related Questions