Pouyan Ebrahimi
Pouyan Ebrahimi

Reputation: 99

combining different columns

I have a dataframe like:

df = pd.DataFrame({'id':[1,2,3,4,5,6,7],
                'vote':[5,4,5,1,10,1,9],
                'doggo': [None,"doggo",None,None,"doggo",None,None], 
                'floofer': ["floofer",None,None,"floofer",None,None,None],
                'pupper': [None,None,"pupper",None,None,None,None],
               'puppo':[None,None,None,None,None,None,"puppo"]})

I want to combine last 4 columns and gnerate:

df = pd.DataFrame({'id':[1,2,3,4,5,6,7],
                    'vote':[5,4,5,1,10,1,9],
                    'categories': ["floofer","doggo","pupper","floofer","doggo",None,"puppo"]})

Any guidance is appreciated.

Upvotes: 3

Views: 97

Answers (3)

jezrael
jezrael

Reputation: 862511

Solution if each row has only one not None value per categorical columns:

cols = ['doggo','floofer','pupper','puppo']
cols1 = df.columns.difference(cols)
df2 = df[cols1].join(df[cols].ffill(axis=1).iloc[:, -1].rename('Categories'))
print (df2)
   id  vote Categories
0   1     5    floofer
1   2     4      doggo
2   3     5     pupper
3   4     1    floofer
4   5    10      doggo
5   6     1       None
6   7     9      puppo

Explanation:

First select only columns with catagorical data and forward fill missing values - expected data are in last column:

print (df[cols].ffill(axis=1))
  doggo  floofer   pupper    puppo
0   None  floofer  floofer  floofer
1  doggo    doggo    doggo    doggo
2   None     None   pupper   pupper
3   None  floofer  floofer  floofer
4  doggo    doggo    doggo    doggo
5   None     None     None     None
6   None     None     None    puppo

Select last column by position:

print (df[cols].ffill(axis=1).iloc[:, -1])
0    floofer
1      doggo
2     pupper
3    floofer
4      doggo
5       None
6      puppo
Name: puppo, dtype: object

Solution if multiple values - data are created from columns names of categorical columns:

df = pd.DataFrame({'id':[1,2,3,4,5,6,7],
                'vote':[5,4,5,1,10,1,9],
                'doggo': [None,"doggo1",None,"doggo2","doggo3",None,None], 
                'floofer': ["floofer1",None,None,"floofer2",None,None,None],
                'pupper': [None,None,"pupper1",None,None,None,None],
               'puppo':["puppo1",None,None,None,None,None,"puppo2"]})
print (df)
   id  vote   doggo   floofer   pupper   puppo
0   1     5    None  floofer1     None  puppo1
1   2     4  doggo1      None     None    None
2   3     5    None      None  pupper1    None
3   4     1  doggo2  floofer2     None    None
4   5    10  doggo3      None     None    None
5   6     1    None      None     None    None
6   7     9    None      None     None  puppo2


s = (df[cols].notnull()
            .dot(pd.Index(cols) + ', ')
            .str.strip(', ')
            .rename('Categories')
            .replace('', np.nan)
            )
df = df[cols1].join(s)
print (df)
   id  vote      Categories
0   1     5  floofer, puppo
1   2     4           doggo
2   3     5          pupper
3   4     1  doggo, floofer
4   5    10           doggo
5   6     1             NaN
6   7     9           puppo

Another solution, expected output is not from columns names:

s = pd.Series(df[cols].add(', ').fillna('').values.sum(axis=1), 
                  index=df.index, name='Categories').str.strip(', ')
df = df[cols1].join(s)
print (df)
   id  vote        Categories
0   1     5  floofer1, puppo1
1   2     4            doggo1
2   3     5           pupper1
3   4     1  doggo2, floofer2
4   5    10            doggo3
5   6     1                  
6   7     9            puppo2

Upvotes: 1

lightalchemist
lightalchemist

Reputation: 10211

We can exploit the fact that x or None is x and use Numpy's logical_or operator reduced/applied along each row to get the category.

import numpy as np

cols = ['doggo','floofer','pupper','puppo']
categories = np.logical_or.reduce(df[cols], axis=1)
df = df.assign(categories=categories).drop(cols, axis=1)

Upvotes: 0

cs95
cs95

Reputation: 402303

bfill + iloc

You can bfill (backfill) and select the first column:

(df.set_index(['id', 'vote'])
   .bfill(axis=1)
   .iloc[:, 0]
   .reset_index(name='Categories'))

   id  vote Categories
0   1     5    floofer
1   2     4      doggo
2   3     5     pupper
3   4     1    floofer
4   5    10      doggo
5   6     1       None
6   7     9      puppo

stack + reindex

cats = (df.drop(['id', 'vote'], 1).stack()
          .reset_index(level=1, drop=True).reindex(df.index))
pd.DataFrame(dict(id=df.id, vote=df.vote, Categories=cats))

    
   id  vote Categories
0   1     5    floofer
1   2     4      doggo
2   3     5     pupper
3   4     1    floofer
4   5    10      doggo
5   6     1        NaN
6   7     9      puppo

last_valid_index

Slow, but succinct.

(df.set_index(['id', 'vote'])
   .agg(lambda x: x.last_valid_index(), axis=1)
   .reset_index(name='Categories'))

   id  vote Categories
0   1     5    floofer
1   2     4      doggo
2   3     5     pupper
3   4     1    floofer
4   5    10      doggo
5   6     1       None
6   7     9      puppo

Assuming "id" and "vote" are the only non-categorical columns.

Upvotes: 1

Related Questions