Reputation: 79

Editing then concatenating values of several columns into a single one (pandas, python)

I'm looking for a way to use pandas and python to combine several columns in an excel sheet with known column names into a new, single one, keeping all the important information as in the example below:

input:

ID,tp_c,tp_b,tp_p  
0,transportation - cars,transportation - boats,transportation - planes
1,checked,-,-
2,-,checked,-
3,checked,checked,-
4,-,checked,checked
5,checked,checked,checked

desired output:

ID,tp_all  
0,transportation  
1,cars  
2,boats  
3,cars+boats  
4,boats+planes  
5,cars+boats+planes

The row with ID of 0 contans a description of the contents of the column. Ideally the code would parse the description in the second row, look after the '-' and concatenate those values in the new "tp_all" column.

Upvotes: 1

Answers (3)

EdChum

Reputation: 394399

OK a more dynamic method:

In [63]:
# get a list of the columns
col_list = list(df.columns)
# remove 'ID' column
col_list.remove('ID')
# create a dict as a lookup
col_dict = dict(zip(col_list, [df.iloc[0][col].split(' - ')[1] for col in col_list]))
col_dict
Out[63]:
{'tp_b': 'boats', 'tp_c': 'cars', 'tp_p': 'planes'}
In [64]:
# define a func that tests the value and uses the dict to create our string
def func(x):
    temp = ''
    for col in col_list:
        if x[col] == 'checked':
            if len(temp) == 0:
                temp = col_dict[col]
            else:
                temp = temp + '+' + col_dict[col]
    return temp
df['combined'] = df[1:].apply(lambda row: func(row), axis=1)
df
Out[64]:
   ID                   tp_c                    tp_b                     tp_p  \
0   0  transportation - cars  transportation - boats  transportation - planes   
1   1                checked                     NaN                      NaN   
2   2                    NaN                 checked                      NaN   
3   3                checked                 checked                      NaN   
4   4                    NaN                 checked                  checked   
5   5                checked                 checked                  checked   

            combined  
0                NaN  
1               cars  
2              boats  
3         cars+boats  
4       boats+planes  
5  cars+boats+planes  

[6 rows x 5 columns]
In [65]:

df = df.ix[1:,['ID', 'combined']]
df
Out[65]:
   ID           combined
1   1               cars
2   2              boats
3   3         cars+boats
4   4       boats+planes
5   5  cars+boats+planes

[5 rows x 2 columns]

Upvotes: 1

Andy Hayden

Reputation: 375865

This is quite interesting as it's a reverse get_dummies...

I think I would manually munge the column names so that you have a boolean DataFrame:

In [11]: df1  # df == 'checked'
Out[11]:
    cars  boats planes
0
1   True  False  False
2  False   True  False
3   True   True  False
4  False   True   True
5   True   True   True

Now you can use an apply with zip:

In [12]: df1.apply(lambda row: '+'.join([col for col, b in zip(df1.columns, row) if b]),
                   axis=1)
Out[12]:
0
1                 cars
2                boats
3           cars+boats
4         boats+planes
5    cars+boats+planes
dtype: object

Now you just have to tweak the headers, to get the desired csv.

Would be nice if there were a less manual way / faster to do reverse get_dummies...

Upvotes: 3

BrenBarn

Reputation: 251578

Here is one way:

newCol = pandas.Series('',index=d.index)
for col in d.ix[:, 1:]:
    name = '+' + col.split('-')[1].strip()
    newCol[d[col]=='checked'] += name
newCol = newCol.str.strip('+')

Then:

>>> newCol
0                 cars
1                boats
2           cars+boats
3         boats+planes
4    cars+boats+planes
dtype: object

You can create a new DataFrame with this column or do what you like with it.

Edit: I see that you have edited your question so that the names of the modes of transportation are now in row 0 instead of in the column headers. It is easier if they're in the column headers (as my answer assumes), and your new column headers don't seem to contain any additional useful information, so you should probably start by just setting the column names to the info from row 0, and deleting row 0.

Upvotes: 1

Editing then concatenating values of several columns into a single one (pandas, python)

Answers (3)

Related Questions