Reputation: 79
I'm looking for a way to use pandas and python to combine several columns in an excel sheet with known column names into a new, single one, keeping all the important information as in the example below:
input:
ID,tp_c,tp_b,tp_p
0,transportation - cars,transportation - boats,transportation - planes
1,checked,-,-
2,-,checked,-
3,checked,checked,-
4,-,checked,checked
5,checked,checked,checked
desired output:
ID,tp_all
0,transportation
1,cars
2,boats
3,cars+boats
4,boats+planes
5,cars+boats+planes
The row with ID of 0 contans a description of the contents of the column. Ideally the code would parse the description in the second row, look after the '-' and concatenate those values in the new "tp_all" column.
Upvotes: 1
Views: 3766
Reputation: 394399
OK a more dynamic method:
In [63]:
# get a list of the columns
col_list = list(df.columns)
# remove 'ID' column
col_list.remove('ID')
# create a dict as a lookup
col_dict = dict(zip(col_list, [df.iloc[0][col].split(' - ')[1] for col in col_list]))
col_dict
Out[63]:
{'tp_b': 'boats', 'tp_c': 'cars', 'tp_p': 'planes'}
In [64]:
# define a func that tests the value and uses the dict to create our string
def func(x):
temp = ''
for col in col_list:
if x[col] == 'checked':
if len(temp) == 0:
temp = col_dict[col]
else:
temp = temp + '+' + col_dict[col]
return temp
df['combined'] = df[1:].apply(lambda row: func(row), axis=1)
df
Out[64]:
ID tp_c tp_b tp_p \
0 0 transportation - cars transportation - boats transportation - planes
1 1 checked NaN NaN
2 2 NaN checked NaN
3 3 checked checked NaN
4 4 NaN checked checked
5 5 checked checked checked
combined
0 NaN
1 cars
2 boats
3 cars+boats
4 boats+planes
5 cars+boats+planes
[6 rows x 5 columns]
In [65]:
df = df.ix[1:,['ID', 'combined']]
df
Out[65]:
ID combined
1 1 cars
2 2 boats
3 3 cars+boats
4 4 boats+planes
5 5 cars+boats+planes
[5 rows x 2 columns]
Upvotes: 1
Reputation: 375865
This is quite interesting as it's a reverse get_dummies
...
I think I would manually munge the column names so that you have a boolean DataFrame:
In [11]: df1 # df == 'checked'
Out[11]:
cars boats planes
0
1 True False False
2 False True False
3 True True False
4 False True True
5 True True True
Now you can use an apply with zip:
In [12]: df1.apply(lambda row: '+'.join([col for col, b in zip(df1.columns, row) if b]),
axis=1)
Out[12]:
0
1 cars
2 boats
3 cars+boats
4 boats+planes
5 cars+boats+planes
dtype: object
Now you just have to tweak the headers, to get the desired csv.
Would be nice if there were a less manual way / faster to do reverse get_dummies
...
Upvotes: 3
Reputation: 251578
Here is one way:
newCol = pandas.Series('',index=d.index)
for col in d.ix[:, 1:]:
name = '+' + col.split('-')[1].strip()
newCol[d[col]=='checked'] += name
newCol = newCol.str.strip('+')
Then:
>>> newCol
0 cars
1 boats
2 cars+boats
3 boats+planes
4 cars+boats+planes
dtype: object
You can create a new DataFrame with this column or do what you like with it.
Edit: I see that you have edited your question so that the names of the modes of transportation are now in row 0 instead of in the column headers. It is easier if they're in the column headers (as my answer assumes), and your new column headers don't seem to contain any additional useful information, so you should probably start by just setting the column names to the info from row 0, and deleting row 0.
Upvotes: 1