Reputation: 43
I have a csv file dataset that contains 170 columns, the first 5 columns contain unique identifiers (Platform, ID, date, length of call, name). The remaining columns 175 contain binary data covering 10 categories. I want to condense those columns so that the number of columns in my data frame is 15. Including an example below:
import pandas as pd
df1 = pd.DataFrame({'Platform': ['Telephone', 'Chat', 'Text'], 'ID': [1, 2,
3], 'Length': [1545,1532,1511], 'Name': ['andy', 'helen', 'peter'], 'Problem:
A':[0,1,0], 'Problem: B':[1,0,0], 'Problem: C': [0,0,1], 'Solution: A':
[0,1,0], 'Solution: B':[1,0,0], 'Solution: C': [0,0,1]})
The output is:
df.head()
ID Date Length\\
1 2015-10-16 1545
2 2015-10-09 1532
3 2015-10-13 1511
Name Problem: A Problem: B Problem: C Solution: A Solution: B Solution: C
andy 0 1 0 0 1 0
helen 1 0 0 1 0 0
peter 0 0 1 0 0 1
What I want the data frame to look like:
Platform ID Length Name Problem Solution
Telephone 1 1545 andy B B
Chat 2 1532 helen A A
Text 3 1511 peter C C
FYI this is not the full dataframe. There are a total of 170 colums that I would like to transform into 15.
Upvotes: 2
Views: 2321
Reputation: 878
I created this custom function which will serve your purpose. I got the idea from this stackoverflow article
def condenseCols(data,finalCol,*cols):
cols = list(cols)
x = data[cols] # Slice the cols
x = x.idxmax(axis=1)
# x is now a series, holding column name of the max value in the row i.e one of the column from cols
x = x.apply(lambda s : s.split(": ")[1]) # extract only the prefix (A,B,C)
data[finalCol] = x
data = data.drop(cols, axis=1, inplace=True) # Drop the columns : cols
return data
Call this method, by passing the column names you want to condense, along with the final name of column
condenseCols(df1,'Problem','Problem: A','Problem: B','Problem: C')
condenseCols(df1,'Solution','Solution: A','Solution: B','Solution: C')
There are other ways too, to do this, as cited in the article stackoverflow article
Upvotes: 0
Reputation: 402872
You can use groupby
+ apply
with a dot product on the columns ;
df = df.set_index('Name')
df.groupby(df.columns.str.split(':').str[0], axis=1).apply(
lambda x: x.dot(x.columns.str.split(': ').str[1])
)
Problem Solution
Name
andy B B
helen A A
peter C C
Upvotes: 1