The Great
The Great

Reputation: 7693

Transpose all rows in one column of dataframe to multiple columns based on certain conditions

I would like to convert one column of data to multiple columns in dataframe based on certain values/conditions.

Please find the code to generate the input dataframe

df1 = pd.DataFrame({'VARIABLE':['studyid',1,'age_interview', 65,'Gender','1.Male',
                            '2.Female',
                            'Ethnicity','1.Chinese','2.Indian','3.Malay']})

The data looks like as shown below

enter image description here

Please note that I may not know the column names in advance. But it usually follows this format. What I have shown above is a sample data and real data might have around 600-700 columns and data arranged in this fashion

What I would like to do is convert values which start with non-digits(characters) as new columns in dataframe. It can be a new dataframe.

I attempted to write a for loop but failed to due to the below error. Can you please help me achieve this outcome.

for i in range(3,len(df1)):
#str(df1['VARIABLE'][i].contains('^\d'))
    if (df1['VARIABLE'][i].astype(str).contains('^\d') == True):

Through the above loop, I was trying to check whether first char is a digit, if yes, then retain it as a value (ex: 1,2,3 etc) and if it's a character (ex:gender, ethnicity etc), then create a new column. But guess this is an incorrect and lengthy approach

For example, in the above example, the columns would be studyid,age_interview,Gender,Ethnicity.

The final output would look like this

enter image description here

Can you please let me know if there is an elegant approach to do this?

Upvotes: 2

Views: 440

Answers (3)

G.G
G.G

Reputation: 765

col1=(~df1.VARIABLE.astype(str).str[0].str.isdigit()).cumsum()
def function1(dd1:pd.DataFrame):
    return dd1.reset_index(drop=1).T

df1.groupby(col1).apply(function1).set_index(0).T.fillna('')

:

┌───────┬─────────┬───────────────┬──────────┬───────────┐
│ index │ studyid │ age_interview │  Gender  │ Ethnicity │
│ int64 │ varchar │    varchar    │ varchar  │  varchar  │
├───────┼─────────┼───────────────┼──────────┼───────────┤
│     1 │ 1       │ 65            │ 1.Male   │ 1.Chinese │
│     2 │         │               │ 2.Female │ 2.Indian  │
│     3 │         │               │          │ 3.Malay   │
└───────┴─────────┴───────────────┴──────────┴───────────┘

Upvotes: 1

anky
anky

Reputation: 75080

You can use groupby to do something like:

m=~df1['VARIABLE'].str[0].str.isdigit().fillna(True)
new_df=(pd.DataFrame(df1.groupby(m.cumsum()).VARIABLE.apply(list).
                                    values.tolist()).set_index(0).T)
print(new_df.rename_axis(None,axis=1))

  studyid age_interview    Gender  Ethnicity
1       1            65    1.Male  1.Chinese
2    None          None  2.Female   2.Indian
3    None          None      None    3.Malay

Explanation: m is a helper series which helps seperating the groups:

print(m.cumsum())
0     1
1     1
2     2
3     2
4     3
5     3
6     3
7     4
8     4
9     4
10    4

Then we group this helper series and apply list:

df1.groupby(m.cumsum()).VARIABLE.apply(list)
VARIABLE
1                                 [studyid, 1]
2                          [age_interview, 65]
3                   [Gender, 1.Male, 2.Female]
4    [Ethnicity, 1.Chinese, 2.Indian, 3.Malay]
Name: VARIABLE, dtype: object

At this point we have each group as a list with the column name as the first entry. So we create a dataframe with this and set the first column as index and transpose to get our desired output.

Upvotes: 1

Chris
Chris

Reputation: 29732

Use itertools.groupby and then construct pd.DataFrame:

import pandas as pd
import itertools

l = ['studyid',1,'age_interview', 65,'Gender','1.Male',
                            '2.Female',
                            'Ethnicity','1.Chinese','2.Indian','3.Malay']
l = list(map(str, l))
grouped = [list(g) for k, g in itertools.groupby(l, key=lambda x:x[0].isnumeric())]
d = {k[0]: v for k,v in zip(grouped[::2],grouped[1::2])}

pd.DataFrame.from_dict(d, orient='index').T

Output:

     Gender studyid age_interview  Ethnicity
0    1.Male       1            65  1.Chinese
1  2.Female    None          None   2.Indian
2      None    None          None    3.Malay

Upvotes: 1

Related Questions