nadavb
nadavb

Reputation: 128

Python 3 match values based on column name similarity

I have a dataframe of the following form:

Year 1 Grade Year 2 Grade Year 3 Grade Year 4 Grade Year 1 Students Year 2 Students Year 3 Students Year 4 Students
60 70 80 100 20 32 18 25

I would like to somehow transpose this table to the following format:

Year Grade Students
1 60 20
2 70 32
3 80 18
4 100 25

I created a list of years and initiated a new dataframe with the "year" column. I was thinking of matching the year integer to the column name containing it in the original DF, match and assign the correct value, but got stuck there.

Upvotes: 4

Views: 281

Answers (3)

mozway
mozway

Reputation: 260420

You need a manual reshaping using a split of the Index into a MultiIndex:

out = (df
 .set_axis(df.columns.str.split(expand=True), axis=1) # make MultiIndex
 .iloc[0]                                             # select row as Series
 .unstack()                                           # unstack Grade/Students
 .droplevel(0)                                        # remove literal "Year"
 .rename_axis('Year')                                 # set index name
 .reset_index()                                       # index to column
)

output:

  Year  Grade  Students
0    1     60        20
1    2     70        32
2    3     80        18
3    4    100        25

Or using pivot_longer from janitor:

# pip install pyjanitor

import janitor

out = (df.pivot_longer(
           names_to = ('ignore', 'Year', '.value'), 
           names_sep = ' ')
         .drop(columns='ignore')
       )

out 
  Year  Grade  Students
0    1     60        20
1    2     70        32
2    3     80        18
3    4    100        25

The .value determines which parts of the sub labels in the columns are retained; the labels are split apart by names_sep, which can be a string or a regex. Another option is by using a regex, with names_pattern to split and reshape the columns:

df.pivot_longer(names_to = ('Year', '.value'), 
                names_pattern = r'.+(\d)\s(.+)')
  Year  Grade  Students
0    1     60        20
1    2     70        32
2    3     80        18
3    4    100        25

Upvotes: 2

Jred
Jred

Reputation: 315

Here's one way to do it. Feel free to ask questions about how it works.

import pandas as pd
cols = ["Year 1 Grade", "Year 2 Grade", "Year 3 Grade" , "Year 4 Grade",
        "Year 1 Students", "Year 2 Students", "Year 3 Students", "Year 4 Students"]
vals = [60,70,80,100,20,32,18,25]
vals = [[v] for v in vals]
df = pd.DataFrame({k:v for k,v in zip(cols,vals)})

grades = df.filter(like="Grade").T.reset_index(drop=True).rename(columns={0:"Grades"})
students = df.filter(like="Student").T.reset_index(drop=True).rename(columns={0:"Students"})
pd.concat([grades,students], axis=1)

Upvotes: 1

Nuri Taş
Nuri Taş

Reputation: 3845

I came up with these. grades here are your first row.

df = pd.DataFrame(grades) # your dataframe without columns

grades = np.array(df.iloc[0]).reshape(4,2) # extract the first row, turn it into an array and reshape it to 4*2
new_df = pd.DataFrame(grades).reset_index()

new_df.columns = ['Year', 'Grade', 'Students'] # rename the columns

Upvotes: 0

Related Questions