Reputation: 60
I'm trying to split the values in each row of a column into multiple rows while duplicating the corresponding values of other columns. I'm fairly new to python and trying to figure out a way to implement this solution to a bigger dataset.
This is the input file:
Name Year Subject State
Jack 2003 Math, Sci, Music MA
Sam 2004 Math, PE, Language, Social CA
Nicole 2005 Math, Life Sci, Geography, Music, Computer Sci NY
This is what I want as output:
Name Year Subject State
Jack 2003 Math MA
Jack 2003 Sci MA
Jack 2003 Music MA
Sam 2004 Math CA
Sam 2004 PE CA
Sam 2004 Language CA
Sam 2004 Social CA
Nicole 2005 Math NY
Nicole 2005 Life Sci NY
Nicole 2005 Geography NY
Nicole 2005 Music NY
Nicole 2005 Computer Sci NY
I tried this code:
import pandas as pd
df= pd.read_csv('C:/Users/3216140/Desktop/test.csv', delimiter=',', skiprows = 1, names = ["Name","Year","Subject","State"] )
print(df)
sub = df['Subject'].str.split(',').apply(pd.Series, 1).stack()
sub.index = sub.index.droplevel(-1)
sub.name = 'Subject'
print (sub)
del df['Subject']
df.join(sub)
print(df)
But the join doesn't seem to have worked. I just get the input file without the 'Subject' as output.
Upvotes: 2
Views: 87
Reputation: 402852
You can use np.repeat
and itertools.chain
here.
from itertools import chain
v = df.pop('Subject').str.split(r'\s*,\s*')
df_new = pd.DataFrame(
df.values.repeat(v.str.len(), axis=0),
columns=df.columns
)
df_new['Subject'] = list(itertools.chain.from_iterable(v))
df_new
Name State Year Subject
0 Jack 2003 MA Math
1 Jack 2003 MA Sci
2 Jack 2003 MA Music
3 Sam 2004 CA Math
4 Sam 2004 CA PE
5 Sam 2004 CA Language
6 Sam 2004 CA Social
7 Nicole 2005 NY Math
8 Nicole 2005 NY Life Sci
9 Nicole 2005 NY Geography
10 Nicole 2005 NY Music
11 Nicole 2005 NY Computer Sci
Upvotes: 1