Reputation: 135
I am trying to transform a Pandas DataFrame into a new one with every item from a certain column given its own row. For example:
Before:
ID Name Date Location
0 0 John, Dave 01/01/1992 Mexico
1 1 Tim 07/07/1997 Australia
2 2 Mike, John 12/24/2012 Zambia
3 3 Bob, Rick, Tony 05/17/2007 Cuba
4 4 Roger 04/05/2000 Iceland
5 5 Carlos 05/24/1995 Guam
Current Solution:
new_df = pd.DataFrame(columns = df.columns)
for index,row in df.iterrows():
new_row = pd.DataFrame(df.loc[index]).transpose()
target_info = df.loc[index,'Name']
if (len(target_info.split(',')) > 1):
for item in target_info.split(','):
new_row.loc[index,'Name'] = item
new_df = new_df.append(new_row)
else:
new_df = new_df.append(new_row)
After:
ID Name Date Location
0 0 John 01/01/1992 Mexico
0 0 Dave 01/01/1992 Mexico
1 1 Tim 07/07/1997 Australia
2 2 Mike 12/24/2012 Zambia
2 2 John 12/24/2012 Zambia
3 3 Bob 05/17/2007 Cuba
3 3 Rick 05/17/2007 Cuba
3 3 Tony 05/17/2007 Cuba
4 4 Roger 04/05/2000 Iceland
5 5 Carlos 05/24/1995 Guam
Surely there is something more elegant?
Upvotes: 4
Views: 260
Reputation: 33773
You could get the split names as a Series, drop your existing Name column, then join the split names.
# Split the 'Name' column as a Series, setting the appropriate name and index.
split_names = df['Name'].str.split(', ', expand=True).stack()
split_names.name = 'Name'
split_names.index = split_names.index.get_level_values(0)
# Drop the existing 'Name' column, and join the split names.
df.drop('Name', axis=1, inplace=True)
df = df.join(split_names)
The resulting output is the same as in your example, but with the Name column last. You can reorder the columns if you want the original order.
ID Date Location Name
0 0 01/01/1992 Mexico John
0 0 01/01/1992 Mexico Dave
1 1 07/07/1997 Australia Tim
2 2 12/24/2012 Zambia Mike
2 2 12/24/2012 Zambia John
3 3 05/17/2007 Cuba Bob
3 3 05/17/2007 Cuba Rick
3 3 05/17/2007 Cuba Tony
4 4 04/05/2000 Iceland Roger
5 5 05/24/1995 Guam Carlos
Upvotes: 2
Reputation: 210822
you can do it this way:
nm = df.Name.str.split(',\s*', expand=True)
cols=list(set(df.columns) - set(['Name']))
pd.melt(df[cols].join(nm),
id_vars=cols,
value_vars=nm.columns.tolist(),
value_name='Name') \
.dropna() \
.drop(['variable'], axis=1) \
.sort_values('ID')
Step by step:
In [128]: nm = df.Name.str.split(',\s*', expand=True)
In [129]: nm
Out[129]:
0 1 2
0 John Dave None
1 Tim None None
2 Mike John None
3 Bob Rick Tony
4 Roger None None
5 Carlos None None
In [130]: cols=list(set(df.columns) - set(['Name']))
In [131]: cols
Out[131]: ['Date', 'ID', 'Location']
In [133]: pd.melt(df[cols].join(nm),
.....: id_vars=cols,
.....: value_vars=nm.columns.tolist(),
.....: value_name='Name') \
.....: .dropna() \
.....: .drop(['variable'], axis=1) \
.....: .sort_values('ID')
Out[133]:
Date ID Location Name
0 01/01/1992 0 Mexico John
6 01/01/1992 0 Mexico Dave
1 07/07/1997 1 Australia Tim
2 12/24/2012 2 Zambia Mike
8 12/24/2012 2 Zambia John
3 05/17/2007 3 Cuba Bob
9 05/17/2007 3 Cuba Rick
15 05/17/2007 3 Cuba Tony
4 04/05/2000 4 Iceland Roger
5 05/24/1995 5 Guam Carlos
Upvotes: 1