this_is_david
this_is_david

Reputation: 135

DataFrame transformation in Python Pandas

I am trying to transform a Pandas DataFrame into a new one with every item from a certain column given its own row. For example:

Before:

   ID             Name        Date   Location
0   0       John, Dave  01/01/1992     Mexico
1   1              Tim  07/07/1997  Australia
2   2       Mike, John  12/24/2012     Zambia
3   3  Bob, Rick, Tony  05/17/2007       Cuba
4   4            Roger  04/05/2000    Iceland
5   5           Carlos  05/24/1995       Guam

Current Solution:

new_df = pd.DataFrame(columns = df.columns)
for index,row in df.iterrows():
    new_row = pd.DataFrame(df.loc[index]).transpose()
    target_info = df.loc[index,'Name']
    if (len(target_info.split(',')) > 1):
        for item in target_info.split(','):
            new_row.loc[index,'Name'] = item
           new_df = new_df.append(new_row)
    else:
        new_df = new_df.append(new_row)  

After:

  ID    Name        Date   Location
0  0    John  01/01/1992     Mexico
0  0    Dave  01/01/1992     Mexico
1  1     Tim  07/07/1997  Australia
2  2    Mike  12/24/2012     Zambia
2  2    John  12/24/2012     Zambia
3  3     Bob  05/17/2007       Cuba
3  3    Rick  05/17/2007       Cuba
3  3    Tony  05/17/2007       Cuba
4  4   Roger  04/05/2000    Iceland
5  5  Carlos  05/24/1995       Guam

Surely there is something more elegant?

Upvotes: 4

Views: 260

Answers (2)

root
root

Reputation: 33773

You could get the split names as a Series, drop your existing Name column, then join the split names.

# Split the 'Name' column as a Series, setting the appropriate name and index.
split_names = df['Name'].str.split(', ', expand=True).stack()
split_names.name = 'Name'
split_names.index = split_names.index.get_level_values(0)

# Drop the existing 'Name' column, and join the split names.
df.drop('Name', axis=1, inplace=True)
df = df.join(split_names)

The resulting output is the same as in your example, but with the Name column last. You can reorder the columns if you want the original order.

   ID        Date   Location    Name
0   0  01/01/1992     Mexico    John
0   0  01/01/1992     Mexico    Dave
1   1  07/07/1997  Australia     Tim
2   2  12/24/2012     Zambia    Mike
2   2  12/24/2012     Zambia    John
3   3  05/17/2007       Cuba     Bob
3   3  05/17/2007       Cuba    Rick
3   3  05/17/2007       Cuba    Tony
4   4  04/05/2000    Iceland   Roger
5   5  05/24/1995       Guam  Carlos

Upvotes: 2

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210822

you can do it this way:

nm = df.Name.str.split(',\s*', expand=True)
cols=list(set(df.columns) - set(['Name']))

pd.melt(df[cols].join(nm),
        id_vars=cols,
        value_vars=nm.columns.tolist(),
        value_name='Name') \
  .dropna() \
  .drop(['variable'], axis=1) \
  .sort_values('ID')

Step by step:

In [128]: nm = df.Name.str.split(',\s*', expand=True)

In [129]: nm
Out[129]:
        0     1     2
0    John  Dave  None
1     Tim  None  None
2    Mike  John  None
3     Bob  Rick  Tony
4   Roger  None  None
5  Carlos  None  None

In [130]: cols=list(set(df.columns) - set(['Name']))

In [131]: cols
Out[131]: ['Date', 'ID', 'Location']

In [133]: pd.melt(df[cols].join(nm),
   .....:         id_vars=cols,
   .....:         value_vars=nm.columns.tolist(),
   .....:         value_name='Name') \
   .....:   .dropna() \
   .....:   .drop(['variable'], axis=1) \
   .....:   .sort_values('ID')
Out[133]:
          Date  ID   Location    Name
0   01/01/1992   0     Mexico    John
6   01/01/1992   0     Mexico    Dave
1   07/07/1997   1  Australia     Tim
2   12/24/2012   2     Zambia    Mike
8   12/24/2012   2     Zambia    John
3   05/17/2007   3       Cuba     Bob
9   05/17/2007   3       Cuba    Rick
15  05/17/2007   3       Cuba    Tony
4   04/05/2000   4    Iceland   Roger
5   05/24/1995   5       Guam  Carlos

Upvotes: 1

Related Questions