REFER
REFER

Reputation: 43

Split Multiple Values into New Rows

I have a dataframe where a few columns may have multiple values in a single observation. Each observation in these rows has a "/" at the end of the observation, regardless of whether or not there are multiple. This means that some of the values look like this: 'OneThing/' while others like this: 'OneThing/AnotherThing/'

I need to take the values where there is more than one value in an observation and split them into individual rows.

This is a general example of what the dataframe looks like before:

ID  Date   Name ColA   ColB   Col_of_Int                        ColC   ColD
1   09/12  Ann  String String OneThing/                         String String
2   09/13  Pete String String OneThing/AnotherThing             String String
3   09/13  Ann  String String OneThing/AnotherThing/ThirdThing/ String String
4   09/12  Pete String String OneThing/                         String String

What I want the output to be:

ID  Date   Name ColA   ColB   Col_of_Int                        ColC   ColD
1   09/12  Ann  String String OneThing                         String String
2   09/13  Pete String String OneThing                         String String
2   09/13  Pete String String Another Thing                    String String
3   09/13  Ann  String String OneThing                         String String
3   09/13  Ann  String String AnotherThing                     String String
3   09/13  Ann  String String ThirdThing                       String String
4   09/12  Pete String String OneThing/                        String String

I've tried the following:

df = df[df['Column1'].str.contains('/')]
df_split = df[df['Column1'].str.contains('/')]
df1 = df_split.copy()
df2 = df_split.copy()

split_cols = ['Column1']

for c in split_cols:
    df1[c] = df1[c].apply(lambda x: x.split('/')[0])
    df2[c] = df2[c].apply(lambda x: x.split('/')[1])

new_rows = df1.append(df2)
df.drop(df_split.index, inplace=True)
df = df.append(new_rows, ignore_index=True)

This works, but I think it is creating new rows after every '/', which means that one new row is being created for every observation with only one value (where I want zero new rows), and two new rows are being created for every observation with two values (only need one), etc.

This is particularly frustrating where there are three or more values in an observation because I am getting several unnecessary rows.

Is there any way to fix this so that only observations with more than one get added to new rows?

Upvotes: 1

Views: 269

Answers (1)

sacuL
sacuL

Reputation: 51395

Your method would work (I think) if you use df['column_of_interest'] = df['column_of_interest'].str.rstrip('/'), as it would get rid of that annoying / at the end of your observations. However, the loop is inneficient, and the way you have it, requires that you know how many observations you maximally have in your column. Here is another way, which I think achieves what you need:

Take this example df:

df = pd.DataFrame({'column_of_interest':['onething/', 
                                         'onething/twothings/', 
                                         'onething/twothings/threethings/'], 
                   'values1': [1,2,3], 
                   'values2': [5,6,7]})

>>> df
                column_of_interest  values1  values2
0                        onething/        1        5
1              onething/twothings/        2        6
2  onething/twothings/threethings/        3        7

This gets a bit messy because your want to presumably keep the data that is in the columns outside column_of_interest. So, you can temporarily find those and cast those aside, using:

value_columns = [i for i in df.columns if i != 'column_of_interest']

And put them in the index for the following manipulation (which restores them at the end):

new_df = (df.set_index(value_columns)
          .column_of_interest.str.rstrip('/')
          .str.split('/')
          .apply(pd.Series)
          .stack()
          .rename('new_column_of_interest')
          .reset_index(value_columns))

And your new_df then looks like:

>>> new_df
   values1  values2 new_column_of_interest
0        1        5               onething
0        2        6               onething
1        2        6              twothings
0        3        7               onething
1        3        7              twothings
2        3        7            threethings

Or alternatively, using merge:

new_df = (df[value_columns].merge(df.column_of_interest
                        .str.rstrip('/')
                        .str.split('/')
                        .apply(pd.Series)
                        .stack()
                        .reset_index(1, drop=True)
                        .to_frame('new_column_of_interest'),
                        left_index=True, right_index=True))

EDIT: On the dataframe you posted, this results in:

   ID   Date  Name    ColA    ColB    ColC    ColD new_column_of_interest
0   1  09/12   Ann  String  String  String  String               OneThing
0   2  09/13  Pete  String  String  String  String               OneThing
1   2  09/13  Pete  String  String  String  String           AnotherThing
0   3  09/13   Ann  String  String  String  String               OneThing
1   3  09/13   Ann  String  String  String  String           AnotherThing
2   3  09/13   Ann  String  String  String  String             ThirdThing
0   4  09/12  Pete  String  String  String  String               OneThing

Upvotes: 1

Related Questions