Reputation: 43
I have a dataframe where a few columns may have multiple values in a single observation. Each observation in these rows has a "/" at the end of the observation, regardless of whether or not there are multiple. This means that some of the values look like this: 'OneThing/' while others like this: 'OneThing/AnotherThing/'
I need to take the values where there is more than one value in an observation and split them into individual rows.
This is a general example of what the dataframe looks like before:
ID Date Name ColA ColB Col_of_Int ColC ColD
1 09/12 Ann String String OneThing/ String String
2 09/13 Pete String String OneThing/AnotherThing String String
3 09/13 Ann String String OneThing/AnotherThing/ThirdThing/ String String
4 09/12 Pete String String OneThing/ String String
What I want the output to be:
ID Date Name ColA ColB Col_of_Int ColC ColD
1 09/12 Ann String String OneThing String String
2 09/13 Pete String String OneThing String String
2 09/13 Pete String String Another Thing String String
3 09/13 Ann String String OneThing String String
3 09/13 Ann String String AnotherThing String String
3 09/13 Ann String String ThirdThing String String
4 09/12 Pete String String OneThing/ String String
I've tried the following:
df = df[df['Column1'].str.contains('/')]
df_split = df[df['Column1'].str.contains('/')]
df1 = df_split.copy()
df2 = df_split.copy()
split_cols = ['Column1']
for c in split_cols:
df1[c] = df1[c].apply(lambda x: x.split('/')[0])
df2[c] = df2[c].apply(lambda x: x.split('/')[1])
new_rows = df1.append(df2)
df.drop(df_split.index, inplace=True)
df = df.append(new_rows, ignore_index=True)
This works, but I think it is creating new rows after every '/', which means that one new row is being created for every observation with only one value (where I want zero new rows), and two new rows are being created for every observation with two values (only need one), etc.
This is particularly frustrating where there are three or more values in an observation because I am getting several unnecessary rows.
Is there any way to fix this so that only observations with more than one get added to new rows?
Upvotes: 1
Views: 269
Reputation: 51395
Your method would work (I think) if you use df['column_of_interest'] = df['column_of_interest'].str.rstrip('/')
, as it would get rid of that annoying /
at the end of your observations. However, the loop is inneficient, and the way you have it, requires that you know how many observations you maximally have in your column. Here is another way, which I think achieves what you need:
Take this example df
:
df = pd.DataFrame({'column_of_interest':['onething/',
'onething/twothings/',
'onething/twothings/threethings/'],
'values1': [1,2,3],
'values2': [5,6,7]})
>>> df
column_of_interest values1 values2
0 onething/ 1 5
1 onething/twothings/ 2 6
2 onething/twothings/threethings/ 3 7
This gets a bit messy because your want to presumably keep the data that is in the columns outside column_of_interest
. So, you can temporarily find those and cast those aside, using:
value_columns = [i for i in df.columns if i != 'column_of_interest']
And put them in the index for the following manipulation (which restores them at the end):
new_df = (df.set_index(value_columns)
.column_of_interest.str.rstrip('/')
.str.split('/')
.apply(pd.Series)
.stack()
.rename('new_column_of_interest')
.reset_index(value_columns))
And your new_df
then looks like:
>>> new_df
values1 values2 new_column_of_interest
0 1 5 onething
0 2 6 onething
1 2 6 twothings
0 3 7 onething
1 3 7 twothings
2 3 7 threethings
Or alternatively, using merge
:
new_df = (df[value_columns].merge(df.column_of_interest
.str.rstrip('/')
.str.split('/')
.apply(pd.Series)
.stack()
.reset_index(1, drop=True)
.to_frame('new_column_of_interest'),
left_index=True, right_index=True))
EDIT: On the dataframe you posted, this results in:
ID Date Name ColA ColB ColC ColD new_column_of_interest
0 1 09/12 Ann String String String String OneThing
0 2 09/13 Pete String String String String OneThing
1 2 09/13 Pete String String String String AnotherThing
0 3 09/13 Ann String String String String OneThing
1 3 09/13 Ann String String String String AnotherThing
2 3 09/13 Ann String String String String ThirdThing
0 4 09/12 Pete String String String String OneThing
Upvotes: 1