Reputation: 534
One column of my dataframe has a variable number of \n
s inside its content and I need each line to be on a single row on the final dataframe.
This is a minimal example:
df = pd.DataFrame({'a': ['x', 'y'], 'b':['line 1\nline 2\nline 3', 'line 1' ]})
That produces this starting dataframe:
a b
0 x line 1\nline 2\nline 3
1 y line 1
I want it to become like this one:
a b
0 x line 1
1 x line 2
2 x line 3
3 y line 1
I've seen there is a built in function that converts each pattern
to a new column with the str.extract
command below, for example, this is what I tried:
df['b'].str.extract(pat='(.*)\n(.*)', expand=True)
That produces a somewhat interesting output:
0 1
0 line 1 line 2
1 NaN NaN
But this is not a viable solution, because the data is split over columns and not rows, not all patterns matched and it's not clear how to put it back on the original dataframe in place and order. The order of the entries is relevant to be preserved, although the dataframe index
is not.
In order to capture all the patterns, it would be possible to do this:
df['b'].transform(lambda x: x.split('\n'))
That yields this output:
0 [line 1, line 2, line 3]
1 [line 1]
But again, I don't see a way to make progress from this to the desired state.
Upvotes: 1
Views: 41
Reputation: 25259
Try using str.split
and explode
df = df.set_index('a').b.str.split('\\n').explode().reset_index()
Out[153]:
a b
0 x line 1
1 x line 2
2 x line 3
3 y line 1
For pandas < 0.25
df = (df.set_index('a').b.str.split('\\n', expand=True).stack()
.droplevel(1).reset_index(name='b'))
Out[174]:
a b
0 x line 1
1 x line 2
2 x line 3
3 y line 1
Upvotes: 3