Jairo Alves
Jairo Alves

Reputation: 534

Break a row into muliple rows based on the (string) content of a column

One column of my dataframe has a variable number of \ns inside its content and I need each line to be on a single row on the final dataframe.

This is a minimal example:

df = pd.DataFrame({'a': ['x', 'y'], 'b':['line 1\nline 2\nline 3', 'line 1' ]})

That produces this starting dataframe:

    a   b
0   x   line 1\nline 2\nline 3
1   y   line 1 

I want it to become like this one:

    a   b
0   x   line 1
1   x   line 2
2   x   line 3
3   y   line 1

I've seen there is a built in function that converts each pattern to a new column with the str.extract command below, for example, this is what I tried:

df['b'].str.extract(pat='(.*)\n(.*)', expand=True)

That produces a somewhat interesting output:

    0       1
0   line 1  line 2
1   NaN     NaN

But this is not a viable solution, because the data is split over columns and not rows, not all patterns matched and it's not clear how to put it back on the original dataframe in place and order. The order of the entries is relevant to be preserved, although the dataframe index is not.

In order to capture all the patterns, it would be possible to do this:

df['b'].transform(lambda x: x.split('\n'))

That yields this output:

0    [line 1, line 2, line 3]
1                    [line 1]

But again, I don't see a way to make progress from this to the desired state.

Upvotes: 1

Views: 41

Answers (1)

Andy L.
Andy L.

Reputation: 25259

Try using str.split and explode

df = df.set_index('a').b.str.split('\\n').explode().reset_index()

Out[153]:
   a       b
0  x  line 1
1  x  line 2
2  x  line 3
3  y  line 1

For pandas < 0.25

df = (df.set_index('a').b.str.split('\\n', expand=True).stack()
                         .droplevel(1).reset_index(name='b'))

Out[174]:
   a       b
0  x  line 1
1  x  line 2
2  x  line 3
3  y  line 1

Upvotes: 3

Related Questions