Ajean
Ajean

Reputation: 5659

Split pandas column into new columns in presence of NaN

I have a pandas DataFrame containing a string column which needs splitting into two separate columns. The answer using tolist that I found on SO from this question works like a charm, except when my column contains NaNs. The excerpt below describes the difficulty:

import pandas as pd
import numpy as np

# Example DataFrame
df = pd.DataFrame([[25.0, '34.2/ 18.1', 'one'],
                   [32.6, '28.6/ 17.9', 'two'],
                   [12.5, '30.1/ 17.6', 'three']], columns=['A', 'B', 'C'])
df2 = df.copy()

# This method works when all data are present
df['D'] = pd.DataFrame(df['B'].str.split('/').tolist())[1]

# However, when there are NaNs:
df2['B'][0] = np.nan

# This line fails
df2['D'] = pd.DataFrame(df2['B'].str.split('/').tolist())[1]

It gives me a KeyError, because the intermediate DataFrame only has one column, indicating that the bother of going to a list and back doesn't accomplish anything anymore:

               0
0            NaN
1  [28.6,  17.9]
2  [30.1,  17.6]

I've tried dropping the NaN first via pd.DataFrame(df2['B'].str.split('/').dropna().tolist()), but then I lose my index ... I need to keep the NaN at index 0. I've also thought of somehow duplicating the NaN in the creation of the intermediate DataFrame to force the two columns, but am having no luck.

This is what I would need my data to look like for df2:

      A           B      C     D
0  25.0         NaN    one   NaN
1  32.6  28.6/ 17.9    two  17.9
2  12.5  30.1/ 17.6  three  17.6

Is there a way to do this without using a list as an intermediary? Or somehow deal with the NaN?

Upvotes: 3

Views: 5041

Answers (2)

unutbu
unutbu

Reputation: 880239

The str.extract method allows you to provide a regex pattern. Each group in the pattern is returned as a separate column. NaN is used when no match is found:

df2['D'] = df2['B'].str.extract(r'/(.*)')
print(df2)

yields

      A           B      C      D
0  25.0         NaN    one    NaN
1  32.6  28.6/ 17.9    two   17.9
2  12.5  30.1/ 17.6  three   17.6

Note that if you want the D column to be treated as floats, then you'll also need to call astype:

df2['D'] = df2['D'].astype('float')

Upvotes: 1

Alex Riley
Alex Riley

Reputation: 176938

You can continue to use your method if you use the str accessor again after the split (instead of using tolist() and making another DataFrame):

>>> df2['D'] = df2['B'].str.split('/').str[-1]
>>> df2
      A           B      C      D
0  25.0         NaN    one    NaN
1  32.6  28.6/ 17.9    two   17.9
2  12.5  30.1/ 17.6  three   17.6

This returns NaN if the index doesn't exist, instead of raising the error.

Upvotes: 4

Related Questions