Reputation: 5659
I have a pandas DataFrame containing a string column which needs splitting into two separate columns. The answer using tolist
that I found on SO from this question works like a charm, except when my column contains NaNs. The excerpt below describes the difficulty:
import pandas as pd
import numpy as np
# Example DataFrame
df = pd.DataFrame([[25.0, '34.2/ 18.1', 'one'],
[32.6, '28.6/ 17.9', 'two'],
[12.5, '30.1/ 17.6', 'three']], columns=['A', 'B', 'C'])
df2 = df.copy()
# This method works when all data are present
df['D'] = pd.DataFrame(df['B'].str.split('/').tolist())[1]
# However, when there are NaNs:
df2['B'][0] = np.nan
# This line fails
df2['D'] = pd.DataFrame(df2['B'].str.split('/').tolist())[1]
It gives me a KeyError
, because the intermediate DataFrame only has one column, indicating that the bother of going to a list and back doesn't accomplish anything anymore:
0
0 NaN
1 [28.6, 17.9]
2 [30.1, 17.6]
I've tried dropping the NaN first via pd.DataFrame(df2['B'].str.split('/').dropna().tolist())
, but then I lose my index ... I need to keep the NaN at index 0. I've also thought of somehow duplicating the NaN in the creation of the intermediate DataFrame to force the two columns, but am having no luck.
This is what I would need my data to look like for df2:
A B C D
0 25.0 NaN one NaN
1 32.6 28.6/ 17.9 two 17.9
2 12.5 30.1/ 17.6 three 17.6
Is there a way to do this without using a list as an intermediary? Or somehow deal with the NaN?
Upvotes: 3
Views: 5041
Reputation: 880239
The str.extract
method allows you to provide a regex pattern. Each group in the pattern is returned as a separate column. NaN
is used when no match is found:
df2['D'] = df2['B'].str.extract(r'/(.*)')
print(df2)
yields
A B C D
0 25.0 NaN one NaN
1 32.6 28.6/ 17.9 two 17.9
2 12.5 30.1/ 17.6 three 17.6
Note that if you want the D
column to be treated as floats, then you'll also need to call astype
:
df2['D'] = df2['D'].astype('float')
Upvotes: 1
Reputation: 176938
You can continue to use your method if you use the str
accessor again after the split (instead of using tolist()
and making another DataFrame):
>>> df2['D'] = df2['B'].str.split('/').str[-1]
>>> df2
A B C D
0 25.0 NaN one NaN
1 32.6 28.6/ 17.9 two 17.9
2 12.5 30.1/ 17.6 three 17.6
This returns NaN
if the index doesn't exist, instead of raising the error.
Upvotes: 4