Reputation:
My input data:
df=pd.DataFrame({'A':['adam','monica','joe doe','michael mo'], 'B':['david','valenti',np.nan,np.nan]})
print(df)
A B
0 adam david
1 monica valenti
2 joe doe NaN
3 michael mo NaN
I need to extract strings after space, to a second column, but when I use my code...:
df['B'] = df['A'].str.extract(r'( [a-zA-Z](.*))')
print(df)
A B
0 adam NaN
1 monica NaN
2 joe doe doe
3 michael mo mo
...I receive NaN
in each cell where value has not been extracted. How to avoid it?
I tried to extract only from rows where NaN exist using this code:
df.loc[df.B.isna(),'B'] = df.loc[df.B.isna(),'A'].str.extract(r'( [a-zA-Z](.*))')
ValueError: Incompatible indexer with DataFrame
Expected output:
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
Upvotes: 0
Views: 1030
Reputation: 863236
I think solution should be simplify - split by spaces and get second lists and pass to Series.fillna
function:
df['B'] = df['B'].fillna(df['A'].str.split().str[1])
print (df)
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
Detail:
print (df['A'].str.split().str[1])
0 NaN
1 NaN
2 doe
3 mo
Name: A, dtype: object
Your solution should be changed:
df['B'] = df['A'].str.extract(r'( [a-zA-Z](.*))')[0].fillna(df.B)
print (df)
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
Better solution wich changed regex and expand=False
for Series
:
df['B'] = df['A'].str.extract(r'( [a-zA-Z].*)', expand=False).fillna(df.B)
print (df)
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
Detail:
print (df['A'].str.extract(r'( [a-zA-Z].*)', expand=False))
0 NaN
1 NaN
2 doe
3 mo
Name: A, dtype: object
EDIT:
For extract also values from first column simpliest is use:
df1 = df['A'].str.split(expand=True)
df['A'] = df1[0]
df['B'] = df['B'].fillna(df1[1])
print (df)
A B
0 adam david
1 monica valenti
2 joe doe
3 michael mo
Upvotes: 0
Reputation: 17884
Your approach doesn't function because of the different shapes of the right and the left sides of your statement. The left part has the shape (2,)
and the right part (2, 2)
:
df.loc[df.B.isna(),'B']
Returns:
2 NaN
3 NaN
And you want to fill this with:
df.loc[df.B.isna(),'A'].str.extract(r'( [a-zA-Z](.*))')
Returns:
0 1
2 doe oe
3 mo o
You can take the column 1
and then it will have the same shape (2,)
as the left part and will fit:
df.loc[df.B.isna(),'A'].str.extract(r'( [a-zA-Z](.*))')[1]
Returns:
2 oe
3 o
Upvotes: 1