user8560167
user8560167

Reputation:

Keep original string values after pandas str.extract() if the regex doesn't match

My input data:

df=pd.DataFrame({'A':['adam','monica','joe doe','michael mo'], 'B':['david','valenti',np.nan,np.nan]})
print(df)
            A        B
0        adam    david
1      monica  valenti
2     joe doe      NaN
3  michael mo      NaN

I need to extract strings after space, to a second column, but when I use my code...:

df['B'] = df['A'].str.extract(r'( [a-zA-Z](.*))')
print(df)
            A     B
0        adam   NaN
1      monica   NaN
2     joe doe   doe
3  michael mo    mo

...I receive NaN in each cell where value has not been extracted. How to avoid it? I tried to extract only from rows where NaN exist using this code:

df.loc[df.B.isna(),'B'] = df.loc[df.B.isna(),'A'].str.extract(r'( [a-zA-Z](.*))')

ValueError: Incompatible indexer with DataFrame

Expected output:

            A     B
0        adam   david
1      monica   valenti
2     joe doe   doe
3  michael mo    mo

Upvotes: 0

Views: 1030

Answers (2)

jezrael
jezrael

Reputation: 863236

I think solution should be simplify - split by spaces and get second lists and pass to Series.fillna function:

df['B'] = df['B'].fillna(df['A'].str.split().str[1])
print (df)
            A        B
0        adam    david
1      monica  valenti
2     joe doe      doe
3  michael mo       mo

Detail:

print (df['A'].str.split().str[1])
0    NaN
1    NaN
2    doe
3     mo
Name: A, dtype: object

Your solution should be changed:

df['B'] = df['A'].str.extract(r'( [a-zA-Z](.*))')[0].fillna(df.B)
print (df)
            A        B
0        adam    david
1      monica  valenti
2     joe doe      doe
3  michael mo       mo

Better solution wich changed regex and expand=False for Series:

df['B'] = df['A'].str.extract(r'( [a-zA-Z].*)', expand=False).fillna(df.B)
print (df)
            A        B
0        adam    david
1      monica  valenti
2     joe doe      doe
3  michael mo       mo

Detail:

print (df['A'].str.extract(r'( [a-zA-Z].*)', expand=False))
0     NaN
1     NaN
2     doe
3      mo
Name: A, dtype: object

EDIT:

For extract also values from first column simpliest is use:

df1 = df['A'].str.split(expand=True)

df['A'] = df1[0]
df['B'] = df['B'].fillna(df1[1])
print (df)
         A        B
0     adam    david
1   monica  valenti
2      joe      doe
3  michael       mo

Upvotes: 0

Mykola Zotko
Mykola Zotko

Reputation: 17884

Your approach doesn't function because of the different shapes of the right and the left sides of your statement. The left part has the shape (2,) and the right part (2, 2):

df.loc[df.B.isna(),'B']

Returns:

2    NaN
3    NaN

And you want to fill this with:

df.loc[df.B.isna(),'A'].str.extract(r'( [a-zA-Z](.*))')

Returns:

      0   1
2   doe  oe
3    mo   o

You can take the column 1 and then it will have the same shape (2,) as the left part and will fit:

df.loc[df.B.isna(),'A'].str.extract(r'( [a-zA-Z](.*))')[1]

Returns:

2    oe
3     o 

Upvotes: 1

Related Questions