Reputation: 139
DF1
index|Number
0 |[Number 1]
1 |[Number 2]
2 |[kg]
3 |[]
4 |[kg,Number 3]
In my dataframe in the Number
column, I need to extract the number if present, kg
if the string has kg
and NaN
if there is no value. If the row has both the number and kg
then I will extract only the number.
Expected Output
index|Number
0 |1
1 |2
2 |kg
3 |NaN
4 |3
I wrote a lambda function for this but I am getting Error
NumorKG = lambda x: x.str.extract('(\d+)') if x.str.extract('(\d+)').isdigit() else 'kg' if x.str.find('kg') else "NaN"
DF1['Number']=DF1['Number'].apply(NumorKG)
The error that I am getting is:
AttributeError: 'str' object has no attribute 'str'
Upvotes: 1
Views: 359
Reputation: 1570
In apply
, what is returned is a scalar, so you can't use the .str accessor.
As you are dealing with only one column, no need for apply.
As an alternative to Jezrael (that would be reproducible), this is a possible solution:
DF1 = pd.DataFrame({'Number': [["Number 1"], ["Number 2"], ["kg"], [""], ["kg", "Number 3"]]})
DF1['Number'] = DF1.Number.str.join(sep=" ")
mask_digit = DF1.Number.str.extract('(\d+)', expand=False).str.isdigit().fillna(False)
mask_kg = DF1['Number'].str.contains('kg', na=False)
DF1.loc[mask_digit, 'Number'] = DF1.Number.str.extract('(\d+)', expand=False)
DF1.loc[mask_kg,'Number'] = 'kg'
DF1.loc[~(mask_digit | mask_kg), 'Number'] = np.NaN
Upvotes: 0
Reputation: 863331
Use numpy.where
for set values:
#extract numeric to Series
d = df['Number'].str.extract('(\d+)', expand=False)
#test if digit
mask1 = d.str.isdigit().fillna(False)
#test if values contains kg mask2 = df['Number'].str.contains('kg', na=False)
df['Number'] = np.where(mask1, d,
np.where(mask2 & ~mask1, 'kg',np.nan))
print (df)
Number
0 1
1 2
2 kg
3 nan
4 3
Your solution should be changed:
import re
def NumorKG(x):
a = re.findall('(\d+)', x)
if len(a) > 0:
return a[0]
elif 'kg' in x:
return 'kg'
else:
return np.nan
df['Number']=df['Number'].apply(NumorKG)
print (df)
Number
0 1
1 2
2 kg
3 NaN
4 3
And your lambda function should be changed:
NumorKG = lambda x: re.findall('(\d+)', x)[0]
if len(re.findall('(\d+)', x)) > 0
else 'kg'
if 'kg' in x
else np.nan
Upvotes: 1