Reputation: 58
I have a column that should only contain ints however, due to data errors, it currently contains both strings and ints. I need to apply a np.where
statement that says the following np.where(df['IO8'] >= 2002),"NEW","OLD")
The statement fails with the error cannot use >= on strings. How would I get around this? Any help would be great. Let me know if any more detail is needed. I have also tried to use regex like the following:
df['split'] = pd.np.where(df['IO8'].str.contains("^\d{4}$", regex=True), "Number", "Error")
df['IO8'] = pd.np.where(df['split'].str.contains("Number"), df['IO8'].astype(int), df['IO8'].astype(str))
df['split1'] = pd.np.where(df['split'].str.contains("Number") & (df['IO8'] >= 2002),"NEW","OLD")
But still get an error on this.
Upvotes: 2
Views: 58
Reputation: 108
@Author, you would like to see this too
b = df['IO8'].apply(lambda x: "New" if (x.isnumeric() and int(x) >= 2002) else "None" if not x.isnumeric() else "Old")
Upvotes: 0
Reputation: 863226
Use Series.str.extract
for get years to new column with convert to floats:
df = pd.DataFrame({'IO8':['2000','2009','20','dwd21']})
df['num'] = df['IO8'].str.extract("(^\d{4}$)").astype(float)
Then is possible use numpy.select
for 3 states:
m1 = df['num'].notna()
m2 = df['num'] >= 2002
df['split1'] = pd.np.select([m1 & m2, m1 & ~m2],["NEW","OLD"], default='no match')
Or use double np.where
:
df['split1'] = pd.np.where(m2, "NEW", pd.np.where(m1, "OLD", 'no match'))
print (df)
IO8 num split1
0 2000 2000.0 OLD
1 2009 2009.0 NEW
2 20 NaN no match
3 dwd21 NaN no match
Because if use only np.where
output is:
df = pd.DataFrame({'IO8':['2000','2009','20','dwd21']})
df['num'] = df['IO8'].str.extract("(^\d{4}$)").astype(float)
m1 = df['num'].notna()
m2 = df['num'] >= 2002
df['split1'] = pd.np.where(m1 & m2, "NEW","OLD")
print (df)
IO8 num split1
0 2000 2000.0 OLD
1 2009 2009.0 NEW
2 20 NaN OLD
3 dwd21 NaN OLD
Upvotes: 3