Reputation:
I have the following output
Age
'1 year old',
'14 years old',
'music store',
'7 years old ',
'16 years old ',
created after using this line of code
df['Age']=df['Age'].str.split('.', expand=True,n=0)[0]
df['Age'].tolist()
I would like to remove rows from the dataset (it would be better using a copy of it or a new one after filtering it) that does not start with a number or a number + year + old or a number + years + old.
Expected output
Age (in a new dataset filtered)
'1 year old',
'14 years old',
'7 years old ',
'16 years old ',
How could I do?
Upvotes: 1
Views: 503
Reputation: 28644
The code below looks for text that starts with an apostrophe, followed by a number, and keeps only those rows :
df = pd.read_clipboard(sep=';')
df.loc[df.Age.str.match("\'\d+")]
Age
0 '1 year old',
1 '14 years old',
3 '7 years old ',
4 '16 years old ',
Note this just restricts to apostrophe and number, @Shubham's solution covers a lot more
Upvotes: 0
Reputation: 71689
Use, Series.str.contains
and create a boolean mask to filter the dataframe:
m = df['Age'].str.contains(r'(?i)^\d+\syears?\sold')
df1 = df[m]
Result:
# print(df1)
Age
0 1 year old
1 14 years old
3 7 years old
4 16 years old
You can test the regex pattern here
.
Upvotes: 1