Removing rows that does not start with/contain specific words

Question

I have the following output

Age
'1 year old',
'14 years old', 
'music store', 
'7 years old ',
'16 years old ',

created after using this line of code

df['Age']=df['Age'].str.split('.', expand=True,n=0)[0]
df['Age'].tolist()

I would like to remove rows from the dataset (it would be better using a copy of it or a new one after filtering it) that does not start with a number or a number + year + old or a number + years + old.

Expected output

Age (in a new dataset filtered)
'1 year old',
'14 years old', 
'7 years old ',
'16 years old ',

How could I do?

Shubham Sharma · Accepted Answer

Use, Series.str.contains and create a boolean mask to filter the dataframe:

m = df['Age'].str.contains(r'(?i)^\d+\syears?\sold')
df1 = df[m]

Result:

# print(df1)
             Age
0     1 year old
1   14 years old 
3    7 years old
4   16 years old

You can test the regex pattern here.

Removing rows that does not start with/contain specific words

Answers (2)

Related Questions