user12809368
user12809368

Reputation:

Removing rows that does not start with/contain specific words

I have the following output

Age
'1 year old',
'14 years old', 
'music store', 
'7 years old ',
'16 years old ',

created after using this line of code

df['Age']=df['Age'].str.split('.', expand=True,n=0)[0]
df['Age'].tolist()

I would like to remove rows from the dataset (it would be better using a copy of it or a new one after filtering it) that does not start with a number or a number + year + old or a number + years + old.

Expected output

Age (in a new dataset filtered)
'1 year old',
'14 years old', 
'7 years old ',
'16 years old ',

How could I do?

Upvotes: 1

Views: 503

Answers (2)

sammywemmy
sammywemmy

Reputation: 28644

The code below looks for text that starts with an apostrophe, followed by a number, and keeps only those rows :

df = pd.read_clipboard(sep=';')


df.loc[df.Age.str.match("\'\d+")]

            Age
0   '1 year old',
1   '14 years old',
3   '7 years old ',
4   '16 years old ',

Note this just restricts to apostrophe and number, @Shubham's solution covers a lot more

Upvotes: 0

Shubham Sharma
Shubham Sharma

Reputation: 71689

Use, Series.str.contains and create a boolean mask to filter the dataframe:

m = df['Age'].str.contains(r'(?i)^\d+\syears?\sold')
df1 = df[m]

Result:

# print(df1)
             Age
0     1 year old
1   14 years old 
3    7 years old
4   16 years old

You can test the regex pattern here.

Upvotes: 1

Related Questions