La Cordillera
La Cordillera

Reputation: 422

regex expression for strings that start the same but end with number to subset pandas dataframe

Might be a bit of a basic question, but, say I have a dataframe that looks like:

string_lst = ["bar0001", "bar0002", "bar0003", "bar0003", "bar0004", "bar0004", "bar0005", "bar0006"]
a = pd.DataFrame({'foo': string_lst,
             'test':[0,1,2,3,4,5,6,7]})

How do I subset the dataframe such that I get all "bars" from 3:6?

I am guessing something around the lines of:

a['foo'== regex 3:6]?

What I thought was to select the last n numbers of the string_lst but the real dataframe will have different amount of numbers such as bar2005 or bar20005, so I'm not sure how to proceed on this.

Many thanks!

Upvotes: 0

Views: 49

Answers (4)

Scott Boston
Scott Boston

Reputation: 153500

IIUC,

a[a['foo'].str.contains('bar0+[3-6]', regex=True)]

Output:

       foo  test
2  bar0003     2
3  bar0003     3
4  bar0004     4
5  bar0004     5
6  bar0005     6
7  bar0006     7

Upvotes: 1

wwnde
wwnde

Reputation: 26676

What did you need?

1.Select indexes 3 to 6?

a.loc[3:6,:]


       foo  test
3  bar0003     3
4  bar0004     4
5  bar0004     5
6  bar0005     6



    

or

Select bars number 3 to 6?

a['s']=a['foo'].str.extract('(\d$)').astype(int)
a[a.s.ge(3)&a.s.le(6)].drop('s',1)

      foo    test
2  bar0003     2
3  bar0003     3
4  bar0004     4
5  bar0004     5
6  bar0005     6
7  bar0006     7

Upvotes: 1

Allen Qin
Allen Qin

Reputation: 19957

If your dataset has the same pattern (bar followed by numbers), you can do something like below. This will handle cases like 'bar004', 'bar00004' etc.

a.loc[a.foo.str.extract('(\d+)')[0].astype(float).between(3,6)]

Upvotes: 2

Hugzey
Hugzey

Reputation: 21

your regex string can be: "bar[0-9]*" this will allow: bar1, bar01, bar000000000001 but not bar 1 and bar001a

Upvotes: 1

Related Questions