spitfiredd
spitfiredd

Reputation: 3135

Pandas Regular expression, more than one choice?

I have a string where I was extracting the date part that looked like

A620170101 # output 20170101

In pandas I would just do something like,

df['var'] = df.sba.str.extract(r'A6(.{8})', expand=False)

However, now I need to update to also extract the date from a string that looks like

JT20170101 # output 20170101

I tried added a | but that didn't work.

Here is a quick test data,

d = {'var1': 'A620170101', 'var2': 'JT20170102', 'var3': '', 'var4': 'TG20170102'}
pd.DataFrame(list(d.items()), columns=['var', 'sba'])

I just want the date part with the A6 and JT prefix.

Upvotes: 1

Views: 122

Answers (3)

Francis Gagnon
Francis Gagnon

Reputation: 3675

If your data is always the same length, like shown above, you could skip using a regex and just grab the first two characters for the code and a grab of the last 8 characters for the date.

If you want a regex that captures any prefix code (of any length) and the date suffix you could use this:

(.*)(\d{8})

I'm not familiar with panda but I'm assuming that it works with this.

Upvotes: 0

jezrael
jezrael

Reputation: 863226

Use solution from comment:

df['var3'] = df.sba.str.extract(r'(?:JT|A6)(.{8})', expand=False)
print (df)
    var         sba      var3
0  var1  A620170101  20170101
1  var2  JT20170102  20170102
2  var3                   NaN
3  var4  TG20170102       NaN

Another solution is check first 2 values and if in list extract from 2 to 10 value:

df['var3'] = np.where(df.sba.str[:2].isin(['A6','JT']), df.sba.str[2:10], np.nan)
print (df)
    var         sba      var3
0  var1  A620170101  20170101
1  var2  JT20170102  20170102
2  var3                   NaN
3  var4  TG20170102       NaN

Upvotes: 2

Luc
Luc

Reputation: 1433

if you want to use the "|" operator, you could try something like:

(?:JT|A6)(.{8})

The previous answer is good too.

Upvotes: 0

Related Questions