Reputation: 3135
I have a string where I was extracting the date part that looked like
A620170101 # output 20170101
In pandas I would just do something like,
df['var'] = df.sba.str.extract(r'A6(.{8})', expand=False)
However, now I need to update to also extract the date from a string that looks like
JT20170101 # output 20170101
I tried added a |
but that didn't work.
Here is a quick test data,
d = {'var1': 'A620170101', 'var2': 'JT20170102', 'var3': '', 'var4': 'TG20170102'}
pd.DataFrame(list(d.items()), columns=['var', 'sba'])
I just want the date part with the A6 and JT prefix.
Upvotes: 1
Views: 122
Reputation: 3675
If your data is always the same length, like shown above, you could skip using a regex and just grab the first two characters for the code and a grab of the last 8 characters for the date.
If you want a regex that captures any prefix code (of any length) and the date suffix you could use this:
(.*)(\d{8})
I'm not familiar with panda but I'm assuming that it works with this.
Upvotes: 0
Reputation: 863226
Use solution from comment:
df['var3'] = df.sba.str.extract(r'(?:JT|A6)(.{8})', expand=False)
print (df)
var sba var3
0 var1 A620170101 20170101
1 var2 JT20170102 20170102
2 var3 NaN
3 var4 TG20170102 NaN
Another solution is check first 2 values and if in list extract from 2
to 10
value:
df['var3'] = np.where(df.sba.str[:2].isin(['A6','JT']), df.sba.str[2:10], np.nan)
print (df)
var sba var3
0 var1 A620170101 20170101
1 var2 JT20170102 20170102
2 var3 NaN
3 var4 TG20170102 NaN
Upvotes: 2
Reputation: 1433
if you want to use the "|" operator, you could try something like:
(?:JT|A6)(.{8})
The previous answer is good too.
Upvotes: 0