JaP
JaP

Reputation: 87

How to extract substring with regex

I have SKUs of like the following:

SBC225SLB32
SBA2161BRB30
PBA632AS32

Where the first 3-4 characters are A-Z, which must be extracted, and the following 3-4 numbers are [0-9], and also have to be extracted.

For the first, I tried \D{3,4} and for the second, I tried \d{3,4}.

But when using pandas' .str.extract('\D{3,4}'), I got a pattern contains no capture groups error. Is there a better way to do this?

Upvotes: 1

Views: 105

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

The regex pattern you pass to Series.str.extract contains no capturing groups, while the method expects at least one.

In your case, it is more convenient to grab both values at once with the help of two capturing groups. You can use

df[['Code1', 'Code2']] = df['SKU'].str.extract(r'^([A-Z]{3,4})([0-9]{3,4})', expand=False)

See the regex demo. Pattern details:

  • ^ - start of string
  • ([A-Z]{3,4}) - Capturing group 1: three to four uppercase ASCII letters
  • ([0-9]{3,4}) - Capturing group 2: three to four uppercase ASCII digits.

Upvotes: 2

Related Questions