Pyspark SparkSQL regex to get substring before space

Question

I have written an SQL in Athena, that uses the regex_extract to extract substring from a column, it extracts string, where there is "X10003" and takes up to when the space appears. for example from 5570 - Site 811111 - X10003-10447-XXX-20443 (CAMP) it extracts X10003-10447-XXX-20443 and it works fine using REGEXP_EXTRACT(site, 'X10033.*?\w+-\d+') in Athena.

Now I am converting this to sparkSQL and it does not work properly, I have applied different ways but nothing works, for example I have used expression: Regexp_extract("5570 - Site 811111 - X10003-10447-XXX-20443 (CAMP)", 'X10003([^\s]+)') gives me -10447-XXX-20443 (CAMP) which is not what I require.

Can anyone tells me what I am doing wrong?

Wiktor Stribiżew · Accepted Answer

You can use

REGEXP_EXTRACT(site, '(X10003.*?[A-Za-z0-9]-[0-9]+)', 1)

See the regex demo.

The (X10003.*?[A-Za-z0-9]-[0-9]+) pattern is a capturing group with ID 1 that matches

X10003 - a literal string
.*? - zero or more chars other than line break chars, as few as possible
[A-Za-z0-9] - an alphanumeric char
- - a hyphen
[0-9]+ - one or more digits.

Pyspark SparkSQL regex to get substring before space

Answers (2)

Related Questions