shzyincu
shzyincu

Reputation: 419

Pyspark SparkSQL regex to get substring before space

I have written an SQL in Athena, that uses the regex_extract to extract substring from a column, it extracts string, where there is "X10003" and takes up to when the space appears. for example from 5570 - Site 811111 - X10003-10447-XXX-20443 (CAMP) it extracts X10003-10447-XXX-20443 and it works fine using REGEXP_EXTRACT(site, 'X10033.*?\w+-\d+') in Athena.

Now I am converting this to sparkSQL and it does not work properly, I have applied different ways but nothing works, for example I have used expression: Regexp_extract("5570 - Site 811111 - X10003-10447-XXX-20443 (CAMP)", 'X10003([^\s]+)') gives me -10447-XXX-20443 (CAMP) which is not what I require.

Can anyone tells me what I am doing wrong?

Upvotes: 2

Views: 2723

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626929

You can use

REGEXP_EXTRACT(site, '(X10003.*?[A-Za-z0-9]-[0-9]+)', 1)

See the regex demo.

The (X10003.*?[A-Za-z0-9]-[0-9]+) pattern is a capturing group with ID 1 that matches

  • X10003 - a literal string
  • .*? - zero or more chars other than line break chars, as few as possible
  • [A-Za-z0-9] - an alphanumeric char
  • - - a hyphen
  • [0-9]+ - one or more digits.

Upvotes: 1

Toto
Toto

Reputation: 91430

Enlarge the capture group:

Regexp_extract("5570 - Site 811111 - X10003-10447-XXX-20443 (CAMP)", '(X10003[^\s]+)')
#                                                             here ___^

Upvotes: 0

Related Questions