ACCazacu
ACCazacu

Reputation: 25

How to do regEx in Spark SQL

I have to create a data frame in which the rows in one column should be a name I extract from a long URL. Let's say I have the following url:

https://xxx.xxxxxx.com/xxxxx/y...y/?...?/<irrelevant>

Now unfortunately I can't disclose the exact URLs but what I can say is that the letters x contain strings that don't change (i.e. all URLs in the database contain those patterns and are known), the y...y is an username that is unknown, with unknown length and may change with each URL and the ?...? is the name in which I am interested in (again a string with unknown length). After that there may be multiple strings separated by / which are not useful. How exactly would I do that? Up until now I used to do three different UDFs which use substrings and indexes but I think that's a very cumbersome solution.

I am not very familiar with Regex or with Spark SQL, so even just the regex would be useful.

Thanks

Edit: I think I got the regex down, now I just need to find out how to use it.

https:\/\/xxx\.xxxxxx\.com\/xxxxx\/(?:[^0-9\/]+)\/([a-zA-z]*)

Upvotes: 1

Views: 10226

Answers (1)

Aleksejs R
Aleksejs R

Reputation: 517

I have a bit modified your regex. Regex:

^https:\/\/www\.example\.com\/user=\/(.*?)\/(.*?)(?:\/.*|$)$

It will capture two groups:

  • 1st group - username
  • 2nd group - some name

You can use regexp_extract spark function for selecting regex capture groups. E.g.

import spark.implicits._
import org.apache.spark.sql.functions.regexp_extract

val df = Seq(
    ("https://www.example.com/user=/username1/name3/asd"),
    ("https://www.example.com/user=/username2/name2"),
    ("https://www.example.com/user=/username3/name1/asd"),
    ("https://www.example.com/user=")
).toDF("url")

val r = "^https:\\/\\/www\\.example\\.com\\/user=\\/(.*?)\\/(.*?)(?:\\/.*|$)$"

df.select(
    $"url",
    regexp_extract($"url", r, 1).as("username"),
    regexp_extract($"url", r, 2).as("name")
).show(false)

Result:

+-------------------------------------------------+---------+-----+
|url                                              |username |name |
+-------------------------------------------------+---------+-----+
|https://www.example.com/user=/username1/name3/asd|username1|name3|
|https://www.example.com/user=/username2/name2    |username2|name2|
|https://www.example.com/user=/username3/name1/asd|username3|name1|
|https://www.example.com/user=                    |         |     | <- not correct url
+-------------------------------------------------+---------+-----+

P.S. you can use regex101.com for validating your regular expressions

Upvotes: 3

Related Questions