parse url string in spark df with PySpark

Question

I need to parse url strings from column refererurl in spark df. The data looks like this:

refererurl
https://www.delish.com/cooking/recipes/t678
https://www.delish.com/food/recipes/a463/
https://www.delish.com/cooking/recipes/g877

I am only interested in what comes after delish.com. Desired output is:

content
cooking
food
cooking

I have tried:

data.withColumn("content", fn.regexp_extract('refererurl', 'param1=(\d)', 2))

Returns all null values

werner · Accepted Answer

You can use parse_url to the get the path of the url and then get the first level of the path with regexp_extract:

df.withColumn("content", fn.expr("regexp_extract(parse_url(refererurl, 'PATH'),'/(.*?)/')")) \
    .show(truncate=False)

Output:

+-------------------------------------------+-------+
|refererurl                                 |content|
+-------------------------------------------+-------+
|https://www.delish.com/cooking/recipes/t678|cooking|
|https://www.delish.com/food/recipes/a463/  |food   |
|https://www.delish.com/cooking/recipes/g877|cooking|
+-------------------------------------------+-------+

parse url string in spark df with PySpark

Answers (2)

Related Questions