Sonya
Sonya

Reputation: 167

parse url string in spark df with PySpark

I need to parse url strings from column refererurl in spark df. The data looks like this:

refererurl
https://www.delish.com/cooking/recipes/t678
https://www.delish.com/food/recipes/a463/
https://www.delish.com/cooking/recipes/g877

I am only interested in what comes after delish.com. Desired output is:

content
cooking
food
cooking

I have tried:

data.withColumn("content", fn.regexp_extract('refererurl', 'param1=(\d)', 2))

Returns all null values

Upvotes: 1

Views: 1232

Answers (2)

dsk
dsk

Reputation: 2011

One more way of solving the problem using Split and element_at function in case we know the position of the string always remains same.

df = spark.createDataFrame([(1,"https://www.delish.com/cooking/recipes/t678"), (2,"https://www.delish.com/food/recipes/a463/"),(3,"https://www.delish.com/cooking/recipes/g877")],[ "col1","col2"])
df.show(truncate=False)
df = df.withColumn("splited_col", F.split("col2", "/"))
df = df.withColumn("content", F.element_at(F.col('splited_col'), 4).alias('content'))
df.show(truncate=False)

Input

+----+-------------------------------------------+
|col1|col2                                       |
+----+-------------------------------------------+
|1   |https://www.delish.com/cooking/recipes/t678|
|2   |https://www.delish.com/food/recipes/a463/  |
|3   |https://www.delish.com/cooking/recipes/g877|
+----+-------------------------------------------+
  

Output

 +----+-------------------------------------------+--------------------------------------------------+-------+


|col1|col2                                       |splited_col                                       |content|
+----+-------------------------------------------+--------------------------------------------------+-------+
|1   |https://www.delish.com/cooking/recipes/t678|[https:, , www.delish.com, cooking, recipes, t678]|cooking|
|2   |https://www.delish.com/food/recipes/a463/  |[https:, , www.delish.com, food, recipes, a463, ] |food   |
|3   |https://www.delish.com/cooking/recipes/g877|[https:, , www.delish.com, cooking, recipes, g877]|cooking|
+----+-------------------------------------------+--------------------------------------------------+-------+

Upvotes: 1

werner
werner

Reputation: 14895

You can use parse_url to the get the path of the url and then get the first level of the path with regexp_extract:

df.withColumn("content", fn.expr("regexp_extract(parse_url(refererurl, 'PATH'),'/(.*?)/')")) \
    .show(truncate=False)

Output:

+-------------------------------------------+-------+
|refererurl                                 |content|
+-------------------------------------------+-------+
|https://www.delish.com/cooking/recipes/t678|cooking|
|https://www.delish.com/food/recipes/a463/  |food   |
|https://www.delish.com/cooking/recipes/g877|cooking|
+-------------------------------------------+-------+

Upvotes: 2

Related Questions