Reputation: 167
I need to parse url strings from column refererurl
in spark df. The data looks like this:
refererurl
https://www.delish.com/cooking/recipes/t678
https://www.delish.com/food/recipes/a463/
https://www.delish.com/cooking/recipes/g877
I am only interested in what comes after delish.com
. Desired output is:
content
cooking
food
cooking
I have tried:
data.withColumn("content", fn.regexp_extract('refererurl', 'param1=(\d)', 2))
Returns all null values
Upvotes: 1
Views: 1232
Reputation: 2011
One more way of solving the problem using Split and element_at function in case we know the position of the string always remains same.
df = spark.createDataFrame([(1,"https://www.delish.com/cooking/recipes/t678"), (2,"https://www.delish.com/food/recipes/a463/"),(3,"https://www.delish.com/cooking/recipes/g877")],[ "col1","col2"])
df.show(truncate=False)
df = df.withColumn("splited_col", F.split("col2", "/"))
df = df.withColumn("content", F.element_at(F.col('splited_col'), 4).alias('content'))
df.show(truncate=False)
Input
+----+-------------------------------------------+
|col1|col2 |
+----+-------------------------------------------+
|1 |https://www.delish.com/cooking/recipes/t678|
|2 |https://www.delish.com/food/recipes/a463/ |
|3 |https://www.delish.com/cooking/recipes/g877|
+----+-------------------------------------------+
Output
+----+-------------------------------------------+--------------------------------------------------+-------+
|col1|col2 |splited_col |content|
+----+-------------------------------------------+--------------------------------------------------+-------+
|1 |https://www.delish.com/cooking/recipes/t678|[https:, , www.delish.com, cooking, recipes, t678]|cooking|
|2 |https://www.delish.com/food/recipes/a463/ |[https:, , www.delish.com, food, recipes, a463, ] |food |
|3 |https://www.delish.com/cooking/recipes/g877|[https:, , www.delish.com, cooking, recipes, g877]|cooking|
+----+-------------------------------------------+--------------------------------------------------+-------+
Upvotes: 1
Reputation: 14895
You can use parse_url to the get the path of the url and then get the first level of the path with regexp_extract:
df.withColumn("content", fn.expr("regexp_extract(parse_url(refererurl, 'PATH'),'/(.*?)/')")) \
.show(truncate=False)
Output:
+-------------------------------------------+-------+
|refererurl |content|
+-------------------------------------------+-------+
|https://www.delish.com/cooking/recipes/t678|cooking|
|https://www.delish.com/food/recipes/a463/ |food |
|https://www.delish.com/cooking/recipes/g877|cooking|
+-------------------------------------------+-------+
Upvotes: 2