Reputation: 291
Experts, i have a simple requirement but not able to find the function to achieve the goal.
I am using pyspark (spark 1.6 & Python 2.7) and have a simple pyspark dataframe column with certain values like-
1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC
1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234
1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678
The common thing about these values is that there is a single "underscore" and after that there are certain characters (can be any number of characters). These are the characters i am interested to get in the output. I want to use a substring or regex function which will find the position of "underscore" in the column values and select "from underscore position +1" till the end of column value. So the output will look like a dataframe with values as-
ABC
1234
12345678
I tried using sub-string but could find anything to "index" the "underscore"
Thanks!
Upvotes: 1
Views: 9165
Reputation: 655
No need to use any regexp
Please try below as shown. Basically Splitting on _ char and getting second item via getItem()
>>> from pyspark.sql import functions as F
>>> my_list = [("1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC",),("1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234",),("1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678",)]
>>> my_df = spark.createDataFrame(my_list, schema=["input_v"])
>>> my_df.withColumn("get_val", F.split(F.col("input_v"),"_").getItem(1)).show(20,False)
Result
+-------------------------------------------------------------------+--------+
|input_v |get_val |
+-------------------------------------------------------------------+--------+
|1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC |ABC |
|1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234 |1234 |
|1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678|12345678|
+-------------------------------------------------------------------+--------+```
Upvotes: 5
Reputation: 31480
You can use regexp_extract
to extract data after _
Example:
from pyspark.sql.functions import *
df=spark.sql("""select stack(3,"1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC","1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234","1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678") as (txt)""")
df.withColumn("extract",regexp_extract(col("txt"),"_(.*)",1)).show(10,False)
Result:
+-------------------------------------------------------------------+--------+
|txt |extract |
+-------------------------------------------------------------------+--------+
|1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC |ABC |
|1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234 |1234 |
|1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678|12345678|
+-------------------------------------------------------------------+--------+
Upvotes: 2