Spark extract value to multiple columns based on name

Question

i have a String column and need to extract values of it into multiple columns based on the name associated with it.

otherPartofString State DALLocate_SFO-4/3/9 sub Area=<8> ID 8 Name 7

the columns need to be formed from above are

State     | Area      | Sub Area | ID | Name
DALLocate | SFO-4/3/9 | 8        | 8  | 7

any help is appreciated.

Matt · Accepted Answer

IF the pattern is always fixed you could use regexp_extract:

from pyspark.sql.functions import regexp_extract

df = spark.createDataFrame([{"raw": "otherPartofString State DALLocate_SFO-4/3/9 sub Area=<8> ID 8 Name 7 "}], 'raw string') 

(df
 .select(regexp_extract('raw', 'State ([^_]*)', 1).alias('State'), 
         regexp_extract('raw', 'State ([a-zA-Z]*)_([^ ]*)', 2).alias('Area'), 
         regexp_extract('raw', 'Area=<(.*)>', 1).alias('Sub Area'), 
         regexp_extract('raw', 'ID ([^ ]*)', 1).alias('ID'),
         regexp_extract('raw', 'Name ([^ ]*)', 1).alias('Name')).show())

regexp_extract takes 3 arguments the first ist the column you want to match on. the second is the pattern and the third is the group you want to extract.

ref: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.regexp_extract

Spark extract value to multiple columns based on name

Answers (2)

Related Questions