marc
marc

Reputation: 319

Spark extract value to multiple columns based on name

i have a String column and need to extract values of it into multiple columns based on the name associated with it.

otherPartofString State DALLocate_SFO-4/3/9 sub Area=<8> ID 8 Name 7 

the columns need to be formed from above are

State     | Area      | Sub Area | ID | Name
DALLocate | SFO-4/3/9 | 8        | 8  | 7

any help is appreciated.

Upvotes: 0

Views: 795

Answers (2)

Matt
Matt

Reputation: 650

IF the pattern is always fixed you could use regexp_extract:

from pyspark.sql.functions import regexp_extract

df = spark.createDataFrame([{"raw": "otherPartofString State DALLocate_SFO-4/3/9 sub Area=<8> ID 8 Name 7 "}], 'raw string') 

(df
 .select(regexp_extract('raw', 'State ([^_]*)', 1).alias('State'), 
         regexp_extract('raw', 'State ([a-zA-Z]*)_([^ ]*)', 2).alias('Area'), 
         regexp_extract('raw', 'Area=<(.*)>', 1).alias('Sub Area'), 
         regexp_extract('raw', 'ID ([^ ]*)', 1).alias('ID'),
         regexp_extract('raw', 'Name ([^ ]*)', 1).alias('Name')).show())

regexp_extract takes 3 arguments the first ist the column you want to match on. the second is the pattern and the third is the group you want to extract.

ref: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.regexp_extract

Upvotes: 1

Nir Hedvat
Nir Hedvat

Reputation: 870

Try this:

import org.apache.spark.sql.functions.udf
def myFunc: String => Array[String] = s => Array(/* TODO parse the string as you wish */)
val myUDF = udf(myFunc)

df.withColumn("parsedInput", myUDF(df("input")))
  .select(
    $"parsedInput"(0).as("State"),
    $"parsedInput"(1).as("Area"),
    $"parsedInput"(2).as("Sub Area"),
    $"parsedInput"(3).as("ID"),
    $"parsedInput"(4).as("Name"))

Where 'input' is your original input (e.g. "otherPartofString State DALLocate_SFO-4/3/9 sub Area=<8> ID 8 Name 7 ").

Make sure your UDF returns a valid array (num of items and order)

Upvotes: 0

Related Questions