Reputation: 319
i have a String column and need to extract values of it into multiple columns based on the name associated with it.
otherPartofString State DALLocate_SFO-4/3/9 sub Area=<8> ID 8 Name 7
the columns need to be formed from above are
State | Area | Sub Area | ID | Name
DALLocate | SFO-4/3/9 | 8 | 8 | 7
any help is appreciated.
Upvotes: 0
Views: 795
Reputation: 650
IF the pattern is always fixed you could use regexp_extract:
from pyspark.sql.functions import regexp_extract
df = spark.createDataFrame([{"raw": "otherPartofString State DALLocate_SFO-4/3/9 sub Area=<8> ID 8 Name 7 "}], 'raw string')
(df
.select(regexp_extract('raw', 'State ([^_]*)', 1).alias('State'),
regexp_extract('raw', 'State ([a-zA-Z]*)_([^ ]*)', 2).alias('Area'),
regexp_extract('raw', 'Area=<(.*)>', 1).alias('Sub Area'),
regexp_extract('raw', 'ID ([^ ]*)', 1).alias('ID'),
regexp_extract('raw', 'Name ([^ ]*)', 1).alias('Name')).show())
regexp_extract
takes 3 arguments the first ist the column you want to match on. the second is the pattern and the third is the group you want to extract.
Upvotes: 1
Reputation: 870
Try this:
import org.apache.spark.sql.functions.udf
def myFunc: String => Array[String] = s => Array(/* TODO parse the string as you wish */)
val myUDF = udf(myFunc)
df.withColumn("parsedInput", myUDF(df("input")))
.select(
$"parsedInput"(0).as("State"),
$"parsedInput"(1).as("Area"),
$"parsedInput"(2).as("Sub Area"),
$"parsedInput"(3).as("ID"),
$"parsedInput"(4).as("Name"))
Where 'input' is your original input (e.g. "otherPartofString State DALLocate_SFO-4/3/9 sub Area=<8> ID 8 Name 7 ").
Make sure your UDF returns a valid array (num of items and order)
Upvotes: 0