PySpark: nested-splitting one column into multiple new ones

Question

I have a network.log on Hadoop:

{"Source":"Network","Detail":"Event=01|Device=Mobile|ClientIP=10.0.0.0|URL=example.com"}

I want to load it as a data frame splitting Detail with |. Then I want to further split each new column using = with left part as the column names and right part as the values.

The expected result would be:

Source  | Event | Device | ClientIP | URL
Network | 01    | Mobile | 10.0.0.0 | example.com

I've done the first split as follows:

from pyspark import SparkContext
from pyspark.sql import functions, SQLContext

INPUT_PATH = 'network.log'

sc = SparkContext("local", "NetworkEvent")
sqlContext = SQLContext(sc)

raw = sqlContext.read.json(INPUT_PATH)

detail_col = functions.split(raw['Detail'], '\|')

for i in range(4):
    raw = raw.withColumn('col_' + str(i), detail_col.getItem(i))

raw.show()

My question is, can I do the second split at the same time on top of detail_col.getItem(i)? I can think of creating another UDF for each column of the new data frame but is there a more elegant way in one UDF? Many thanks!

Note: I'm using Spark 1.5.0 so Pandas' UDFs won't be available.

Steven · Accepted Answer

In 1.5.0, you can use regexp_extract.

from pyspark.sql import functions as F

for i in ['Event', 'Device', 'ClientIP', 'URL']:
    df = df.withColumn(i, F.regexp_extract('Detail',"{}=([^\|]+)".format(i),1))

df.show()

+-------+--------------------+-----+------+--------+-----------+
| Source|              Detail|Event|Device|ClientIP|        URL|
+-------+--------------------+-----+------+--------+-----------+
|Network|Event=01|Device=M...|   01|Mobile|10.0.0.0|example.com|
+-------+--------------------+-----+------+--------+-----------+

PySpark: nested-splitting one column into multiple new ones

Answers (2)

Related Questions