Reputation: 2967
I have a network.log
on Hadoop:
{"Source":"Network","Detail":"Event=01|Device=Mobile|ClientIP=10.0.0.0|URL=example.com"}
I want to load it as a data frame splitting Detail
with |
. Then I want to further split each new column using =
with left part as the column names and right part as the values.
The expected result would be:
Source | Event | Device | ClientIP | URL
Network | 01 | Mobile | 10.0.0.0 | example.com
I've done the first split as follows:
from pyspark import SparkContext
from pyspark.sql import functions, SQLContext
INPUT_PATH = 'network.log'
sc = SparkContext("local", "NetworkEvent")
sqlContext = SQLContext(sc)
raw = sqlContext.read.json(INPUT_PATH)
detail_col = functions.split(raw['Detail'], '\|')
for i in range(4):
raw = raw.withColumn('col_' + str(i), detail_col.getItem(i))
raw.show()
My question is, can I do the second split at the same time on top of detail_col.getItem(i)
? I can think of creating another UDF for each column of the new data frame but is there a more elegant way in one UDF? Many thanks!
Note: I'm using Spark 1.5.0 so Pandas' UDFs won't be available.
Upvotes: 0
Views: 917
Reputation: 4420
No need to write UDF for this thing, you can apply a number of alternatives and achieve this, Here is one of the alternative:-
from pyspark import SparkContext
from pyspark.sql import functions
INPUT_PATH = 'network.log'
sc = SparkContext("local", "NetworkEvent")
sqlContext = SQLContext(sc)
raw = sqlContext.read.json(INPUT_PATH)
detail_col = functions.split(raw['Detail'], '\|')
cols_to_be = raw.select([functions.split(detail_col.getItem(i), "=").getItem(0).alias("col_"+str(i)) for i in range(4)]).first()
for i in range(4):
raw = raw.withColumn(
cols_to_be["col_"+str(i)],
functions.split(detail_col.getItem(i), "=").getItem(1)
)
raw.show()
+--------------------+-------+-----+------+--------+-----------+
| Detail| Source|Event|Device|ClientIP| URL|
+--------------------+-------+-----+------+--------+-----------+
|Event=01|Device=M...|Network| 01|Mobile|10.0.0.0|example.com|
+--------------------+-------+-----+------+--------+-----------+
Hope your Details data should follow a pattern.
Upvotes: 1
Reputation: 15258
In 1.5.0, you can use regexp_extract
.
from pyspark.sql import functions as F
for i in ['Event', 'Device', 'ClientIP', 'URL']:
df = df.withColumn(i, F.regexp_extract('Detail',"{}=([^\|]+)".format(i),1))
df.show()
+-------+--------------------+-----+------+--------+-----------+
| Source| Detail|Event|Device|ClientIP| URL|
+-------+--------------------+-----+------+--------+-----------+
|Network|Event=01|Device=M...| 01|Mobile|10.0.0.0|example.com|
+-------+--------------------+-----+------+--------+-----------+
Upvotes: 2