Eumcoz
Eumcoz

Reputation: 2458

Reading json with leading timestamp

I have a data set(~100GB) in S3 that has a timestamp followed by a JSON string rather than just a pure JSON string. This data is gzipped. Is there anyway to read this data into a Dataframe without reformatting the data to remove the timestamp? I do not need the timestamp at all, it can be disregarded. Here is an example of the data:

2019-06-28T00:00:00.000Z { "a": 123, "b": "456", "c": 789 }

I normally use the glue library read_from_options to read it data, but I don't see any option to ignore the timestamp and just read in the JSON string. I was unsure if spark hand any functionality to do this.

Upvotes: 0

Views: 676

Answers (2)

Charlie Flowers
Charlie Flowers

Reputation: 1380

Anamdeo's answer is good, but I would stress that you should avoid using UDFs whenever possible due to performance concerns. In this case, you can easily use regexp_extract to separate the timestamp from the JSON content that you're interested in:

scala> val regex =  "([0-9\\-TZ\\.:]+) (\\{.*)"
regex: String = ([0-9\-TZ\.:]+) (\{.*)

scala> val dff = df.withColumn("tstamp", regexp_extract('json_content, regex, 1)).withColumn("json", regexp_extract('json_content, regex, 2)).drop("json_content")
dff: org.apache.spark.sql.DataFrame = [country: string, city: string ... 2 more fields]

scala> dff.show(false)
+-------+-------+------------------------+----------------------------------+
|country|city   |tstamp                  |json                              |
+-------+-------+------------------------+----------------------------------+
|america|chicago|2019-06-28T00:00:00.000Z|{ "a": 123, "b": "456", "c": 789 }|
|india  |mumbai |2019-06-28T00:00:00.000Z|{ "a": 123, "b": "456", "c": 789 }|
+-------+-------+------------------------+----------------------------------+

From this point, you can use Spark's built-in functions like from_json and get_json_object to work directly with your JSON data if needed.

Upvotes: 3

a9207
a9207

Reputation: 354

Lets assume, this is how the data looks like -

country|city|json_content
america|chicago|2019-06-28T00:00:00.000Z { "a": 123, "b": "456", "c": 789 }
india|mumbai|2019-06-28T00:00:00.000Z { "a": 123, "b": "456", "c": 789 }

Read it into a spark dataframe -

    val df = spark
      .read
      .option("header", "true") // Use first line of all files as header
      .option("delimiter", "|")
      .csv("csv_file_path")

Since you have gzipped data, read it into RDD first as shown below then convert it into a DF(Let me know if you need my help in RDD-to-DF conversion.) -

val rdd = sc.textFile("myFile.gz")

Import important functions -

import org.apache.spark.sql.functions._

Write and register a UDF which extracts only json content from your timestamp+json column

    val getJsonContent = udf{input: String => input.substring(input.indexOf("{"))}

Apply this UDF and create the final Dataframe-

    val finalDf = df.withColumn("json_content",getJsonContent(col("json_content")))

Upvotes: 3

Related Questions