Reading json with leading timestamp

Question

I have a data set(~100GB) in S3 that has a timestamp followed by a JSON string rather than just a pure JSON string. This data is gzipped. Is there anyway to read this data into a Dataframe without reformatting the data to remove the timestamp? I do not need the timestamp at all, it can be disregarded. Here is an example of the data:

2019-06-28T00:00:00.000Z { "a": 123, "b": "456", "c": 789 }

I normally use the glue library read_from_options to read it data, but I don't see any option to ignore the timestamp and just read in the JSON string. I was unsure if spark hand any functionality to do this.

Charlie Flowers · Accepted Answer

Anamdeo's answer is good, but I would stress that you should avoid using UDFs whenever possible due to performance concerns. In this case, you can easily use regexp_extract to separate the timestamp from the JSON content that you're interested in:

scala> val regex =  "([0-9\-TZ\.:]+) (\{.*)"
regex: String = ([0-9\-TZ\.:]+) (\{.*)

scala> val dff = df.withColumn("tstamp", regexp_extract('json_content, regex, 1)).withColumn("json", regexp_extract('json_content, regex, 2)).drop("json_content")
dff: org.apache.spark.sql.DataFrame = [country: string, city: string ... 2 more fields]

scala> dff.show(false)
+-------+-------+------------------------+----------------------------------+
|country|city   |tstamp                  |json                              |
+-------+-------+------------------------+----------------------------------+
|america|chicago|2019-06-28T00:00:00.000Z|{ "a": 123, "b": "456", "c": 789 }|
|india  |mumbai |2019-06-28T00:00:00.000Z|{ "a": 123, "b": "456", "c": 789 }|
+-------+-------+------------------------+----------------------------------+

From this point, you can use Spark's built-in functions like from_json and get_json_object to work directly with your JSON data if needed.

Reading json with leading timestamp

Answers (2)

Related Questions