Reputation: 2458
I have a data set(~100GB) in S3 that has a timestamp followed by a JSON string rather than just a pure JSON string. This data is gzipped. Is there anyway to read this data into a Dataframe without reformatting the data to remove the timestamp? I do not need the timestamp at all, it can be disregarded. Here is an example of the data:
2019-06-28T00:00:00.000Z { "a": 123, "b": "456", "c": 789 }
I normally use the glue library read_from_options
to read it data, but I don't see any option to ignore the timestamp and just read in the JSON string. I was unsure if spark hand any functionality to do this.
Upvotes: 0
Views: 676
Reputation: 1380
Anamdeo's answer is good, but I would stress that you should avoid using UDFs whenever possible due to performance concerns. In this case, you can easily use regexp_extract
to separate the timestamp from the JSON content that you're interested in:
scala> val regex = "([0-9\\-TZ\\.:]+) (\\{.*)"
regex: String = ([0-9\-TZ\.:]+) (\{.*)
scala> val dff = df.withColumn("tstamp", regexp_extract('json_content, regex, 1)).withColumn("json", regexp_extract('json_content, regex, 2)).drop("json_content")
dff: org.apache.spark.sql.DataFrame = [country: string, city: string ... 2 more fields]
scala> dff.show(false)
+-------+-------+------------------------+----------------------------------+
|country|city |tstamp |json |
+-------+-------+------------------------+----------------------------------+
|america|chicago|2019-06-28T00:00:00.000Z|{ "a": 123, "b": "456", "c": 789 }|
|india |mumbai |2019-06-28T00:00:00.000Z|{ "a": 123, "b": "456", "c": 789 }|
+-------+-------+------------------------+----------------------------------+
From this point, you can use Spark's built-in functions like from_json
and get_json_object
to work directly with your JSON data if needed.
Upvotes: 3
Reputation: 354
Lets assume, this is how the data looks like -
country|city|json_content
america|chicago|2019-06-28T00:00:00.000Z { "a": 123, "b": "456", "c": 789 }
india|mumbai|2019-06-28T00:00:00.000Z { "a": 123, "b": "456", "c": 789 }
Read it into a spark dataframe -
val df = spark
.read
.option("header", "true") // Use first line of all files as header
.option("delimiter", "|")
.csv("csv_file_path")
Since you have gzipped data, read it into RDD first as shown below then convert it into a DF(Let me know if you need my help in RDD-to-DF conversion.) -
val rdd = sc.textFile("myFile.gz")
Import important functions -
import org.apache.spark.sql.functions._
Write and register a UDF which extracts only json content from your timestamp+json column
val getJsonContent = udf{input: String => input.substring(input.indexOf("{"))}
Apply this UDF and create the final Dataframe-
val finalDf = df.withColumn("json_content",getJsonContent(col("json_content")))
Upvotes: 3