Reputation: 433
I have been trying to form data frame out of jsonl string. I'm able to form data frame but the problem is only single row is being read, ignoring other.
Here are things I tries in spark-shell
// This one is example multiline json.
val jsonEx = "{\"name\":\"James\"}{\"name\":\"John\"}{\"name\":\"Jane\"}"
// schema for it is
val sch = new StructType().add("name", StringType)
val ds = Seq(jsonEx).toDS()
// 1st attempt -- using multiline and spark.json
spark.read.option("multiLine", true).schema(sch).json(ds).show
+-----+
| name|
+-----+
|James|
+-----+
// 2nd attempt -- using from_json
ds.withColumn("json", from_json(col("value"), sch)).select("json.*").show
+-----+
| name|
+-----+
|James|
+-----+
//3rd attempt -- using from_json in little different way
ds.select(from_json(col("value"), sch) as "json").select("json.*").show
+-----+
| name|
+-----+
|James|
+-----+
I even tried updating string as,
val jsonEx = "{\"name\":\"James\"}\n{\"name\":\"John\"}\n{\"name\":\"Jane\"}"
and
val jsonEx = "{\"name\":\"James\"}\n\r{\"name\":\"John\"}\n\r{\"name\":\"Jane\"}"
But the result was same.
Does anyone what am I missing here ?
if someone is wondering why am I not reading from file instead of string.
I have one jsonl config file inside resources
path. when I try to read it using getClass.getResource
scala gives me error while getClass.getResourceAsStream
works and I'm able to read data.
val configPath = "/com/org/example/data_sources_config.jsonl"
for(line <- Source.fromInputStream(getClass.getResourceAsStream(configPath)).getLines) { print(line)}
{"name":"james"} ...
but when I do,
for(line <- Source.fromFile(getClass.getResource(configPath).getPath).getLines) { print(line)}
java.io.FileNotFoundException: file:/Users/sachindoiphode/workspace/dap-links-datalake-jobs/target/dap-links-datalake-jobs-0.0.65.jar!/com/org/example/data_sources_config.jsonl (No such file or directory)
Upvotes: 0
Views: 463
Reputation: 2187
Even if jsonEx
is multi-line JSON. It still one element. You need to extract rows out of it.
val ds = jsonEx.split("\n").toSeq.toDS
To read the multi-line JSON file maybe you can try something like this:
val path = "/com/org/example/data_sources_config.jsonl"
val source = Source.fromFile(getClass.getResource(path).getPath)
val content = source.getLines.mkString
Then do content.split().toSq.toDF
if you want to create a dataframe out of it.
Upvotes: 1