Sachin Doiphode
Sachin Doiphode

Reputation: 433

Not able to create dataframe out of multi line json string or JSONL string using spark

I have been trying to form data frame out of jsonl string. I'm able to form data frame but the problem is only single row is being read, ignoring other.
Here are things I tries in spark-shell

// This one is example multiline json.
val jsonEx = "{\"name\":\"James\"}{\"name\":\"John\"}{\"name\":\"Jane\"}"

// schema for it is 
val sch = new StructType().add("name", StringType)

val ds = Seq(jsonEx).toDS()

// 1st attempt -- using multiline and spark.json
spark.read.option("multiLine", true).schema(sch).json(ds).show
+-----+
| name|
+-----+
|James|
+-----+

// 2nd attempt -- using from_json
ds.withColumn("json", from_json(col("value"), sch)).select("json.*").show
+-----+
| name|
+-----+
|James|
+-----+

//3rd attempt -- using from_json in little different way
ds.select(from_json(col("value"), sch) as "json").select("json.*").show
+-----+
| name|
+-----+
|James|
+-----+

I even tried updating string as, 
 val jsonEx = "{\"name\":\"James\"}\n{\"name\":\"John\"}\n{\"name\":\"Jane\"}"
and 
val jsonEx = "{\"name\":\"James\"}\n\r{\"name\":\"John\"}\n\r{\"name\":\"Jane\"}"

But the result was same.

Does anyone what am I missing here ?

if someone is wondering why am I not reading from file instead of string. I have one jsonl config file inside resources path. when I try to read it using getClass.getResource scala gives me error while getClass.getResourceAsStream works and I'm able to read data.

val configPath = "/com/org/example/data_sources_config.jsonl"
for(line <- Source.fromInputStream(getClass.getResourceAsStream(configPath)).getLines) { print(line)}
{"name":"james"} ...

but when I do, 
for(line <- Source.fromFile(getClass.getResource(configPath).getPath).getLines) { print(line)}
java.io.FileNotFoundException: file:/Users/sachindoiphode/workspace/dap-links-datalake-jobs/target/dap-links-datalake-jobs-0.0.65.jar!/com/org/example/data_sources_config.jsonl (No such file or directory)

Upvotes: 0

Views: 463

Answers (1)

Chaitanya Chandurkar
Chaitanya Chandurkar

Reputation: 2187

Even if jsonEx is multi-line JSON. It still one element. You need to extract rows out of it.

val ds = jsonEx.split("\n").toSeq.toDS

To read the multi-line JSON file maybe you can try something like this:

val path = "/com/org/example/data_sources_config.jsonl"
val source = Source.fromFile(getClass.getResource(path).getPath)
val content = source.getLines.mkString

Then do content.split().toSq.toDF if you want to create a dataframe out of it.

Upvotes: 1

Related Questions