Reputation: 2075
My data is like this:
{"id":"1","time":123,"sth":100}
{"id":"2","sth":456}
{"id":"3","time":789,"sth":300}
And I write my schema as:
StructType(
Array(
StructField("id", StringType, false),
StructField("time", StringType, false),
StructField("sth", StringType, true),
)
)
And I read my data using:
val df = spark.read.schema(buildSchema()).json(path)
What I want is that my dataframe doesn't read those lines without "time" value, so the result I want is
| id | time | sth |
| 1 | 123 | 100 |
| 3 | 789 | 300 |
However, even I set the nullable attribute as false in my StructField, it still read the second line {"id":"2","sth":456}
into my table, and I need to waste time to drop those rows with null value after reading. Is there any way to do what I want efficiently?
Upvotes: 1
Views: 2139
Reputation: 344
You can try this,
val otherPeopleRDD = spark.sparkContext.makeRDD(
"""[{"id":"1","time":123,"sth":100} ,
{"id":"2","sth":456} ,
{"id":"3","time":789,"sth":300} ] """ :: Nil)
val otherPeople = spark.read.json(otherPeopleRDD).na.drop()
otherPeople.show()
+---+---+----+
| id|sth|time|
+---+---+----+
| 1|100| 123|
| 3|300| 789|
+---+---+----+
Upvotes: 3