Reputation: 1305
I have some json file with such format:
{"_t":1480647647,"_p":"[email protected]","_n":"app_loaded","device_type":"desktop"}
{"_t":1480647676,"_p":"[email protected]","_n":"app_loaded","device_type":"desktop"}
{"_t":1483161958,"_p":"[email protected]","_n":"app_loaded","device_type":"desktop"}
{"_t":1483162393,"_p":"[email protected]","_n":"app_loaded","device_type":"desktop"}
{"_t":1483499947,"_p":"[email protected]","_n":"app_loaded","device_type":"desktop"}
{"_t":1505361824,"_p":"[email protected]","_n":"added_to_team","account":"1234"}
{"_t":1505362047,"_p":"[email protected]","_n":"added_to_team","account":"1234"}
{"_t":1505362372,"_p":"[email protected]","_n":"added_to_team","account":"1234"}
{"_t":1505362854,"_p":"[email protected]","_n":"added_to_team","account":"1234"}
{"_t":1505366071,"_p":"[email protected]","_n":"added_to_team","account":"1234"}
I'm using Apache Spark in my java application in order to read this json file and save to parquet format.
If I didn't use schema definition then there is no problem with file parsing There is my code example:
Dataset<Row> dataset = spark.read().json(pathToFile);
dataset.show(100);
And there is my console output:
+-------------+------------------+----------+-------+-------+-----------+
| _n| _p| _t|account|channel|device_type|
+-------------+------------------+----------+-------+-------+-----------+
| app_loaded| [email protected]|1480647647| null| null| desktop|
| app_loaded| [email protected]|1480647676| null| null| desktop|
| app_loaded| [email protected]|1483161958| null| null| desktop|
| app_loaded| [email protected]|1483162393| null| null| desktop|
| app_loaded| [email protected]|1483499947| null| null| desktop|
|added_to_team| [email protected]|1505361824| 1234| null| null|
|added_to_team| [email protected]|1505362047| 1234| null| null|
...
When I'm using schema definition like this
StructType schema = new StructType();
schema.add("_n", StringType, true);
schema.add("_p", StringType, true);
schema.add("_t", TimestampType, true);
schema.add("account", StringType, true);
schema.add("channel", StringType, true);
schema.add("device_type", StringType, true);
// Read data from file
Dataset<Row> dataset = spark.read().schema(schema).json(pathToFile);
dataset.show(100);
I got console output :
++
||
++
||
||
||
||
...
What's wrong with schma definition?
Upvotes: 1
Views: 933
Reputation: 35249
StrutType
is immutable, so just discard all additions. If you print it
schema.printTreeString
you'll see it doesn't contain any field:
root
You should use:
StructType schema = new StructType()
.add("_n", StringType, true)
.add("_p", StringType, true)
.add("_t", TimestampType, true)
.add("account", StringType, true)
.add("channel", StringType, true)
.add("device_type", StringType, true);
Upvotes: 1