Reputation: 89
I have json file with the following structure:
{
"name": {
"0": "name1",
"1": "name2",
"2": "name3"
},
"id": {
"0": "00001",
"1": "00002",
"2": "00013"
}
}
When I read this json file to spark DF (with python) I receive DF with dictionaries at each column:
schema = StructType([
StructField("name",StringType(),True),
StructField("id",StringType(),True)
])
spark_df = spark.read.schema(schema).json('path_to_json_file', multiLine=True)
spark_df.show()
+-------------------------------------+-------------------------------------+
| name | id |
+-------------------------------------+-------------------------------------+
|{"0":"name1","1":"name2","2":"name3"}|{"0":"00001","1":"00002","2":"00013"}|
+-------------------------------------+-------------------------------------+
How do I explode each column to have only the values:
+------+-----+
| name | id |
+------+-----+
|name1 |00001|
+------+-----+
|name2 |00002|
+------+-----+
|name3 |00013|
+------+-----+
I tried to use explode
function but receive an error:
from pyspark.sql import functions as f
spark_df.select('*', f.explode('id').alias('id')).show()
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "cannot resolve 'explode(`id`)' due to data type mismatch: input to function explode should be array or map type, not string;;\n'Project [name#860, id#861, explode(id#860) AS id#947]\n+- Relation[name#860,id#861] json\n"
I also tried from_json
function but for that I have to define an inner schema, and this is something I cannot do as number of values is unknown. I tried this schema but received only nulls.
schema = StructType([StructField('key1', StringType(), True)])
Basically what I do know is only the upper keys names (that should become fields names), but number of records that I will get is unknown.
Upvotes: 1
Views: 897
Reputation: 89
Thank you @Steven and @Alex Ott! This is what worked for me based on your suggestions:
schema = T.StructType(
[
T.StructField("name", T.MapType(T.StringType(), T.StringType()), True),
T.StructField("id", T.MapType(T.StringType(), T.StringType()), True),
]
)
spark_df = spark.read.schema(schema).json("path_to_json_file", multiLine=True)
spark_df.withColumn(
"name", F.explode(F.map_values("name"))
).withColumn(
"id", F.explode(F.map_values("id"))
).select("name", "id").show()
+-----+-----+
| name| id|
+-----+-----+
|name1|00001|
|name2|00002|
|name3|00013|
+-----+-----+
Upvotes: 0
Reputation: 15258
First of all, your input schema is wrong. Change it with a MapType
:
schm= T.StructType(
[
T.StructField("name", T.MapType(T.StringType(), T.StringType()), True),
T.StructField("id", T.MapType(T.StringType(), T.StringType()), True),
]
)
df = spark.read.schema(schm).json("path_to_json_file", multiLine=True)
df.printSchema()
root
|-- json: struct (nullable = true)
| |-- name: map (nullable = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
| |-- id: map (nullable = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
Then, assuming both name
and id
have the same number of inputs:
df.withColumn("key", F.explode(F.map_keys("json.name"))).select(
F.col("json.name").getItem(F.col("key")).alias("name"),
F.col("json.id").getItem(F.col("key")).alias("id"),
).show()
+-----+-----+
| name| id|
+-----+-----+
|name1|00001|
|name2|00002|
|name3|00013|
+-----+-----+
Upvotes: 1