Reputation: 61
How can I create a pyspark data frame with 2 JSON files?
file1
{"RESIDENCY":"AUS","EFFDT":"01-01-1900","EFF_STATUS":"A","DESCR":"Australian Resident","DESCRSHORT":"Australian"}
file2
[{"fields":[{"metadata":{},"name":"RESIDENCY","nullable":true,"type":"string"},{"metadata":{},"name":"EFFDT","nullable":true,"type":"string"},{"metadata":{},"name":"EFF_STATUS","nullable":true,"type":"string"},{"metadata":{},"name":"DESCR","nullable":true,"type":"string"},{"metadata":{},"name":"DESCRSHORT","nullable":true,"type":"string"}],"type":"struct"}]
Upvotes: 2
Views: 1644
Reputation: 32720
You have to read, first, the schema file using Python json.load
, then convert it to DataType
using StructType.fromJson
.
import json
from pyspark.sql.types import StructType
with open("/path/to/file2.json") as f:
json_schema = json.load(f)
schema = StructType.fromJson(json_schema[0])
Now just pass that schema to DataFrame Reader:
df = spark.read.schema(schema).json("/path/to/file1.json")
df.show()
#+---------+----------+----------+-------------------+----------+
#|RESIDENCY| EFFDT|EFF_STATUS| DESCR|DESCRSHORT|
#+---------+----------+----------+-------------------+----------+
#| AUS|01-01-1900| A|Australian Resident|Australian|
#+---------+----------+----------+-------------------+----------+
EDIT:
If the file containing the schema is located in GCS, you can use Spark or Hadoop API to get the file content. Here is an example using Spark:
file_content = spark.read.text("/path/to/file2.json").rdd.map(
lambda r: " ".join([str(elt) for elt in r])
).reduce(
lambda x, y: "\n".join([x, y])
)
json_schema = json.loads(file_content)
Upvotes: 4
Reputation: 61
I have found GCSFS packages to access files in GCP Buckets:
pip install gcsfs
import gcsfs
fs = gcsfs.GCSFileSystem(project='your GCP project name')
with fs.open('path/toread/sample.json', 'rb') as f:
json_schema=json.load(f)
Upvotes: 0