Reputation: 841
I try to load the following data.json file in a spark dataframe:
{"positionmessage":{"callsign": "PPH1", "name": 0.0, "mmsi": 100}}
{"positionmessage":{"callsign": "PPH2", "name": 0.0, "mmsi": 200}}
{"positionmessage":{"callsign": "PPH3", "name": 0.0, "mmsi": 300}}
by the following code:
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType
appName = "PySpark Example - JSON file to Spark Data Frame"
master = "local"
# Create Spark session
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.getOrCreate()
# Create a schema for the dataframe
schema = StructType([
StructField('callsign', StringType(), True),
StructField('name', StringType(), True),
StructField('mmsi', IntegerType(), True)
])
# Create data frame
json_file_path = "data.json"
df = spark.read.json(json_file_path, schema, multiLine=True)
print(df.schema)
print(df.head(3))
It prints: [Row(callsign=None, name=None, mmsi=None)]. What do I do wrong? I have set my environment variables in the system settings.
Upvotes: 0
Views: 167
Reputation: 31460
You are having positionmessage
struct field and missing in schema
.
Change the schema to include struct field as shown below:
schema = StructType([StructField("positionmessage",StructType([StructField('callsign', StringType(), True),
StructField('name', StringType(), True),
StructField('mmsi', IntegerType(), True)
]))])
spark.read.schema(schema).json("<path>").\
select("positionmessage.*").\
show()
#+--------+----+----+
#|callsign|name|mmsi|
#+--------+----+----+
#| PPH1| 0.0| 100|
#| PPH2| 0.0| 200|
#| PPH3| 0.0| 300|
#+--------+----+----+
Upvotes: 1