Reputation: 2439
I have the following file, which was supposed to be a JSON file, but it has a string slapped right before the actual JSON content (they are separated by a tab!):
string_smth\t{id:"str", num:0}
string_smth1\t{id:"str2", num:1}
string_smth2\t{id:"str3", num:2}
string_smth3\t{id:"str4", num:3}
Doing the following returns null for all columns:
import pyspark.sql
from pyspark.sql.types import *
schema = StructType([
StructField("id", StringType()),
StructField("num", IntegerType())
])
df = spark.read.json("hdfs:///path/files.json/*", schema=schema)
df.show()
+--+---+
|id|num|
+--+---+
|null|null|
|null|null|
|null|null|
|null|null|
Any way of fixing that during the spark.read.json
call? If not, what are my options?
Upvotes: 1
Views: 1761
Reputation: 41987
json, struct and case class don't need schema to be created.
You can use sparkContext
's textFile
api to read the text file and parse the lines to get the valid json strings
rdd = sc.textFile("path to the csv file")\
.map(lambda line: line.split("\t", 1)[1].replace("id:", "\"id\":").replace("num:", "\"num\":"))
Then finally convert the valid json rdds to dataframe
df = sqlContext.read.json(rdd)
which should give you
+----+---+
|id |num|
+----+---+
|str |0 |
|str2|1 |
|str3|2 |
|str4|3 |
+----+---+
Upvotes: 1
Reputation: 15283
I can see several issues in your file, but maybe it is just a problem related to your example.
I created a rdd :
a = sc.parallelize(['string_smth\t{"id":"str","num":0}',
'string_smth1\t{"id":"str2","num":1}',
'string_smth2\t{"id":"str3","num":2}',
'string_smth3\t{"id":"str4","num":3}'])
In your case, replace this sc.parallelize
with sc.textFile(path_to_file)
to acquire the file you need.
As you can see, the id
is enclosed with double quotes. That is how a json is supposed to be in a string format. And also, technically, you do not have space after the comma. How is your original file exactly ?
Then, just do this :
import json
schema = StructType([
StructField("id", StringType()),
StructField("num", IntegerType())
])
a.map(lambda x : json.loads(x.split('\t')[1])).toDF(schema).show()
+----+---+
| id|num|
+----+---+
| str| 0|
|str2| 1|
|str3| 2|
|str4| 3|
+----+---+
Upvotes: 1
Reputation: 1037
A potential solution would be to split on the '{' character for each line :
json_lin = '{' + 'string_smth {id:"str", num:0}'.split('{')[-1]
Upvotes: 0