Reputation: 2439

Filtering trash from a JSON file before reading it into PySpark DataFrame

I have the following file, which was supposed to be a JSON file, but it has a string slapped right before the actual JSON content (they are separated by a tab!):

string_smth\t{id:"str", num:0}
string_smth1\t{id:"str2", num:1}
string_smth2\t{id:"str3", num:2}
string_smth3\t{id:"str4", num:3}

Doing the following returns null for all columns:

import pyspark.sql
from pyspark.sql.types import * 
schema = StructType([
       StructField("id", StringType()),
       StructField("num", IntegerType())
])

df = spark.read.json("hdfs:///path/files.json/*", schema=schema)

df.show()

+--+---+
|id|num|
+--+---+
|null|null|
|null|null|
|null|null|
|null|null|

Any way of fixing that during the spark.read.json call? If not, what are my options?

Upvotes: 1

Answers (3)

Ramesh Maharjan

Reputation: 41987

json, struct and case class don't need schema to be created.

You can use sparkContext's textFile api to read the text file and parse the lines to get the valid json strings

rdd = sc.textFile("path to the csv file")\
    .map(lambda line: line.split("\t", 1)[1].replace("id:", "\"id\":").replace("num:", "\"num\":"))

Then finally convert the valid json rdds to dataframe

df = sqlContext.read.json(rdd)

which should give you

+----+---+
|id  |num|
+----+---+
|str |0  |
|str2|1  |
|str3|2  |
|str4|3  |
+----+---+

Upvotes: 1

Steven

Reputation: 15283

I can see several issues in your file, but maybe it is just a problem related to your example.

I created a rdd :

a = sc.parallelize(['string_smth\t{"id":"str","num":0}',
'string_smth1\t{"id":"str2","num":1}',
'string_smth2\t{"id":"str3","num":2}',
'string_smth3\t{"id":"str4","num":3}'])

In your case, replace this sc.parallelize with sc.textFile(path_to_file) to acquire the file you need. As you can see, the id is enclosed with double quotes. That is how a json is supposed to be in a string format. And also, technically, you do not have space after the comma. How is your original file exactly ?

Then, just do this :

import json
schema = StructType([
       StructField("id", StringType()),
       StructField("num", IntegerType())
])
a.map(lambda x : json.loads(x.split('\t')[1])).toDF(schema).show()

+----+---+
|  id|num|
+----+---+
| str|  0|
|str2|  1|
|str3|  2|
|str4|  3|
+----+---+

Upvotes: 1

Chuk Ultima

Reputation: 1037

A potential solution would be to split on the '{' character for each line :

json_lin = '{' + 'string_smth {id:"str", num:0}'.split('{')[-1]

Upvotes: 0

Filtering trash from a JSON file before reading it into PySpark DataFrame

Answers (3)

Related Questions