Kalyan
Kalyan

Reputation: 1940

How to save JSON data fetched from URL in PySpark?

I have fetched some .json data from API.

import urllib2
test=urllib2.urlopen('url') 
print test

json data fetched from url

How can I save it as a table or data frame? I am using Spark 2.0.

Upvotes: 2

Views: 6794

Answers (4)

Bhgyalaxmi Patel
Bhgyalaxmi Patel

Reputation: 1

from pyspark import SparkFiles

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Project").getOrCreate()

zip_url = "https://raw.githubusercontent.com/spark-examples/spark-scala-examples/master/src/main/resources/zipcodes.json"

spark.sparkContext.addFile(zip_url)

zip_df = spark.read.json("file://" +SparkFiles.get("zipcodes.json"))

#click on raw and then copy url

Upvotes: -1

ZygD
ZygD

Reputation: 24478

This is how I succeeded importing .json data from web into df:

from pyspark.sql import SparkSession, functions as F
from urllib.request import urlopen

spark = SparkSession.builder.getOrCreate()

url = 'https://web.url'
jsonData = urlopen(url).read().decode('utf-8')
rdd = spark.sparkContext.parallelize([jsonData])
df = spark.read.json(rdd)

Upvotes: 3

Yaron
Yaron

Reputation: 10450

Adding to Rakesh Kumar answer, the way to do it in spark 2.0 is:

http://spark.apache.org/docs/2.1.0/sql-programming-guide.html#data-sources

As an example, the following creates a DataFrame based on the content of a JSON file:

# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON. As a consequence, a regular multi-line JSON file will most often fail.

Upvotes: 0

Rakesh Kumar
Rakesh Kumar

Reputation: 4420

For this you can have some research and try using sqlContext. Here is Sample code:-

>>> df2 = sqlContext.jsonRDD(test)
>>> df2.first()

Moreover visit line and check for more things here, https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html

Upvotes: 0

Related Questions