Alex Raj Kaliamoorthy
Alex Raj Kaliamoorthy

Reputation: 2095

Reading url via pyspark in Databricks notebook

I am unable to read the content of a URL via pySpark in Databricks Notebooks(Version 8.3, Spark 3.1.1). I have tried almost all the possibilities but unable to find out the exact problem. Here is my code.

from pyspark import SparkFiles
url = 'https://pds-atmospheres.nmsu.edu/PDS/data/mors_1101/tps/1998_028/8028d38a.tps'
spark.sparkContext.addFile(url)
df1 = spark.read.text("file://"+SparkFiles.get('8028d38a.tps'))
df1.show()

Here is the error

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 43) (10.139.64.4 executor 0): com.databricks.sql.io.FileReadException: Error while reading file file:/local_disk0/spark-95887d0f-a955-4075-86ac-520a51f0c64e/userFiles-9204e03a-a0fd-4999-9f40-9d9c3cc599a6/8028d38a.tps. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If Delta cache is stale or the underlying files have been removed, you can invalidate Delta cache manually by restarting the cluster.

I have referred reading data from URL using spark databricks platform as an example. Did anyone face the similar problem?

Upvotes: 0

Views: 1121

Answers (2)

As workaround , we can read respective location panda dataframe and covert into pyspark dataframe for further process .

url = 'https://pds-atmospheres.nmsu.edu/PDS/data/mors_1101/tps/1998_028/8028d38a.tps'
import pandas as pd
df = spark.createDataFrame(pd.read_csv(url))
display(df)

Screen print :

enter image description here

If you want to skip first row if that is invalid one ,

enter image description here

Upvotes: 0

ATee
ATee

Reputation: 1

This is the best i've found from youtube pyspark for everyone playlist

!curl  "https://pds-atmospheres.nmsu.edu/PDS/data/mors_1101/tps/1998_028/8028d38a.tps" >> 8028d38a.tps

Upvotes: 0

Related Questions