sizo_abe
sizo_abe

Reputation: 501

Load only first few .XML files (e.g. 10 xmls) from directory containing 100 files in Pyspark dataframe

I want to load the first 10 XML files in each iteration from a directory containing 100 files and remove that XML file that has already read, to another directory.

what I have tried so far in pyspark.

li = ["/mnt/dev/tmp/xml/100_file/M800143.xml","/mnt/dev/tmp/xml/100_file/M8001422.xml"]
df1 = spark.read.format("com.databricks.spark.xml").option("rowTag","Quality").load(li) 
df1.show()

But I am getting an error : IllegalArgumentException: 'path' must be specified for XML data.

Is there is any way to read files after storing the full path of XML files inside the list? Or please suggest another approach.

Upvotes: 1

Views: 142

Answers (0)

Related Questions