Reputation: 2344
I am doing NLP (Natural Language Processing) processing on my data. The data is in form of files that can be of type PDF/Text/Word/HTML. These files are stored in a nested directory structure on local disk.
My stand alone Java based NLP parser can read input files, extract text from these and do the NLP processing on the extracted text.
I am converting my Java based NLP parser to execute it on my Spark cluster. I know that Spark can read multiple text files from a directory and convert into RDDs for further processing. My input data is not only in text files, but in a multitude of different file formats.
My question is: How can I efficiently read the input files (PDF/Text/Word/HTML) in my Java based Spark program for processing these files in Spark cluster.
Upvotes: 2
Views: 2183
Reputation: 71
For PDF documents we have now custom Open-Source PDF Data Source for Apache Spark:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[*]") \
.appName("SparkPdf") \
.config("spark.jars.packages", "com.stabrise:spark-pdf-spark35_2.12:0.1.11") \
.getOrCreate()
df = spark.read.format("pdf") \
.option("imageType", "BINARY") \
.option("resolution", "200") \
.option("pagePerPartition", "2") \
.option("reader", "pdfBox") \
.option("ocrConfig", "psm=11") \
.load("path to the pdf file(s)")
df.select("path", "document").show()
Upvotes: 1
Reputation: 7207
Files can be read by
sparkContext.binaryFiles()
And then can be processed by parser.
Upvotes: 2