Tariq
Tariq

Reputation: 2344

Reading PDF/text/word file efficiently with Spark

I am doing NLP (Natural Language Processing) processing on my data. The data is in form of files that can be of type PDF/Text/Word/HTML. These files are stored in a nested directory structure on local disk.

My stand alone Java based NLP parser can read input files, extract text from these and do the NLP processing on the extracted text.

I am converting my Java based NLP parser to execute it on my Spark cluster. I know that Spark can read multiple text files from a directory and convert into RDDs for further processing. My input data is not only in text files, but in a multitude of different file formats.

My question is: How can I efficiently read the input files (PDF/Text/Word/HTML) in my Java based Spark program for processing these files in Spark cluster.

Upvotes: 2

Views: 2183

Answers (2)

Mykola Melnyk
Mykola Melnyk

Reputation: 71

For PDF documents we have now custom Open-Source PDF Data Source for Apache Spark:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("SparkPdf") \
    .config("spark.jars.packages", "com.stabrise:spark-pdf-spark35_2.12:0.1.11") \
    .getOrCreate()

df = spark.read.format("pdf") \
    .option("imageType", "BINARY") \
    .option("resolution", "200") \
    .option("pagePerPartition", "2") \
    .option("reader", "pdfBox") \
    .option("ocrConfig", "psm=11") \
    .load("path to the pdf file(s)")

df.select("path", "document").show()

Upvotes: 1

pasha701
pasha701

Reputation: 7207

Files can be read by

sparkContext.binaryFiles()

And then can be processed by parser.

Upvotes: 2

Related Questions