AbtPst
AbtPst

Reputation: 8018

Best way to stream PDF Files

What would be a good way to stream PDF files through a messaging queue?

Would it be a good idea to do this in KAFKA?

Here is what i have in mind:

  1. Pick up the PDF files from a file drop location.
  2. Stream the files through Kafka.
  3. Parse the files for some low level Info Retrieval and cleanup. This will probably be done in a Storm topology or Spark. Maybe some custom Map Reduce code.
  4. Finally, i wan to run some machine learning algorithms on these documents.

Note that the steps mentioned above are just possibilities. If you have a better implementation, please suggest.

Upvotes: 1

Views: 3891

Answers (1)

Chris Gerken
Chris Gerken

Reputation: 16390

I'd break that into three problems:

  1. Ingestion
  2. Parsing
  3. Analytics

So that you can do ingestion once but iterate on the parsing and analytics as your understanding of both the data and the problem evolve.

For ingestion, I'd push the actual file to a widely accessible location, such as HDFS or an HTTP server and then send a short message via Kafka that a file at a given location has just been added and is ready for parsing. Once the file has been parsed, store that info in a database so that you can iterate again over the entire set of ingested files if your parsing algorithm changes.

Upvotes: 3

Related Questions