I am using Mozenda (Mozenda.com) to scrape an online database, but some of the data is in PDF files. Mozenda does not appear to support scraping these files, so I am looking for another solution. There are two questions... What is the appropriate XPath syntax to select the URL from a link? It is not clear how to do this with Mozenda and the PDF urls are necessary to implement a 3rd party solution. What is a good tool to convert large numbers of PDFs online into html, or better yet-scrape them? Any helpful suggestions are most certainly appreciated. I am happy to clarify...just ask.

Reputation: 23

How to scrape information from PDFs?

I am using Mozenda (Mozenda.com) to scrape an online database, but some of the data is in PDF files. Mozenda does not appear to support scraping these files, so I am looking for another solution.

There are two questions...

What is the appropriate XPath syntax to select the URL from a link? It is not clear how to do this with Mozenda and the PDF urls are necessary to implement a 3rd party solution.
What is a good tool to convert large numbers of PDFs online into html, or better yet-scrape them?

Any helpful suggestions are most certainly appreciated. I am happy to clarify...just ask.

Upvotes: 1

Answers (2)

TravisChambers

Reputation: 626

I recognize this is a LATE answer, but Mozenda added the ability to convert PDFs to HTML and scrape from them. It's pretty easy.

https://www.mozenda.com/faqs

Upvotes: 1

Chirag Parmar

Reputation: 11

using mozenda itself you can create xpath . create any action>refine action> put . in the Xpath and take data whatever you want from CaptureDefination.

Upvotes: 0

How to scrape information from PDFs?

Answers (2)

Related Questions