user1043070
user1043070

Reputation: 23

How to scrape information from PDFs?

I am using Mozenda (Mozenda.com) to scrape an online database, but some of the data is in PDF files. Mozenda does not appear to support scraping these files, so I am looking for another solution.

There are two questions...

  1. What is the appropriate XPath syntax to select the URL from a link? It is not clear how to do this with Mozenda and the PDF urls are necessary to implement a 3rd party solution.

  2. What is a good tool to convert large numbers of PDFs online into html, or better yet-scrape them?

Any helpful suggestions are most certainly appreciated. I am happy to clarify...just ask.

Upvotes: 1

Views: 641

Answers (2)

TravisChambers
TravisChambers

Reputation: 626

I recognize this is a LATE answer, but Mozenda added the ability to convert PDFs to HTML and scrape from them. It's pretty easy.

https://www.mozenda.com/faqs

Upvotes: 1

Chirag Parmar
Chirag Parmar

Reputation: 11

using mozenda itself you can create xpath . create any action>refine action> put . in the Xpath and take data whatever you want from CaptureDefination.

Upvotes: 0

Related Questions