Andreas
Andreas

Reputation: 240

Scraping structured information from hundreds of Word documents?

I've been tasked with extracting some structured information from hundreds of human readable documents (mostly MS Word) and to put it into a database. The data is pretty much embedded in tables throughout the entire document but there's a lot of text between the tables and although the documents are very similar in structure, there are a few differences. The documents are changed fairly often (we get an updated version every few months)

So far the only viable option i can think of is to manually go trough all the documents and insert/update the information but I thought I'd ask here if anyone think it's possible to scrape the documents in some way?

Oh, and the data has to be fairly correct...

Upvotes: 3

Views: 3827

Answers (1)

CharlesB
CharlesB

Reputation: 90496

I did similar work (without tables though) using a converter from RTF to FO.

You have convert docs to RTF, and then to FO, which gives you a nice XML structure of the document. You can then easily parse it and scrape the data.

Upvotes: 2

Related Questions