Reputation: 5575
I am using the Azure Search to index word documents automatically that are uploaded to Blob storage. The only reason I am using Search is to extract the text from the Word or PDF document (it's free and works well) - from that point on I read it from the index and remove it.
The problem I have is that the Search Index can only run every 5 minutes - I need it to run ASAP after a blob upload. So I either need to run it on demand (triggered every time a new blob is added) OR figure out how to insert the Word/PDF document into the index (or how to extract the text from it)
The flow is therefore:
So my question is:
A. Is there a better way of extracting text natively from a word/pdf document using Azure? (in which case Question 2 is void) B. How can I used the .NET SDK to invoke the Index to Run (I could not find a Run method here) although several places mention you can run it on demand with the SDK.
Upvotes: 0
Views: 589
Reputation: 4671
If you only need Azure Search for document cracking, and don't need the rest of the search and enrichment functionalities, it may be simpler to do document cracking directly in an Azure Function. There are many OSS and commercial libraries for document parsing, e.g. Apache Tika.
An example of using Tika from an Azure Function, written by one of our team members.
Upvotes: 1