saddam
saddam

Reputation: 829

Extract large file with Apache Tika

I'm using Apache Tika with Go to extract content from any type of files (.txt, .docx, .pdf etc) with below code.

file, err := os.Open("foo.docx")
    if err != nil {
        fmt.Println(err)
    }
client := tika.NewClient(nil, "http://localhost:9998/")
body, err := client.Parse(context.Background(), file)

It's extracting content well, but the problem is, if the file size would be larger that time error could be generate Viz. memory out of bound. So what I want to do here, I want to pass the file in chunks to Apache Tika server, so that it extract content in chunks.

Upvotes: 0

Views: 897

Answers (1)

marek.kapowicki
marek.kapowicki

Reputation: 732

  1. You can change the timeout using header : X-Tika-OCRtimeout: xxx (600)
  2. The pdf document can be split into pages using the pdfbox - check the org.apache.pdfbox.multipdf.Splitter (apache tika also uses pdfbox under the hood) So instead of sending the big pdf file you can split the document per pages and send it to tika

Upvotes: 2

Related Questions