Vakhtang
Vakhtang

Reputation: 431

Apache Tika: docx files parsing via Rest in java

I'm using Appache Tika in server mode. I need to develop java rest client for parsing files. For pdf file upload i'm using code:

fileBody = new FileBody(file, "application/pdf");
multiPartEntity.addPart("uploaded_file", fileBody);
pdfPutRequest.setEntity(multiPartEntity);
response = client.execute(pdfPutRequest);

Using apache.http library. Now i try to develop docx part, but i don't know which mimeType i need to provide (application/docx give me the error). Without mimeTipe i receive the exception " Unsupported Media Type" in the Tika server. So which type i need to provide and do i need to do some other changes.

Solved!

Upvotes: 0

Views: 1184

Answers (2)

Vakhtang
Vakhtang

Reputation: 431

I found the solution:

HttpPost docxPutRequest new HttpPost(url);
docxPutRequest.setHeader("Accept", "text/plain");
MultipartEntity multiPartEntity = new MultipartEntity();
FileBody fileBody = new FileBody(file);
multiPartEntity.addPart("uploaded_file", fileBody);
docxPutRequest.setEntity(multiPartEntity);
response = client.execute(docxPutRequest);

May be this will help to someone

Upvotes: 1

Gagravarr
Gagravarr

Reputation: 48346

The official mime type for .docx files is

application/vnd.openxmlformats-officedocument.wordprocessingml.document

If you use the Tika CLI tool in --detect mode it can tell you that

Alternately, the Tika Server has a detection mode available as documented in the Tika Server wiki.

Finally, Tika will auto-detect the mime type for you if none is given, see the text extraction part of the Tika Server docs for info on giving or not giving a mimetype hint with your file

Upvotes: 0

Related Questions