Reputation: 3049
I am using python3
, urllib3
and tika-server-1.13
in order to get text from different types of files. This is my python code:
def get_text(self, input_file_path, text_output_path, content_type):
global config
headers = util.make_headers()
mime_type = ContentType.get_mime_type(content_type)
if mime_type != '':
headers['Content-Type'] = mime_type
with open(input_file_path, "rb") as input_file:
fields = {
'file': (os.path.basename(input_file_path), input_file.read(), mime_type)
}
retry_count = 0
while retry_count < int(config.get("Tika", "RetriesCount")):
response = self.pool.request('PUT', '/tika', headers=headers, fields=fields)
if response.status == 200:
data = response.data.decode('utf-8')
text = re.sub("[\[][^\]]+[\]]", "", data)
final_text = re.sub("(\n(\t\r )*\n)+", "\n\n", text)
with open(text_output_path, "w+") as output_file:
output_file.write(final_text)
break
else:
if retry_count == (int(config.get("Tika", "RetriesCount")) - 1):
return False
retry_count += 1
return True
This code works for html files, but when i am trying to parse text from docx files it doesn't work.
I get back from the server Http error code 422: Unprocessable Entity
Using the tika-server
documentation I've tried using curl
to check if it works with it:
curl -X PUT --data-binary @test.docx http://localhost:9998/tika --header "Content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document"
and it worked.
At the tika server docs:
422 Unprocessable Entity - Unsupported mime-type, encrypted document & etc
This is the correct mime-type(also checked it with tika's detect system), it's supported and the file is not encrypted.
I believe this is related to how I upload the file to the tika server, What am I doing wrong?
Upvotes: 0
Views: 733
Reputation: 1556
I believe you can make this much easier by using the tika-python module with Client Only Mode.
If you still insist on rolling your own client, maybe there is some clues in the source code for this module to show how he is handling all these different mime types... if your having a problem with *.docx
you will probably have issues with others.
Upvotes: 1
Reputation: 28807
You're not uploading the data in the same way. --data-binary
in curl simply uploads the binary data as it is. No encoding. In urllib3, using fields
causes urllib3 to generate a multipart/form-encoded
message. On top of that, you're preventing urllib3 from properly setting that header on the request so Tika can understand it. Either stop updating headers['Content-Type']
or simply pass body=input_file.read()
.
Upvotes: 3