Space X
Space X

Reputation: 97

Docker python tika

I like to create a Dockerfile that installs all the necessary components to run python-tika inside a Docker container.

So far this is my Dockerfile:

###Get python
FROM python:3

RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx tika numpy pandas

RUN mkdir scripts

ADD runner.py /scripts/

CMD [ "python", "./scripts/runner.py" ]

I build it and run the Dockerfile:

docker build -t docker-tika .

docker run docker-tika

But it complains with the following error:

[~/Documents/BERT_DV/Docker_Parser] $ docker run docker-tika
2020-05-08 13:49:52,528 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
2020-05-08 13:50:09,742 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
2020-05-08 13:50:10,133 [MainThread  ] [ERROR]  Unable to run java; is it installed?
2020-05-08 13:50:10,134 [MainThread  ] [ERROR]  Failed to receive startup confirmation from startServer.
2020-05-08 13:50:10,271 [MainThread  ] [ERROR]  Unable to run java; is it installed?
2020-05-08 13:50:10,271 [MainThread  ] [ERROR]  Failed to receive startup confirmation from startServer.

The runner.py script is as below:

import tika
tika.initVM()

I have following two questions: 1. I read we need tika-server jar to be downloaded 2. Call to initVM() inside python script that starts the tika-server in the backgroud.

I don't know what'm missing in the. Dockerfile. Appreciate help!

I have update Docker file with Java as well and still it's complaining about Java

### 1. Get Linux
FROM alpine:3.7

### 2. Get Java via the package manager
RUN apk update \
&& apk upgrade \
&& apk add --no-cache bash \
&& apk add --no-cache --virtual=build-dependencies unzip \
&& apk add --no-cache curl \
&& apk add --no-cache openjdk8-jre

ENV JAVA_HOME=/opt/java/openjdk \
    PATH="/opt/java/openjdk/bin:$PATH"

###3. Get ython
FROM python:3

RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx tika numpy pandas

RUN mkdir scripts
RUN mkdir pdfs
RUN mkdir output

ADD runner2.py /scripts/
ADD sample.pdf .

CMD [ "python", "./scripts/runner2.py" ]

cat runner2.py:

#!/usr/bin/env python
import tika
from tika import parser
parsed = parser.from_file('sample.pdf')
print(parsed["metadata"])
print(parsed["content"])

[~/Documents/BERT_DV/Docker_Parser] $ docker run docker-tika

2020-05-08 14:40:23,183 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
2020-05-08 14:41:00,480 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
2020-05-08 14:41:02,324 [MainThread  ] [ERROR]  Unable to run java; is it installed?
2020-05-08 14:41:02,324 [MainThread  ] [ERROR]  Failed to receive startup confirmation from startServer.

Upvotes: 3

Views: 3349

Answers (3)

Noumenon
Noumenon

Reputation: 6452

I'm reposting @anapaulagomes' comment as an answer because it's what I was Googling for -- running Tika as a Docker container:

I managed to solve this by using Tika as a separate service (which had better performance than having it in the same image). But instead of running Tika's jar, I consume its API. You only need to configure the environment variables TIKA_CLIENT_ONLY: 1 and TIKA_SERVER_ENDPOINT: tika:9998. You can see the Dockerfile and docker-compose.yml here: https://github.com/DadosAbertosDeFeira/maria-quiteria

You can start the Tika service with

docker run --rm -t -d --name my_tika --net my-network \
         -p 9998:9998 apache/tika:1.27

or by adding this to your docker-compose.yml:

tika:
    image: apache/tika
    ports:
        - "9998:9998"

This allows you to call from tika import parser and parse without ever calling tika.initVM().

Upvotes: 1

Niklas
Niklas

Reputation: 1630

I don't have reputation to comment, so posting here.

It seems, that your Dockerfile is making now multi-stage build, Java is not in the last phase anymore - previous phase gets deleted.

As Giga Kokaia earlier and others stated, Java is the problem. It seems that you want do it with single Dockerfile. It can be achieved for example by keeping that Alpine as base image, but you will need some additional dependencies to be able to install Python and required pip packages. Alpine might not be best base for Python, when used with many libraries, as it is not using libc library. However, here is very roughly updated Dockerfile:

### 1. Get Linux
FROM alpine:3.7

### 2. Get Java via the package manager
RUN apk update \
&& apk upgrade \
&& apk add --no-cache bash \
&& apk add --no-cache --virtual=build-dependencies unzip \
&& apk add --no-cache curl \
&& apk add --no-cache openjdk8-jre \
&& apk add python3 python3-dev gcc g++ gfortran musl-dev libxml2-dev libxslt-dev

ENV JAVA_HOME=/opt/java/openjdk \
    PATH="/opt/java/openjdk/bin:$PATH"


RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx wheel tika numpy 
RUN pip3 install pandas

RUN mkdir scripts
RUN mkdir pdfs
RUN mkdir output

ADD runner2.py /scripts/
ADD sample.pdf .

CMD [ "python3", "./scripts/runner2.py"  ]

Upvotes: 5

Giga Kokaia
Giga Kokaia

Reputation: 939

From tika-s github:

To use this library, you need to have Java 7+ installed on your system as tika-python starts up the Tika REST server in the background.

So you need to have java, but there is no java in python:3 image. There is some solutions

  1. Find python and tika installed docker image
  2. Use separate images
  3. Manually install java on python:3, add java installation commands to your Dockerfile
  4. Install python on java image

Upvotes: 1

Related Questions