Reputation: 97
I like to create a Dockerfile that installs all the necessary components to run python-tika inside a Docker container.
So far this is my Dockerfile:
###Get python
FROM python:3
RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx tika numpy pandas
RUN mkdir scripts
ADD runner.py /scripts/
CMD [ "python", "./scripts/runner.py" ]
I build it and run the Dockerfile:
docker build -t docker-tika .
docker run docker-tika
But it complains with the following error:
[~/Documents/BERT_DV/Docker_Parser] $ docker run docker-tika
2020-05-08 13:49:52,528 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
2020-05-08 13:50:09,742 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
2020-05-08 13:50:10,133 [MainThread ] [ERROR] Unable to run java; is it installed?
2020-05-08 13:50:10,134 [MainThread ] [ERROR] Failed to receive startup confirmation from startServer.
2020-05-08 13:50:10,271 [MainThread ] [ERROR] Unable to run java; is it installed?
2020-05-08 13:50:10,271 [MainThread ] [ERROR] Failed to receive startup confirmation from startServer.
The runner.py script is as below:
import tika
tika.initVM()
I have following two questions: 1. I read we need tika-server jar to be downloaded 2. Call to initVM() inside python script that starts the tika-server in the backgroud.
I don't know what'm missing in the. Dockerfile. Appreciate help!
I have update Docker file with Java as well and still it's complaining about Java
### 1. Get Linux
FROM alpine:3.7
### 2. Get Java via the package manager
RUN apk update \
&& apk upgrade \
&& apk add --no-cache bash \
&& apk add --no-cache --virtual=build-dependencies unzip \
&& apk add --no-cache curl \
&& apk add --no-cache openjdk8-jre
ENV JAVA_HOME=/opt/java/openjdk \
PATH="/opt/java/openjdk/bin:$PATH"
###3. Get ython
FROM python:3
RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx tika numpy pandas
RUN mkdir scripts
RUN mkdir pdfs
RUN mkdir output
ADD runner2.py /scripts/
ADD sample.pdf .
CMD [ "python", "./scripts/runner2.py" ]
cat runner2.py:
#!/usr/bin/env python
import tika
from tika import parser
parsed = parser.from_file('sample.pdf')
print(parsed["metadata"])
print(parsed["content"])
[~/Documents/BERT_DV/Docker_Parser] $ docker run docker-tika
2020-05-08 14:40:23,183 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
2020-05-08 14:41:00,480 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
2020-05-08 14:41:02,324 [MainThread ] [ERROR] Unable to run java; is it installed?
2020-05-08 14:41:02,324 [MainThread ] [ERROR] Failed to receive startup confirmation from startServer.
Upvotes: 3
Views: 3349
Reputation: 6452
I'm reposting @anapaulagomes' comment as an answer because it's what I was Googling for -- running Tika as a Docker container:
I managed to solve this by using Tika as a separate service (which had better performance than having it in the same image). But instead of running Tika's jar, I consume its API. You only need to configure the environment variables
TIKA_CLIENT_ONLY: 1
andTIKA_SERVER_ENDPOINT: tika:9998
. You can see the Dockerfile and docker-compose.yml here: https://github.com/DadosAbertosDeFeira/maria-quiteria
You can start the Tika service with
docker run --rm -t -d --name my_tika --net my-network \
-p 9998:9998 apache/tika:1.27
or by adding this to your docker-compose.yml:
tika:
image: apache/tika
ports:
- "9998:9998"
This allows you to call from tika import parser
and parse without ever calling tika.initVM().
Upvotes: 1
Reputation: 1630
I don't have reputation to comment, so posting here.
It seems, that your Dockerfile is making now multi-stage build, Java is not in the last phase anymore - previous phase gets deleted.
As Giga Kokaia earlier and others stated, Java is the problem. It seems that you want do it with single Dockerfile. It can be achieved for example by keeping that Alpine as base image, but you will need some additional dependencies to be able to install Python and required pip packages. Alpine might not be best base for Python, when used with many libraries, as it is not using libc library. However, here is very roughly updated Dockerfile:
### 1. Get Linux
FROM alpine:3.7
### 2. Get Java via the package manager
RUN apk update \
&& apk upgrade \
&& apk add --no-cache bash \
&& apk add --no-cache --virtual=build-dependencies unzip \
&& apk add --no-cache curl \
&& apk add --no-cache openjdk8-jre \
&& apk add python3 python3-dev gcc g++ gfortran musl-dev libxml2-dev libxslt-dev
ENV JAVA_HOME=/opt/java/openjdk \
PATH="/opt/java/openjdk/bin:$PATH"
RUN pip3 install --upgrade pip requests
RUN pip3 install python-docx wheel tika numpy
RUN pip3 install pandas
RUN mkdir scripts
RUN mkdir pdfs
RUN mkdir output
ADD runner2.py /scripts/
ADD sample.pdf .
CMD [ "python3", "./scripts/runner2.py" ]
Upvotes: 5
Reputation: 939
From tika-s github:
To use this library, you need to have Java 7+ installed on your system as tika-python starts up the Tika REST server in the background.
So you need to have java, but there is no java in python:3
image.
There is some solutions
Upvotes: 1