Reputation: 23
I'm encountering an issue while trying to upload documents to solr via the endpoint /update/extract.
I run solr 8.5.2 and zookeeper 3.5.8 in docker and could index data before via
...
solr.add(solr_documents)
My Setup:
The Filesystem (the django folder is not relevant for the Problem)
The Files in solr
The File in solr-config
I use the docker-compose.yaml (the django image isnt relevant for the problem)
version: "1.0"
services:
solr:
build:
context: solr/.
dockerfile: Dockerfile
container_name: aips-solr
hostname: aips-solr
ports:
- 8983:8983
environment:
- ZK_HOST=aips-zk:2181
- SOLR_HOST=aips-solr
networks:
- zk-solr
- solr-django
restart: unless-stopped
depends_on:
- zookeeper
volumes:
- ./solr/solr-config:/opt/solr/server/solr/configsets/_default/conf
zookeeper:
image: zookeeper:3.5.8
container_name: aips-zk
hostname: aips-zk
ports:
- 2181:2128
networks:
- zk-solr
- solr-django
restart: unless-stopped
django:
build:
context: django/.
dockerfile: Dockerfile
container_name: django
hostname: django
ports:
- 4000:4000
depends_on:
- solr
volumes:
- ./django/app:/app
networks:
- solr-django
networks:
zk-solr:
solr-django:
The Dockerfile contains:
FROM solr:8.5.2
USER root
ADD run_solr_w_ltr.sh ./run_solr_w_ltr.sh
RUN chown solr:solr run_solr_w_ltr.sh
RUN chmod u+x run_solr_w_ltr.sh
RUN chown -R solr:solr /opt/solr/
USER solr
ENTRYPOINT "./run_solr_w_ltr.sh"
the launch_sorl.sh contains (to copy plugin learning to rank to solr)
#!/bin/sh
mkdir -p /var/solr/data/lib/
cp dist/solr-ltr-*.jar /var/solr/data/lib/
ls /var/solr/data/lib
solr-foreground -Dsolr.ltr.enabled=true
the launch_solr.sh starts the container with
#!/bin/sh
docker build . -t aips-solr
Solr runs sucessfully and the admin center can be accessed via http://localhost:8983/solr/#/
I followed the instruction of https://solr.apache.org/guide/8_5/uploading-data-with-solr-cell-using-apache-tika.html
I did create an file called solrconfig.xml in the sub folder solr
The contant is:
<lib dir="/opt/solr/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="/opt/solr/dist/" regex="solr-cell-\d.*\.jar" />
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.content">content</str>
</lst>
</requestHandler>
I checked if the solr folder exists and contains the files.
i created a new index in the solr-admin-center
i should be using the config of the directory
/opt/solr/server/solr/configsets/_default/conf
right ?
I set the volumn via
volumes:
- ./solr/solr-config:/opt/solr/server/solr/configsets/_default/conf
therefore the config should be the config of solrconfig.xml
<lib dir="/opt/solr/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="/opt/solr/dist/" regex="solr-cell-\d.*\.jar" />
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.content">content</str>
</lst>
</requestHandler>
right?
The Settings of Parser-Specific Properties are optional if i understand it correct.
If i call the endpoint /update/extract of the collection via the admin center
i get
If i use postmann
with the POST command and the uri: http://localhost:8983/solr/test10/update/extract
and the key Values:
Key | Value |
---|---|
extractOnly | true |
wt | json |
stream.file | Zertifikate.pdf |
stream.body | xaAgikF464R9gR7Jz7ACA0... (base64 string) |
I get also
Same if i use an adjusted curl command like in the docs
curl "http://localhost:8983/solr/gettingstarted/update/extract?literal.id=doc6&defaultField=text&commit=true" --data-binary @example/exampledocs/sample.html -H 'Content-type:text/html'
What i tried so far
i change the path of the solr folder to a relativ path
solrconfig.xml
<lib dir="../../../../../solr/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="../../../../../solr/dist/" regex="solr-cell-\d.*\.jar" />
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.content">content</str>
</lst>
</requestHandler>
I checked if the folder solr contains the the .jars
I checked if i can access the Collection
i checked if the user solr has the right permissions
My setup must be wrong but I can't find any other clues on how to find and solve the error.
Any help or advice would be greatly appreciated.
Based on MatsLindh's comment, I have made the following further changes.
According to the admin interface you're running Solr in in cloud mode - that means that you have to explicitly upload your config set to the running zookeeper instance. See solr.apache.org/guide/solr/latest/deployment-guide/… - you might want to run it as a single instance instance of using the built-in cluster support if you want to just have a single node and supply the configuration on the file system instead. By MatsLindh
I uploaded the confing with the follwing steps
docker-compose up
docker-compose exec solr solr zk upconfig -n newconfig -d /opt/solr/server/solr/configsets/_default/conf -z zookeeper:2181
This will upload the configuration of the folder. Afterwards the file solrconfig.xml had to be adapted as follows:
<config>
<luceneMatchVersion>8.5.2</luceneMatchVersion>
<lib dir="/opt/solr/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="/opt/solr/dist/" regex="solr-cell-\d.*\.jar" />
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.content">content</str>
</lst>
</requestHandler>
</config>
A schema.xml also needed to be created. I used the schema:
<?xml version="1.0" encoding="UTF-8" ?>
<schema>
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fields>
<field name="title" type="text_general" indexed="true"
stored="true"/>
<field name="content" type="text_general" indexed="true"
stored="true"/>
</fields>
</schema>
Because of the schema the two text files
synonyms.txt and stopwords.txt had to be created.
After the changes my Folderstructure looks like
After all the changes i get the following error if i try to create a new collection with the configset:
Possibly unhandled rejection: {"data":{"responseHeader":{"status":400,"QTime":620},"failure":{"aips-solr:8983_solr":"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error from server at http://aips-solr:8983/solr: Error CREATEing SolrCore 'test_upload_3_shard1_replica_n1': Unable to create core [test_upload_3_shard1_replica_n1] Caused by: null"},"Operation create caused exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Underlying core creation failed while creating collection: test_upload_3","exception":{"msg":"Underlying core creation failed while creating collection: test_upload_3","rspCode":400},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"Underlying core creation failed while creating collection: test_upload_3","code":400}},"status":400,"config":{"method":"GET","transformRequest":[null],"transformResponse":[null],"jsonpCallbackParam":"callback","url":"admin/collections","params":{"wt":"json","_":1687760309417,"action":"CREATE","name":"test_upload_3","router.name":"compositeId","numShards":1,"collection.configName":"newconfig","replicationFactor":1,"maxShardsPerNode":1,"autoAddReplicas":"false"},"headers":{"Accept":"application/json, text/plain, /","X-Requested-With":"XMLHttpRequest"},"timeout":10000},"statusText":"Bad Request","xhrStatus":"complete","resource":{}}
I think it has to do with a network or firewall issue. The guess is based on this stackoverflow post Failed to create collection
I will check it this evening on another pc.
Upvotes: 0
Views: 406