Reputation: 55
I have SOLR installed and running on Windows. I am following the Quick Start tutorial from the SOLR website. Using the post.jar file I tried to index the documents listed under /solr/docs and I got the following erros -
ERROR - 2016-05-11 16:35:16.772; [c:gettingstarted s:shard2 r:core_node1 x:gettingstarted_shard2_replica1] org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 (at char #10, byte #-1)
I tried to index one file at a time, starting with a pdf and then tried a html. Below are the commands I used and the exceptions I see
java -Dc=gettingstarted -Dtype=application/pdf -jar example/exampledocs/post.jar scandocs/
ERROR - 2016-05-16 16:17:55.992; [c:gettingstarted s:shard2 r:core_node1 x:gettingstarted_shard2_replica1] org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Unsupported ContentType: application/pdf Not in: [application/xml, application/csv, application/json, text/json, text/csv, text/xml, application/javabin]
java -Dc=gettingstarted -Dtype=text/html -jar example/exampledocs/post.jar scandocs/
ERROR - 2016-05-16 16:19:03.601; [c:gettingstarted s:shard2 r:core_node1 x:gettingstarted_shard2_replica1] org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Unsupported ContentType: text/html Not in: [application/xml, application/csv, application/json, text/json, text/csv, text/xml, application/javabin]
All I have under the /scandocs
fodler is a html file.
It seems as if like my SOLR instance is not configured to read html/pdf
documents. But the tutorial talks about indexing a bunch of rich documents without mentioning anything about the configuration.
I would really appreciate if anyone could help me with the configuration I need here.
Upvotes: 0
Views: 3148
Reputation: 909
I just had a similar issue myself, the problem I had was that the post.jar tool you have to use in windows only uses the /update handler (as MatsLindh mentioned), which is very restrictive in how it indexes documents and only indexes certain file types, requires certain formatting etc. Instead, I used the -Durl parameter to make it use /update/extract, which worked. The command looked like this:
java -Durl=http://localhost:8983/solr/gettingstarted/update/extract -jar example\exampledocs\post.jar "C:\documents to index"
Upvotes: 1