srb
srb

Reputation: 55

SOLR - Missing configuration: Unsupported ContentType: text/html; Unsupported ContentType: application/pdf

I have SOLR installed and running on Windows. I am following the Quick Start tutorial from the SOLR website. Using the post.jar file I tried to index the documents listed under /solr/docs and I got the following erros -

ERROR - 2016-05-11 16:35:16.772; [c:gettingstarted s:shard2 r:core_node1 x:gettingstarted_shard2_replica1] org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 (at char #10, byte #-1)

I tried to index one file at a time, starting with a pdf and then tried a html. Below are the commands I used and the exceptions I see

java -Dc=gettingstarted -Dtype=application/pdf -jar example/exampledocs/post.jar scandocs/

ERROR - 2016-05-16 16:17:55.992; [c:gettingstarted s:shard2 r:core_node1 x:gettingstarted_shard2_replica1] org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Unsupported ContentType: application/pdf  Not in: [application/xml, application/csv, application/json, text/json, text/csv, text/xml, application/javabin]

java -Dc=gettingstarted -Dtype=text/html -jar example/exampledocs/post.jar scandocs/

ERROR - 2016-05-16 16:19:03.601; [c:gettingstarted s:shard2 r:core_node1 x:gettingstarted_shard2_replica1] org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Unsupported ContentType: text/html  Not in: [application/xml, application/csv, application/json, text/json, text/csv, text/xml, application/javabin]

All I have under the /scandocs fodler is a html file. It seems as if like my SOLR instance is not configured to read html/pdf documents. But the tutorial talks about indexing a bunch of rich documents without mentioning anything about the configuration.

I would really appreciate if anyone could help me with the configuration I need here.

Upvotes: 0

Views: 3148

Answers (1)

Matt Wanchap
Matt Wanchap

Reputation: 909

I just had a similar issue myself, the problem I had was that the post.jar tool you have to use in windows only uses the /update handler (as MatsLindh mentioned), which is very restrictive in how it indexes documents and only indexes certain file types, requires certain formatting etc. Instead, I used the -Durl parameter to make it use /update/extract, which worked. The command looked like this:

java -Durl=http://localhost:8983/solr/gettingstarted/update/extract -jar example\exampledocs\post.jar "C:\documents to index"

Upvotes: 1

Related Questions