How to retain HTML coding while indexing HTML documents to Apache Solr?

Question

I am indexing HTML documents into Solr via the SimplePostTool on the command line,

post -c core0 /mnt/Vancouver/programming/datasci/solr/test/d*.html

Despite various edits to solrconfig.xml and schema.xml (solr.HTMLStripCharFilterFactory etc.), Solr will not retain HTML content (URLs) present in the HTML source documents.

A new UC Riverside study shows flame retardants ...

appears in Solr as

"p":[" A new https://news.ucr.edu/articles/2020/11/06/chemicals-your-living-room-cause-diabetes UC Riverside study shows ...

It appears that Apache Tika is stripping the HTML coding from the content within HTML

elements, before it is passed to Solr.

https://lucene.apache.org/solr/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.html#key-solr-cell-concepts

Rendered web page (note, e.g., A new https://news.ucr.edu/articles/2020/11/06/chemicals-your-living-room-cause-diabetes UC Riverside study ... in first document)

How to retain HTML coding while indexing HTML documents to Apache Solr?

Answers (1)

Related Questions