Reputation: 5082
I am indexing HTML documents into Solr via the SimplePostTool on the command line,
post -c core0 /mnt/Vancouver/programming/datasci/solr/test/d*.html
Despite various edits to solrconfig.xml and schema.xml (solr.HTMLStripCharFilterFactory etc.), Solr will not retain HTML content (URLs) present in the HTML source documents.
A new <a href="https://news.ucr.edu/articles/2020/11/06/chemicals-your-living-room-cause-diabetes">UC Riverside study</a> shows flame retardants ...
appears in Solr as
"p":[" A new https://news.ucr.edu/articles/2020/11/06/chemicals-your-living-room-cause-diabetes UC Riverside study shows ...
It appears that Apache Tika is stripping the HTML coding from the content within HTML
elements, before it is passed to Solr.
Upvotes: 1
Views: 131
Reputation: 5082
Update: here is a workaround.
url_process.sh
#!/bin/bash
cd /mnt/Vancouver/programming/datasci/solr/test/url_test/
for FILE in *.html
do
cat $FILE | sed 's/<a href/LEFTANGLEBRACKETa href/g ; s%</a>%LEFTANGLEBRACKET/a>%g' > tmp
post -c core0 tmp
done
solrconfig.xml
<updateRequestProcessorChain
processor="uuid,remove-blank,field-name-mutating,
parse-boolean,parse-long,parse-double,parse-date">
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">p</str>
<str name="pattern">LEFTANGLEBRACKET</str>
<str name="replacement"><</str>
<bool name="literalReplacement">true</bool>
</processor>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
Explanation
I subvert Apache Tika preprocessing by:
preprocessing the HTML source documents with a BASH script, swapping all <
in <a href="...">...</a>
with an alphabetic string. This obfuscates those links from Tika.
Upon indexing, a RegexReplaceProcessorFactory processor in solrconfig.xml swaps back those <
brackets, regenerating the URLs.
Result
Solr:
"p":[" A new <a href=\"https://news.ucr.edu/articles/2020/11/06/chemicals-your-living-room-cause-diabetes\">UC Riverside study</a> shows flame retardants ...],"
A working hyperlink now appears in the Ajax-rendered web page.
Upvotes: 1