Reputation: 3094
TLDR
How do I configure solr Data Import Handler so it will import html similar to solr's "post" utility ?
Context
We're doing a small project where code will export a set pages from wiki/confluence to 'straight html' (for availability in a DR data center--straight html pages will not depend on a database, etc)
We want to index the html pages in solr.
We "have it working" using the solr-shipped "post utility"
post -c OPERATIONS -recursive -0 -host solr $(find . -name '*.html')
This is fine.....However, we would like to leverage the Data Import Handler (DIH), i.e. replace the shell command with a single http call to the DIH endpoint ('/dataimport')
Question
How do I configure the tika "data config xml" file to get "similar functionality" as the solr "post command" ?
correction: i had originally wrote '"id" and "title" field..."'
"id":"database_operations_2019.html",
"_version_":1650836000296927232},
"id":"/usr/local/html/OPERATIONS_2019_1119_1500/./database_operations_2019.html",
"stream_size":[54115],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"],
"stream_content_type":["text/html"],
"dc_title":["Database Operations 2019 Guidebook"],
"content_encoding":["UTF-8"],
"content_type_hint":["text/html; charset=UTF-8"],
"resourcename":["/usr/local/html/OPERATIONS_2019_1119_1500/./database_operations_2019.html"],
"title":["Database Operations 2019 Guidebook"],
"content_type":["text/html; charset=UTF-8"],
"_version_":1650834641083432960},
Some Points
Data Config Xml File
<dataConfig>
<dataSource type="BinFileDataSource"/>
<document>
<entity name="file" processor="FileListEntityProcessor"
dataSource="null"
htmlMapper="true"
format="html"
baseDir="/usr/local/var/www/confluence/OPERATIONS"
fileName=".*html"
rootEntity="false">
<field column="file" name="id"/>
<entity name="html" processor="TikaEntityProcessor"
url="${file.fileAbsolutePath}" format="text">
<field column="title" name="title" meta="true"/>
<field column="dc:format" name="format" meta="true"/>
<field column="text" name="text"/>
</entity>
</entity>
</document>
</dataConfig>
Upvotes: 0
Views: 374
Reputation: 3094
I ended up writing a few lines of code to parse the html files (jsoup) and ditched the solr data import handler (DIH).
Very straightforward using Spring and solr and jsoup html parser.
One caveat: my java "bean" object to store the solr fields needed a "text" field for the out-of-the-box default-search-field to work (i.e. with the solr docker instance)
Upvotes: 0