W.P. McNeill
W.P. McNeill

Reputation: 17066

Import CSV with HTML values into Solr

I have a CSV file I want to import into Solr. It has the columns HTMLText which contains English text with HTML markup.

How should I write my schema.xml to properly import this column? I'm working from the sample schema XML, in which I see general purpose text field types and English field types, but I don't see a field type for HTML.

I know the post command allows you to post whole HTML documents, so presumably there's a field parser to handle this, but I don't know what it is.

Is there a parser type for HTML built into Solr, or should I strip the HTML tags out of my HTMLText column?

Upvotes: 0

Views: 74

Answers (1)

MatsLindh
MatsLindh

Reputation: 52892

There's a HTMLStripCharFilterFactory that you can apply to a field, which will strip off any HTML before any tokenization happens.

It will drop comments and properties, so whether it's perfectly suitable depends on what you expect the end result to be.

Upvotes: 1

Related Questions