tkaramp
tkaramp

Reputation: 21

Skip faulty lines when using Solr's csv handler

I want to parse a csv file using the solr handler. The problem is that my file might contain problematic lines (those lines can contain unescaped encaptulators). When Solr finds one such line, fails with the following message and stops

<str name="msg">CSVLoader: input=null, line=1941,can't read line: 1941
    values={NO LINES AVAILABLE}</str><int name="code">400</int>

I understand that in that case the parser cannot fix the problematic line and this ok for me.I just want to skip the faulty line and continue with the rest of the file.

I tried using the TolerantUpdateProcessorFactory in my processor chain but the result was the same.

I use solr 6.5.1 and the curl command that I try is something like that

curl '<path>/update?update.chain=tolerant&maxErrors=10&commit=true&fieldnames=<my fields are provided>,&skipLines=1' --data-binary @my_file.csv -H 'Content-type:application/csv'

Finally this is what I put in my solrconfig.xml

 <updateRequestProcessorChain name="tolerant">
   <processor class="solr.TolerantUpdateProcessorFactory">
     <int name="maxErrors">10</int>
   </processor>
   <processor class="solr.RunUpdateProcessorFactory" />
 </updateRequestProcessorChain>

Upvotes: 1

Views: 941

Answers (1)

Jeeppp
Jeeppp

Reputation: 1573

I would suggest that you pre-process and clean the data using the using the UpdateRequestProcessors.

This is a mechanism to transform the documents that is submitted to Solr for indexing.

Read more about UpdateRequestProocessors

Upvotes: 1

Related Questions