Reputation: 616

SOLR Delta import takes longer than next scheduled delta import cron job

We are using Solr 5.0.0. Delta import configuration is very simple, just like the apache-wiki

We have setup cron job to do delta-imports every 30 mins, simple setup as well:

0,30 * * * * /usr/bin/wget http://<solr_host>:8983/solr/<core_name>/dataimport?command=delta-import

Now, what happens if sometimes currently running delta-import takes longer than the next scheduled chron job?

Does SOLR Launches next delta-import in a parallel thread? Or ignores job until previous one is done?

Extending time in cron scheduler isn't an option as similar problem could happen as user and document number increases over the time...

Upvotes: 0

Answers (2)

Halis Yılboğa

Reputation: 900

Solr will simple ignore next import request until the end of the first one and it will not cache the second request. I can observe the behaviour and I've been read it somewhere but couldn't find it now.

Infact I'm dealing with same problem. I try to optimize the queries:

deltaImportQuery="select * from Assests where ID='${dih.delta.ID}'"
            deltaQuery="select [ID] from Assests where date_created &gt; '${dih.last_index_time}' "

I only retrieved ID field in first hand and than try to retrive the intended doc. You may also specify your fields instead of '*' sign. since I use view it doesn't apply in my case
I will update if I had another solution.

Edit After Solution

Beyond the suggested request above I change one more think that speed up my indexing process 10 times. I had two big Entities nested. I used Entity inside another one like

  <entity name="TableA"  query="select * from TableA"> 
       <entity name="TableB"  query="select * from TableB where TableB.TableA_ID='${TableA.ID}'" >

Which yields to multi valued tableB fields. But For every row one request maded to db for TableB. I changed my view using a with clause combined with a comma separeted field value. And parse the value from solr field mapping. and indexed it in to multivalued field.

My whole indexing process speed up from hours to minutes. Below is my view and solr mappping config.

WITH tableb_with as (SELECT * from TableB)
    SELECT *,STUFF( (SELECT ',' +   REPLACE( fieldb1, ',', ';') from tableb_with where  tableb_with.tableA.ID = tableA.ID 
                                            for xml path(''), type).value('.', 'varchar(max)') , 1, 1, '') AS field2WithComma,
    STUFF( (SELECT ',' +   REPLACE( fieldb1, ',', ';') from tableb_with where  tableb_with.tableA.ID = tableA.ID 
                                            for xml path(''), type).value('.', 'varchar(max)') , 1, 1, '') AS field2WithComma,

Al fancy Joins and unions goes into with clouse in tableB and also alot of joins in tableA. Actually this view held 200 hundred field in total.

solr mappping is goes like this :

<field column="field1WithComma" name="field1" splitBy=","/>

Hope It may help someone.

Upvotes: 0

Abhijit Bashetti

Reputation: 8678

I had the similar problem at my end.

Here is how I had a work around for it.

Note : I have implemented solr with core.

I have one table where in I have kept the info about solr like core name, last re-index date and re-indexing-required, current_status.

I have written a scheduler where it check which all cores needs re-indexing(delta-import) from the above table and starts the re-index.

Re-indexing request are sent/invoked after every 20 minutes(In your its 30 min).

When I start the re-indexing also update table and mark the status for the specific core as "inprogress".

After ten minutes I fire a request checking if the re-indexing is completed.

For checking the re-indexing I have used the request as :

final URL url = new URL(SOLR_INDEX_SERVER_PROTOCOL, SOLR_INDEX_SERVER_IP, Integer.valueOf(SOLR_INDEX_SERVER_PORT),
                    "/solr/"+ core_name +"/select?qt=/dataimport&command=status");

check the status for Committed or idle and the consider it as re-indexing is completed and mark the status of it as Idle in the table.

So re-indexing scheduler wont pick core which are in inprogress status.

Also it considers only those cores for re-indexing where in there some updates (which can be identified by flag "re-indexing-required").

Re-indexing is invoked only if re-indexing-required is true and current status is idle.

If there are some updates(identified by "re-indexing-required") but the current_status is inprogress the scheduler wont pick it for re-indexing.

I hope this may help you.

Note : I have used DIH for indexing and re-indexing.

Upvotes: 0

SOLR Delta import takes longer than next scheduled delta import cron job

Answers (2)

Edit After Solution

Related Questions