Reputation: 2232
Here is example in which I try to index large collection with whoosh
schema = Schema(name=TEXT(stored=True), m=ID(stored=True), content=KEYWORD(stored=True))
ix = create_in("indexdir", schema)
from whoosh.writing import BufferedWriter
from multiprocessing import Pool
jobs = []
writer = BufferedWriter(ix, period=15, limit=512, writerargs = {"limitmb": 512})
for item in cursor:
if len(jobs) < 1024:
jobs.append(item)
else:
p = Pool(8)
p.map(create_barrel, jobs)
p.close()
p.join()
jobs = []
writer.commit()
create_barrel function in the end does the following:
writer.add_document(name = name, m = item['_id'], content = " ".join(some_processed_data))
yet after a few hours of running the index is empty and the only file in the indexdir is lock file _MAIN_0.toc
The code above kind of works when I switch no AsyncWriter but for some reason AsyncWriter misses around 90% of commits and standard writer is too slow for me.
Why does BufferedWriter miss commits?
Upvotes: 2
Views: 424
Reputation: 491
the code looks a little problematic for cases where the cursor iterator is not giving a precise multiple of 1024 items.
at the end, you will have < 1024 items in the jobs list then and it will leave the for-loop. do you handle this remainder after the for-loop?
besides that: which whoosh version are you using?
did you try latest 2.4x branch and default branch code from the repo?
Upvotes: 1