Moonwalker
Moonwalker

Reputation: 2232

Python-Whoosh BufferedWriter does not commit to the disk

Here is example in which I try to index large collection with whoosh

schema = Schema(name=TEXT(stored=True), m=ID(stored=True), content=KEYWORD(stored=True))
ix = create_in("indexdir", schema)
from whoosh.writing import BufferedWriter
from multiprocessing import Pool
jobs = []

writer = BufferedWriter(ix, period=15, limit=512, writerargs = {"limitmb": 512})
for item in cursor:
    if len(jobs) < 1024:
        jobs.append(item)
    else:
        p = Pool(8)
        p.map(create_barrel, jobs)
        p.close()
        p.join()
        jobs = []
        writer.commit()

create_barrel function in the end does the following:

writer.add_document(name = name, m = item['_id'], content = " ".join(some_processed_data))

yet after a few hours of running the index is empty and the only file in the indexdir is lock file _MAIN_0.toc

The code above kind of works when I switch no AsyncWriter but for some reason AsyncWriter misses around 90% of commits and standard writer is too slow for me.

Why does BufferedWriter miss commits?

Upvotes: 2

Views: 424

Answers (1)

Thomas Waldmann
Thomas Waldmann

Reputation: 491

the code looks a little problematic for cases where the cursor iterator is not giving a precise multiple of 1024 items.

at the end, you will have < 1024 items in the jobs list then and it will leave the for-loop. do you handle this remainder after the for-loop?

besides that: which whoosh version are you using?

did you try latest 2.4x branch and default branch code from the repo?

Upvotes: 1

Related Questions