Efficiently generating a document index for a large number of small documents in a large file

Question

Goal

I have a very large corpus of the following format:


Some text
...
Some more text


...


Some text
...
Some more text

There are tens of millions of entries for this corpus, and more for other corpora I want to deal with.

I want to treat each entry as a separate document and have a mapping from words of the corpus to the list of documents they occur in.

Problem

Ideally, I would just split the file into separate files for each entry and run something like a Lucene indexer over the directory with all the files. However, creating millions and millions of files seems to crash my lab computer.

Question

Is there a relatively simple way of solving this problem? Should I keep all the entries in a single file? How can I track where they are in the file for use in an index? Should I use some other tool than separate files for each entry?

If it's relevant, I do most of my coding in Python, but solutions in another language are welcome.

hymloth · Accepted Answer

Well, keeping all entries in a single file is not a good idea. You can process your big file using generators, so as to avoid memory issues, entry by entry, and then I'd recommend storing each one in a database. While on the process, you can dynamically construct all the relevant stuff, such as term frequencies, document frequencies, posting lists etc, which you can also save in a database.

This question might have some useful info.

Take also a look at this to get an idea.

Efficiently generating a document index for a large number of small documents in a large file

Answers (1)

Related Questions