Reputation: 1184
I am working on a language database, and for my work, I often need to load 2 entries out of a 10+ GB text file, which contains over 500k entries. This used to be manageable for work when my text file was 1 GB, but now it takes nearly 4 minutes to do a test run. My text file has a simple format with 2 fields. The first field (primary key) is a word (e.g. "apple" or "parking lot"), and the second field is a large text blog. File format:
column 1 | column 2
a dictionary word | a large text blob
I have considerable SQL experience, and I'm aware that I could load the content into sqlite, or index with Lucene, Xapian, etc., I've done all that for other parts of my project, but for now, I'd really like to retain my big flat file. But instead of this:
grep term1 bigFlatFileDB.txt
grep term2 bigFlatFileDB.txt
which takes 4+ minutes each time, I'd like to index (or inverted index, if you prefer) once, basically a low-level btree, and then search fast. I'd love something like this:
buildIndex bigFlatFileDB.txt bigFlatFileDB.index
alt-grep term1 bigFlatFileDB.txt bigFlatFileDB.index
alt-grep term2 bigFlatFileDB.txt bigFlatFileDB.index
etc.
In the above "alt-grep" refers to some other method of searching the big file.
I've seen ideas for KinoSearch (perl/C adaptation of Lucene), Namazu, Swish-e, Flat Fire, Apache CouchDB. But it's hard to know what's best, or if these are even plausible. Maybe there's some classic old hack I don't know about, indexing line numbers, using sed to pull the lines I need, etc. (I know I can split my file into, say, 1000 sub-files; but that's also adding a complexity I don't want, and I lose the tie to my main big file.) I can't do a live memory map, mmap because my computer only has 16 GB memory, and that still requires a full file load.
I've tried to search for ideas on StackOverflow, but many responses are like "use Lucene", or "use X variety of SQL". My situation is simple, and I only have one type of query. I see lots of dead or ignored requests on StackOverflow for inverted indexes.
Any methods or workflows that you've found work well to match a single line/row/entry, based on a unique first field, from a huge text file?
Upvotes: 2
Views: 172
Reputation: 1184
Here's an example of how to accomplish this through sqlite. It only takes a few lines of code. Here, the input file is /path/to/input.txt, and the column delimiter is a tilde "~".
rm sample.db
sqlite3 sample.db << EOF
create table entries(headword TEXT, contents TEXT);
.separator "~"
.import /path/to/input.txt entries
EOF
And get matches:
sqlite3 sample.db 'SELECT * FROM entries WHERE headword IN ("#apple", "#banana");'
#banana|....
#apple|....
But still, this is abstracting a step from my original file, and I wonder if there's a smart way to do the same without a typical database like sqlite.
Upvotes: 0