George
George

Reputation: 15571

How can I index a lot of txt files? (Java/C/C++)

I need to index a lot of text. The search results must give me the name of the files containing the query and all of the positions where the query matched in each file - so, I don't have to load the whole file to find the matching portion. What libraries can you recommend for doing this?

update: Lucene has been suggested. Can you give me some info on how should I use Lucene to achieve this? (I have seen examples where the search query returned only the matching files)

Upvotes: 4

Views: 3382

Answers (8)

Fabian Steeg
Fabian Steeg

Reputation: 45754

I'm aware you asked for a library, just wanted to point you to the underlying concept of building an inverted index (from Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze).

Upvotes: 0

Nemanja Trifunovic
Nemanja Trifunovic

Reputation: 24559

Also take a look at Lemur Toolkit.

Upvotes: 2

dance2die
dance2die

Reputation: 36985

Lucene - Java

It's open source as well so you are free to use and deploy in your application.

As far as I know, Eclipse IDE help file is powered by Lucene - It is tested by millions

Upvotes: 2

Paul Whelan
Paul Whelan

Reputation: 16799

Have a look at http://www.compass-project.org/ it can be looked on as a wrapper on top of Lucene, Compass simplifies common usage patterns of Lucene such as google-style search, index updates as well as more advanced concepts such as caching and index sharding (sub indexes). Compass also uses built in optimizations for concurrent commits and merges.

The Overview can give you more info http://www.compass-project.org/overview.html

I have integrated this into a spring project in no time. It is really easy to use and gives what your users will see as google like results.

Upvotes: 2

Yuval F
Yuval F

Reputation: 20621

I believe the lucene term for what you are looking for is highlighting. Here is a very recent report on Lucene highlighting. You will probably need to store word position information in order to get the snippets you are looking for. The Token API may help.

Upvotes: 2

Benoît
Benoît

Reputation: 17014

Why don't you try and construct a state machine by reading all files ? Transitions between states will be letters, and states will be either final (some files contain the considered word, in which case the list is available there) or intermediate.

As far as multiple-word lookups, you'll have to deal with them independently before intersecting the results.

I believe the Boost::Statechart library may be of some help for that matter.

Upvotes: 0

dirkgently
dirkgently

Reputation: 111316

It all depends on how you are going to access it. And of course, how many are going to access it. Read up on MapReduce.

If you are going to roll your own, you will need to create an index file which is sort of a map between unique words and a tuple like (file, line, offset). Of course, you can think of other in-memory data structures like a trie(prefix-tree) a Judy array and the like...

Some 3rd party solutions are listed here.

Upvotes: 2

Jared
Jared

Reputation: 39913

For java try Lucene

Upvotes: 8

Related Questions