vjain27
vjain27

Reputation: 3674

creating a simple index on a text file in java

I need to implement a simple indexing scheme for a big text file. The text file contains key value pairs and I need to read back a specific key value pair without loading the complete file in memory. The text file is huge and contains millions of entries and the keys are not sorted. Different key-value pairs need to be read depending on user-input. So I don't want the complete file to be read every time. Please let me know the exact classes and methods in java file handling api that would help to implement this in a simple and efficient way.I want to do this without using an external library such as lucene.

Upvotes: 3

Views: 19561

Answers (3)

darw
darw

Reputation: 1057

Let's say we have file:

Rust ➡️ if it works, fix it anyway
Ruby ➡️ easy to understand, especially after the first five years
JavaScript ➡️ car is to carpet as Java is JavaScript
C ➡️ I don't care what you think, I'm faster than you

In-memory

Scan file once to build an index:

String pattern = "(.*) ➡️ (.*)";
var file = Files.newByteChannel(Paths.get("file"));

Map<String, Integer> index = new Scanner(file)
        .findAll(pattern)

and store it in RAM:

        .collect(toMap(record -> record.group(1), MatchResult::start));

then, if the user enters JavaScript, look it up in the index:

int offset = index.get("JavaScript");
new Scanner(file.position(offset))
        .findAll(pattern)
        .map(MatchResult::group)
        .findFirst()
        .ifPresent(System.out::println);

― will print:

JavaScript ➡️ I promise to call you back

Persistent

We need a data structure optimized for disk. Luckily, it is already implemented for us by the file system. Given a file path, the file system finds the file on a disk.

Hence, if we encode our keys in terms of file paths and our offsets in terms of files, we can delegate the job to the file system.

Just store the index on a disk instead of in RAM:

        .forEach(record -> Files.writeString(Paths.get(record.group(1)),
                                        String.valueOf(record.start()))); // try catch

and adjust the lookup accordingly:

int offset = Integer.parseInt(Files.readString(Paths.get("JavaScript")));

Upvotes: 0

DJClayworth
DJClayworth

Reputation: 26856

As the comments pointed out, you're going to need to do a linear search of the entire file in worst case, and half of it on average. But fortunately there are some tricks you can do.

If the file doesn't change much, then create a copy of the file in which the entries are sorted. Ideally make records in the copy the same length, so that you can go straight to the Nth entry in the sorted file.

If you don't have the disk space for that, then create an index file, which has all the keys in the original file as key and the offset into the original file as the value. Again used fixed length records. Or better, make this index file a database. Or load the original file into a database. In either case, disk storage is very cheap.

EDIT: To create the index file, open the main file using RandomAccessFile and read it sequentially. Use the 'getFilePointer()' method at the start of each entry to read the position in the file, and store that plus the key in the index file. When looking up something read the file pointer from the index file and use the 'seek(long)' method to jump to the point in the original file.

Upvotes: 5

TMN
TMN

Reputation: 3070

I'd recommend building an index file. Scan the input file and write every key and its offset into a List, then sort the list and write it to the index file. Then, whenever you want to look up a key, you read in the index file and do a binary search on the list. Once you find the key you need, open the data file as a RandomAccessFile and seek to the position of the key. Then you can read the key and the value.

Upvotes: 4

Related Questions