Optimal on-disk data structure for searching a file?

I've spent a couple of hours reading posts that were related to the question in a bid to try and come up with a solution, but I wasn't really successful in coming up with one.

So here goes: I was once asked in an interview which data structure I would use to search a if a particular word existed in a file. The file is also supposedly big enough to not be able to fit in the memory and the interviewer was really looking for an on-disk solution.

Is the B-Tree an on-disk data structure?

A Binary Search Tree is an in-memory data structure isn't it?

Upvotes: 2

Answers (3)

Anon.

Reputation: 59983

There are really two different possible questions here:

Given a massive file, and a word, how do you check if the word exists in the file?
Given a massive file, how do you build an index so that you can efficiently check if an arbitrary word exists in the file?

The first problem is efficiently solved with Boyer-Moore and a linear search through the file. If you're only searching once, building an index is a complete waste of time.

Regarding the second problem, it sounds like the interviewer is really pushing B-Trees.

Upvotes: 4

corsiKa

Reputation: 82579

You want to use a data structure that maps one node to one page of disk space. This will minimize disk activity.

Because a B-Tree is often used for this. See http://en.wikipedia.org/wiki/B-tree, specifically the section "Time to search a sorted file".

Upvotes: 2

Aryabhatta

Reputation:

Both are just data-structures and can be both on-disk or in-memory. It depends on how you choose to use them.

btw, B-Trees were motivated by a need to have on-disk structures. Binary search trees are just a special case of B-trees, in one way.

Upvotes: 2

Optimal on-disk data structure for searching a file?

Answers (3)

Related Questions