anand mishra
anand mishra

Reputation: 125

How to decide which collection will best suited for your requirement

Recently in an interview, I was asked a question -

Let's we have billion of data in a file system [Assuming data is already being fetched by existing code] what my task was to find the Person name using the email id as search criteria with best case complexity. Also, what collection is best suited for this?

public class Person
 {
     public string Name {get;set;}
     public string Email {get;set;}
 }

Upvotes: 0

Views: 122

Answers (3)

Scott Hannen
Scott Hannen

Reputation: 29217

You can't (or really, really shouldn't) plan on searching billions of files to find one with a matching email address. That's like reading every book in the library to find out which ones were written by certain author. What you need (just as they have at the library) is an index. You might have to do all of the work to read and parse all of that content once to build the index, but then when you need a particular file or files you search the index, not the files.

You might read each file and save a record to a database containing elements such as the email address and perhaps other details about that document, and then with that record store a pointer (a path) to the file itself.

That way when you need to perform a search you're executing a SQL query, not scanning billions of files.

I disagree with using a Dictionary. Where is that dictionary going to come from? If you're using an index (like a SQL table or tables) then you'll query that. There would be no reason to query the tables and build a gigantic in-memory dictionary. What if you also wanted the files to be queryable by some other attribute. Then what - create another dictionary?

And the great big hole in that - it assumes that there will be one file for each email address. What if two contain the same email address? Then you'd have duplicate keys.

If you wanted that much data stored in memory for some reason (like extremely fast performance) it still doesn't change the solution. Newer versions of SQL Server load data into memory. But it's still on the SQL server which can handle queries much more efficiently.

The question about which collection to use was prefaced with "also", suggesting it's not central to the problem. That's good, because I don't think it's relevant at all. If the query returns references to more than one document you could return the results in an IEnumerable<T> - the underlying type (List<T>, array, etc.) doesn't matter very much.

Upvotes: 0

paparazzo
paparazzo

Reputation: 45096

Definitely Dictionary with email as the key
It is O(1) lookup by key
And email would hash very nicely

For value you can use Name or Person

There is also KeyedCollection that is O(1) but that would almost be showing off.

Upvotes: 2

pijemcolu
pijemcolu

Reputation: 2605

Dictionary<string,string> would be my answer.

Arguing about the overhead of the class is irelevant. Under the hood, keys are implemented as a hash table. Retrieval by key is approaching O(1) complexity.

In your case the unique key would be the email address and person's name would be value.

Upvotes: 0

Related Questions