Reputation: 125
Recently in an interview, I was asked a question -
Let's we have billion of data in a file system [Assuming data is already being fetched by existing code] what my task was to find the Person name using the email id as search criteria with best case complexity. Also, what collection is best suited for this?
public class Person
{
public string Name {get;set;}
public string Email {get;set;}
}
Upvotes: 0
Views: 122
Reputation: 29217
You can't (or really, really shouldn't) plan on searching billions of files to find one with a matching email address. That's like reading every book in the library to find out which ones were written by certain author. What you need (just as they have at the library) is an index. You might have to do all of the work to read and parse all of that content once to build the index, but then when you need a particular file or files you search the index, not the files.
You might read each file and save a record to a database containing elements such as the email address and perhaps other details about that document, and then with that record store a pointer (a path) to the file itself.
That way when you need to perform a search you're executing a SQL query, not scanning billions of files.
I disagree with using a Dictionary
. Where is that dictionary going to come from? If you're using an index (like a SQL table or tables) then you'll query that. There would be no reason to query the tables and build a gigantic in-memory dictionary. What if you also wanted the files to be queryable by some other attribute. Then what - create another dictionary?
And the great big hole in that - it assumes that there will be one file for each email address. What if two contain the same email address? Then you'd have duplicate keys.
If you wanted that much data stored in memory for some reason (like extremely fast performance) it still doesn't change the solution. Newer versions of SQL Server load data into memory. But it's still on the SQL server which can handle queries much more efficiently.
The question about which collection to use was prefaced with "also", suggesting it's not central to the problem. That's good, because I don't think it's relevant at all. If the query returns references to more than one document you could return the results in an IEnumerable<T>
- the underlying type (List<T>
, array, etc.) doesn't matter very much.
Upvotes: 0
Reputation: 45096
Definitely Dictionary with email as the key
It is O(1) lookup by key
And email would hash very nicely
For value you can use Name or Person
There is also KeyedCollection that is O(1) but that would almost be showing off.
Upvotes: 2
Reputation: 2605
Dictionary<string,string>
would be my answer.
Arguing about the overhead of the class is irelevant. Under the hood, keys are implemented as a hash table. Retrieval by key is approaching O(1) complexity.
In your case the unique key would be the email address and person's name would be value.
Upvotes: 0