Reputation: 552

Fast lookup for 100gb of data

I have about 100GB of data saved in ~10MB .csv files. How can I optimize lookup speed for several thousand queries to this data? Specifically, I don't know what technologies to consider or how to estimate relative performance.

Each file is unique to a date, and contains data for multiple people, for example:

...
2005-07-03, "Daffy Duck", ...
2005-07-03, "Daffy Duck", ...
2005-07-03, "Mickey Mouse", ...
2005-07-03, "Mickey Mouse", ...
...

I want to pull all of the info corresponding with a given date/name, for several thousand date/name pairs. The equivalent SQL query would be SELECT * FROM myDB WHERE Date='2005-07-03' AND Name='Mickey Mouse'.

Currently I haven't loaded the data into a database. To execute my "queries", I locate the appropriate date file and filter the lines by the name I am looking for. Would I get performance improvements storing the data in a relational database, noSQL database, or in any other way? If so why and by how much?

Upvotes: 3

Answers (4)

user533832

Reputation:

Would I get performance improvements storing the data in a relational database, noSQL database, or in any other way?

yes (I'd recommend a 'normal' RDBMS)

If so why…

that's one of the things indexes are for

…and by how much?

lots

Upvotes: 10

beatgammit

Reputation: 20215

I would definitely use a database, but picking the right one for the problem would require a bit more information, especially about the format of the data. Here are my recommendations, with some details about when I'd choose one over the other:

Relational:

If all of your data fits the same schema (has all the same fields), then relational would make sense. From your question, you mentioned that you only need 2 indexes, date and name.

Assuming you have a lot of other data for each entry, an SQL database would make a lot of sense (using something like your query).

Benefits:

You seem to already know how it works
Very similar to the CSV style of doing things
You can use SELECT/JOIN (if you need to later)

Downsides:

Wasted space for unused fields
Doesn't scale well (if you need more space)
Might be overkill, because the problem isn't embarassingly relational

NoSQL:

If your data doesn't fit the same schema (a lot of different keys with only a couple shared keys), a document store would make more sense. Since your data is kind of relational, MongoDB would make a ton of sense.

I would use the following JSON guide for your database:

{
    "name": "MickyMouse",
    "date": ...,
    other fields...
}

I would set name and date to be indexes, just like in the SQL example. MongoDB is fast, and it doesn't take up space for extra keys.

Benefits of this approach:

Scales really well (you can add nodes and shard)
Really simple to work with

Downsides:

Might not offer the features you need

Conclusion:

Both are good approaches, but it really depends on what exactly the data looks like. In general, databases are really good at querying, filesystems aren't, especially as the data gets big.

I would personally go the NoSQL route, but I would really need more information about the dataset and usage patterns. If the data needs to scale, then this is likely the best option.

I'm not really an expert, but I just don't like working with SQL that much. If the data is embarassingly relational, then SQL makes tons of sense, but it seems that everything you're doing would fit in one or two tables.

Upvotes: 1

Branko Dimitrijevic

Reputation: 52107

Others have already provided some good points, let me just talk a bit about physical database structure...

If you can, pick a DBMS that supports clustering¹ and make a clustered (aka. index-organized) table whose PK is {Date, Name, No}². Your SELECT can then be satisfied with a simple index range scan and no heap access at all (the table heap doesn't even exist), so you don't have to worry about bad clustering factor. The practical performance should be excellent and scale well to even much more data than you currently have.

If your DBMS supports leading-edge index compression, turn it on to eliminate the storage (and cache) cost of repeating values in the B-Tree structure of this composite primary/clustering index.

¹ E.g. Oracle, MS SQL Server, MySQL/InnoDB...

² Where No distinguishes between multiple rows on the same Date with the same Name. Alternatively, just make the Date more granular (e.g. make it precise to a second), modify the query to: SELECT * FROM myDB WHERE Name='Mickey Mouse' AND Date >= '2005-07-03' AND Date < '2005-07-04'), and reverse the order of PK fields to {Name, Date}, to satisfy the modified query.

Upvotes: 2

Cade Roux

Reputation: 89661

I'm going to go out on a devil's advocate limb here and say that you might not get that much better performance with a relational database or any other database "system" for this particular operation relative to the work required to put all this data into a database.

As much as I would recommend loading the data into some kind of database (i.e. a full-blown codified data management system), your files are tiny. From your question, it sounds like you can identify the file needed in constant time and then only have to read and filter (using a regex perhaps?) at most 10MB of data, so why would you need a relational database?

Just identify the file and pipe it through grep and you're done, right? That's pretty efficient.

The relational database with appropriate indexing (on date, name), is only going to make the second step more efficient, and even then, the data set is fairly small - a few thousand rows in each 10MB file?

I know this sounds like a very rough way to solve the problem by keeping everything in the text files, but keep it simple. You would have to manage the parsing, validation and loading of the data into the database, and then manage the additional storage of the data in database form etc.

You haven't given any information about how frequently you need to carry out this search, what you do with the data you obtain as a result or any other performance and operational requirements.

If you were needing to carry out this particular operation many times a second or want to have flexibility in addressing the data in more creative fashion or perform any kind of analysis across data which is currently in separate files or any number of things like that, then the relational database immediately presents itself as the best option for data management.

Upvotes: 5

Fast lookup for 100gb of data

Answers (4)

Related Questions