Here are the facts: We have a lot (L O T) of data coming in everyday. Each file we receive is in a csv format and while there are a couple of headers that reoccur more often than others, there is not really a standard. The normalization of each file to be uploaded into a mySQL database is highly time consuming and often pushes us to change the schema (new field appeared in on file that was not existing before..). While the primary key is unique, anything else can be duplicated These are customers records (i.e.: email,firstname,lastname,city,state,address...etc) We could have multiple emails for the same individual .. We read 70% of the time and we write 30% of the time Scalability could be a concern but it is not right now, though availability is key Speed is what we are looking for. Mysql is too slow to answer queries where tables are over 50 million records. Even well optimized we have too many speed issue. Breaking down the tables has become an organizational concern. Schema less noSQL seemed attractive. What would you recommend, what did you implement? (Please do not answer to optimize mysql .. pointless and off topic) --

mysqlnosqlcassandra

Michel

Reputation: 11

NOsql Vs Mysql - Going schemaless with Cassandra

Here are the facts:

We have a lot (L O T) of data coming in everyday.
Each file we receive is in a csv format and while there are a couple of headers that reoccur more often than others, there is not really a standard.
The normalization of each file to be uploaded into a mySQL database is highly time consuming and often pushes us to change the schema (new field appeared in on file that was not existing before..).
While the primary key is unique, anything else can be duplicated
These are customers records (i.e.: email,firstname,lastname,city,state,address...etc)
We could have multiple emails for the same individual ..
We read 70% of the time and we write 30% of the time
Scalability could be a concern but it is not right now, though availability is key
Speed is what we are looking for. Mysql is too slow to answer queries where tables are over 50 million records. Even well optimized we have too many speed issue. Breaking down the tables has become an organizational concern. Schema less noSQL seemed attractive. What would you recommend, what did you implement? (Please do not answer to optimize mysql .. pointless and off topic)

Upvotes: 1

Answers (1)

Gates VP

Reputation: 45287

Let's go over the points:

We have a lot (L O T) of data coming in everyday.

NoSQL solutions are basically all created to scale to large numbers (Riak, MongoDB, Cassandra, etc.)

... headers that reoccur more often than others, there is not really a standard... The normalization of each file to be uploaded into a mySQL database is highly time consuming and often pushes us to change the schema

NoSQL definitely fits this model many of them are "schema-less" so it's easy to store those extra fields. This will however cost you extra space as the field names are typically stored with the document.

While the primary key is unique, anything else can be duplicated

"Document-oriented" and "Key-Value" databases are a good fit for this as long as the key is provided. If you have to run duplicate checks, then most key-value database are ill-equipped. The "document-oriented" database might be slightly better equipped, but not by much.

We could have multiple emails for the same individual

Most of these databases have some notion of "arrays as a basic type". CouchDB and MongoDB both store objects as JSON, so it's easy to see how a customer could have an array of e-mails without the need for a "join table". MongoDB also provides "atomic update" features like "$addToSet" that plays nicely with arrays.

We read 70% of the time and we write 30% of the time Scalability could be a concern but it is not right now, though availability is key

The major NoSQL DBs are all designed to scale. (both reads and writes)

The only way to availability is through hardware and locational redundancy (no different that MySQL or other databases). Despite their low version numbers, many of these Databases are being used in production environments by very big companies, so many of the simple cases are covered. It's still virgin territory, but we're also past the "randomly crashes when nothing has changed" phase.

Speed is what we are looking for... Schema less noSQL seemed attractive. What would you recommend, what did you implement?

We have 100s of M of flexible user records in MongoDB. Performance on individual seeks is really awesome.

However, you have to wary about the type of queries you're running.

If you need to run queries that bring back several Users at once, you're going to have speed issues with basically any of these Key-Value or Document-Oriented database. You may want to look at Graph database or some other fancy solution. However, if your use cases all center around one user at a time then take a look at MongoDB.

MongoDB also supports native map-reduce so you'll be able to scale "non-real time" queries.

Upvotes: 3

NOsql Vs Mysql - Going schemaless with Cassandra

Answers (1)

Related Questions