Reputation: 4457
I'm scraping a series of products from websites and storing them in a DB. At the moment, I'm using MySQL, with two tables:
products (product_id, site, product_description, etc)
e.g. (1234, "xyz.com", "nice product", etc)
product_history (product_id, scrape_timestamp)
e.g. (1234, 2012-01-02 10:53:09)
(1234, 2012-01-03 11:02:09)
This schema means that I can get a snapshot of which products were listed at any time in the past.
I'm considering a move to NoSQL because (i) I'm often getting different fields from different sites and (ii) the data volume will be around 5-10m products per year. I can see two ways of structuring the schema
Option 1: "Embedded list"
product_id, site, product_description, scrape_timestamp
1234, "xyz.com", "test product", {2012-01-02 10:53:09, 2012-01-03 11:02:09}
Option 2: "Secondary index"
keep both tables as above with the relational schema
I think that option 1 would work in something like MongoDB where secondary indexes are allowed whereas option 2 would work in either MongoDB or DynamoDB. (I'm considering these two because I want something hosted in the cloud with zero maintenance overhead for me).
Am I on the right track with these approaches? Which would work best? I'm entirely unfamiliar with nosql so any advice would be appreciated.
Upvotes: 2
Views: 2190
Reputation: 9036
You say you're trying to solve the following data storage problem:
10 million or so historical data points per year
As was indicated in the comments, 10 million data points per year is not a big data set. Most databases including MySQL will be able to support this, as long as you know how to tune them, so don't switch technologies just to solve this problem. If you have specific questions about how to tune MySQL, then I suggest you post a separate question.
Regarding a MongoDB-based solution to your historical data problem, you'd probably have a collection for the product_history
data, and you just have to decide how many entries to put in each document. Some options: one per product; one per product/timestamp; one per product/timestamp within a specific time period, ex. one day or one month. A few more details:
{ "0" : [ 123, 456 ], "1" : [ 789 ] }
If you have a ton of data that doesn't fit into memory on one server, and you need to be able to read historical data really quickly, then you may get some benefit from MongoDB's support for sharding. Basically you'll be able to split your data up such that a read doesn't have to hit disk. But it's unclear from your question whether read performance matters and what types of reads you do, so please post a new question if you want more info.
heterogeneous data points
Regarding an RDBMS solution to your problem, here are the most common approaches that I've seen:
(Note that there are other RDBMS-based solutions out there. For example, some people like to put XML documents into DB2.)
You're trying the last approach, so it's no wonder that you're unhappy! If you're going to stick with an RDBMS, then I'd recommend switching to one of the other approaches that I listed.
Now, back to your question about NoSQL. A document-based NoSQL database such as MongoDB is an attractive solution to your "heterogeneous data points" problem because it's schema-less. You can list a product's spec in one document, and when you want to add new attributes then, well, you just do so. MongoDB does allow you to index the attributes, so if you want to find a product with attribute A, then you can do so quickly via an index. (There's no magic here though: as with any database, indexes can be expensive to maintain, so create them sparingly).
Upvotes: 2