NoSQL (MongoDB/ DynamoDB) Schema

Question

I'm scraping a series of products from websites and storing them in a DB. At the moment, I'm using MySQL, with two tables:

products (product_id, site, product_description, etc)
    e.g. (1234, "xyz.com", "nice product", etc)
product_history (product_id, scrape_timestamp)
    e.g. (1234, 2012-01-02 10:53:09)
         (1234, 2012-01-03 11:02:09)

This schema means that I can get a snapshot of which products were listed at any time in the past.

I'm considering a move to NoSQL because (i) I'm often getting different fields from different sites and (ii) the data volume will be around 5-10m products per year. I can see two ways of structuring the schema

Option 1: "Embedded list"

product_id,  site,      product_description, scrape_timestamp
1234,        "xyz.com", "test product",      {2012-01-02 10:53:09, 2012-01-03 11:02:09}

Option 2: "Secondary index"

keep both tables as above with the relational schema

I think that option 1 would work in something like MongoDB where secondary indexes are allowed whereas option 2 would work in either MongoDB or DynamoDB. (I'm considering these two because I want something hosted in the cloud with zero maintenance overhead for me).

Am I on the right track with these approaches? Which would work best? I'm entirely unfamiliar with nosql so any advice would be appreciated.

jtoberon · Accepted Answer

You say you're trying to solve the following data storage problem:

10 million or so historical data points per year
heterogeneous data points

10 million or so historical data points per year

As was indicated in the comments, 10 million data points per year is not a big data set. Most databases including MySQL will be able to support this, as long as you know how to tune them, so don't switch technologies just to solve this problem. If you have specific questions about how to tune MySQL, then I suggest you post a separate question.

Regarding a MongoDB-based solution to your historical data problem, you'd probably have a collection for the product_history data, and you just have to decide how many entries to put in each document. Some options: one per product; one per product/timestamp; one per product/timestamp within a specific time period, ex. one day or one month. A few more details:

The mongo docs suggest that embedding the data into an array will make sense "for certain use cases." Yours sounds like one such use case, i.e. I would NOT create one document per product/timestamp.
You also don't want one document per product because mongo isn't very good at growing very long arrays.
So, the remaining option is one per product/timestamp within a specific time period. Go with this one. Unfortunately you'll have to try a few things in order to determine exactly how to structure this document. Here are a few things to try:
- Use a "period" attribute to indicate the start time of the period in question. Create an index on product/period.
- When you're appending to the data structure within this document, you can simplify your code by using mongo's upsert functionality.
- Pick the duration of the period (hourly, daily, monthly, etc.) based on how frequently new data points come in. You probably need to tune this duration based on a load test.
- The simplest way to structure the document per period is to use one array to hold the data. However, mongo isn't good at appending to very long arrays, so some people have improved performance by creating a simple tree within the document. For example, if you have a daily document, then you could use a separate array per hour in the day:

{
  "0" : [ 123, 456 ],
  "1" : [ 789 ]
}

If you have a ton of data that doesn't fit into memory on one server, and you need to be able to read historical data really quickly, then you may get some benefit from MongoDB's support for sharding. Basically you'll be able to split your data up such that a read doesn't have to hit disk. But it's unclear from your question whether read performance matters and what types of reads you do, so please post a new question if you want more info.

heterogeneous data points

Regarding an RDBMS solution to your problem, here are the most common approaches that I've seen:

Denormalize the data. You say that you have 85 attributes per product? Then create a table with 85 columns! You say that another product has 20 different attributes? Then add 20 more columns to your table! This solution:
- Is simple.
- Performs well, since all the usual benefits of denormalization apply.
- Undermines some of the benefits of using a relational database. For example, most of your columns cannot have a "not null" constraint.
Normalize the data. Create one table to capture the attributes of type X, another table to capture type Y, etc. If you have a new product type, then add a new table. This solution:
- Can be complicated. For example, how do you decide which table to put certain common attributes in?
- Can perform badly. For example, if you have to join 10 tables in order to get any useful information out of the database, then you have a problem.
- Will make database purists happy.
Use rows instead of columns. This solution:
- Is complicated and hard to maintain.
- Performs badly.
- Undermines some of the benefits of using a relational database.

(Note that there are other RDBMS-based solutions out there. For example, some people like to put XML documents into DB2.)

You're trying the last approach, so it's no wonder that you're unhappy! If you're going to stick with an RDBMS, then I'd recommend switching to one of the other approaches that I listed.

Now, back to your question about NoSQL. A document-based NoSQL database such as MongoDB is an attractive solution to your "heterogeneous data points" problem because it's schema-less. You can list a product's spec in one document, and when you want to add new attributes then, well, you just do so. MongoDB does allow you to index the attributes, so if you want to find a product with attribute A, then you can do so quickly via an index. (There's no magic here though: as with any database, indexes can be expensive to maintain, so create them sparingly).

NoSQL (MongoDB/ DynamoDB) Schema

Answers (1)

Related Questions