Reputation: 579
I am working on a big data project where large amounts of product information is gathered from different online sellers, such as prices, titles, sellers and so on (30+ data points per item).
In general, there are 2 use cases for the project:
I first decided to use MongoDB to be able to scale horizontally as the data stored for the project is assumed to be in the range of hundreds of GBs and the data could be sharded dynamically with MongoDB across many MongoDB instances.
The 30+ data points per product won't be collected at once, but at different times, e.g. one crawler collects the prices, a couple of days later another one collects the product description. However, some data points might overlap because both crawler collect e.g. the product title. For example the result could be something like:
Document 1:
{
'_id': 1,
'time': ISODate('01.05.2016'),
'price': 15.00,
'title': 'PlayStation4',
'description': 'Some description'
}
Document 2:
{
'_id': 1,
'time': ISODate('02.05.2016'),
'price': 16.99,
'title': 'PlayStation4',
'color': 'black'
}
Therefore I initially came up with the following idea (Idea 1):
Thus, I was thinking about a different solution (Idea 2):
For example:
{
'_id': 1,
'timestamp': ISODate('04.05.2016'),
'type': 'price',
'value': 15.00
}
Therefore, I am struggling to find the right model and / or database to use for this project. To sum it up, these are the requirements:
I would be very grateful for any ideas (data model / architecture, different database, ...) that might help me advance the project. Thanks a lot in advance!
Upvotes: 0
Views: 213
Reputation: 843
Are the fields / data points already known and specified? I.e., do you have a fixed schema? If so, then you can consider relational databases as well.
DB2 has a what they call temporal databases. In the 'system' form, the DB handles versioning transparently. Any inserts are automatically timestamped, and whenever you update a row, the previous row is automatically migrated to a history table (keeping its old timestamp). Thereafter, you can run SQL queries at any given point in time, and DB2 will return the data as it was at the time (or time range) specified. They also have an 'application' form, in which you specify the time periods that the row is valid for when you insert the row (e.g. if prices are valid for a specific period of time), but the ultimate SQL queries still work the same way. What's nice is that either way, all the time complexity is managed by the database and you can write relatively clean SQL queries.
You can check out more at their DeveloperWorks site.
I know that other relational DBs like Oracle also have special capabilities for time series data that manage the versioning / timestamping stuff for you.
As far as space efficiency and scale, I'm not sure as I don't run any databases that big :-)
(OTOH, if you don't have a fixed schema, or you know you'll have multiple schemas for the different data inputs and you can't model it with sparse tables, then a document DB like mongo might be your best bet)
Upvotes: 1