Which data model to choose for big data project with 100 mio. items

Question

I am working on a big data project where large amounts of product information is gathered from different online sellers, such as prices, titles, sellers and so on (30+ data points per item).

In general, there are 2 use cases for the project:

Display the latest data points for a specific product in a web app or widget
Analyze historical data, e.g. price history, product clustering, semantic analysis and so on

I first decided to use MongoDB to be able to scale horizontally as the data stored for the project is assumed to be in the range of hundreds of GBs and the data could be sharded dynamically with MongoDB across many MongoDB instances.

The 30+ data points per product won't be collected at once, but at different times, e.g. one crawler collects the prices, a couple of days later another one collects the product description. However, some data points might overlap because both crawler collect e.g. the product title. For example the result could be something like:

Document 1:
{
'_id': 1,
'time': ISODate('01.05.2016'),
'price': 15.00,
'title': 'PlayStation4',
'description': 'Some description'
}

Document 2:
{
'_id': 1,
'time': ISODate('02.05.2016'),
'price': 16.99,
'title': 'PlayStation4',
'color': 'black'
}

Therefore I initially came up with the following idea (Idea 1):

All the data points found at one specific crawl process end up in one document as described above. To get the latest product info, I would then query each data point individually and get the newest entry that is not older than some threshold, e.g. a week, to make sure that the product info is not outdated for "Use Case 1" and that we have all the data points (because a single document may not include all data points but only a subset).
However, as some data points (e.g. product titles) do not change regularly, just saving all the data all the time (to be able to do time series analysis and advanced analytics) would lead to massive redundancy in the database, e.g. the same product description would be saved every day even though it doesn't change. Therefore I thought I might check the latest value in the DB and only save the value if it has changed. However, this leads to a lot of additional DB queries (one for each data point) and, due to the time threshold mentioned above, we would lose the information whether the data point did not change or was removed from the website by the owner of the shop.

Thus, I was thinking about a different solution (Idea 2):

I wanted to split up all the data points in different documents, e.g. the price and the title are stored in separate documents with own timestamps. If a data point does not change, the timestamp can be updated to indicate that the data point did not change and is still available on the website. However, this would lead to a tremendous overhead for small data points, e.g just boolean values, because every document needs its own key, timestamp and so on to be able to find / filter / sort them quickly using indexes.

For example:

{
'_id': 1,
'timestamp': ISODate('04.05.2016'),
'type': 'price',
'value': 15.00
}

Therefore, I am struggling to find the right model and / or database to use for this project. To sum it up, these are the requirements:

Collect hundreds of millions of products (hundreds of GBs even TBs)
Overlapping subsets of product information are retrieved by distributed crawlers at different points of time
Information should be stored in a distributed, horizontally scalable database
Data redundancy should be reduced to a minimum
Time series information about the data points should be retained

I would be very grateful for any ideas (data model / architecture, different database, ...) that might help me advance the project. Thanks a lot in advance!

Which data model to choose for big data project with > 100 mio. items

Answers (1)

Related Questions

Which data model to choose for big data project with &gt; 100 mio. items

Answers (1)

Related Questions

Which data model to choose for big data project with > 100 mio. items