Reputation: 3923

MongoDB Schema Design for Analytic or Trending Dashboard

Currently working on a POC to migrate from a RDBMS and evaluating MongoDB, in the process of doing the schema and would like to ensure it meets our needs.

Some background - We process +-100,000 files per a day - About 10 different file types (which would go into different collections) - +100million records per a day - Files are Pipe Delimited Text - Will Use python to load the data - Each file contains between 100-250 fields

Current Setup - Each file type gets loaded into it's own table (Which is partitioned by day) i.e File Type A would be in FileA (Partition 20120701) File Type B would be in FileB (Partition 20120701) etc.

We run a job to summarize the data every 5-10 minutes (for hourly trending)
We run a job to summarize the data daily (for daily trending)
No joins etc are done on the data

I have included a trimmed file (removed fields and records) of the source data, as well 4 different options to structure the data. My current design would be:

Collection per a day for each file type (detailed records) i.e FileA_20120701 FileB_20120701
Collection per month for hourly summarized data , will include all file types records i.e Hourly_usage_201207
Collection per a year for daily summarized data , will include all file type records i.e Daily_usage_2012

So : Map Reduce For Hourly > Map Reduce for Daily

Right now I am focusing on how to store the detailed records, please take a look at the raw data provided in files:

SERIAL  TIMESTAMP   CUSTOMER_ID RESERVED01  PRODUCT_ID  CUSTOMER_TYPE   CUSTOMER_STATE  ChargeAmount_OF_UNITS   ChargeAmount_OF_FUND    ChargeAmount_FROM_ACCOUNT1  ACCOUNT1_BALANCE    ChargeAmount_FROM_ACCOUNT2  ACCOUNT2_BALANCE    WalletType1 UnitType1   ChargeAmount1   WalletBalance1  WalletType2 UnitType2   ChargeAmount2   WalletBalance2  WalletType3 UnitType3   ChargeAmount3   WalletBalance3  Bonus1  Bonus2  Bonus3  AddtionaInfo 
379120186   20120701235122  1345567 0   555 0   1000000 0   0   5   664 0   0   200 1   5   664 0   0   0   0   0   0   0   0               1234
379120190   20120701235124  1345568 0   1   0   1000000 0   0   4   108 0   0   200 1   4   108 0   0   0   0   0   0   0   0               
379120197   20120701235132  1345569 0   4   0   1000000 0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0               
379120203   20120701235136  1345570 0   1   0   1000000 0   0   0   346 0   0   0   0   0   0   0   0   0   0   0   0   0   0               XXX
379120207   20120701235138  1345571 0   10  0   1000000 0   0   0   0   3   100 200 1   3   100 500 2   10  0   610 2   20  80  10          
379120208   20120701235138  1345571 0   1   0   1000000 0   0   3   306 0   0   200 1   3   306 0   0   0   0   0   0   0   0               
379120211   20120701235141  1345573 0   1   0   1000000 0   0   0   181 0   0   0   0   0   0   0   0   0   0   0   0   0   0               
379120213   20120701235143  3101015742206   0   349 1   1000001 0   0   0   274 0   0   0   0   0   0   0   0   0   0   0   0   0   0               
379120214   20120701235144  3101015742206   0   349 1   1000001 0   0   0   0   120 680 210 1   37  0   200 18  120 680 0   0   0   0               
379120215   20120701235147  3101015742206   0   349 1   1000001 0   0   0   992 0   0   0   0   0   0   0   0   0   0   0   0   0   0               
379120217   20120701235147  3101015742206   0   349 1   1000001 0   2   0   1   0   0   400 3   2   1766    0   0   0   0   0   0   0   0               
379120223   20120701235149  3101015742206   0   349 1   1000001 0   0   11  196 0   0   200 1   11  196 0   0   0   0   0   0   0   0               
379120229   20120701235153  1345579 0   349 3   1000000 0   0   40  707 0   0   200 1   40  707 0   0   0   0   0   0   0   0   20  5       XXX
379120230   20120701235153  3101015742206   0   349 1   1000001 0   0   9   1702    0   0   200 1   9   1702    0   0   0   0   0   0   0   0               
379120232   20120701235153  1345581 0   349 2   1000000 0   0   150 59  0   0   200 1   150 59  0   0   0   0   0   0   0   0               
379120237   20120701235158  1345582 0   1   2   1000000 0   0   3   303 0   0   200 1   3   303 0   0   0   0   0   0   0   0               
379120241   20120701235202  538552582   0   14  0   1000000 0   0   0   779 10  777 210 1   150 200 0   0   0   0   0   0   0   0               YYY
379120245   20120701235206  538552582   0   14  0   1000000 0   0   3   300 0   0   200 1   3   300 0   0   0   0   0   0   0   0               
379120248   20120701235206  538552582   0   14  0   1000000 0   0   155 202 0   0   200 1   155 202 0   0   0   0   0   0   0   0               
379120250   20120701235208  538552582   0   14  0   1000000 0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

data1.json structure : just loads the data from the CSV without any transformation, removing nulls etc. This would not be ideal, but quickest processing time

[
    {
        'SERIAL': 379120186,
        'TIMESTAMP': 20120701235122,
        'CUSTOMER_ID': 1345567,
        'RESERVED01': 0,
        'PRODUCT_ID': 555,
        'CUSTOMER_TYPE': 0,
        'CUSTOMER_STATE': 1000000,
        'ChargeAmount_OF_UNITS': 0,
        'ChargeAmount_OF_FUND': 0,
        'ChargeAmount_FROM_ACCOUNT1': 5,
        'ACCOUNT1_BALANCE': 664,
        'ChargeAmount_FROM_ACCOUNT2': 0,
        'ACCOUNT2_BALANCE': 0,
        'WalletType1': 200,
        'UnitType1': 1,
        'ChargeAmount1': 5,
        'WalletBalance1': 664,
        'WalletType2': 0,
        'UnitType2': 0,
        'ChargeAmount2': 0,
        'WalletBalance2': 0,
        'WalletType3': 0,
        'UnitType3': 0,
        'ChargeAmount3': 0,
        'WalletBalance3': 0,
        'Bonus1': '',
        'Bonus2': '',
        'Bonus3': '',
        'AddtionaInfo': '1234'
    },
    {
        'SERIAL': 379120203,
        'TIMESTAMP': 20120701235136,
        'CUSTOMER_ID': 1345570,
        'RESERVED01': 0,
        'PRODUCT_ID': 1,
        'CUSTOMER_TYPE': 0,
        'CUSTOMER_STATE': 1000000,
        'ChargeAmount_OF_UNITS': 0,
        'ChargeAmount_OF_FUND': 0,
        'ChargeAmount_FROM_ACCOUNT1': 0,
        'ACCOUNT1_BALANCE': 346,
        'ChargeAmount_FROM_ACCOUNT2': 0,
        'ACCOUNT2_BALANCE': 0,
        'WalletType1': 0,
        'UnitType1': 0,
        'ChargeAmount1': 0,
        'WalletBalance1': 0,
        'WalletType2': 0,
        'UnitType2': 0,
        'ChargeAmount2': 0,
        'WalletBalance2': 0,
        'WalletType3': 0,
        'UnitType3': 0,
        'ChargeAmount3': 0,
        'WalletBalance3': 0,
        'Bonus1': '',
        'Bonus2': '',
        'Bonus3': '',
        'AddtionaInfo': XXX
    },
    {
        'SERIAL': 379120207,
        'TIMESTAMP': 20120701235138,
        'CUSTOMER_ID': 1345571,
        'RESERVED01': 0,
        'PRODUCT_ID': 10,
        'CUSTOMER_TYPE': 0,
        'CUSTOMER_STATE': 1000000,
        'ChargeAmount_OF_UNITS': 0,
        'ChargeAmount_OF_FUND': 0,
        'ChargeAmount_FROM_ACCOUNT1': 0,
        'ACCOUNT1_BALANCE': 0,
        'ChargeAmount_FROM_ACCOUNT2': 3,
        'ACCOUNT2_BALANCE': 100,
        'WalletType1': 200,
        'UnitType1': 1,
        'ChargeAmount1': 3,
        'WalletBalance1': 100,
        'WalletType2': 500,
        'UnitType2': 2,
        'ChargeAmount2': 10,
        'WalletBalance2': 0,
        'WalletType3': 610,
        'UnitType3': 2,
        'ChargeAmount3': 20,
        'WalletBalance3': 80,
        'Bonus1': 10,
        'Bonus2': '',
        'Bonus3': '',
        'AddtionaInfo': ''
    }
]

data2.json structure: removes empty values. Values from the charging fields (WalletType1-3) where no charging occurred (Wallet ID = 0) is removed. Only wallets that were affected are kept.

    [
        {
            'SERIAL': 379120186,
            'TIMESTAMP': 20120701235122,
            'CUSTOMER_ID': 1345567,
            'PRODUCT_ID': 555,
            'CUSTOMER_TYPE': 0,
            'CUSTOMER_STATE': 1000000,
            'ChargeAmount_OF_UNITS': 0,
            'ChargeAmount_OF_FUND': 0,
            'ChargeAmount_FROM_ACCOUNT1': 5,
            'ACCOUNT1_BALANCE': 664,
            'ChargeAmount_FROM_ACCOUNT2': 0,
            'ACCOUNT2_BALANCE': 0,
            'WalletType1': 200,
            'UnitType1': 1,
            'ChargeAmount1': 5,
            'WalletBalance1': 664,
            'AddtionaInfo': 1234
        },
        {
            'SERIAL': 379120203,
            'TIMESTAMP': 20120701235136,
            'CUSTOMER_ID': 1345570,
            'RESERVED01': 0,
            'PRODUCT_ID': 1,
            'CUSTOMER_TYPE': 0,
            'CUSTOMER_STATE': 1000000,
            'ChargeAmount_OF_UNITS': 0,
            'ChargeAmount_OF_FUND': 0,
            'ChargeAmount_FROM_ACCOUNT1': 0,
            'ACCOUNT1_BALANCE': 346,
            'ChargeAmount_FROM_ACCOUNT2': 0,
            'ACCOUNT2_BALANCE': 0,
            'AddtionaInfo': 'XXX'
        },
        {
            'SERIAL': 379120207,
            'TIMESTAMP': 20120701235138,
            'CUSTOMER_ID': 1345571,
            'RESERVED01': 0,
            'PRODUCT_ID': 10,
            'CUSTOMER_TYPE': 0,
            'CUSTOMER_STATE': 1000000,
            'ChargeAmount_OF_UNITS': 0,
            'ChargeAmount_OF_FUND': 0,
            'ChargeAmount_FROM_ACCOUNT1': 0,
            'ACCOUNT1_BALANCE': 0,
            'ChargeAmount_FROM_ACCOUNT2': 3,
            'ACCOUNT2_BALANCE': 100,
            'WalletType1': 200,
            'UnitType1': 1,
            'ChargeAmount1': 3,
            'WalletBalance1': 100,
            'WalletType2': 500,
            'UnitType2': 2,
            'ChargeAmount2': 10,
            'WalletBalance2': 0,
            'WalletType3': 610,
            'UnitType3': 2,
            'ChargeAmount3': 20,
            'WalletBalance3': 80,
            'Bonus1': 10,
        }
    ]

data3.json structure: removes empty strings, Values from charging are inserted into the key 'Charging', with a list. The wallet type is used as the key, and the linked info for that wallet type is a further level down. Only Wallets are affected are being inserted.

[
    {
        'SERIAL': 379120186,
        'TIMESTAMP': 20120701235122,
        'CUSTOMER_ID': 1345567,
        'PRODUCT_ID': 555,
        'CUSTOMER_TYPE': 0,
        'CUSTOMER_STATE': 1000000,
        'ChargeAmount_OF_UNITS': 0,
        'ChargeAmount_OF_FUND': 0,
        'ChargeAmount_FROM_ACCOUNT1': 5,
        'ACCOUNT1_BALANCE': 664,
        'ChargeAmount_FROM_ACCOUNT2': 0,
        'ACCOUNT2_BALANCE': 0,
        'CHARGING': 
            [
                {
                    '200': 
                        {
                            'UnitType': 1,
                            'ChargeAmount': 5,
                            'WalletBalance': 664
                        }
                }
            ] , 

        'AddtionaInfo': '1234'
    },
    {
        'SERIAL': 379120203,
        'TIMESTAMP': 20120701235136,
        'CUSTOMER_ID': 1345570,
        'RESERVED01': 0,
        'PRODUCT_ID': 1,
        'CUSTOMER_TYPE': 0,
        'CUSTOMER_STATE': 1000000,
        'ChargeAmount_OF_UNITS': 0,
        'ChargeAmount_OF_FUND': 0,
        'ChargeAmount_FROM_ACCOUNT1': 0,
        'ACCOUNT1_BALANCE': 346,
        'ChargeAmount_FROM_ACCOUNT2': 0,
        'ACCOUNT2_BALANCE': 0,
        'CHARGING' : [],
        'AddtionaInfo': 'XXX'
    },
    {
        'SERIAL': 379120207,
        'TIMESTAMP': 20120701235138,
        'CUSTOMER_ID': 1345571,
        'RESERVED01': 0,
        'PRODUCT_ID': 10,
        'CUSTOMER_TYPE': 0,
        'CUSTOMER_STATE': 1000000,
        'ChargeAmount_OF_UNITS': 0,
        'ChargeAmount_OF_FUND': 0,
        'ChargeAmount_FROM_ACCOUNT1': 0,
        'ACCOUNT1_BALANCE': 0,
        'ChargeAmount_FROM_ACCOUNT2': 3,
        'ACCOUNT2_BALANCE': 100,
        'CHARGING': 
            [
                {
                    '200': 
                        {               
                            'UnitType': 1,
                            'ChargeAmount': 3,
                            'WalletBalance': 100,
                        }
                },
                {
                    '500':
                        {
                            'UnitType': 2,
                            'ChargeAmount': 10,
                            'WalletBalance': 0,
                        }
                },
                {
                    '610':
                        {
                            'UnitType': 2,
                            'ChargeAmount': 20,
                            'WalletBalance': 80
                        }
                }
            ],
        'Bonus1': 10,
    }
]

data4.json removes empty strings, Charging values are put into 'Charging', however we keep it at this level and now use the key 'WalletType', once again only keeping those wallets that were affected. The List gives you the order in which they were charged.

[
    {
        'SERIAL': 379120186,
        'TIMESTAMP': 20120701235122,
        'CUSTOMER_ID': 1345567,
        'PRODUCT_ID': 555,
        'CUSTOMER_TYPE': 0,
        'CUSTOMER_STATE': 1000000,
        'ChargeAmount_OF_UNITS': 0,
        'ChargeAmount_OF_FUND': 0,
        'ChargeAmount_FROM_ACCOUNT1': 5,
        'ACCOUNT1_BALANCE': 664,
        'ChargeAmount_FROM_ACCOUNT2': 0,
        'ACCOUNT2_BALANCE': 0,
        'CHARGING': 
            [
                {
                    'WalletType': '200',
                    'UnitType': 1,
                    'ChargeAmount': 5,
                    'WalletBalance': 664
                }
            ] , 

        'AddtionaInfo': '1234'
    },
    {
        'SERIAL': 379120203,
        'TIMESTAMP': 20120701235136,
        'CUSTOMER_ID': 1345570,
        'RESERVED01': 0,
        'PRODUCT_ID': 1,
        'CUSTOMER_TYPE': 0,
        'CUSTOMER_STATE': 1000000,
        'ChargeAmount_OF_UNITS': 0,
        'ChargeAmount_OF_FUND': 0,
        'ChargeAmount_FROM_ACCOUNT1': 0,
        'ACCOUNT1_BALANCE': 346,
        'ChargeAmount_FROM_ACCOUNT2': 0,
        'ACCOUNT2_BALANCE': 0,
        'CHARGING' : [],
        'AddtionaInfo': 'XXX'
    },
    {
        'SERIAL': 379120207,
        'TIMESTAMP': 20120701235138,
        'CUSTOMER_ID': 1345571,
        'RESERVED01': 0,
        'PRODUCT_ID': 10,
        'CUSTOMER_TYPE': 0,
        'CUSTOMER_STATE': 1000000,
        'ChargeAmount_OF_UNITS': 0,
        'ChargeAmount_OF_FUND': 0,
        'ChargeAmount_FROM_ACCOUNT1': 0,
        'ACCOUNT1_BALANCE': 0,
        'ChargeAmount_FROM_ACCOUNT2': 3,
        'ACCOUNT2_BALANCE': 100,
        'CHARGING': 
            [
                {
                    'WalletType': '200',    
                    'UnitType': 1,
                    'ChargeAmount': 3,
                    'WalletBalance': 100,
                },
                {
                    'WalletType': 500,
                    'UnitType': 2,
                    'ChargeAmount': 10,
                    'WalletBalance': 0,
                },
                {
                    'WalletType': 610,
                    'UnitType': 2,
                    'ChargeAmount': 20,
                    'WalletBalance': 80
                }
            ],
        'Bonus1': 10,
    }
]

Note: - When we summarize the data would like to see something similar to

Date | Hour | Wallet  200 Total Charge| Wallet 500 Total Charge | Wallet 610 Total Charge

I have trimmed this, but the actual file has WalletType1-20
The number in the field name WalletType(1), refers to the order of which wallets were charged (indicated by the Wallet ID's) i.e WalletType1 = 200, means first wallet 200 was charged etc.
The order in which they were charged is relevant when viewing the detailed records (nor for the summaries)
Any of the WalletType id's can appear anywhere in the wallet fields. i.e Wallet ID =200 can be populated in WalletType1 or WalletTyp2 etc, this is dependent on priority rules.

Questions:

Which of the document structures would you recommend, and why?
What would be the pitfalls of the structures provided?
Do you have any further suggestions / other possible structures?
Any sharding/partitioning recommendations?

Upvotes: 2

Answers (2)

Emily S

Reputation: 369

Sharding would probably not be necessary to process +-100,000 files per day.

Upvotes: 1

A. Jesse Jiryu Davis

Reputation: 24007

The final schema looks most compact and queryable to me.

Avoid long key names -- key names are stored in each document and their storage size can add up. If the SERIAL field is unique, you can set the _id to that value instead of storing SERIAL separately.

Upvotes: 2

MongoDB Schema Design for Analytic or Trending Dashboard

Answers (2)

Related Questions