Reputation: 3923
Currently working on a POC to migrate from a RDBMS and evaluating MongoDB, in the process of doing the schema and would like to ensure it meets our needs.
Some background - We process +-100,000 files per a day - About 10 different file types (which would go into different collections) - +100million records per a day - Files are Pipe Delimited Text - Will Use python to load the data - Each file contains between 100-250 fields
Current Setup - Each file type gets loaded into it's own table (Which is partitioned by day) i.e File Type A would be in FileA (Partition 20120701) File Type B would be in FileB (Partition 20120701) etc.
I have included a trimmed file (removed fields and records) of the source data, as well 4 different options to structure the data. My current design would be:
Collection per a day for each file type (detailed records) i.e FileA_20120701 FileB_20120701
Collection per month for hourly summarized data , will include all file types records i.e Hourly_usage_201207
Collection per a year for daily summarized data , will include all file type records i.e Daily_usage_2012
So : Map Reduce For Hourly > Map Reduce for Daily
Right now I am focusing on how to store the detailed records, please take a look at the raw data provided in files:
SERIAL TIMESTAMP CUSTOMER_ID RESERVED01 PRODUCT_ID CUSTOMER_TYPE CUSTOMER_STATE ChargeAmount_OF_UNITS ChargeAmount_OF_FUND ChargeAmount_FROM_ACCOUNT1 ACCOUNT1_BALANCE ChargeAmount_FROM_ACCOUNT2 ACCOUNT2_BALANCE WalletType1 UnitType1 ChargeAmount1 WalletBalance1 WalletType2 UnitType2 ChargeAmount2 WalletBalance2 WalletType3 UnitType3 ChargeAmount3 WalletBalance3 Bonus1 Bonus2 Bonus3 AddtionaInfo
379120186 20120701235122 1345567 0 555 0 1000000 0 0 5 664 0 0 200 1 5 664 0 0 0 0 0 0 0 0 1234
379120190 20120701235124 1345568 0 1 0 1000000 0 0 4 108 0 0 200 1 4 108 0 0 0 0 0 0 0 0
379120197 20120701235132 1345569 0 4 0 1000000 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
379120203 20120701235136 1345570 0 1 0 1000000 0 0 0 346 0 0 0 0 0 0 0 0 0 0 0 0 0 0 XXX
379120207 20120701235138 1345571 0 10 0 1000000 0 0 0 0 3 100 200 1 3 100 500 2 10 0 610 2 20 80 10
379120208 20120701235138 1345571 0 1 0 1000000 0 0 3 306 0 0 200 1 3 306 0 0 0 0 0 0 0 0
379120211 20120701235141 1345573 0 1 0 1000000 0 0 0 181 0 0 0 0 0 0 0 0 0 0 0 0 0 0
379120213 20120701235143 3101015742206 0 349 1 1000001 0 0 0 274 0 0 0 0 0 0 0 0 0 0 0 0 0 0
379120214 20120701235144 3101015742206 0 349 1 1000001 0 0 0 0 120 680 210 1 37 0 200 18 120 680 0 0 0 0
379120215 20120701235147 3101015742206 0 349 1 1000001 0 0 0 992 0 0 0 0 0 0 0 0 0 0 0 0 0 0
379120217 20120701235147 3101015742206 0 349 1 1000001 0 2 0 1 0 0 400 3 2 1766 0 0 0 0 0 0 0 0
379120223 20120701235149 3101015742206 0 349 1 1000001 0 0 11 196 0 0 200 1 11 196 0 0 0 0 0 0 0 0
379120229 20120701235153 1345579 0 349 3 1000000 0 0 40 707 0 0 200 1 40 707 0 0 0 0 0 0 0 0 20 5 XXX
379120230 20120701235153 3101015742206 0 349 1 1000001 0 0 9 1702 0 0 200 1 9 1702 0 0 0 0 0 0 0 0
379120232 20120701235153 1345581 0 349 2 1000000 0 0 150 59 0 0 200 1 150 59 0 0 0 0 0 0 0 0
379120237 20120701235158 1345582 0 1 2 1000000 0 0 3 303 0 0 200 1 3 303 0 0 0 0 0 0 0 0
379120241 20120701235202 538552582 0 14 0 1000000 0 0 0 779 10 777 210 1 150 200 0 0 0 0 0 0 0 0 YYY
379120245 20120701235206 538552582 0 14 0 1000000 0 0 3 300 0 0 200 1 3 300 0 0 0 0 0 0 0 0
379120248 20120701235206 538552582 0 14 0 1000000 0 0 155 202 0 0 200 1 155 202 0 0 0 0 0 0 0 0
379120250 20120701235208 538552582 0 14 0 1000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
data1.json structure : just loads the data from the CSV without any transformation, removing nulls etc. This would not be ideal, but quickest processing time
[
{
'SERIAL': 379120186,
'TIMESTAMP': 20120701235122,
'CUSTOMER_ID': 1345567,
'RESERVED01': 0,
'PRODUCT_ID': 555,
'CUSTOMER_TYPE': 0,
'CUSTOMER_STATE': 1000000,
'ChargeAmount_OF_UNITS': 0,
'ChargeAmount_OF_FUND': 0,
'ChargeAmount_FROM_ACCOUNT1': 5,
'ACCOUNT1_BALANCE': 664,
'ChargeAmount_FROM_ACCOUNT2': 0,
'ACCOUNT2_BALANCE': 0,
'WalletType1': 200,
'UnitType1': 1,
'ChargeAmount1': 5,
'WalletBalance1': 664,
'WalletType2': 0,
'UnitType2': 0,
'ChargeAmount2': 0,
'WalletBalance2': 0,
'WalletType3': 0,
'UnitType3': 0,
'ChargeAmount3': 0,
'WalletBalance3': 0,
'Bonus1': '',
'Bonus2': '',
'Bonus3': '',
'AddtionaInfo': '1234'
},
{
'SERIAL': 379120203,
'TIMESTAMP': 20120701235136,
'CUSTOMER_ID': 1345570,
'RESERVED01': 0,
'PRODUCT_ID': 1,
'CUSTOMER_TYPE': 0,
'CUSTOMER_STATE': 1000000,
'ChargeAmount_OF_UNITS': 0,
'ChargeAmount_OF_FUND': 0,
'ChargeAmount_FROM_ACCOUNT1': 0,
'ACCOUNT1_BALANCE': 346,
'ChargeAmount_FROM_ACCOUNT2': 0,
'ACCOUNT2_BALANCE': 0,
'WalletType1': 0,
'UnitType1': 0,
'ChargeAmount1': 0,
'WalletBalance1': 0,
'WalletType2': 0,
'UnitType2': 0,
'ChargeAmount2': 0,
'WalletBalance2': 0,
'WalletType3': 0,
'UnitType3': 0,
'ChargeAmount3': 0,
'WalletBalance3': 0,
'Bonus1': '',
'Bonus2': '',
'Bonus3': '',
'AddtionaInfo': XXX
},
{
'SERIAL': 379120207,
'TIMESTAMP': 20120701235138,
'CUSTOMER_ID': 1345571,
'RESERVED01': 0,
'PRODUCT_ID': 10,
'CUSTOMER_TYPE': 0,
'CUSTOMER_STATE': 1000000,
'ChargeAmount_OF_UNITS': 0,
'ChargeAmount_OF_FUND': 0,
'ChargeAmount_FROM_ACCOUNT1': 0,
'ACCOUNT1_BALANCE': 0,
'ChargeAmount_FROM_ACCOUNT2': 3,
'ACCOUNT2_BALANCE': 100,
'WalletType1': 200,
'UnitType1': 1,
'ChargeAmount1': 3,
'WalletBalance1': 100,
'WalletType2': 500,
'UnitType2': 2,
'ChargeAmount2': 10,
'WalletBalance2': 0,
'WalletType3': 610,
'UnitType3': 2,
'ChargeAmount3': 20,
'WalletBalance3': 80,
'Bonus1': 10,
'Bonus2': '',
'Bonus3': '',
'AddtionaInfo': ''
}
]
data2.json structure: removes empty values. Values from the charging fields (WalletType1-3) where no charging occurred (Wallet ID = 0) is removed. Only wallets that were affected are kept.
[
{
'SERIAL': 379120186,
'TIMESTAMP': 20120701235122,
'CUSTOMER_ID': 1345567,
'PRODUCT_ID': 555,
'CUSTOMER_TYPE': 0,
'CUSTOMER_STATE': 1000000,
'ChargeAmount_OF_UNITS': 0,
'ChargeAmount_OF_FUND': 0,
'ChargeAmount_FROM_ACCOUNT1': 5,
'ACCOUNT1_BALANCE': 664,
'ChargeAmount_FROM_ACCOUNT2': 0,
'ACCOUNT2_BALANCE': 0,
'WalletType1': 200,
'UnitType1': 1,
'ChargeAmount1': 5,
'WalletBalance1': 664,
'AddtionaInfo': 1234
},
{
'SERIAL': 379120203,
'TIMESTAMP': 20120701235136,
'CUSTOMER_ID': 1345570,
'RESERVED01': 0,
'PRODUCT_ID': 1,
'CUSTOMER_TYPE': 0,
'CUSTOMER_STATE': 1000000,
'ChargeAmount_OF_UNITS': 0,
'ChargeAmount_OF_FUND': 0,
'ChargeAmount_FROM_ACCOUNT1': 0,
'ACCOUNT1_BALANCE': 346,
'ChargeAmount_FROM_ACCOUNT2': 0,
'ACCOUNT2_BALANCE': 0,
'AddtionaInfo': 'XXX'
},
{
'SERIAL': 379120207,
'TIMESTAMP': 20120701235138,
'CUSTOMER_ID': 1345571,
'RESERVED01': 0,
'PRODUCT_ID': 10,
'CUSTOMER_TYPE': 0,
'CUSTOMER_STATE': 1000000,
'ChargeAmount_OF_UNITS': 0,
'ChargeAmount_OF_FUND': 0,
'ChargeAmount_FROM_ACCOUNT1': 0,
'ACCOUNT1_BALANCE': 0,
'ChargeAmount_FROM_ACCOUNT2': 3,
'ACCOUNT2_BALANCE': 100,
'WalletType1': 200,
'UnitType1': 1,
'ChargeAmount1': 3,
'WalletBalance1': 100,
'WalletType2': 500,
'UnitType2': 2,
'ChargeAmount2': 10,
'WalletBalance2': 0,
'WalletType3': 610,
'UnitType3': 2,
'ChargeAmount3': 20,
'WalletBalance3': 80,
'Bonus1': 10,
}
]
data3.json structure: removes empty strings, Values from charging are inserted into the key 'Charging', with a list. The wallet type is used as the key, and the linked info for that wallet type is a further level down. Only Wallets are affected are being inserted.
[
{
'SERIAL': 379120186,
'TIMESTAMP': 20120701235122,
'CUSTOMER_ID': 1345567,
'PRODUCT_ID': 555,
'CUSTOMER_TYPE': 0,
'CUSTOMER_STATE': 1000000,
'ChargeAmount_OF_UNITS': 0,
'ChargeAmount_OF_FUND': 0,
'ChargeAmount_FROM_ACCOUNT1': 5,
'ACCOUNT1_BALANCE': 664,
'ChargeAmount_FROM_ACCOUNT2': 0,
'ACCOUNT2_BALANCE': 0,
'CHARGING':
[
{
'200':
{
'UnitType': 1,
'ChargeAmount': 5,
'WalletBalance': 664
}
}
] ,
'AddtionaInfo': '1234'
},
{
'SERIAL': 379120203,
'TIMESTAMP': 20120701235136,
'CUSTOMER_ID': 1345570,
'RESERVED01': 0,
'PRODUCT_ID': 1,
'CUSTOMER_TYPE': 0,
'CUSTOMER_STATE': 1000000,
'ChargeAmount_OF_UNITS': 0,
'ChargeAmount_OF_FUND': 0,
'ChargeAmount_FROM_ACCOUNT1': 0,
'ACCOUNT1_BALANCE': 346,
'ChargeAmount_FROM_ACCOUNT2': 0,
'ACCOUNT2_BALANCE': 0,
'CHARGING' : [],
'AddtionaInfo': 'XXX'
},
{
'SERIAL': 379120207,
'TIMESTAMP': 20120701235138,
'CUSTOMER_ID': 1345571,
'RESERVED01': 0,
'PRODUCT_ID': 10,
'CUSTOMER_TYPE': 0,
'CUSTOMER_STATE': 1000000,
'ChargeAmount_OF_UNITS': 0,
'ChargeAmount_OF_FUND': 0,
'ChargeAmount_FROM_ACCOUNT1': 0,
'ACCOUNT1_BALANCE': 0,
'ChargeAmount_FROM_ACCOUNT2': 3,
'ACCOUNT2_BALANCE': 100,
'CHARGING':
[
{
'200':
{
'UnitType': 1,
'ChargeAmount': 3,
'WalletBalance': 100,
}
},
{
'500':
{
'UnitType': 2,
'ChargeAmount': 10,
'WalletBalance': 0,
}
},
{
'610':
{
'UnitType': 2,
'ChargeAmount': 20,
'WalletBalance': 80
}
}
],
'Bonus1': 10,
}
]
data4.json removes empty strings, Charging values are put into 'Charging', however we keep it at this level and now use the key 'WalletType', once again only keeping those wallets that were affected. The List gives you the order in which they were charged.
[
{
'SERIAL': 379120186,
'TIMESTAMP': 20120701235122,
'CUSTOMER_ID': 1345567,
'PRODUCT_ID': 555,
'CUSTOMER_TYPE': 0,
'CUSTOMER_STATE': 1000000,
'ChargeAmount_OF_UNITS': 0,
'ChargeAmount_OF_FUND': 0,
'ChargeAmount_FROM_ACCOUNT1': 5,
'ACCOUNT1_BALANCE': 664,
'ChargeAmount_FROM_ACCOUNT2': 0,
'ACCOUNT2_BALANCE': 0,
'CHARGING':
[
{
'WalletType': '200',
'UnitType': 1,
'ChargeAmount': 5,
'WalletBalance': 664
}
] ,
'AddtionaInfo': '1234'
},
{
'SERIAL': 379120203,
'TIMESTAMP': 20120701235136,
'CUSTOMER_ID': 1345570,
'RESERVED01': 0,
'PRODUCT_ID': 1,
'CUSTOMER_TYPE': 0,
'CUSTOMER_STATE': 1000000,
'ChargeAmount_OF_UNITS': 0,
'ChargeAmount_OF_FUND': 0,
'ChargeAmount_FROM_ACCOUNT1': 0,
'ACCOUNT1_BALANCE': 346,
'ChargeAmount_FROM_ACCOUNT2': 0,
'ACCOUNT2_BALANCE': 0,
'CHARGING' : [],
'AddtionaInfo': 'XXX'
},
{
'SERIAL': 379120207,
'TIMESTAMP': 20120701235138,
'CUSTOMER_ID': 1345571,
'RESERVED01': 0,
'PRODUCT_ID': 10,
'CUSTOMER_TYPE': 0,
'CUSTOMER_STATE': 1000000,
'ChargeAmount_OF_UNITS': 0,
'ChargeAmount_OF_FUND': 0,
'ChargeAmount_FROM_ACCOUNT1': 0,
'ACCOUNT1_BALANCE': 0,
'ChargeAmount_FROM_ACCOUNT2': 3,
'ACCOUNT2_BALANCE': 100,
'CHARGING':
[
{
'WalletType': '200',
'UnitType': 1,
'ChargeAmount': 3,
'WalletBalance': 100,
},
{
'WalletType': 500,
'UnitType': 2,
'ChargeAmount': 10,
'WalletBalance': 0,
},
{
'WalletType': 610,
'UnitType': 2,
'ChargeAmount': 20,
'WalletBalance': 80
}
],
'Bonus1': 10,
}
]
Note: - When we summarize the data would like to see something similar to
Date | Hour | Wallet 200 Total Charge| Wallet 500 Total Charge | Wallet 610 Total Charge
I have trimmed this, but the actual file has WalletType1-20
The number in the field name WalletType(1), refers to the order of which wallets were charged (indicated by the Wallet ID's) i.e WalletType1 = 200, means first wallet 200 was charged etc.
The order in which they were charged is relevant when viewing the detailed records (nor for the summaries)
Any of the WalletType id's can appear anywhere in the wallet fields. i.e Wallet ID =200 can be populated in WalletType1 or WalletTyp2 etc, this is dependent on priority rules.
Questions:
Which of the document structures would you recommend, and why?
What would be the pitfalls of the structures provided?
Do you have any further suggestions / other possible structures?
Any sharding/partitioning recommendations?
Upvotes: 2
Views: 763
Reputation: 369
Sharding would probably not be necessary to process +-100,000 files per day.
Upvotes: 1
Reputation: 24007
The final schema looks most compact and queryable to me.
Avoid long key names -- key names are stored in each document and their storage size can add up. If the SERIAL field is unique, you can set the _id to that value instead of storing SERIAL separately.
Upvotes: 2