Reputation: 11
I need some guidance in setting up datalake:
{
"version": "3.0",
"name": "application_name",
...
"request":
{
"startdate": "" ,
"enddate": "",
"records": 1000
...
}
"ranking":
{
"90":
{
"name": "somename",
"class": "someclass"
..
},
"98":
{
"name": "somename",
"class": "someclass"
..
}
"86":
{
"name": "somename",
"class": "someclass"
..
}
}
}
We are planning to store this information in datalake and hence I had following questions:
A. We are planning to dump data under directory [YEAR] - [MONTH] - [FILE_.json]. What is the advantage in splitting the directory by [Year] and [Month] in comparion to storing [FILE_.json] in root directory?
B. If we are extracting data multiple times (say every hour), then shall I overwrite [FILE_.json] or appending the file is better?
C. Is it advisable to store data as it is from source or better to store required data. In this case data under "ranking" attribute.
D. Consider I notice that data for last 10 days hasn't been extracted. In this situation, if I extract data for last 10 days, it will store all the data in a single file [FILE_.json]. Is it good way of storing data? For getting data from source, I have to pass startdata and enddate. It would be easier to get all 10 days data in a single call rather than making 10 different calls.
E. Is it advisable to add additional attributes to JSON file like following: - ingestdt - tags??
F. In future, if we are extracting data from another source, should I combine data and dump it in single json or should I have different JSON file. JSON structure between different sources will be different? Which method would be easier for downstream consumption?
I will be grateful if someone can please guide us in right direction. Any inputs will be helpful.
Thank you
Upvotes: 0
Views: 141
Reputation: 4552
Below are the answers to all your questions.
A. Advantage of splitting or filtering the data on [YEAR] and [MONTH] will help you to manage it. You will be able to identify the data based on per month, date, hours for future reference. If you have loaded all the data in a single file, it will be difficult to process further. Therefore, sorting the data is the recommended approach in any given case.
B. Overwriting the file or appending it totally depends on your use case. If your application may need previous data in future, it is better to append. Or, if the previous data is useless and won't be required in future, then simply overwrite it and save your storage memory.
C. It is always advisable to store only required data. Storing unrequired data will use unnecessary storage. Clean your data using Azure Data Factory data flow and store the output in data lake.
D. Technically you should manage the same hierarchy as mentioned in answer A to manage data properly. Split your data per day basis for easy processing.
E. This is not mandatory but you can if want to. As you are already splitting it on per day basis, no extra tags required.
F. When extracting data from different source, always keep it in separate JSON file so that it would be easy to identify and there will be no confusion. Create a separate folder structure and store data there.
Upvotes: 0