Guidance needs to setup datalake

Question

I need some guidance in setting up datalake:

We are pulling data from source (rest api) which returns JSON file. Sample structure given below.

{
    "version": "3.0",
    "name": "application_name",
    ...
    "request":
    {
        "startdate": "" ,
        "enddate": "",
        "records": 1000
        ...
    }
    "ranking":
    {
        "90":
        {
            "name": "somename",
            "class": "someclass"
            ..
        },
        "98":
        {
            "name": "somename",
            "class": "someclass"
            ..
        }
        "86":
        {
            "name": "somename",
            "class": "someclass"
            ..
        }   
    }
}

We are planning to store this information in datalake and hence I had following questions:

A. We are planning to dump data under directory [YEAR] - [MONTH] - [FILE_.json]. What is the advantage in splitting the directory by [Year] and [Month] in comparion to storing [FILE_.json] in root directory?

B. If we are extracting data multiple times (say every hour), then shall I overwrite [FILE_.json] or appending the file is better?

C. Is it advisable to store data as it is from source or better to store required data. In this case data under "ranking" attribute.

D. Consider I notice that data for last 10 days hasn't been extracted. In this situation, if I extract data for last 10 days, it will store all the data in a single file [FILE_.json]. Is it good way of storing data? For getting data from source, I have to pass startdata and enddate. It would be easier to get all 10 days data in a single call rather than making 10 different calls.

E. Is it advisable to add additional attributes to JSON file like following: - ingestdt - tags??

F. In future, if we are extracting data from another source, should I combine data and dump it in single json or should I have different JSON file. JSON structure between different sources will be different? Which method would be easier for downstream consumption?

I will be grateful if someone can please guide us in right direction. Any inputs will be helpful.

Thank you

Guidance needs to setup datalake

Answers (1)

Related Questions