Username784792
Username784792

Reputation: 135

What is the best way to run a report in notebooks when connected to snowflake connector?

My last couple of questions have been on how to connect to snowflake and add and read data with the python connector in a ipython notebook. However, I am having troubling with the next best step to create a report with the data I seek to visualize.

I would like to upload all of the data, store it, then analyze it, kind of like a homemade dashboard.

So what I have done so far is a small version:

  1. Staged my data from a local file, and I will run adding new data each time I open the notebook
  2. Then I will use the python connector to call any data from storage
  3. Create visualizations with numpy objects in the local notebook.

My data will start out very small, but over time I would imagine I would have to move computation to the cloud to minimize the memory used locally for the small dashboard.

My question is, my data is called from a api that results in json files, new data is no bigger that 75 MB a day 8 columns, with two aggregate calls to the data, done in the sql call. If I run these visualizations monthly, is it better to aggregate the information in Snowflake, or locally?

Upvotes: 0

Views: 183

Answers (2)

doyouevendata
doyouevendata

Reputation: 179

My question is, my data is called from a api that results in json files, new data is no bigger that 75 MB a day 8 columns, with two aggregate calls to the data, done in the sql call. If I run these visualizations monthly, is it better to aggregate the information in Snowflake, or locally?

I would flatten your data in python or Snowflake - depending on which you feel more comfortable using or how complex the data is. You can just do everything on the straight json, although I would rarely look to design something that way myself (it's going to be the slowest to query.)

As far as aggregating the data, I'd always do that on Snowflake. If you would like to slice and dice the data various ways, you may look to design a data mart data model and have your dashboard simply aggregate data on the fly via queries. Snowflake should be pretty good with that, but for additional speed then aggregating it up to months may be a good idea too.

You can probably mature your process from being local python script driven too something like serverless lambda and event driven wwith a scheduler as well.

Upvotes: 0

Trevor
Trevor

Reputation: 4860

Put the raw data into Snowflake. Use tasks and procedures to aggregate it and store the result. Or better yet, don't do any aggregations except for when you want the data - let Snowflake do the aggregations in real-time off the raw data.

I think what you might be asking is whether you should ETL your data or ELT your data:

  • ETL: Extract, Transform, Load (in that order) - Extract data from your API. Transform it locally on your computer. Load it into Snowflake.
  • ELT: Extract, Load, Transform (in that order) - Extract data from your API. Load it into Snowflake. Transform it after it's in Snowflake.

Both ETL and ELT are valid. Many companies use both approaches w/ snowflake interchangeably. But Snowflake was built for it to kind of be your data lake - the idea being, "Just throw all your data up here and then use our awesome compute and storage resources to transform them quickly and easily."

Do a Google search on "Snowflake ELT" or "ELT vs ETL" for more information.

Here are some considerations either way off the top of my head:

Tools you're using: Some tools like SSIS were built w/ ETL in mind - transformation of the data before you store it in your warehouse. That's not to say you can't ELT, but it wasn't built w/ ELT in mind. More modern tools - like Fivetran or even Snowpipe assume you're going to aggregate all your data into Snowflake, and then transform it once it's up there. I really like the ELT paradigm - i.e. just get your data into the cloud - transform it quickly once it's up there.

Size and growth of your data: If your data is growing, it becomes harder and harder to manage it on local resources. It might not matter when your data is in gigabytes or millions of rows. But as you get into billions of rows or terabytes of data, the scale-ability of the cloud can't be matched. If you feel like this might happen and you think putting it into the cloud isn't a premature optimization, I'd load your raw data into Snowflake and transform it after it's up there.

Compute and Storage Capacity: Maybe you have a massive amount of storage and compute at your fingertips. Maybe you have an on-prem cluster you can provision resources from at the drop of a hat. Most people don't have that.

Short-Term Compute and Storage Cost: Maybe you have some modest resources you can use today and you'd rather not pay Snowflake while your modest resources can do the job. Having said that, it sounds like the compute to transform this data will be pretty minimal, and you'll only be doing it once a day or once a month. If that's the case, the compute cost will be very minimal.

Data Security or Privacy: Maybe you have a need to anonymize data before moving it to the public cloud. If this is important to you you should look into Snowflake's security features, but if you're in an organization where it's super difficult to get a security review and you need to move forward with something, transforming it on-prem while waiting for security review is a good alternative.

Data Structure: Do you have duplicates in your data? Do you need access to other data in Snowflake to join on in order to perform your transformations? As you start putting more and more data into Snowflake, it makes sense to transform it after it's in Snowflake - that's where all your data is and you will find it easier to join, query and transform in the cloud where all your other data is.

Upvotes: 1

Related Questions