How did i learn Snowflake within 30 days and use it in production?

Sid
7 min readApr 10, 2024

In this guide, I’ll show you how to start with Snowflake and confidently scale it for production-level workloads. This approach has helped me scale numerous startups which use snowflake in order to manage and optimize their operations while keeping the costs in check.

The first step is simple: sign up for a Snowflake account to get $400 in free credits — plenty to explore all its services. Choose the enterprise version, select your preferred cloud provider (GCP, AWS, or Azure), and region. Selecting the right cloud provider is crucial since most companies use Snowflake primarily as a data warehouse alongside other cloud services such as AWS or Google cloud.

After setting up your account, take some time to familiarize yourself with the intuitive user interface.

With the basics out of the way and assuming you’re comfortable with SQL, you’re ready to dive deeper. Snowflake provides ample sample data to experiment with all its features.

STEP-1 : Understanding the big picture

The very first step while learning any cloud technology is to understand how does the billing work, what are the different billing components involved and what does the high level architecture look like. Below are the 2 major components in snowflake that contribute to the billing:

1 — Virtual Warehouses (different sizes)

2 — Storage costs

3 — Snowpipe objects (For automated data ingestion pipelines)

STEP-2 : Understanding the different tables: Understanding the types of tables that snowflake supports can be extremely important as storage contributes to the overall cost. Snowflake supports 3 tables as below and each one has its own purpose:

1 — Transient tables

2 — Temporary tables

3 — Permanent tables (Default)

Besides the above, also understand when to use Views as opposed to tables. There are 3 views that snowflake supports on a high level:

1 — Views

2 — Materialized views

3 — Secure views

Note: You can create all the above tables in your own data/schema and tables using the default sample data in snowflake.

STEP-3 : Running queries on different virtual warehouses: By now you should have figured that in snowflake, the compute is separated from the storage. Which means, you can run and execute complex queries on a warehouse (X-small,Small, Large, X-large etc) and kill the warehouse as soon as the execution is completed while your data resides separately in the storage layer.

Try running different queries or same queries on warehouses of different sizes to get an understanding of the impact on sql execution & performance on warehouses of varied sizes.

Note: Try running queries against one of the sample databases/tables to get a good grasp of the performance using different warehouses.

STEP-4 : Query optimization, clustering and partitioning: Once you are done setting up warehouses of different sizes and running some queries in them, you would notice that higher the warehouse size, faster the execution is. But this does not mean that you always need to beef up your warehouse size and configuration just to ensure the queries execute faster. Also, keep in mind, the bigger the warehouse the more you pay. On that note, its extremely important that you understand your SQL execution plan , do the basic checks and try to optimize your SQL queries as much as possible.

Data scanning is one of the core aspects in the world of distributed data processing which can make or break your entire pipeline (ETL and analytical workloads). So to ensure your SQL queries are optimized on snowflake, you must start by ensuring that your queries only read the data thats absolutely needed. In snowflake, below are the key concepts that revolve around query optimization :

1 — Table Partitions

2 — Clustering Keys

3 — Micropartitions (Concept)

4 — How to select the right clustering keys

5 — Caching data for reusability

6 — SQL Query profile

Also, understand whats Search optimization feature and how does it help queries which return one or a few rows of data (Also known as “point lookup” queries).

Note: You can implement all of the above features in your trial account with the same data that comes by default.

STEP-5 : Data ingestion into Snowflake

Now we get to the part where you might want to use another cloud provider such as AWS or GCP from where you want to ingest data into Snowflake.

The first step in ingesting data from an external source is to create an integration object which allows snowflake access to the cloud service (S3 or GCS buckets) where the data will be residing before being ingested. The step involves creating IAM roles and granting access policies to the role.

Once the integration object is created, try to ingest data in different file formats such as CSV,JSON,Parquet into your snowflake tables and you will do this by manually running “COPY” Commands.

Next, you move on to automated data ingestion pipelines for which you will use “Snowpipe” which is a snowflake object specifically used to ingest data as they arrive.

Implementing Snowpipe is quite easy but whats really important here is to understand the cost of executing snow pipe.

Finally, you would also want to figure out how to unload/extract data from snowflake to an external storage layer such as S3 or GCS bucket, which is also done using COPY commands.

Note — You can find plethora of sample data online which you can use to ingest into snowflake tables.

STEP-6 : Snowflake Tasks

One of the most helpful features in any technology is the ability to schedule or trigger pipelines or query execution basis an event or at a particular frequency. This is where Snowflake tasks come into picture. You cannot learn snowflake without understanding how tasks work and how can they help with query scheduling. Tasks can also be nested with dependencies.

Note: Executing tasks means running a few lines of simple SQL queries. Think of some simple/advanced logic using the provided sample data, build one task for each logic and chain them all together using Snowflake tasks. You can implement both standalone tasks on schedule and dependent tree of tasks as well.

STEP-7 : Snowflake Streams and CDC

“Streams” and “Change Data Capture (CDC)” are key features that enable real-time data processing and analytics.

A Snowflake Stream is a feature that tracks data changes (INSERTs, UPDATEs, and DELETEs) on tables within Snowflake.

Change Data Capture (CDC) refers to the process of identifying and capturing changes made to the data in a database, and then delivering or applying these changes to another system or database. In Snowflake, CDC can be implemented using Streams.

In Snowflake, Streams and CDC work hand-in-hand to provide a robust solution for data replication, synchronization, and real-time analytics. Here’s how they can work together:

1: Create a Stream on a Snowflake table to monitor for

2: Consume Changes from the stream, which captures all DML changes like INSERTs, UPDATEs, and

3: Apply Changes to a target table or system using the captured changes, effectively implementing CDC.

STEP-8 : Used defined functions and stored procedures using Snowflake and Snowpark

User-Defined Functions (UDFs) and Stored Procedures are powerful features that extend the capabilities of SQL and allow for more complex, procedural logic to be executed within snowflake.

These features become even more potent with the introduction of Snowpark, Snowflake’s developer framework for building sophisticated data pipelines and applications using familiar programming languages.

User-Defined Functions (UDFs) in Snowflake

UDFs allow you to define custom functions that can be used in SQL queries, just like native functions. These can be written in SQL or JavaScript and are used to encapsulate frequently used logic or complex calculations that are not easily accomplished with standard SQL functions.

Key Points About UDFs:

1: Custom Logic: UDFs let you implement business-specific logic or complex calculations that can be reused across multiple queries and reports.

2: Language Support: Snowflake supports UDFs written in SQL for set-based operations and JavaScript for scalar operations, giving you flexibility in how you define your functions.

3: Performance: UDFs in Snowflake can leverage the platform’s massive parallel processing (MPP) capabilities, ensuring that even complex functions execute efficiently at scale.

Stored Procedures in Snowflake

Stored Procedures in Snowflake allow for even more complex processing by supporting procedural logic, such as loops and conditional statements, and can execute multiple SQL statements in a single call. These are written in JavaScript and can be invoked directly from SQL statements.

Key Points About Stored

1: Procedural Logic: Stored Procedures support complex procedural logic, enabling the execution of a series of SQL statements with conditional logic, loops, and transaction

2: Transactions: Unlike UDFs, Stored Procedures can manage transactions (commit and rollback), making them suitable for complex data manipulation and business logic that requires atomicity.

3: Interactivity: Stored Procedures can return results to the caller and even support dynamic SQL execution, providing flexibility in how data is processed and returned.

Snowpark and Enhancing UDFs and Stored Procedures

Snowpark is Snowflake’s developer framework that allows for building data pipelines, machine learning models, and applications using familiar programming languages such as Scala, Java, and Python. Snowpark introduces new ways to extend the capabilities of UDFs and Stored Procedures:

  • Language Expansion: With Snowpark, developers can write UDFs in languages like Scala and Python, expanding the possibilities for data transformation and analysis directly within Snowflake.
  • Dataframe API: Snowpark provides a Dataframe API that can be used within UDFs and Stored Procedures for more complex data manipulations, leveraging the power of Snowflake’s computing engine.
  • Integrated Pipelines: Snowpark allows for seamless integration of UDFs and Stored Procedures within data pipelines, enabling sophisticated data processing workflows that can be executed entirely within Snowflake.

We have covered 8 steps in total and if you have thoroughly learned & implemented all the above features in snowflake while not ignoring the cost implications, you have learned 75–80% of snowflake.

There are some other very important features in snowflake which i will cover separately and these mainly involve :

  • Data sharing, security and governance
  • Integrating Kafka with Snowflake
  • Snowpark for data science

--

--

Sid

Passionate data expert & Udemy instructor with 20k+ students, helping startups scale and derive value from data.