Friday, September 4, 2020

Clusters

Snowflake cloud data warehouse produces create clustered tables by default. However, as the table size grows and DML occurs on the table, the data in some table rows may no longer cluster optimally on desired dimensions. In this article, we will check how to create Snowflake clustered tables to improve the DML query performance.

Snowflake Clustered Tables

When you create a table and insert records into the Snowflake tables, inserted rows are grouped into continuous storage such as micro-partitions. But, as the table size grows, data in some column may no longer cluster optimally. Clustering is not intended for all tables, you will add cluster key to improve the clustering of the underlying table micro-partitions.

The cluster columns can be a subset of the table columns, or an expression of the tables. A table with a clustering key defined is considered to be clustered table.

What is Snowflake Clustering Key?

A clustering key is a subset of columns in a table or an expression that are explicitly used to co-locate the data in the table in the same micro-partition.

This is very useful for the large tables (multi-TB) where ordering of the column was not optimal or extensive DML operation on the table has caused the table’s natural clustering to degrade.

So, which column(s) should be clustering key?

Following are the recommended columns as a clustering key.

Cluster columns that are most actively used in selective filters. For example, many fact tables involves the date column as a filter condition (for example “WHERE invoice_date = X” ), choosing the date column is a good idea.
Consider columns frequently used in join predicates.

The number of distinct values in a column/expression is a critical aspect of selecting it as a clustering key.

You should identify a clustering key that has:

A large enough number of distinct values to enable effective pruning on the table.
A small enough number of distinct values to allow Snowflake to effectively group rows in the same micro-partitions.

Note that, if you define two or more columns/expressions as the clustering key for a table, the order has an impact on how the data is clustered in micro-partitions. Snowflake recommends ordering the columns from lowest cardinality to highest cardinality.

Snowflake Clustered Table Syntax

You can use following syntax to create clustered tables in Snowflake.

CREATE TABLE <name> ... CLUSTER BY ( <expr1> [ , <expr2> ... ] )

Snowflake Clustered Table Examples

Consider following example to create table with clustering key.

create or replace table sn_clustered_table

(c1 date, c2 string, c3 number)

cluster by (c1, c2);

Alter Snowflake Table to Add Clustering Key

You can add the clustering key while creating table or use ALTER TABLE syntax to add a clustering key to existing tables.

Following is the syntax to add a clustering key to existing Snowflake table.

ALTER TABLE <name> CLUSTER BY ( <expr1> [ , <expr2> ... ] )

For example,

alter table sn_clustered_table2 cluster by (c1, c3);

Dropping the Clustering Keys in Snowflake

If the query performance is not as per expectation, you can drop the clustering key.

Following is the syntax to drop the clustering key on Snowflake table.

ALTER TABLE <name> DROP CLUSTERING KEY

For example,

ALTER TABLE sn_clustered_table2 DROP CLUSTERING KEY

Reclustering in Snowflake

Reclustering in Snowflake is automatic; no maintenance is needed. During reclustering, Snowflake uses the clustering key for a clustered table to reorganize the column data, so that related records are relocated to the same micro-partition. This DML operation deletes the affected records and re-inserts them, grouped according to the clustering key.

Limitation of Clustering in Snowflake

Following are some of the limitations on clustered Tables.

An existing Clustering key is not propagated when using CREATE TABLE … LIKE operation.
An existing clustering key is not supported when a table is created using CREATE TABLE … AS SELECT; however, you can use the ALTER TABLE command to define a clustering key after the table is created.
You cannot use a VARIANT column in clustering key. However, you can specify the expression to extract value in clustering key.
CREATE TABLE … CLONE will retain the clustering key.

Conclusion :

Snowflake does not support indexes, though it does support "clustering" for performance improvements of I/O.

Snowflake does not use indexes. This is one of the things that makes Snowflake scale so well for arbitrary queries. Instead, Snowflake calculates statistics about columns and records in files that you load, and uses those statistics to figure out what parts of what tables/records to actually load to execute a query

Snowflake doesn't support indexes, it keeps data in micro partition or in another sense it breaks data sets in small files and format rows to column and compress them. Snowflake metadata manager in service layer will have all the information about each micro partition like which partition have which data. Each partition will have information about itself in header like max value, min value, cardinality etc. this is much better then indexes as compare to conventional databases.

3 comments:

chayan $October 16, 2021 at 6:31 PM

Thanks for sharing this blog this content is very significant for me I really appreciate you.

click here now

ReplyDelete
Replies
akshay @123October 19, 2021 at 9:16 AM
Very good information given on his new blogs and famous and trendy blogs with different categories.
Read the article
ReplyDelete
Replies
chayan $November 21, 2021 at 3:23 PM
The tutorial is very helpful for a first timer like me, by the way if you could have given a
tutorial on creating a free blog, it would have been very helpful for newbies like me.
click here now

ReplyDelete
Replies

Add comment