Friday, October 2, 2020

Identify and Remove Duplicate Records from Snowflake Table.

Snowflake supports primary key, foreign key and unique key, but, does not enforce them. Snowflake table allows you to insert duplicate rows.

Remove Duplicate Records from Snowflake Table

The following methods can be used to remove duplicate records in Snowflake table.

·         Using DISTINCT Keyword.

·         ALTER TABLE using SWAP WITH method.

·         Using ROW_NUMBER Analytic function.

·         Using GROUP BY Clause to Remove Duplicate Records.

Using Intermediate and Snowflake DISTINCT Keyword

This is one of the easiest methods to remove the duplicate records from the table. 

This method is to create another intermediate table using the DISTINCT keyword while selecting from the original table.

Create table sample_tab1 as select distinct * from sample_tab;

Alter table sample_tab rename to sample_tab_old;

Alter table sample_tab1 rename to sample_tab1;

Drop table sample_tab_old;

This method is pretty simple and worked perfectly fine if table size is small.

ALTER TABLE using SWAP – Used Along with DISTINCT

The SWAP WITH method swaps all content and metadata between two specified tables. The SWAP will rename two tables in a single transaction.

Create or replace table sample_tab1_dup like sample_tab1;

Insert into sample_tab1_dup select distinct * from sample_tab1;

Alter table sample sample_tab1_dup swap with sample_tab1;

Drop table sample_tab1_dup;

Use ROW_NUMBER Analytic Function To Remove Duplicate

The row_number Snowflake analytic function is used to rank or numbering the rows. Here, the row_number function is used to rank the rows for each group of records and then select only one record from that group.

Select col1,col2,col3 from(

Select col1,col2,col3,row_number() over(partition by col1,col2,col3 order by

Col1,col2,col3) RNO from sample_tab1) where RNO=1;

GROUP BY Clause to Remove Duplicate

 GROUP BY clause as an alternative to DISTINCT keyword.

For example, consider below Snowflake query to get unique records from a table using GROUP BY.

Select col1,col2,col3 from sample_tab1 group by col1,col2,col3;

This method is usually faster on small tables.

 

 

1 comment: