Snowflake Data Lake Tips for Supporting Data Ingestion
by Mauricio Rojas, on Mar 29, 2022 11:01:54 AM
Snowflake is the Data Cloud, but does that mean that I can also use Snowflake as a Data Lake? Well, yes. Snowflake could be your Data Lake, your Data Warehouse, your Data Analytics pipeline... all of it. That's what makes Snowflake so versatile as a platform.
But with all of that versatility, are we really taking advantage of it? Consider using External Tables in Snowflake. We sometimes forget a super feature in Snowflake External Tables that is the ability to create partitions automatically from partition columns expressions. Let's dive into a few scenarios where that could support your data ingestion into Snowflake.
In simple English, we can create expressions which can come from the file name and use those same expressions. When the query is run, Snowflake will determine which set of files to consume automatically.As stated in the documentation, this feature can help if your data files are organized in cloud storage with a structure like logs/YYYY/MM/DD/HH24
or logs/YYYY/MM/DD/HH24/MINUTE. If that is the case then you can then create an external table like this:
Notice the date_part expression that is created based on the filename. When querying the external table, filter the data by the partition column (date_part) using a WHERE clause. Snowflake only scans the files in the specified partitions that match the filter condition.
And it does not need to be limited to one partition. You can use several partitions if you want:
This is great because you can use this for incrementally loading files. Specifically, if you have a scenario where you have a lot of small files (which as a side note, would not be a good scenario for a Snowpipe).
Change tracking can be used, but the documentation also warns about its limitations:
Supported for streams on external tables only. An insert-only stream tracks row inserts only; they do not record delete operations that remove rows from an inserted set (i.e. no-ops).
This approach can be used to ingest files. A Serverless Task can be schedule to run each X-min, and you can query the files in the elapsed time.
You can create a stream with a statement like this:
If you have AUTO_REFRESH=FALSE then you need to run:
You do not need to use data streams. You can just register on a table the last YYYY,MM,DD,HH24,MINUTE that were used. But this opens up some very interesting approaches for data-ingestion.