Cloud Data Warehouse Explained
by John Browne, on Sep 1, 2021 8:17:24 AM
Possibly the single-most wildly anticipated and over-hyped IPO of 2020 was Snowflake (SNOW:NYSE), a hot provider of cloud data warehouse software. (Snowflake, in keeping with all hot young startups, likes to re-invent the standard terminology in their space so they have decided that "data warehouse" is out and "data cloud" is in.)
If you're late to the party or didn't get the memo, you might be asking yourself what all the fuss is about. As someone who got to the part way late--all the peeled shrimp had already been eaten--let me try to help you out.
Data warehouse, data lakes, data clouds: where does it end?
There's a straight line from the wadded up grocery list in my pocket to a data cloud, and it looks something like this: from a list of things to a table (things with properties) to a database (multiple tables) to a data warehouse (multiple databases) to a data lake (data warehouses plus the kitchen sink) to data cloud (who needs hardware?).
Application software developers are usually intimately familiar with databases, whether Access, SQL Server, Oracle, MySQL, or whatever. It's hard to write a useful application that doesn't rely on some kind of data, and probably 90+ percent of business applications are just front ends for databases, allowing CRUD (create, replace, update, and delete) functions with associated business rules and logic. Our Salmon King Seafood demo app is exactly that.
But as businesses grow, they tend to have multiple databases--one for finance, one for HR, one for sales, one for manufacturing, and on and on. While reporting from a single database is relatively easy (that's what SQL is for), reporting across multiple databases can be tricky, especially if the schemas aren't properly normalized across different tables in different databases. Suppose, for example, the sales database tracks contract dates as DDMMYYY and the finance database tracks receipt dates as MMDDYYYY: any report that wants to examine the lag between a signed contract and receipt of funds will have to do some process to normalize the dates from the two sources. Multiply that trivial example by multiple databases, often in different platforms with different data type definitions, SQL extensions, and software versions and it can get messy fast.
But that's not all. Many production databases are heavily transaction focused--think of a sales database for a large e-commerce site. If your website is taking orders 24x7 because you sell worldwide, when can you run big queries without hurting responsiveness for customers trying to make a purchase?
A data warehouse is not a database
The answer is a data warehouse, which pulls together data from multiple databases, normalizes it, cleanses it, and allows for queries across different data sources using a common data model. This not only reduces the hit on production databases, but, because the ETL (extract, transform, and load) processes create normalized data across all the tables in the warehouse, information consumers (management, business analysts, and even partners) can produce rich reports with minimal impact on the IT organization.
Since their widespread adoption dating back to the 80s and 90s, enterprise data warehouses (EDWs) have become de rigueur in larger organizations. Companies like IBM, Oracle, and Teradata gobbled up market share with high-performance appliances for ingesting, processing, and running queries on enterprise data.
Enter the cloud
Like a tornado suddenly developing from a dark cloud and that proceeds to rip apart everything in its path, the gathering storm of IoT and public cloud upended a lot of EDW strategies.
Seemingly overnight organizations have petabytes of data flooding in from connected devices all over the world. Drawing insights from that data requires both a high-performance computing platform as well as vast capacities in both structured and unstructured data. Data lakes allow for extremely large storage of unstructured data for consumption by AI models to create business intelligence. Cloud storage and elastic cloud processing slashes the investment and direct costs of storing and processing these vast rivers of raw data.
In the "old" model you might have to build up capacity by adding more appliances and disk storage to handle your peak loads, with a public cloud EDW the compute capacity can expand or contract with usage while maintaining acceptable performance for running big queries or scripts. Plus, BLOB (binary large object) storage costs for AWS or Azure are quite inexpensive compared to arrays of on-prem disk drives.
The only real downside of public cloud-based EDWs was stickiness--if your warehouse was in AWS is was difficult to move it to Azure, and vice versa.
Along comes Snowflake
Last year Snowflake was the hottest thing no one had ever heard of. Overnight the CEO becomes a newly minted billionaire (who hasn't gone into space yet, but it's early days) and the stock price went through the roof on day 1.
Well, hype mainly. Yet the hype had some basis in reality. Let's look at the things that make the Snowflake cloud data warehouse different:
- Cloud agnostic: You can run your Snowflake instance in Azure, AWS, or wherever you want. You can easily move it from one cloud provider to a different one; something that isn't possible if you're using Azure or Amazon Redshift. In general, customers are wary of vendor lock-ins.
- Only pay for what you use: Snowflake's pricing model is focused on your compute usage, rather than storage. And unlike many other EDW solutions, you only pay for what you use. If you've got a ginormous query you only run once a month, you don't have to pay the rest of the month for that surge capacity.
- Shared data: Snowflake dramatically reduced the complexity or time it takes to share data among disparate sources. They're building a public data marketplace with some government databases already in place. Doesn't really matter if your data is in AWS and the partner's data you want to consume is in their private cloud: Snowflake makes it really easy to get it.
- Super fast performance: They claim queries run much much faster on Snowflake than on other systems. YMMV.
- Mix and match data: whether it's a data lake of unstructured data from IoT devices or highly structured data from finance systems, Snowflake makes it cheap and easy to store, access, and understand that data. Queries across disparate data systems, housed in different architectures and sources, are simple.
Enterprise data warehouses are the domain of very large companies, which are often the acquirers in merger and acquisition (M&A) deals. If BigCo running an EDW in AWS acquires SmallerCo who in turn is running their EDW in Azure, then it will be hard to combine those EDWs without Snowflake. In this situation, a Snowflake data cloud will work seamlessly across both AWS and Azure, eliminating the need for BigCo to convert the SmallerCo data lake and data warehouse to its own system. Further, the Snowflake architecture makes it easy to query very large data sets without the need to pay for the capacity to do so even when it's idle most of the time.
If Snowflake's so great, why isn't everyone using it?
Recently I was watching a show featuring an architect in Dublin, Ireland and in one episode he's redoing a house for a family that moved from California; they get a shipping container delivered with all their furniture and appliances from the US. My immediate thought was that those appliances weren't going to be much use in a country with 220v and different plug standards.
Changing platforms or systems can be hard. And changing data warehouses can be really hard.
We'll get into this in more detail in a subsequent post, but basically consider the things you have to do for a migration from, say Teradata to Snowflake:
- Create the new Snowflake data warehouse, training the existing user base on the new platform
- Duplicate all the data in the existing Teradata instance and move it into Snowflake
- Extract the code (stored procedures, BTEQ scripts, and so forth) from the Teradata instance
- Identify and transfer all the associated processes like ETL and BI reporting/analysis
- Run the new system in parallel with the existing Teradata system and compare all the results for a thorough QA review.
- When everything checks out, turn off Teradata and enjoy the new cloud data warehouse.