From On-Prem to Cloud. 9 Common Cloud Data Warehousing Anti-Patterns and How to Avoid Them

Moving from an on-premises data warehouse to the cloud? You’re not alone. We see many companies making this move. These days, there are plenty of great options for cloud data warehouses, such as Snowflake, BigQuery, Redshift, Firebolt, Synapse, and others. Although cloud data warehouses simplify the operational and administrative overhead of your on-premises data warehouse, there are some key things you need to know to make your transition much easier. Data warehousing in the cloud has some fundamental differences compared to on-premises data warehousing. We are often contacted by teams that assumed their on-premises data knowledge would translate with little effort into their new cloud data warehouse, only to run into unanticipated (and avoidable) problems.

If there’s one piece of advice to take away from this article - get to know your data warehouse. Read the documentation, spend time with it, take a few courses, talk to people with experience in the new tool, and even get a certification or two. There is nothing more important than understanding how your data warehouse operates: why certain patterns and structures are used, how billing works, and how to leverage the separation of compute and storage. Having this understanding will allow you to create best practice patterns that are optimized for these features.

At the same time, there are several anti-patterns we’ve seen over and over again. As you transition from on-prem to cloud, recognizing these anti-patterns will save you significant time and money.

The cloud data warehouse anti-patterns fall into three main categories: (1) general anti-patterns, (2) loading and storage anti-patterns, and (3) querying and processing anti-patterns.

General Cloud Data Warehouse Anti-Patterns

#1: Treating your cloud data warehouse like your on-premises data warehouse

Your on-premises data warehouse is not the same as a cloud data warehouse. We’ve seen more people get tripped up on this anti-pattern than anything else. This anti-pattern has far reaching and broad implications, and will therefore show up a few more times in this article in more specific variations.

Unfortunately, it is also the snafu that we encounter most frequently during a data warehouse migration. On-prem and cloud data warehousing have fundamental differences, such as separation of compute from storage, different pricing models, and the gory details of how the data warehouses differ at a technical level. We understand that it’s difficult to shake-off habits and routines from your on-prem days, but it is essential to set aside your past conditioning.

Which brings us to the next anti-pattern.

#2: Not understanding how your cloud data warehouse works

Some of the aforementioned differences between a cloud data warehouse and on-prem are also what make a data warehouse so powerful. Take the time to read the cloud data warehouse’s documentation. Really understand how it wants to be treated. Not having a good understanding of the unique attributes of your data warehouse means you are not leveraging its full power, and potentially making costly mistakes.

#3: Embracing denormalization by default, and outright rejecting dimensional modeling

Dimensional modeling, such as Kimball modeling, applies business logic to data structures. Though dimensional modeling has been a common practice in data warehousing for decades, some vendors and articles push for extreme denormalization. Columnar data warehouses allow us to relax some assumptions of traditional data modeling, such as normalization. While this can have performance benefits, it also means your business logic can become muddled. Consider your use cases and whether denormalization is warranted.

Just know that outright rejecting dimensional modeling can lead to confusion and data mismatches down the line - if you’re not consistent, how do you know you’re defining a critical function, i.e. customer engagement, in a way that conforms to business standards?

Loading and Storage Anti-patterns

#4: Treating your OLAP cloud data warehouse like an OLTP database

Repeat after me - an OLAP data warehouse is NOT an OLTP database. We often see teams performing streaming direct inserts, frequent merges, and more in their cloud data warehouse - all of which are common practice in an OLTP data warehouse, but bad ideas for a cloud data warehouse. The result is poor performance, slow queries, bottlenecked resources, and frustrated end users.

Learn how to optimize the columnar structure of a cloud data warehouse. Perform bulk loads, microbatch loading, minimize massive updates and merges. If possible, create an insert only workflow in conjunction with a change data capture system. Definitely read - and re-read - the best practices for loading data in your cloud data warehouse. Obey these rules at all costs, and don’t try to second guess how data should be loaded.

#5: Obsessing over data size in storage

A central element of a cloud data warehouse is the separation of compute and storage. In the cloud, storage is dirt cheap and infinite. Whereas an on-premises data warehouse had fixed data size limitations, those constraints are largely eliminated with a cloud data warehouse. It’s important to know that storage is much cheaper than compute, allowing you to precompute results for frequently run queries.

Querying and Processing Anti-patterns

#6: Not clustering your data

Since cloud data warehouses are typically column-based and don’t have indexes, querying unpartitioned data triggers a full scan of selected columns. Get in the habit of using clustering and partitioning: a way to sort data that organizes data by keys. Organizing data in this fashion allows the query optimizer to limit scan size based on predicates.

Terminology: In Redshift, data is clustered by “sort key.” In Snowflake, clustering is automatic, with the option to manually set cluster keys based on columns that are frequently used in predicates. BigQuery supports both partitioning and clustering as different tiers of data organization.

#7: ETL, instead of ELT

ETL is the traditional method of loading data into a data warehouse. While it still has a place, it’s no longer necessary with cloud data warehouses. Since we no longer need to worry about data size in a cloud data warehouse, the downsides of ETL outweigh the benefits: i.e. imposing modeling constraints that prespose that business logic is static, reducing analytical flexibility, and relying on a potentially resource constrained transformation layer.

Instead, simply extract data from your source systems, load it into a staging area in your cloud data warehouse, then transform within the cloud data warehouse. Working with an ELT workflow leverages the scalability made possible by separation of compute and storage.

#8: Using SELECT *

SELECT * is an extremely common way to preview all of the columns in traditional databases. Everyone does it. However, in a column-based cloud data warehouse SELECT * will scan every column. Depending on the pricing model of your cloud data warehouse - some of them charge by the amount of compute utilized in a query, or by the number of bytes queried - this is a quick way to rack up huge charges.

Here’s a quick story. At one of our clients, an analyst created a report that ran a query using SELECT *, which scanned several terabytes of data. Users of this report ran variations of this query, so the results were never dumped into cache for re-use. At the end of the month, the client received a bill equivalent to several months of normal billing. Needless to say, the analyst very quickly rewrote the query for better pricing performance.

To avoid a surprise bill, get in the habit of using the ‘preview’ capability of your data warehouses: in BigQuery, you can preview a table for free; in Snowflake you need to have a warehouse running, but it is still significantly cheaper. Also, only query the columns you need. Instead of running SELECT *, run SELECT column A, column D, column Z, and so on. This will significantly reduce the amount of data you need to query, and will utilize less data warehouse compute resources behind the scenes.

Other tips - Review the query plan if you have a query that’s consuming inordinate amounts of resources. Also, set billing alerts for your cloud data warehouse to be alerted if sudden billing spikes occur.

#9: Assuming bottlenecks are about data size

Coming from the on-prem world, it is understandable to attribute slow responses and lags to too much data and not enough space. This assumption does not translate into the cloud data warehouse. In our experience, bottlenecks are rarely related to storage space. Rather, these bottlenecks are due to badly optimized queries, poor schema choice, inadequate compute, etc. As indicated above, look at your query plan and understand how the query is executing. Is there something you can tweak in your query to make it run better? Do this before scaling up your cloud data warehouse’s compute resources, which can be costly and unnecessary. Only scale up the size of the warehouse should as a last line of defense.

Snowflake pro tip: If you’re using Snowflake, you can get a lot of mileage from a small warehouse. Before scaling up, consider rethinking your queries, precomputing common queries and modifying data structures.

Summary

To recap - get to know your data warehouse. Understand how to optimize the separation of storage and compute, query in a columnar database, work with modern data structures, and more. We guarantee that spending a few weeks getting comfortable with the architecture of your data warehouse before jumping all the way in will save you money and time down the road.

For more information, check out our video on Cloud Data Warehouse Anti-patterns.

BlogMaike WellsMay 27, 2021