Understanding BigQuery Pricing

TL;DR

  • Queries in Google BigQuery are several times more expensive than the same queries in Amazon Athena.
  • We discuss other BigQuery cost, performance and ecosystem advantages that can offset these higher costs.

BigQuery is one of our favorite cloud big data warehouses. It has virtually limitless data storage capacity and delivers extraordinary performance on large data sets. That said, pricing can be a bit problematic if you’re not careful. We frequently hear that query costs for BigQuery and AWS Athena are the same. In this article, we will address that misconception and understand how to calculate prices for BigQuery storage and queries; we’ll also discuss ways to leverage BigQuery cost, performance, and ecosystem advantages to help offset these higher costs.

Unlike AWS Redshift, but similar to AWS Athena, BigQuery is serverless in the sense that you don’t need to reserve or spin up resources to run queries. Instead, you pay for data storage and for the amount of data scanned in each query you run. On the surface, Athena and BigQuery appear to be priced almost identically. As one comparison article puts it,

“Google also charges by the amount of data scanned, and the price is the same as for Athena. The storage is $0.02 per GB, which is more or less the same in AWS (the price tiers depends on the overall amount stored).” https://logz.io/blog/bigquery-vs-athena/

This discussion of prices ignores some key details. In practice, prices for queries and storage in Athena are several times cheaper than equivalent prices in BigQuery. Let’s dig into these pricing differences, call out other ancillary costs specific to Athena, and explain why we still generally recommend BigQuery to our clients.

Athena vs BigQuery – A Pricing Comparison

Both Athena and BigQuery bill at $5/TB queried. However, not all terabytes are created equal. Athena charges for bytes read from S3; compressing data to reduce its size thus saves costs for both queries and storage. The Athena pricing documentation (https://aws.amazon.com/athena/pricing/) mentions this strategy specifically: “Compressing your data allows Athena to scan less data.”

Behind the scenes, BigQuery also compresses data, but this is invisible to the user. Here’s the big difference with Athena – storage and queries are billed by decompressed bytes (https://cloud.google.com/bigquery/pricing#data), and the difference in calculated data size can be dramatic. For example, I recently encountered a column like this in some client data stored in BigQuery.

In fact, this entire column consists of a single repeated value. Typing the query

SELECT activitytype FROM `sample-table`

in the BigQuery console editor returns a query size of 1.3 GB. The table contains over 200 million rows, so the raw data size makes sense, but a quick experiment with gzip compression achieves a ratio of over 100 to 1.

Low cardinality columns, i.e., highly compressible columns containing a small number of distinct values repeated many times, are common in real world data sets. Noisy numeric data, also very common in practical data, doesn’t compress nearly so well, but we typically see compression ratios of 5 to 1 or better for full tables.

To look at another example, we turn to a Bureau of Labor Statistics table in the BigQuery public datasets. (bigquery-public-data:bls.employment_hours_earnings). The reported table size in BigQuery is 1.16 GB. Exporting from BigQuery in CSV format with .gz compression yields 141 MB of files. Using Qubole Spark to convert the table into Parquet columnar format with Snappy compression yields an output size of 222 MB. While csv.gz format gives us a higher compression ratio, Parquet/Snappy format yields substantially better query performance in Athena and allows us to further limit scan size by only paying for data in columns that we query. To put this example in pricing terms, querying the full table in BigQuery is roughly five times more expensive than using Athena to query the data in compressed Parquet format.

BigQuery Freebies

Query and storage pricing tell only part of the cost story for BigQuery and Athena. BigQuery gives away a number of services that entail extra charges with AWS/Athena.

  • Data import – BigQuery imports data from Google Cloud Storage to native tables for free. Athena can directly query data from S3 in a variety of formats, but it is a best practice to convert data to a compressed columnar format like Parquet with Snappy compression for data that will be queried repeatedly. Format conversion in AWS requires running a workload in Glue, Athena or another tool.
  • Data export – BigQuery exports native tables to compressed Avro, CSV and JSON files for free. (Note that BigQuery can read Parquet files, but inconveniently does not export directly to Parquet.)
  • Schema management – standard practice is to manage Athena schemas in Glue Catalog, using a Glue crawler scan for detection. BigQuery can auto-detect schemas for free as part of the ingest process.
  • Query caching – when you run a duplicate query in BigQuery within 24 hours, the database will return cached results at no additional charge.

Other BigQuery Features We Love.

Here are some reasons why we frequently recommend BigQuery to our clients.

  • Ease of management – BigQuery’s project, dataset and table hierarchy makes it exceptionally easy to organize data and manage access. Data management and administration overhead is a key cost driver for any large organization. Athena is an extremely powerful and flexible query engine, but data organization and management is substantially simpler with BigQuery.
  • Flat rate pricing – organizations querying more than 2 petabytes a month, or 20 petabytes a year, can save by signing up for BigQuery flat rate pricing, starting at $10,000 per month or $100,000 a year with an annual commitment.
  • Data sharing – we expect data sharing to grow in importance over the next decade. This can be in the form of open data sets, data shared for collaboration and paid data access. Cloud platforms offer many methods of data sharing, for example ‘requester pays’  S3 buckets and AWS Data Exchange. That being said, we’ve found data sharing in BigQuery to be exceptionally easy. Consumers simply run queries against shared data from their own Google Cloud projects, and can easily union or join with data from other sources. (This is also a key strength of Snowflake.)
  • Integration – BigQuery features deep integration into various Google Cloud Platform data services. We anticipate continued growth of the ecosystem in 2020.

Your Mileage May Vary

Innovation from cloud big data warehouse vendors is driving a sea change in business analytics and data science. Google BigQuery, Amazon Athena, Snowflake, Redshift and Azure Synapse Analytics all offer remarkable technology and performance. The choice of a cloud data warehouse is just one component of an organizational cloud strategy. Consider other factors like operational costs to manage data pipelines and other technologies that support the data warehouse.

About the Author

Matt Housley is one of the cofounders of Ternary Data, a consulting firm specializing in data engineering and processes. Before founding the company, he worked as a math professor, data scientist, and data engineer. He holds certifications on both GCP and AWS.

Keys To Starting Successful Data Science Projects

It’s still early days in data science, and it’s still difficult to know how to get started. Indeed, a 2018 O’Reilly survey finds that half of organizations are still in the “exploring” phase. Given the hype around data science, many companies are approaching it as a solution looking for a problem. Without a clear use case, very little business value will be gained.

If you’re just starting out on the data science journey, here are some keys to starting successful data science projects.

Step back. Get perspective. What is your true north? How does your data science problem relate to this?

You should always strive to solve problems aligned with your business’s true north. This is easier said than done, especially given the transformational potential of data technologies. You will be tempted to do data science for the sake of data science. Focus instead on solving core business problems. Make sure that the problem you’re solving is aligned with your business’s true north.

Start with the end in mind. Identify ONE problem (preferably small) that needs to be solved.

Data science projects work best when you focus on a single problem, preferably something small and discrete. In this way, you can obtain, analyze and model the data in a way that doesn’t contaminate the results.

Starting with a small, single objective also provides your data team the design patterns and experience for more sophisticated next projects.

What does “done” look like? What are good results?

Establish a finish line for initial success. Define “done” clearly and succinctly. Your first data science project doesn’t need to be a big win. For example, you may decide to create a simple model to predict customer churn. Decide in advance the metrics you will use to determine if this model is performing well. Without a notion of “done”, it’s easy to wander aimlessly.

Get executive and organizational buy-in.

Lack of executive and organizational buy-in is a major reason that data projects fail. Get support from the top-down and bottom-up, and rally your team around the problem. Communicate regularly with stakeholders and give them skin in the game.

As executives and the organization gain more trust in the business’s ability to execute on data projects, a virtuous cycle of support and data value is created.

Assess your data team, capabilities and resources. Tell them the WHAT. Let the team figure out the HOW.

A good data team will support the functions of data science, data engineering, data analysis, and data architecture. If the team is small, team members may need to play multiple roles, but data engineers, data analysts, and data architects have as much value as the data scientist. It’s all about building a cohesive team with all the essential moving parts.

Once you’re confident that you have the team and resources to execute on your data goals, make sure the team is aligned on WHAT you’re trying to achieve. Let the data team figure out HOW to accomplish this goal. This is as much management 101 as anything else. Make sure that you consult stakeholders who will use the tools day to day before making technology decisions.

Invest in building a solid data foundation.

It’s tempting to jump into a data science project with nothing more than a Python notebook, some libraries (Sklearn/Tensorflow/Pytorch/etc), and boundless enthusiasm. For exploratory data analysis or toy data projects, this is fine. But in order for the business to get real and sustained value from its data initiative, you will need to establish a solid data foundation.

The data science hierarchy of needs is an insightful blueprint for realizing value from your data.

Please understand that it takes substantial investments of organizational effort, capital and time to build a data foundation. It requires an ongoing journey of continuous improvement. Given the lack of clarity or a timeline for data ROI, it’s tempting to cut corners or become impatient. But in a world where winning with data is a key competitive advantage, do you think you want to take this investment lightly?

Start simple. Use simple algorithms before moving to advanced approaches.

Avoid starting with sophisticated approaches. Just because deep learning is all the rage doesn’t mean it’s the first approach you should take. In a lot of cases, simple approaches like linear or logistic regression work great and are very simple to implement.

Plan for production from the beginning of the project.

Will your model be exposed through a web service? Should it trigger emails to customers? In and of themselves, models do not solve business problems. Decide early what it means to put your models into production and make sure that your team has the resources to deploy. Data science done in isolation on laptops does not add value to the business. A production solution should include automated data pipelines, model training and serving, as well as monitoring and alerting for the model, pipelines and production service.

Watch out for resume driven development.

New data tools pop up on a daily basis, and it’s easy for data teams to get shiny object syndrome for the latest and coolest data tool. This is a dangerous temptation. A dirty secret of engineers and data scientists is that they may choose tools in order to improve their resumes, not because they are the best fit for the problem at hand. Resume-driven development is a real thing, and it can have unintended consequences. Make sure your data team has a process for vetting and implementing data tools that align with the objectives and problems being solved. Making an upfront commitment for best-fit will avoid the consequences of a suboptimal data stack down the road.

Avoid technology for the sake of technology. Use the simplest solution that will deliver value to the business. See what you can glean from regression before applying more sophisticated ML. If you need to forecast website traffic, start with Facebook Prophet rather than building a sophisticated bespoke time series model. Utilize a fully managed cloud ML platform like AWS Sagemaker in lieu of standing up a TensorFlow cluster on Kubernetes. Pursue custom, cutting edge solutions only after you’ve identified clear business value beyond a simple solution. Take into account long term operational and maintenance costs, not just the upfront resources required for a proof of concept.

In Conclusion.

Your journey with data science will create value for your business by taking a smart, methodical approach. Build a solid data foundation. Experiment, implement, learn, and iterate. Best of all. Have fun. You’re a pioneer in the early days of data science.

If you are working on data science projects and want some advice on taking it into production, don’t hesitate to reach out.