Queries in Google BigQuery are several times more expensive than the same queries in Amazon Athena.
We discuss other BigQuery cost, performance and ecosystem advantages that can offset these higher costs.
BigQuery is one of our favorite cloud big data warehouses. It has virtually limitless data storage capacity and delivers extraordinary performance on large data sets. That said, pricing can be a bit problematic if you’re not careful. We frequently hear that query costs for BigQuery and AWS Athena are the same. In this article, we will address that misconception and understand how to calculate prices for BigQuery storage and queries; we’ll also discuss ways to leverage BigQuery cost, performance, and ecosystem advantages to help offset these higher costs.
Unlike AWS Redshift, but similar to AWS Athena, BigQuery is serverless in the sense that you don’t need to reserve or spin up resources to run queries. Instead, you pay for data storage and for the amount of data scanned in each query you run. On the surface, Athena and BigQuery appear to be priced almost identically. As one comparison article puts it,
“Google also charges by the amount of data scanned, and the price is the same as for Athena. The storage is $0.02 per GB, which is more or less the same in AWS (the price tiers depends on the overall amount stored).”https://logz.io/blog/bigquery-vs-athena/
This discussion of prices ignores some key details. In practice, prices for queries and storage in Athena are several times cheaper than equivalent prices in BigQuery. Let’s dig into these pricing differences, call out other ancillary costs specific to Athena, and explain why we still generally recommend BigQuery to our clients.
Athena vs BigQuery – A Pricing Comparison
Both Athena and BigQuery bill at $5/TB queried. However, not all terabytes are created equal. Athena charges for bytes read from S3; compressing data to reduce its size thus saves costs for both queries and storage. The Athena pricing documentation (https://aws.amazon.com/athena/pricing/) mentions this strategy specifically: “Compressing your data allows Athena to scan less data.”
Behind the scenes, BigQuery also compresses data, but this is invisible to the user. Here’s the big difference with Athena – storage and queries are billed by decompressed bytes (https://cloud.google.com/bigquery/pricing#data), and the difference in calculated data size can be dramatic. For example, I recently encountered a column like this in some client data stored in BigQuery.
In fact, this entire column consists of a single repeated value. Typing the query
SELECT activitytype FROM `sample-table`
in the BigQuery console editor returns a query size of 1.3 GB. The table contains over 200 million rows, so the raw data size makes sense, but a quick experiment with gzip compression achieves a ratio of over 100 to 1.
Low cardinality columns, i.e., highly compressible columns containing a small number of distinct values repeated many times, are common in real world data sets. Noisy numeric data, also very common in practical data, doesn’t compress nearly so well, but we typically see compression ratios of 5 to 1 or better for full tables.
To look at another example, we turn to a Bureau of Labor Statistics table in the BigQuery public datasets. (bigquery-public-data:bls.employment_hours_earnings). The reported table size in BigQuery is 1.16 GB. Exporting from BigQuery in CSV format with .gz compression yields 141 MB of files. Using Qubole Spark to convert the table into Parquet columnar format with Snappy compression yields an output size of 222 MB. While csv.gz format gives us a higher compression ratio, Parquet/Snappy format yields substantially better query performance in Athena and allows us to further limit scan size by only paying for data in columns that we query. To put this example in pricing terms, querying the full table in BigQuery is roughly five times more expensive than using Athena to query the data in compressed Parquet format.
Query and storage pricing tell only part of the cost story for BigQuery and Athena. BigQuery gives away a number of services that entail extra charges with AWS/Athena.
Data import – BigQuery imports data from Google Cloud Storage to native tables for free. Athena can directly query data from S3 in a variety of formats, but it is a best practice to convert data to a compressed columnar format like Parquet with Snappy compression for data that will be queried repeatedly. Format conversion in AWS requires running a workload in Glue, Athena or another tool.
Data export – BigQuery exports native tables to compressed Avro, CSV and JSON files for free. (Note that BigQuery can read Parquet files, but inconveniently does not export directly to Parquet.)
Schema management – standard practice is to manage Athena schemas in Glue Catalog, using a Glue crawler scan for detection. BigQuery can auto-detect schemas for free as part of the ingest process.
Query caching – when you run a duplicate query in BigQuery within 24 hours, the database will return cached results at no additional charge.
Other BigQuery Features We Love.
Here are some reasons why we frequently recommend BigQuery to our clients.
Ease of management – BigQuery’s project, dataset and table hierarchy makes it exceptionally easy to organize data and manage access. Data management and administration overhead is a key cost driver for any large organization. Athena is an extremely powerful and flexible query engine, but data organization and management is substantially simpler with BigQuery.
Flat rate pricing – organizations querying more than 2 petabytes a month, or 20 petabytes a year, can save by signing up for BigQuery flat rate pricing, starting at $10,000 per month or $100,000 a year with an annual commitment.
Data sharing – we expect data sharing to grow in importance over the next decade. This can be in the form of open data sets, data shared for collaboration and paid data access. Cloud platforms offer many methods of data sharing, for example ‘requester pays’ S3 buckets and AWS Data Exchange. That being said, we’ve found data sharing in BigQuery to be exceptionally easy. Consumers simply run queries against shared data from their own Google Cloud projects, and can easily union or join with data from other sources. (This is also a key strength of Snowflake.)
Integration – BigQuery features deep integration into various Google Cloud Platform data services. We anticipate continued growth of the ecosystem in 2020.
Your Mileage May Vary
Innovation from cloud big data warehouse vendors is driving a sea change in business analytics and data science. Google BigQuery, Amazon Athena, Snowflake, Redshift and Azure Synapse Analytics all offer remarkable technology and performance. The choice of a cloud data warehouse is just one component of an organizational cloud strategy. Consider other factors like operational costs to manage data pipelines and other technologies that support the data warehouse.
About the Author
Matt Housley is one of the cofounders of Ternary Data, a consulting firm specializing in data engineering and processes. Before founding the company, he worked as a math professor, data scientist, and data engineer. He holds certifications on both GCP and AWS.
The Ternary crew spent a crazy week in San Francisco, attending Google Next 2019 with almost 40,000 of our closest friends. Given the 122+ announcements, we obviously didn’t catch everything.
Here’s our roundup of our favorite data-related announcements at Next.
AI is Google’s sweet spot. While other cloud vendors are making great progress with their AI offerings, Google stands on its own. One area where Google is pushing boundaries is AutoML. For the uninitiated, think of AutoML as “feed your data, get predictions”. No data scientist required.
How is this working out? Well, during Next, Google’s AutoML earned second place in a one-day invitational Kaggle competition. Not bad for a self service product where PhD level machine learning skills were once required.
At Ternary, we truly believe that AutoML is the wave of the future. It greatly simplified the ceremony around most data science use cases. After all, it’s just math 😉
The other impressive shift is Google’s move toward AI-powered business solutions. In the past, Google tended to pitch their services as “we know AI, here’s what you should use too”. With new products such as Contact Center AI, Recommendations, and much more, it’s clear that Google has wised up and is moving toward solving real-world business problems in a managed service.
AutoML Tables (Beta)
Machine learning on structured data is notoriously difficult. We will go out on a limb and say it’s one of the harder areas of Automated ML. Why? Because you’re usually dealing with data input by humans in some janky user interface. Humans make errors. For example, Utah may be spelled “UT”, “Utah”, “UTAH”, “Beehivestate”, etc.
Needless to say, structured data is a mess. It’s one thing to make a custom model by hand. The capability to do machine learning on any structured dataset is in a different universe of difficulty. Some of us have been part of a few serious endeavors to solve this, and it hasn’t gotten easier over the years. Most important, structured data is what much of the business world depends on, whether it’s ERP, BI, or other critical systems.
Enter AutoML Tables, Google’s own high powered solution for machine learning on structured tabular data. The process is pretty straightforward – point Tables at your data source, give it a target feature, and let it predict. It offers classification or regression; we even notice it allows for time series forecasting, which is a nice boon. When we chatted with an engineer on the Tables team, he indicated that it used deep learning behind the scenes. Since that’s the case, more data is better. If you’re dealing with a super small dataset (less than 100K rows), you may be better off using Sklearn or similar classical ML packages.
Being a beta product, there are a few drawbacks we hope get corrected soon. Training costs $19.32 per hour. Not cheap, though certainly not as expensive as a data scientist. The issue we noticed is that (as far as we can tell) AutoML spins up 92 n1-standard-4 servers, regardless of data size. This may be overkill for smaller datasets. Again, if you have a small dataset, it may be easier to use a classical ML library.
The QuickStart example dataset took 2 hours to train on a dataset of 45K rows and 16 columns. That’s quite a long time. You can preset a time limit for training, but this must be done before training starts. After training starts, we couldn’t find a way to stop the process. Depending on your dataset, it may take… a while… for training to complete. This could get expensive.
Despite these quirks, Tables produces a great model, complete with diagnostic metrics such as F-scores, a confusion matrix, and more. Like all AutoML products, Google does a lot of the hard work for you to set up API endpoints against a deployed model. Overall, Tables is going to be a force to be reckoned with in the hot space of automated machine learning on tabular data.
AI Hub – Google now offers a “one stop shop for everything AI”, greatly simplifying users’s abilities to collaborate and access AI tools.
AI Platform – In the past, Google’s AI tools were scattered around ML Engine and Datalab. The new AI Platform consolidates everything – AI Hub, data labeling, Kubeflow, notebooks, jobs, models – under one roof. No more searching around for various parts to build an ML pipeline.
Kubeflow – Deploy ML workflows on Kubernetes. This is now included in AI Platform, allowing you to write your ML pipeline once, and run anywhere.
Recommendations AI – Personalized recommendations is a huge area of business data science. There’s not an app that won’t benefit from recommendations. In the past, data scientists would need to build recommendation engines from scratch. Now, just give Recommendations AI your product and customer data and let it set up a prediction API for you. We recommend you try this (ok, bad joke).
The Ternary crew are huge BigQuery fans. The number of announcements around BigQuery were downright mindblowing. It’s clear that Google wants BigQuery to be the hub for all of your data. Here are several highlights.
BI Engine (Beta)
Oftentimes, analytics depends on query execution times. Slow running queries increase the time to insights, and reduce value to the business. Thankfully, fast, (preferably) real time analytics is quickly becoming an expectation.
BI Engine is a “fast, in-memory analysis service”. No longer do you need to wait for slow running queries. Now you can get highly concurrent sub-second queries in BigQuery. You can even hook up BI Engine to BigQuery Streaming to get genuine real-time dashboards on streaming data.
Want to write a streaming or batch data pipeline, but don’t know how to write Python or Java? If you know SQL, you can now take advantage of Dataflow SQL. Simply write a SQL query in BigQuery, and a Dataflow job is set up in the process. This seems like a great way for analysts to dive in and write their own pipelines. Given the opinions around whether data scientists and analysts should write their own ETL, it will be interesting to see where Dataflow SQL goes.
Connecting data is a pain point for data pipeline developers. But that’s part of the job. BigQuery’s Data Transfer Service has been a great tool for importing Google Ad and Analytics data into BigQuery. And while you could still connect your apps using Fivetran and other services, it still required a few steps. Thankfully, Google announced 100+ SaaS connectors for BigQuery. Connect to your favorite apps via Supermetrics and Fivetran. This will definitely ease the ceremony of setting up data connections to BigQuery, allowing you to focus on analysis.
BigQuery’s flat rate pricing got a whole lot more attractive. Instead of $40,000 USD for 2000 slots, now you can pay $10,000 for 500 slots. This makes flat rate usage of BigQuery a lot more affordable.
Let’s face it. There are a ton of great third party cloud ETL tools like Matillion, Talend, and Informatica. We haven’t seen a cloud provider offer their own decent ETL tool, except for Azure’s Data Factory. And if you’re in Google Cloud, your options are pretty slim. Until now.
Google Cloud’s Data Fusion is the result of Google’s acquisition of Cask Data last year, and is based on the open source Apache CDAP framework. It offers GCP users a full fledged ETL/ELT tool that works with both streaming and batch workloads. We’ve tested the service (still in beta), and are mostly impressed. Right now it uses Google’s Dataproc, but we suspect it will also run on Dataflow at some point. Pricing is for the basic (batch only) or enterprise edition (batch and streaming). It features 100+ connectors, so you can be productive with your favorite data sources and sinks in no time. Also, you can wrangle your data (much like Trifacta/Dataprep), which take the headache out of nobody’s favorite chore – data cleaning.
We will start recommending Data Fusion to Google Cloud users as a full fledged ETL/ELT tool. It’s a very complete tool that will satisfy the needs of nearly every company that needs to pipeline their data. Dataprep is still available for ad-hoc data cleaning, and Dataflow/Dataproc if you’re into coding these solutions instead.
Got a couple of questions for you – do you know where all of your data assets are stored, and what they mean for your business? And do you practice data governance? Most companies don’t have good answers to these questions besides “no” or “maybe”.
Meanwhile, data continues to grow. Data sprawl is a reality for many companies. In an age of increasing data compliance and protections, data governance is no longer just a nice to have.
Google Cloud’s Data Catalog is a managed data catalog and governance tool that “offers a unified view of all datasets”. It tracks metadata about your data, allows users to tag data, and offers excellent search and discovery. Data Catalog also integrates with Google’s DLP for enhanced data privacy tracking.
As this product becomes more fully baked, we expect to recommend it to our clients. Ternary takes data cataloging and governance very seriously. So should you.
The first day’s Next keynote kicked off with Anthos, Google’s hybrid/multi-cloud platform. Now, you can run your apps in Google Kubernetes Engine (GKE), AWS, Azure, or your on-prem environment with unified management in GCP. This truly opens up the world for applications to be “write once, run anywhere”.
We’re particularly excited for Anthos in the Rockies. We see two types of customers – those on AWS, and those with on-prem infrastructure. Many of the companies we see in our area are still on-prem, figuring out how to containerize their applications, and plan to move to the cloud very soon. Something like Anthos makes this transition to hybrid cloud a lot easier. We’re looking forward to helping our customers use Anthos to migrate their data workloads to any cloud they desire.
Also announced during the first day is Cloud Run, which fills a gap between Cloud Functions and App Engine. Just point Cloud Run to your hosted container, deploy, and you get a fully functioning application with an API endpoint (for up to 15 minutes of runtime). I heard a rumor on a recent Google Cloud podcast that somebody implemented the video game Doom on Cloud Run.
Another new product is Cloud Code (get used to the theme of Cloud this, Cloud that), making it easier to build and deploy Kubernetes application in your IDE. For anyone writing apps for Kubernetes, this will definitely help with productivity.
Qubole, Ternary’s favorite 3rd party self service data platform, just announced a partnership with Google Cloud. We’re excited because this gives Google Cloud users a one-stop shop for a unified data science and data engineering experience, complete with a clean and simple UX. You can start today with Qubole in the Google Cloud marketplace.
While some companies are getting bad marks for their behavior around open source, Google Cloud is making open source-centric companies first class citizens in the Google ecosystem. This is a great move that both makes excellent business sense for all parties and keeps the open source ecosystem moving forward. We support open source whenever possible, and so should you.
Sorry, bad pun. But seriously. Google Cloud came a long way with its offerings this year. Their data tools and ecosystem is very robust, and will add a ton of value to many companies. The Ternary crew can’t wait to get these tools into the hands of our customers. Can’t wait to see you all at Next 2020!
Lately, we’re seeing a lot interest in Google Cloud’s BigQuery. Whether it’s the pay-as-you-go, serverless benefits, the built-in machine learning capabilities, or the seamless integration with Google Analytics 360 and other advertising data sources, BigQuery is a complete data warehouse solution.
If your company is interested in learning more about how BigQuery can take analytics to the next level, please contact us to arrange a FREE workshop and training.
Our friends at Google Cloud just announced several new certifications. This is welcome news, especially since Google Cloud is getting a lot of traction both globally and here in Utah.
Ternary Data are big fans of cloud certifications (AWS, Google Cloud, Azure). Earning any of these certifications is no joke. Across the board, the exams are notoriously difficult. If you can earn a cloud certification, it’s clear that you are competent and vetted in the area in which you did your certification.
Ternary Data will soon be offering a crash course for both the Professional Cloud Architect and Professional Data Engineer certifications. We hope that people – especially in Utah – jump on getting Google Cloud certified.
Click here for more information on how you can jumpstart your skills with a new Google Cloud certification.
A couple of days ago, we gave a talk (cloud agnostic) at the SLC Google Office about the various data technologies out there. Needless to say, there are a lot. We split the universe of data flow into 4 components.
Process and Prepare
Analyze and Automate
The above flow will be topics for future articles, as there’s quite a bit there.
We want to provide our cheatsheet of the various data technologies solving the various phases in the above data flow.