Queries in Google BigQuery are several times more expensive than the same queries in Amazon Athena.
We discuss other BigQuery cost, performance and ecosystem advantages that can offset these higher costs.
BigQuery is one of our favorite cloud big data warehouses. It has virtually limitless data storage capacity and delivers extraordinary performance on large data sets. That said, pricing can be a bit problematic if you’re not careful. We frequently hear that query costs for BigQuery and AWS Athena are the same. In this article, we will address that misconception and understand how to calculate prices for BigQuery storage and queries; we’ll also discuss ways to leverage BigQuery cost, performance, and ecosystem advantages to help offset these higher costs.
Unlike AWS Redshift, but similar to AWS Athena, BigQuery is serverless in the sense that you don’t need to reserve or spin up resources to run queries. Instead, you pay for data storage and for the amount of data scanned in each query you run. On the surface, Athena and BigQuery appear to be priced almost identically. As one comparison article puts it,
“Google also charges by the amount of data scanned, and the price is the same as for Athena. The storage is $0.02 per GB, which is more or less the same in AWS (the price tiers depends on the overall amount stored).”https://logz.io/blog/bigquery-vs-athena/
This discussion of prices ignores some key details. In practice, prices for queries and storage in Athena are several times cheaper than equivalent prices in BigQuery. Let’s dig into these pricing differences, call out other ancillary costs specific to Athena, and explain why we still generally recommend BigQuery to our clients.
Athena vs BigQuery – A Pricing Comparison
Both Athena and BigQuery bill at $5/TB queried. However, not all terabytes are created equal. Athena charges for bytes read from S3; compressing data to reduce its size thus saves costs for both queries and storage. The Athena pricing documentation (https://aws.amazon.com/athena/pricing/) mentions this strategy specifically: “Compressing your data allows Athena to scan less data.”
Behind the scenes, BigQuery also compresses data, but this is invisible to the user. Here’s the big difference with Athena – storage and queries are billed by decompressed bytes (https://cloud.google.com/bigquery/pricing#data), and the difference in calculated data size can be dramatic. For example, I recently encountered a column like this in some client data stored in BigQuery.
In fact, this entire column consists of a single repeated value. Typing the query
SELECT activitytype FROM `sample-table`
in the BigQuery console editor returns a query size of 1.3 GB. The table contains over 200 million rows, so the raw data size makes sense, but a quick experiment with gzip compression achieves a ratio of over 100 to 1.
Low cardinality columns, i.e., highly compressible columns containing a small number of distinct values repeated many times, are common in real world data sets. Noisy numeric data, also very common in practical data, doesn’t compress nearly so well, but we typically see compression ratios of 5 to 1 or better for full tables.
To look at another example, we turn to a Bureau of Labor Statistics table in the BigQuery public datasets. (bigquery-public-data:bls.employment_hours_earnings). The reported table size in BigQuery is 1.16 GB. Exporting from BigQuery in CSV format with .gz compression yields 141 MB of files. Using Qubole Spark to convert the table into Parquet columnar format with Snappy compression yields an output size of 222 MB. While csv.gz format gives us a higher compression ratio, Parquet/Snappy format yields substantially better query performance in Athena and allows us to further limit scan size by only paying for data in columns that we query. To put this example in pricing terms, querying the full table in BigQuery is roughly five times more expensive than using Athena to query the data in compressed Parquet format.
Query and storage pricing tell only part of the cost story for BigQuery and Athena. BigQuery gives away a number of services that entail extra charges with AWS/Athena.
Data import – BigQuery imports data from Google Cloud Storage to native tables for free. Athena can directly query data from S3 in a variety of formats, but it is a best practice to convert data to a compressed columnar format like Parquet with Snappy compression for data that will be queried repeatedly. Format conversion in AWS requires running a workload in Glue, Athena or another tool.
Data export – BigQuery exports native tables to compressed Avro, CSV and JSON files for free. (Note that BigQuery can read Parquet files, but inconveniently does not export directly to Parquet.)
Schema management – standard practice is to manage Athena schemas in Glue Catalog, using a Glue crawler scan for detection. BigQuery can auto-detect schemas for free as part of the ingest process.
Query caching – when you run a duplicate query in BigQuery within 24 hours, the database will return cached results at no additional charge.
Other BigQuery Features We Love.
Here are some reasons why we frequently recommend BigQuery to our clients.
Ease of management – BigQuery’s project, dataset and table hierarchy makes it exceptionally easy to organize data and manage access. Data management and administration overhead is a key cost driver for any large organization. Athena is an extremely powerful and flexible query engine, but data organization and management is substantially simpler with BigQuery.
Flat rate pricing – organizations querying more than 2 petabytes a month, or 20 petabytes a year, can save by signing up for BigQuery flat rate pricing, starting at $10,000 per month or $100,000 a year with an annual commitment.
Data sharing – we expect data sharing to grow in importance over the next decade. This can be in the form of open data sets, data shared for collaboration and paid data access. Cloud platforms offer many methods of data sharing, for example ‘requester pays’ S3 buckets and AWS Data Exchange. That being said, we’ve found data sharing in BigQuery to be exceptionally easy. Consumers simply run queries against shared data from their own Google Cloud projects, and can easily union or join with data from other sources. (This is also a key strength of Snowflake.)
Integration – BigQuery features deep integration into various Google Cloud Platform data services. We anticipate continued growth of the ecosystem in 2020.
Your Mileage May Vary
Innovation from cloud big data warehouse vendors is driving a sea change in business analytics and data science. Google BigQuery, Amazon Athena, Snowflake, Redshift and Azure Synapse Analytics all offer remarkable technology and performance. The choice of a cloud data warehouse is just one component of an organizational cloud strategy. Consider other factors like operational costs to manage data pipelines and other technologies that support the data warehouse.
About the Author
Matt Housley is one of the cofounders of Ternary Data, a consulting firm specializing in data engineering and processes. Before founding the company, he worked as a math professor, data scientist, and data engineer. He holds certifications on both GCP and AWS.
Ternary Data heavily focuses on data engineering. So, it makes sense that we are keenly interested where things are heading for the field in 2019. Instead of doing the typical new year ritual of making crazy predictions, we consider several big themes for data engineering in 2019.
Data will still be hard for most companies
Serverless will gain stronger adoption
Containerization will accelerate multi/hybrid cloud
DataOps will become more widely adopted
Data lineage will grow in importance
Ethics issues from data science will become a data engineering problem
The reasons vary, but the theme for failure is mostly the same. Chief among these reasons are lack of executive and organizational buy-in, misaligned expectations, weak data culture, lack of resources, and the talent gap. Data is hard because change management is hard. Fundamentally change management is a people problem, not a technology problem. Therefore, we expect most companies to be challenged with data well past 2019.
There are ways to win with data. We advise our clients to get back to basics with data, build a solid foundation, and start small. We wrote some tips on starting successful data projects, which you can find here.
Serverless will gain stronger adoption
Serverless is rightfully red hot. The benefits of serverless are very clear – less undifferentiated heavy lifting and quicker time to value. The role of the data engineer is changing, from babysitting on prem Hadoop and Spark clusters to managing a plethora of interconnecting services in the cloud. Allowing data engineers to focus on code, DataOps, and integrating with data scientists is game changing and adds a ton of value.
The big cloud providers are transforming toward serverless offerings. AWS and Google Cloud have both been at the forefront of serverless for over a decade (notably, AWS launched S3 in 2006, Google Cloud launched App Engine in 2008). Azure is keeping pace. Where the big clouds go, so goes the future of cloud infrastructure. Although clouds will still continue offering managed solutions like AWS EMR and GCP Dataproc, the future direction is clearly toward serverless.
If your company is still on the fence about serverless (or managed services), now is the time to experiment. Quick smoke tests and proof of concepts are extremely cheap and easy to deploy. Get started today.
Containerization will accelerate the move to multi/hybrid cloud
For the last several years, IT has basically operated in two worlds – public cloud on AWS and on-prem. AWS’s early start in public cloud gave it a huge first-mover advantage. We have a running joke at Ternary that “AWS is the new on-prem”, meaning it’s the strong, incumbent platform.
Expect the power dynamics to shift. Kubernetes and Docker have become the de-facto standard services for containerization, making it easy to “write once, deploy anywhere”. AWS, Azure, and Google Cloud are all heavily investing in managed Kubernetes services. We believe containerization is a central strategy to Azure’s and Google Cloud’s success against AWS.
What does this mean for end-users? Gone are the days of special snowflake infrastructure. Companies using containerized applications now have the liberty to easily migrate their apps between on-prem and cloud (hybrid cloud) and between clouds (multi-cloud). Expect to see more companies using a mix of hybrid and multi cloud. Managed services from AWS such as Outposts and EKS, Azure’s AKS, and Google Cloud’s GKE will make Kubernetes a no-brainer for easily managing application clusters.
DataOps comes into its own
Compared with the adoption of continuous delivery and DevOps in application development, the time to value with data moves at a slow crawl. Making a change to a database field can take weeks or months. Such a painfully slow cadence is unacceptable in today’s fast paced, data-driven world. Time to value is a key competitive advantage for any company.
As data engineering adopts its own version of DevOps – DataOps – companies will see an improved time to value with their data. DataOps applies the disciplines of agile, continuous delivery, DevOps, and statistical process control to data. The result is that data becomes more reliable, consistent, high quality, and quickly available to downstream consumers – analysts, data scientists, automation, and models.
Data lineage will grow in importance
Questions about data commonly revolve around what and where.
“What is the origin of this data?”
“What does this field mean?”
“Where is data for X?”
Answering these questions is surprisingly difficult. Think of your data pipeline as a supply chain of physical goods. Raw data comes in, processed into a variety of outputs, and consumed. In physical supply chains, assets are tracked as they move from raw materials to finished goods. Everything is tracked as it moves through the supply chain.
Similarly, data lineage tracks the “what” and “where” of data. Data is traced from inception to end-use, and gives full visibility into where raw data ingredients are consumed. Another benefit is the ability to quickly identify data errors when they arise, and be able to address the root cause of the error. Data lineage enables better DataOps.
A few things will push data lineage to greater importance for data engineers.
First – and most obvious – as data grows in size and complexity, data will morph into various downstream processes. The result are data stockpiles. Similar to how physical supply chains track raw materials through their finished goods, tracking data assets from their raw input to end-uses will help companies keep track of their data inventory.
Second, GDPR and similar data protections now make data asset tracking a compliance necessity. Companies are now legally required to find records – and their variants – and promptly handle this data in accordance with particular laws. Data lineage is a necessary component for tracking data assets in a company’s data supply chain.
Lastly, good machine learning requires quality features. Fundamentally, machine learning models are intertwined with their training sets; each model is a soup of features and data. If the data change, applying the same algorithm will yield a different model. Data lineage allows you to know what data sources were used for what features.
Ethics issues from data science will become a data engineering problem
If 2018 was about AI and data becoming more tightly interwoven in the fabric of everyday life, it was also a very tough year. The public beca. Facebook infamously suffered from self-inflicted problems due to inappropriate data collection and sharing. China came under scrutiny for its use of AI in controlling its population. GDPR went into effect. A good summary of AI’s problems in 2018 is in the Quartz article, “This year the world woke up to the problems with AI everywhere”. Needless to say, 2018 was a doozy for the data world.
In the above article, Rumman Chowdhury, Accenture’s lead for responsible AI says “We were finding that 25% of companies were having to do a complete overhaul of their system at least once a month because of inconsistent outcomes or bias, or they were just unsure. I’m calling that ethical debt”. 2019 is the year when data engineers help solve the “ethical debt” problem in data and models.
As a side note, the IEEE is defining a new initiative called P7000, or “The IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems”. We’ve been involved in discussions with this group, and are definitely excited about the direction things are going. If the issue of “ethical debt” is not addressed, we predict tough times ahead for AI and data. Our suggestion – get involved. Shape the future.
2019 will be an exciting year where data engineering continues to have an outsized impact. The field of data engineering is poised to create a lot of downstream value for data scientists and business in general.