Queries in Google BigQuery are several times more expensive than the same queries in Amazon Athena.
We discuss other BigQuery cost, performance and ecosystem advantages that can offset these higher costs.
BigQuery is one of our favorite cloud big data warehouses. It has virtually limitless data storage capacity and delivers extraordinary performance on large data sets. That said, pricing can be a bit problematic if you’re not careful. We frequently hear that query costs for BigQuery and AWS Athena are the same. In this article, we will address that misconception and understand how to calculate prices for BigQuery storage and queries; we’ll also discuss ways to leverage BigQuery cost, performance, and ecosystem advantages to help offset these higher costs.
Unlike AWS Redshift, but similar to AWS Athena, BigQuery is serverless in the sense that you don’t need to reserve or spin up resources to run queries. Instead, you pay for data storage and for the amount of data scanned in each query you run. On the surface, Athena and BigQuery appear to be priced almost identically. As one comparison article puts it,
“Google also charges by the amount of data scanned, and the price is the same as for Athena. The storage is $0.02 per GB, which is more or less the same in AWS (the price tiers depends on the overall amount stored).”https://logz.io/blog/bigquery-vs-athena/
This discussion of prices ignores some key details. In practice, prices for queries and storage in Athena are several times cheaper than equivalent prices in BigQuery. Let’s dig into these pricing differences, call out other ancillary costs specific to Athena, and explain why we still generally recommend BigQuery to our clients.
Athena vs BigQuery – A Pricing Comparison
Both Athena and BigQuery bill at $5/TB queried. However, not all terabytes are created equal. Athena charges for bytes read from S3; compressing data to reduce its size thus saves costs for both queries and storage. The Athena pricing documentation (https://aws.amazon.com/athena/pricing/) mentions this strategy specifically: “Compressing your data allows Athena to scan less data.”
Behind the scenes, BigQuery also compresses data, but this is invisible to the user. Here’s the big difference with Athena – storage and queries are billed by decompressed bytes (https://cloud.google.com/bigquery/pricing#data), and the difference in calculated data size can be dramatic. For example, I recently encountered a column like this in some client data stored in BigQuery.
In fact, this entire column consists of a single repeated value. Typing the query
SELECT activitytype FROM `sample-table`
in the BigQuery console editor returns a query size of 1.3 GB. The table contains over 200 million rows, so the raw data size makes sense, but a quick experiment with gzip compression achieves a ratio of over 100 to 1.
Low cardinality columns, i.e., highly compressible columns containing a small number of distinct values repeated many times, are common in real world data sets. Noisy numeric data, also very common in practical data, doesn’t compress nearly so well, but we typically see compression ratios of 5 to 1 or better for full tables.
To look at another example, we turn to a Bureau of Labor Statistics table in the BigQuery public datasets. (bigquery-public-data:bls.employment_hours_earnings). The reported table size in BigQuery is 1.16 GB. Exporting from BigQuery in CSV format with .gz compression yields 141 MB of files. Using Qubole Spark to convert the table into Parquet columnar format with Snappy compression yields an output size of 222 MB. While csv.gz format gives us a higher compression ratio, Parquet/Snappy format yields substantially better query performance in Athena and allows us to further limit scan size by only paying for data in columns that we query. To put this example in pricing terms, querying the full table in BigQuery is roughly five times more expensive than using Athena to query the data in compressed Parquet format.
Query and storage pricing tell only part of the cost story for BigQuery and Athena. BigQuery gives away a number of services that entail extra charges with AWS/Athena.
Data import – BigQuery imports data from Google Cloud Storage to native tables for free. Athena can directly query data from S3 in a variety of formats, but it is a best practice to convert data to a compressed columnar format like Parquet with Snappy compression for data that will be queried repeatedly. Format conversion in AWS requires running a workload in Glue, Athena or another tool.
Data export – BigQuery exports native tables to compressed Avro, CSV and JSON files for free. (Note that BigQuery can read Parquet files, but inconveniently does not export directly to Parquet.)
Schema management – standard practice is to manage Athena schemas in Glue Catalog, using a Glue crawler scan for detection. BigQuery can auto-detect schemas for free as part of the ingest process.
Query caching – when you run a duplicate query in BigQuery within 24 hours, the database will return cached results at no additional charge.
Other BigQuery Features We Love.
Here are some reasons why we frequently recommend BigQuery to our clients.
Ease of management – BigQuery’s project, dataset and table hierarchy makes it exceptionally easy to organize data and manage access. Data management and administration overhead is a key cost driver for any large organization. Athena is an extremely powerful and flexible query engine, but data organization and management is substantially simpler with BigQuery.
Flat rate pricing – organizations querying more than 2 petabytes a month, or 20 petabytes a year, can save by signing up for BigQuery flat rate pricing, starting at $10,000 per month or $100,000 a year with an annual commitment.
Data sharing – we expect data sharing to grow in importance over the next decade. This can be in the form of open data sets, data shared for collaboration and paid data access. Cloud platforms offer many methods of data sharing, for example ‘requester pays’ S3 buckets and AWS Data Exchange. That being said, we’ve found data sharing in BigQuery to be exceptionally easy. Consumers simply run queries against shared data from their own Google Cloud projects, and can easily union or join with data from other sources. (This is also a key strength of Snowflake.)
Integration – BigQuery features deep integration into various Google Cloud Platform data services. We anticipate continued growth of the ecosystem in 2020.
Your Mileage May Vary
Innovation from cloud big data warehouse vendors is driving a sea change in business analytics and data science. Google BigQuery, Amazon Athena, Snowflake, Redshift and Azure Synapse Analytics all offer remarkable technology and performance. The choice of a cloud data warehouse is just one component of an organizational cloud strategy. Consider other factors like operational costs to manage data pipelines and other technologies that support the data warehouse.
About the Author
Matt Housley is one of the cofounders of Ternary Data, a consulting firm specializing in data engineering and processes. Before founding the company, he worked as a math professor, data scientist, and data engineer. He holds certifications on both GCP and AWS.
Google Cloud Next produced a flurry of news in the data space, but we were particularly excited about the announced partnership between Google and Qubole. Qubole on GCP will utilize the new BigQuery storage API, allowing seamless, performant integration between the database and other big data tooling such as Apache Spark.
This resolves a major pain point for data engineers with substantial data assets stored in BigQuery. Until recently, BigQuery did not support direct network export to a Hadoop cluster. Before Spark can read the data, the BigQuery connector must export it to Google Cloud Storage. This adds substantial processing time – in my experiments, exporting a 5 TB table took between 5 and 10 minutes. This is reasonable for batch processing, but absolutely glacial for interactive analysis.
In February, Google quietly introduced a new BigQuery connector that eliminates these compromises. The BigQuery Storage API allows compute clusters to read data in Apache Avro format using parallel streams with low latency and high bandwidth.
A new connector is only as good as its integrations. Dataproc, Dataflow and open source Apache Beam provide support, and clients are available in several programming languages (Python, Go, etc.) More integrations should be forthcoming since the API leverages Avro.
Qubole is one of the first third party services to support the BigQuery Storage API. Qubole is a cloud optimized big data engine, supporting Spark and various Hadoop tools. In the GCP ecosystem, Qubole is an alternative to Dataproc, but built around a more modern containerized approach for better resource utilization and scaling. In a future post, we’ll dig into connecting Spark to BigQuery through the storage API.
Anyone who handles data on a a daily basis knows the pain and tedium of cleansing data – it’s a time consuming, mundane, and thankless task. Trifacta comes to the rescue of bleary-eyed data professionals with its all-in-one data wrangling tool. In this post, I’ll walk through a demo of cleaning up survey data with Google Cloud Dataprep, Google’s fully managed Trifacta service.
I was initially skeptical of a proprietary tool with a GUI, but Trifacta is wonderful for dealing with the kind of dirty, poorly typed data that we encounter in the real world. When I’ve prepared data with Python, I’ve had to spend most of my time writing code to deal with numerous edge cases in each column. Trifacta’s recommended transforms make quick work of these issues, and you can also write custom transforms. You can always combine Dataprep with other jobs if you need the full power of a programming language.
For this post, I’ve created some fake survey data inspired by real survey data that I needed to munge last week. What appears to be personal information (SSN, age) was randomly generated.
Open Dataprep in your GCP account. If this is your first time using Dataprep within your project, you’ll need to go through an authorization process – Trifacta is hosted outside your account and requires access to do its magic. Once you’ve completed this, you’ll be able to create a flow.
Create a new flow. Click Add Datasets, then Import Datasets. Drag and drop the survey data into the window. Click Import, Add to Flow.
We now see survey_data.csv in the flow. Click the recipe button to add a new recipe. In the panel that appears to the right, click Edit Recipe.
Note the horizontal bars at the top of each column. The turquoise section indicates valid data, the red portion mismatched values and the gray missing values. The first thing we notice is that we have several columns with no valid values. First and Last name, address and city have all been redacted. You can shift + click to select all four columns. In the right panel, you’ll see several suggested transforms. Choose Delete columns. Click Add. This step is now part of your recipe.
Next, look at the first column, finished. This is a boolean indicating whether the participant finished the survey. Let’s discard unfinished surveys. At the top of the column, you’ll see vertical bars indicating proportions of data categories for the column.
The short bar indicates false. If you click on it, you’ll be given options for this category in the right panel. Select Delete rows and add the step.
Next, look at the SSN4 column. In a future post, I’ll use this as a join key to combine pre/post data for statistical analysis, so we need valid values here. Let’s get rid of rows that are missing a value. Hover over the gray portion of the horizontal bar for the column and click. Choose the Delete rows transform and add. Next, select the red portion of the bar. If you hover over the Keep rows transform in the right panel, you’ll see a pop up with some of the mismatched values.
Click Delete rows, Add. We’re not quite done with this column – if we look at the distribution at the top, we notice some suspiciously large values.
Scrolling through the data, you’ll notice that there are numbers with too many digits. Click on the bar on the far right and then click the Delete rows transform, but instead of adding, click edit. You should see a range condition (65000000 <= SSN4) && (SSN4 < 70000000). Change this to 10000 <= SSN4 to catch all the bad values. Click add.
You can scroll down to make sure the data match what you expect.
Let’s move on to the age column. We have mismatched data; if you scroll down, you’ll see a field with 37 years. Right click on the column header and select extract, numbers. Adding the transform creates a new column called age1. Delete the age column and rename the new column to age. We still have missing values, but in this case we’ll keep the corresponding rows.
The remainder of the columns are Q1 through Q22. These contain seven options ranging from strongly disagree to strongly agree. We’re going to convert these to integers for statistical analysis. This step could be a little less painful, but it’s not bad. I suspect this would be easier with Trifacta custom types, but they are not supported in Dataprep. Hopefully, this feature will be added in the future.
To start the transform process, select all the question columns with shift + click. Right click on a column header and select replace, text or patterns. In the Find box, type ‘Strongly agree’. In the Replace with box, type 3. Add the transform. We have to repeat these steps six more times. The mapping is agree to 2, somewhat agree to 1, neither agree nor disagree to 0, somewhat disagree to -1, disagree to -2 and strongly disagree to -3. After you’ve completed this, the remaining mismatched values are spaces.
One last transform: we don’t need the finished column anymore. Go ahead and delete it.
Now, we want to save out the data.
Because our data is small, Trifacta has processed the full set without sampling, but GCP forces us to run the job. This takes a couple of minutes because Dataprep must spin up processing resources. Click the Run Job button and then run the job on the next page. In a few minutes your job should finish. Note that you will incur costs of a few cents.
If you handle data on a regular basis, if you’re an analyst, data scientist, engineer or ETL developer, Cloud Dataprep could cut years off your data preparation time. Try working through this example by downloading the survey data above. I’ve also attached my final output and recipe in the zip file below.
The move to cloud offers unprecedented opportunities for organizations to deploy new technologies, accelerate time to value and save on compute costs. This last claim defies tech community folklore. Isn’t cloud more expensive than traditional servers? The answer is yes… and no.
The fundamental problem is what I call the curse of familiarity. When we begin experimenting with cloud, we start with the basic assets of compute, storage and networking. These look and feel like the systems we’re used to. Virtual machines expose CPU cores and memory, and run Linux or Windows. To connect and install software, we SSH or RDP, just like we would with our own hardware. We can lift and shift workloads; the web app we had installed on a Linux blade can be moved to a Compute Engine instance.
Everything is so familiar, and herein lies the problem. While lift and shift is a good way to begin the journey, the familiar way is frequently the wrong way to engineer for the cloud.
Cost Engineering vs Performance Engineering
In the on-prem world, engineering can have an indirect impact on operational and hardware costs. Optimizing a Java web application to improve CPU efficiency by 50% will reduce hardware requirements to handle projected peak load. Combined with other optimizations, this will allow delay of hardware purchases and a smaller hardware refresh cycle. This might have other knock on effects such as deferral of a data center expansion and a smaller headcount for the data center team.
In the cloud, costs are immediate and measurable. With reduced requirements to handle peak load, the engineering team can reduce VM capacity allocated to this web app by half. They will see a 50% reduction on compute costs as soon as the changes are deployed.
As the team gains experience and sophistication, they realize that there are many additional knobs to turn. Instead of running enough servers to handle peak load 24 hours a day, they refactor the app to be stateless and ephemeral so they can utilize autoscaling. They learn that not all CPU cycles are created equal. The price of CPU core and memory resources varies dramatically depending on how these resources are consumed. A move to spot instances delivers a potential savings of 80%. With additional refactoring, AWS Lambda functions are much cheaper again.
I refer to the traditional paradigm as performance engineering and the new problems of cloud cost management as cost engineering. In fact, cost engineering subsumes performance engineering; improvements in application performance will almost always reduce cloud spend in a very direct way. But myopically focusing on performance and CPU/memory efficiency misses the massive savings available through other mechanisms.
Enter AWS Graviton
AWS recently introduced an additional cost saving mechanism by offering new processor types. AMD based instances came on the scene a few months ago; a1 instances, running Graviton processors utilizing the ARM instruction set, were announced at re:Invent in November and are available now through the AWS console.
Since AMD and Intel processors use essentially the same AMD64 instruction set, applications can be transitioned with minimal testing, but the cost savings are modest, about 10% for an instance with the same number of virtual cores. AWS is promising much greater savings with a1 instances, up to 45% for suitable workloads; this can be combined with savings from autoscaling and spot instances.
Of course, the devil’s in the details; Graviton processors have dramatically different performance characteristics from Intel and AMD architectures. While we could deep dive into a technical discussion of memory bandwidth, single threaded performance and floating point units, the best way to determine the cost of running your application on the new architecture is to rebuild it, run with a test load and measure directly.
Code in the Linux / open source ecosystem can be ported with modest effort. Python is a widely used language in the data science / data engineering space where we generally work. We look forward to working with clients on a full migration analysis with cost profiling, but this tutorial will stick to the basics of building and installing a Python library.
Running a Python Library on Graviton
Lately, I’ve been working with the Google Ads Python Client, so I’ll walk through the process of getting this up and running. Go to the EC2 console and click the button to launch a new instance. On the AMI screen, we’ll use Amazon Linux 2, but you’ll notice that you now have a choice of x86 or ARM architectures. Choose ARM and select. You’ll see the new a1 instance types at the top of the next screen. Select a1.medium and click review and launch at the bottom of the screen. SSH to the instance so you can begin installing software.
I prefer to use Python 3 to make my code more future proof, so let’s start by getting that set up. [ec2-user@ip-172-31-36-228 ~]$ python --version
[ec2-user@ip-172-31-36-228 ~]$ python3
-bash: python3: command not found
[ec2-user@ip-172-31-36-228 ~]$ sudo yum install python3
We can now run pip in Python 3 for package installation.
We use the –user switch to avoid permission errors in accessing system Python packages. From here, the install would go smoothly on an Intel EC2 instance, but goes off the rails with ARM.
error: command 'gcc' failed with exit status 1
Looking through the output, we see where the failure occurs.
Running setup.py install for PyYAML ... done
Running setup.py install for suds-jurko ... done
Running setup.py install for lxml ... error
Google Ads uses a legacy SOAP API, so data is encoded in XML. XML has a general reputation for slow performance, and this is especially true with native Python text serialization, a huge headache when pushing large amounts of data to the API. The lxml library relies instead on C libraries to speed up serialization.
This brings us to one of the key strengths and pain points of Python. Python has a reputation for being slow, but that isn’t necessarily the case. Python lives in the C ecosystem and is performant so long as the computationally intensive parts of an application are handled by C code; required C toolchains and dependencies must be present for packages to build, creating complications not present with JVM languages.
In this case, there is no prebuilt ARM binary Wheels package in the PyPI repositories, so pip attempts to build the binaries – we’re on our own to resolve the dependencies. Looking at the end of the error message, we find some additional clues.
unable to execute 'gcc': No such file or directory
Compile failed: command 'gcc' failed with exit status 1
This is a familiar song and dance if you’ve ever had to compile Python package dependencies. We need to install GCC; we also need the Python developer tools which contain required C headers.
sudo yum install gcc python3-devel
This time, we can just try installing lxml since that’s the package causing the failure when we install googleads.
pip3 install lxml --user
Could not find function xmlCheckVersion in library libxml2. Is libxml2 installed?
error: command 'gcc' failed with exit status 1
Let’s investigate libxml2. First, we try installing with yum.
[ec2-user@ip-172-31-36-228 ~]$ sudo yum install libxml2
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
Package libxml2-2.9.1-6.amzn2.3.2.aarch64 already installed and latest version
Nothing to do
List other relevant packages.
[ec2-user@ip-172-31-41-134 ~]$ yum list available | grep libxml2
libxml2-devel.aarch64 2.9.1-6.amzn2.3.2 amzn2-core
libxml2-static.aarch64 2.9.1-6.amzn2.3.2 amzn2-core
Now install lxml with pip. Be warned that the command will take a long time – the compilation process is slow. Once that completes, you should be able to install googleads. You can find instructions on utilizing the client to interact with GoogleAds here.
The potential of ARM servers as a counterweight to Intel’s dominance has for years been a topic of discussion in the tech industry, but the Graviton announcement at re:Invent 2018 may well be remembered as the beginning of a sea change in server architecture. AWS makes these new VMs available with a few button clicks, allowing engineers to evaluate and get jobs up and running immediately without an expensive long term commitment to unfamiliar hardware.
If your application dependencies are open source, you can port with a couple hours of work and begin measuring costs. I expect the process to be much easier a year from now as the ARM server ecosystem grows.