Google Cloud Next produced a flurry of news in the data space, but we were particularly excited about the announced partnership between Google and Qubole. Qubole on GCP will utilize the new BigQuery storage API, allowing seamless, performant integration between the database and other big data tooling such as Apache Spark.
This resolves a major pain point for data engineers with substantial data assets stored in BigQuery. Until recently, BigQuery did not support direct network export to a Hadoop cluster. Before Spark can read the data, the BigQuery connector must export it to Google Cloud Storage. This adds substantial processing time – in my experiments, exporting a 5 TB table took between 5 and 10 minutes. This is reasonable for batch processing, but absolutely glacial for interactive analysis.
In February, Google quietly introduced a new BigQuery connector that eliminates these compromises. The BigQuery Storage API allows compute clusters to read data in Apache Avro format using parallel streams with low latency and high bandwidth.
A new connector is only as good as its integrations. Dataproc, Dataflow and open source Apache Beam provide support, and clients are available in several programming languages (Python, Go, etc.) More integrations should be forthcoming since the API leverages Avro.
Qubole is one of the first third party services to support the BigQuery Storage API. Qubole is a cloud optimized big data engine, supporting Spark and various Hadoop tools. In the GCP ecosystem, Qubole is an alternative to Dataproc, but built around a more modern containerized approach for better resource utilization and scaling. In a future post, we’ll dig into connecting Spark to BigQuery through the storage API.