The Ternary crew spent a crazy week in San Francisco, attending Google Next 2019 with almost 40,000 of our closest friends. Given the 122+ announcements, we obviously didn’t catch everything.
Here’s our roundup of our favorite data-related announcements at Next.
AI is Google’s sweet spot. While other cloud vendors are making great progress with their AI offerings, Google stands on its own. One area where Google is pushing boundaries is AutoML. For the uninitiated, think of AutoML as “feed your data, get predictions”. No data scientist required.
How is this working out? Well, during Next, Google’s AutoML earned second place in a one-day invitational Kaggle competition. Not bad for a self service product where PhD level machine learning skills were once required.
At Ternary, we truly believe that AutoML is the wave of the future. It greatly simplified the ceremony around most data science use cases. After all, it’s just math 😉
The other impressive shift is Google’s move toward AI-powered business solutions. In the past, Google tended to pitch their services as “we know AI, here’s what you should use too”. With new products such as Contact Center AI, Recommendations, and much more, it’s clear that Google has wised up and is moving toward solving real-world business problems in a managed service.
AutoML Tables (Beta)
Machine learning on structured data is notoriously difficult. We will go out on a limb and say it’s one of the harder areas of Automated ML. Why? Because you’re usually dealing with data input by humans in some janky user interface. Humans make errors. For example, Utah may be spelled “UT”, “Utah”, “UTAH”, “Beehivestate”, etc.
Needless to say, structured data is a mess. It’s one thing to make a custom model by hand. The capability to do machine learning on any structured dataset is in a different universe of difficulty. Some of us have been part of a few serious endeavors to solve this, and it hasn’t gotten easier over the years. Most important, structured data is what much of the business world depends on, whether it’s ERP, BI, or other critical systems.
Enter AutoML Tables, Google’s own high powered solution for machine learning on structured tabular data. The process is pretty straightforward – point Tables at your data source, give it a target feature, and let it predict. It offers classification or regression; we even notice it allows for time series forecasting, which is a nice boon. When we chatted with an engineer on the Tables team, he indicated that it used deep learning behind the scenes. Since that’s the case, more data is better. If you’re dealing with a super small dataset (less than 100K rows), you may be better off using Sklearn or similar classical ML packages.
Being a beta product, there are a few drawbacks we hope get corrected soon. Training costs $19.32 per hour. Not cheap, though certainly not as expensive as a data scientist. The issue we noticed is that (as far as we can tell) AutoML spins up 92 n1-standard-4 servers, regardless of data size. This may be overkill for smaller datasets. Again, if you have a small dataset, it may be easier to use a classical ML library.
The QuickStart example dataset took 2 hours to train on a dataset of 45K rows and 16 columns. That’s quite a long time. You can preset a time limit for training, but this must be done before training starts. After training starts, we couldn’t find a way to stop the process. Depending on your dataset, it may take… a while… for training to complete. This could get expensive.
Despite these quirks, Tables produces a great model, complete with diagnostic metrics such as F-scores, a confusion matrix, and more. Like all AutoML products, Google does a lot of the hard work for you to set up API endpoints against a deployed model. Overall, Tables is going to be a force to be reckoned with in the hot space of automated machine learning on tabular data.
Too Many Awesome AI Announcements!
Honestly, there were too many awesome AI announcements. Here are some others of we enjoyed checking out.
AutoML Video Intelligence – In-video classification.
AI Hub – Google now offers a “one stop shop for everything AI”, greatly simplifying users’s abilities to collaborate and access AI tools.
AI Platform – In the past, Google’s AI tools were scattered around ML Engine and Datalab. The new AI Platform consolidates everything – AI Hub, data labeling, Kubeflow, notebooks, jobs, models – under one roof. No more searching around for various parts to build an ML pipeline.
Kubeflow – Deploy ML workflows on Kubernetes. This is now included in AI Platform, allowing you to write your ML pipeline once, and run anywhere.
Recommendations AI – Personalized recommendations is a huge area of business data science. There’s not an app that won’t benefit from recommendations. In the past, data scientists would need to build recommendation engines from scratch. Now, just give Recommendations AI your product and customer data and let it set up a prediction API for you. We recommend you try this (ok, bad joke).
The Ternary crew are huge BigQuery fans. The number of announcements around BigQuery were downright mindblowing. It’s clear that Google wants BigQuery to be the hub for all of your data. Here are several highlights.
BI Engine (Beta)
Oftentimes, analytics depends on query execution times. Slow running queries increase the time to insights, and reduce value to the business. Thankfully, fast, (preferably) real time analytics is quickly becoming an expectation.
BI Engine is a “fast, in-memory analysis service”. No longer do you need to wait for slow running queries. Now you can get highly concurrent sub-second queries in BigQuery. You can even hook up BI Engine to BigQuery Streaming to get genuine real-time dashboards on streaming data.
The ability to use machine learning algorithms inside your data warehouse is pretty cool. Analysts with SQL skills can accomplish an impressive amount of data science with BigQuery ML.
There were lots of announcements for BigQuery ML. Here’s some highlights.
- BigQuery ML now in general availability
- Tensorflow in BigQuery (!) for regression and classification
- K-Means Clustering
Dataflow SQL (Public Alpha)
Want to write a streaming or batch data pipeline, but don’t know how to write Python or Java? If you know SQL, you can now take advantage of Dataflow SQL. Simply write a SQL query in BigQuery, and a Dataflow job is set up in the process. This seems like a great way for analysts to dive in and write their own pipelines. Given the opinions around whether data scientists and analysts should write their own ETL, it will be interesting to see where Dataflow SQL goes.
DTS SaaS connectors
Connecting data is a pain point for data pipeline developers. But that’s part of the job. BigQuery’s Data Transfer Service has been a great tool for importing Google Ad and Analytics data into BigQuery. And while you could still connect your apps using Fivetran and other services, it still required a few steps. Thankfully, Google announced 100+ SaaS connectors for BigQuery. Connect to your favorite apps via Supermetrics and Fivetran. This will definitely ease the ceremony of setting up data connections to BigQuery, allowing you to focus on analysis.
BigQuery’s flat rate pricing got a whole lot more attractive. Instead of $40,000 USD for 2000 slots, now you can pay $10,000 for 500 slots. This makes flat rate usage of BigQuery a lot more affordable.
Data Fusion (Beta)
Let’s face it. There are a ton of great third party cloud ETL tools like Matillion, Talend, and Informatica. We haven’t seen a cloud provider offer their own decent ETL tool, except for Azure’s Data Factory. And if you’re in Google Cloud, your options are pretty slim. Until now.
Google Cloud’s Data Fusion is the result of Google’s acquisition of Cask Data last year, and is based on the open source Apache CDAP framework. It offers GCP users a full fledged ETL/ELT tool that works with both streaming and batch workloads. We’ve tested the service (still in beta), and are mostly impressed. Right now it uses Google’s Dataproc, but we suspect it will also run on Dataflow at some point. Pricing is for the basic (batch only) or enterprise edition (batch and streaming). It features 100+ connectors, so you can be productive with your favorite data sources and sinks in no time. Also, you can wrangle your data (much like Trifacta/Dataprep), which take the headache out of nobody’s favorite chore – data cleaning.
We will start recommending Data Fusion to Google Cloud users as a full fledged ETL/ELT tool. It’s a very complete tool that will satisfy the needs of nearly every company that needs to pipeline their data. Dataprep is still available for ad-hoc data cleaning, and Dataflow/Dataproc if you’re into coding these solutions instead.
Data Catalog (Beta – coming soon)
Got a couple of questions for you – do you know where all of your data assets are stored, and what they mean for your business? And do you practice data governance? Most companies don’t have good answers to these questions besides “no” or “maybe”.
Meanwhile, data continues to grow. Data sprawl is a reality for many companies. In an age of increasing data compliance and protections, data governance is no longer just a nice to have.
Google Cloud’s Data Catalog is a managed data catalog and governance tool that “offers a unified view of all datasets”. It tracks metadata about your data, allows users to tag data, and offers excellent search and discovery. Data Catalog also integrates with Google’s DLP for enhanced data privacy tracking.
As this product becomes more fully baked, we expect to recommend it to our clients. Ternary takes data cataloging and governance very seriously. So should you.
The first day’s Next keynote kicked off with Anthos, Google’s hybrid/multi-cloud platform. Now, you can run your apps in Google Kubernetes Engine (GKE), AWS, Azure, or your on-prem environment with unified management in GCP. This truly opens up the world for applications to be “write once, run anywhere”.
We’re particularly excited for Anthos in the Rockies. We see two types of customers – those on AWS, and those with on-prem infrastructure. Many of the companies we see in our area are still on-prem, figuring out how to containerize their applications, and plan to move to the cloud very soon. Something like Anthos makes this transition to hybrid cloud a lot easier. We’re looking forward to helping our customers use Anthos to migrate their data workloads to any cloud they desire.
Cloud Run (Beta)
Also announced during the first day is Cloud Run, which fills a gap between Cloud Functions and App Engine. Just point Cloud Run to your hosted container, deploy, and you get a fully functioning application with an API endpoint (for up to 15 minutes of runtime). I heard a rumor on a recent Google Cloud podcast that somebody implemented the video game Doom on Cloud Run.
Another new product is Cloud Code (get used to the theme of Cloud this, Cloud that), making it easier to build and deploy Kubernetes application in your IDE. For anyone writing apps for Kubernetes, this will definitely help with productivity.
Qubole and Google Cloud
Qubole, Ternary’s favorite 3rd party self service data platform, just announced a partnership with Google Cloud. We’re excited because this gives Google Cloud users a one-stop shop for a unified data science and data engineering experience, complete with a clean and simple UX. You can start today with Qubole in the Google Cloud marketplace.
While some companies are getting bad marks for their behavior around open source, Google Cloud is making open source-centric companies first class citizens in the Google ecosystem. This is a great move that both makes excellent business sense for all parties and keeps the open source ecosystem moving forward. We support open source whenever possible, and so should you.
Until Next time…
Sorry, bad pun. But seriously. Google Cloud came a long way with its offerings this year. Their data tools and ecosystem is very robust, and will add a ton of value to many companies. The Ternary crew can’t wait to get these tools into the hands of our customers. Can’t wait to see you all at Next 2020!