Big Themes For Data Engineering In 2019

Ternary Data heavily focuses on data engineering. So, it makes sense that we are keenly interested where things are heading for the field in 2019. Instead of doing the typical new year ritual of making crazy predictions, we consider several big themes for data engineering in 2019.


TL;DR

  • Data will still be hard for most companies
  • Serverless will gain stronger adoption
  • Containerization will accelerate multi/hybrid cloud
  • DataOps will become more widely adopted
  • Data lineage will grow in importance
  • Ethics issues from data science will become a data engineering problem

Data will still be hard for most companies

Despite the hype of “AI everywhere” and “data is the new oil”, let’s get real. Data is hard. The dirty secret is that most data initiatives fail. Gartner analyst Nick Heudecker sets failure rate for big data projects at “closer to 85%”.

The reasons vary, but the theme for failure is mostly the same. Chief among these reasons are lack of executive and organizational buy-in, misaligned expectations, weak data culture, lack of resources, and the talent gap. Data is hard because change management is hard. Fundamentally change management is a people problem, not a technology problem. Therefore, we expect most companies to be challenged with data well past 2019.

There are ways to win with data. We advise our clients to get back to basics with data, build a solid foundation, and start small.  We wrote some tips on starting successful data projects, which you can find here.


Serverless will gain stronger adoption

Serverless is rightfully red hot. The benefits of serverless are very clear – less undifferentiated heavy lifting and quicker time to value. The role of the data engineer is changing, from babysitting on prem Hadoop and Spark clusters to managing a plethora of interconnecting services in the cloud. Allowing data engineers to focus on code, DataOps, and integrating with data scientists is game changing and adds a ton of value.

The big cloud providers are transforming toward serverless offerings. AWS and Google Cloud have both been at the forefront of serverless for over a decade (notably, AWS launched S3 in 2006, Google Cloud launched App Engine in 2008). Azure is keeping pace. Where the big clouds go, so goes the future of cloud infrastructure. Although clouds will still continue offering managed solutions like AWS EMR and GCP Dataproc, the future direction is clearly toward serverless.

If your company is still on the fence about serverless (or managed services), now is the time to experiment. Quick smoke tests and proof of concepts are extremely cheap and easy to deploy. Get started today.


Containerization will accelerate the move to multi/hybrid cloud

For the last several years, IT has basically operated in two worlds – public cloud on AWS and on-prem. AWS’s early start in public cloud gave it a huge first-mover advantage. We have a running joke at Ternary that “AWS is the new on-prem”, meaning it’s the strong, incumbent platform.

Expect the power dynamics to shift. Kubernetes and Docker have become the de-facto standard services for containerization, making it easy to “write once, deploy anywhere”. AWS, Azure, and Google Cloud are all heavily investing in managed Kubernetes services. We believe containerization is a central strategy to Azure’s and Google Cloud’s success against AWS.

What does this mean for end-users? Gone are the days of special snowflake infrastructure. Companies using containerized applications now have the liberty to easily migrate their apps between on-prem and cloud (hybrid cloud) and between clouds (multi-cloud). Expect to see more companies using a mix of hybrid and multi cloud. Managed services from AWS such as Outposts and EKS, Azure’s AKS, and Google Cloud’s GKE will make Kubernetes a no-brainer for easily managing application clusters.


DataOps comes into its own

Compared with the adoption of continuous delivery and DevOps in application development, the time to value with data moves at a slow crawl. Making a change to a database field can take weeks or months. Such a painfully slow cadence is unacceptable in today’s fast paced, data-driven world. Time to value is a key competitive advantage for any company.

As data engineering adopts its own version of DevOps – DataOps – companies will see an improved time to value with their data. DataOps applies the disciplines of agile, continuous delivery, DevOps, and statistical process control to data. The result is that data becomes more reliable, consistent, high quality, and quickly available to downstream consumers – analysts, data scientists, automation, and models.


Data lineage will grow in importance

Questions about data commonly revolve around what and where.

“What is the origin of this data?”

“What does this field mean?”

“Where is data for X?”

Answering these questions is surprisingly difficult. Think of your data pipeline as a supply chain of physical goods. Raw data comes in, processed into a variety of outputs, and consumed. In physical supply chains, assets are tracked as they move from raw materials to finished goods. Everything is tracked as it moves through the supply chain.

Similarly, data lineage tracks the “what” and “where” of data. Data is traced from inception to end-use, and gives full visibility into where raw data ingredients are consumed. Another benefit is the ability to quickly identify data errors when they arise, and be able to address the root cause of the error. Data lineage enables better DataOps.

A few things will push data lineage to greater importance for data engineers.

First – and most obvious – as data grows in size and complexity, data will morph into various downstream processes. The result are data stockpiles. Similar to how physical supply chains track raw materials through their finished goods, tracking data assets from their raw input to end-uses will help companies keep track of their data inventory.

Second, GDPR and similar data protections now make data asset tracking a compliance necessity. Companies are now legally required to find records – and their variants – and promptly handle this data in accordance with particular laws. Data lineage is a necessary component for tracking data assets in a company’s data supply chain.

Lastly, good machine learning requires quality features. Fundamentally, machine learning models are intertwined with their training sets; each model is a soup of features and data. If the data change, applying the same algorithm will yield a different model. Data lineage allows you to know what data sources were used for what features.


Ethics issues from data science will become a data engineering problem

If 2018 was about AI and data becoming more tightly interwoven in the fabric of everyday life, it was also a very tough year. The public beca. Facebook infamously suffered from self-inflicted problems due to inappropriate data collection and sharing. China came under scrutiny for its use of AI in controlling its population. GDPR went into effect. A good summary of AI’s problems in 2018 is in the Quartz article, “This year the world woke up to the problems with AI everywhere”. Needless to say, 2018 was a doozy for the data world.

In the above article, Rumman Chowdhury, Accenture’s lead for responsible AI says “We were finding that 25% of companies were having to do a complete overhaul of their system at least once a month because of inconsistent outcomes or bias, or they were just unsure. I’m calling that ethical debt”. 2019 is the year when data engineers help solve the “ethical debt” problem in data and models.

As a side note, the IEEE is defining a new initiative called P7000, or “The IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems”. We’ve been involved in discussions with this group, and are definitely excited about the direction things are going. If the issue of “ethical debt” is not addressed, we predict tough times ahead for AI and data. Our suggestion – get involved. Shape the future.


In conclusion

2019 will be an exciting year where data engineering continues to have an outsized impact. The field of data engineering is poised to create a lot of downstream value for data scientists and business in general.


Porting to AWS Graviton: Cost Engineering with New CPU Architectures

Cloud and the Curse of Familiarity

The move to cloud offers unprecedented opportunities for organizations to deploy new technologies, accelerate time to value and save on compute costs. This last claim defies tech community folklore. Isn’t cloud more expensive than traditional servers? The answer is yes… and no.

The fundamental problem is what I call the curse of familiarity. When we begin experimenting with cloud, we start with the basic assets of compute, storage and networking. These look and feel like the systems we’re used to. Virtual machines expose CPU cores and memory, and run Linux or Windows. To connect and install software, we SSH or RDP, just like we would with our own hardware. We can lift and shift workloads; the web app we had installed on a Linux blade can be moved to a Compute Engine instance.

Everything is so familiar, and herein lies the problem. While lift and shift is a good way to begin the journey, the familiar way is frequently the wrong way to engineer for the cloud.

Cost Engineering vs Performance Engineering

In the on-prem world, engineering can have an indirect impact on operational and hardware costs. Optimizing a Java web application to improve CPU efficiency by 50% will reduce hardware requirements to handle projected peak load. Combined with other optimizations, this will allow delay of hardware purchases and a smaller hardware refresh cycle. This might have other knock on effects such as deferral of a data center expansion and a smaller headcount for the data center team.

In the cloud, costs are immediate and measurable. With reduced requirements to handle peak load, the engineering team can reduce VM capacity allocated to this web app by half. They will see a 50% reduction on compute costs as soon as the changes are deployed.

As the team gains experience and sophistication, they realize that there are many additional knobs to turn. Instead of running enough servers to handle peak load 24 hours a day, they refactor the app to be stateless and ephemeral so they can utilize autoscaling. They learn that not all CPU cycles are created equal. The price of CPU core and memory resources varies dramatically depending on how these resources are consumed. A move to spot instances delivers a potential savings of 80%. With additional refactoring, AWS Lambda functions are much cheaper again.

I refer to the traditional paradigm as performance engineering and the new problems of cloud cost management as cost engineering. In fact, cost engineering subsumes performance engineering; improvements in application performance will almost always reduce cloud spend in a very direct way. But myopically focusing on performance and CPU/memory efficiency misses the massive savings available through other mechanisms.

Enter AWS Graviton

AWS recently introduced an additional cost saving mechanism by offering new processor types. AMD based instances came on the scene a few months ago; a1 instances, running Graviton processors utilizing the ARM instruction set, were announced at re:Invent in November and are available now through the AWS console.

Since AMD and Intel processors use essentially the same AMD64 instruction set, applications can be transitioned with minimal testing, but the cost savings are modest, about 10% for an instance with the same number of virtual cores. AWS is promising much greater savings with a1 instances, up to 45% for suitable workloads; this can be combined with savings from autoscaling and spot instances.

Of course, the devil’s in the details; Graviton processors have dramatically different performance characteristics from Intel and AMD architectures. While we could deep dive into a technical discussion of memory bandwidth, single threaded performance and floating point units, the best way to determine the cost of running your application on the new architecture is to rebuild it, run with a test load and measure directly.

Code in the Linux / open source ecosystem can be ported with modest effort. Python is a widely used language in the data science / data engineering space where we generally work. We look forward to working with clients on a full migration analysis with cost profiling, but this tutorial will stick to the basics of building and installing a Python library.

Running a Python Library on Graviton

Lately, I’ve been working with the Google Ads Python Client, so I’ll walk through the process of getting this up and running. Go to the EC2 console and click the button to launch a new instance. On the AMI screen, we’ll use Amazon Linux 2, but you’ll notice that you now have a choice of x86 or ARM architectures. Choose ARM and select. You’ll see the new a1 instance types at the top of the next screen. Select a1.medium and click review and launch at the bottom of the screen. SSH to the instance so you can begin installing software.

I prefer to use Python 3 to make my code more future proof, so let’s start by getting that set up.
[ec2-user@ip-172-31-36-228 ~]$ python --version
Python 2.7.14
[ec2-user@ip-172-31-36-228 ~]$ python3
-bash: python3: command not found
[ec2-user@ip-172-31-36-228 ~]$ sudo yum install python3
...
Complete!

We can now run pip in Python 3 for package installation.

[ec2-user@ip-172-31-36-228 ~]$ pip3 install googleads --user

We use the –user switch to avoid permission errors in accessing system Python packages. From here, the install would go smoothly on an Intel EC2 instance, but goes off the rails with ARM.

error: command 'gcc' failed with exit status 1

Looking through the output, we see where the failure occurs.

Running setup.py install for PyYAML ... done
Running setup.py install for suds-jurko ... done
Running setup.py install for lxml ... error

Google Ads uses a legacy SOAP API, so data is encoded in XML. XML has a general reputation for slow performance, and this is especially true with native Python text serialization, a huge headache when pushing large amounts of data to the API. The lxml library relies instead on C libraries to speed up serialization.

This brings us to one of the key strengths and pain points of Python. Python has a reputation for being slow, but that isn’t necessarily the case. Python lives in the C ecosystem and is performant so long as the computationally intensive parts of an application are handled by C code; required C toolchains and dependencies must be present for packages to build, creating complications not present with JVM languages.

In this case, there is no prebuilt ARM binary Wheels package in the PyPI repositories, so pip attempts to build the binaries – we’re on our own to resolve the dependencies. Looking at the end of the error message, we find some additional clues.

unable to execute 'gcc': No such file or directory
Compile failed: command 'gcc' failed with exit status 1

This is a familiar song and dance if you’ve ever had to compile Python package dependencies. We need to install GCC; we also need the Python developer tools which contain required C headers.

sudo yum install gcc python3-devel

This time, we can just try installing lxml since that’s the package causing the failure when we install googleads.

pip3 install lxml --user

*********************************************************************************
Could not find function xmlCheckVersion in library libxml2. Is libxml2 installed?
*********************************************************************************
error: command 'gcc' failed with exit status 1

Let’s investigate libxml2. First, we try installing with yum.

[ec2-user@ip-172-31-36-228 ~]$ sudo yum install libxml2
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
Package libxml2-2.9.1-6.amzn2.3.2.aarch64 already installed and latest version
Nothing to do

List other relevant packages.

[ec2-user@ip-172-31-41-134 ~]$ yum list available | grep libxml2
libxml2-devel.aarch64 2.9.1-6.amzn2.3.2 amzn2-core
libxml2-static.aarch64 2.9.1-6.amzn2.3.2 amzn2-core

Install libxml2-devel.

[ec2-user@ip-172-31-36-228 ~]$ sudo yum install libxml2-devel

Trying to install lxml gives more errors, but we find an additional hint in the output.

ERROR: b'/bin/sh: xslt-config: command not found\n'
** make sure the development packages of libxml2 and libxslt are installed **

[ec2-user@ip-172-31-36-228 ~]$ yum list available | grep libxslt
libxslt.aarch64 1.1.28-5.amzn2.0.2 amzn2-core
libxslt-devel.aarch64 1.1.28-5.amzn2.0.2 amzn2-core
libxslt-python.aarch64 1.1.28-5.amzn2.0.2 amzn2-core

This looks promising. Let’s install all three.

[ec2-user@ip-172-31-36-228 ~]$ sudo yum install libxslt-devel libxslt-python libxslt

Now install lxml with pip. Be warned that the command will take a long time – the compilation process is slow. Once that completes, you should be able to install googleads. You can find instructions on utilizing the client to interact with GoogleAds here.

Conclusion

The potential of ARM servers as a counterweight to Intel’s dominance has for years been a topic of discussion in the tech industry, but the Graviton announcement at re:Invent 2018 may well be remembered as the beginning of a sea change in server architecture. AWS makes these new VMs available with a few button clicks, allowing engineers to evaluate and get jobs up and running immediately without an expensive long term commitment to  unfamiliar hardware.

If your application dependencies are open source, you can port with a couple hours of work and begin measuring costs. I expect the process to be much easier a year from now as the ARM server ecosystem grows.

Data Technology Cheatsheet – Multi Cloud and Open Source

A couple of days ago, we gave a talk (cloud agnostic) at the SLC Google Office about the various data technologies out there. Needless to say, there are a lot. We split the universe of data flow into 4 components.

  1. Ingest
  2. Store
  3. Process and Prepare
  4. Analyze and Automate

The above flow will be topics for future articles, as there’s quite a bit there.

We want to provide our cheatsheet of the various data technologies solving the various phases in the above data flow.

Here’s the link.