Cloud Data Warehouse - Nicholas and Company

Project Type:  Data Warehouse  |  Data Pipelines

Nicholas and Company is a privately owned, full service food distributor based in Salt Lake City, Utah. Started in 1939, it now serves thousands of valued customers in 7 Western states.

Summary of Results

Real time change data capture from on-premises ERP systems.
Near real time business analytics using AWS S3, Glue, and Redshift/Redshift Spectrum
Robust, future-proof, mostly serverless architecture that decouples storage from compute.
Infrastructure and licensing costs reduced to less than 1% of the previous data platform.

The Challenges

Nicholas and Company experienced information starvation. Key business reports were delivered with data that was at least one day out of date, forcing decision makers to operate without critical insights. The main bottleneck was an archaic, monolithic data pipeline. This pipeline grew ever more bloated over time and ETL processes took over a day to finish. Even worse, process failures would frequently break the pipeline and leaders would have to make decisions completely blind.

The legacy data pipeline was tightly coupled to the on-premises mainframe ERP system. This put enormous load on the ERP system, causing it to perform poorly for other business applications.

In addition, this whole system was extremely expensive and difficult to maintain. This consumed a massive number of operations hours, especially when instability brought the system down and created a crisis. In practice, much of the system was maintained at a low functioning level. The operations team was oversubscribed keeping the existing system up and running, and so little time was left to address tech debt. Data quality issues were not addressed, and the system and data were poorly documented.

“For years, we struggled to implement business intelligence at Nicholas and Company. Nobody could implement a BI architecture that reliably worked and scaled with the business.

In only a few months, Ternary Data architected and developed a simple, reliable and cost-effective real-time BI architecture on AWS - a data lake using S3, serverless data cataloging with Glue, and a data warehouse using Redshift and Redshift Spectrum. The result is a scalable data architecture that we can use not just for business intelligence, but also a whole host of other analytical problems in our company, such as IOT, data science and AI. The new data architecture is truly game-changing for our business.”

— Murali Penubothu, CIO Nicholas and Company

The Requirements

Nicholas engaged Ternary Data to help design and build a greenfield data platform that met the following requirements.

The core of the data platform would be engineered to last for at least 10 years.
Storage and compute would be decoupled.
Serverless used wherever possible to help reduce cost and infrastructure maintenance
Flexible and modular. As new technologies and trends became available, the platform would be flexible enough to swap out components without disrupting the rest of the system.
New data arriving to the system would be available for analysis within 5 minutes.
Near 100% uptime.
Immediate use case - Business Intelligence (BI)
Future use cases - IOT, AI, etc
Cost effective. As much as possible, the platform would be pay-as-you-go.

The Solution

Because of the greenfield nature of this project, we evaluated several options. We used a combination of off-the-shelf third party tools, and a mostly serverless architecture in AWS.

Why Serverless?

A key differentiator of the next generation Nicholas data platform was its serverless-first approach. Wherever possible, serverless AWS technologies were adopted. Why serverless? Serverless allows the cloud provider - in this case, AWS - to run the background resources (servers, infrastructure maintenance, upgrades, etc). This freed up the Nicholas data team to focus on mission critical activities such as data analysis, modelling, and reports.

ERP change data capture

Transactional data from on-prem ERP systems needs to be available for actionable insights within minutes. This requirement was a perfect use case for change data capture (CDC), which delivers both new and changed data from a transactional system to another source. Also, the delicate nature of the on-prem ERP systems meant we needed something low-touch and lightweight. We chose HVR’s change data capture tool for this task, as it was reasonably priced and would not impede performance of the ERP systems.

Data Lake

The requirement to decouple storage from compute led to a data lake architecture. We built the data lake in S3 to capture all CDC events, as well as offer flexibility for data of other sources, types, sizes, and velocity.

In order to preserve data assets, cross-region replication and a lifecycle policy to archive old data in Glacier were both enabled.

Serverless Data Lake Tracking with AWS Lambda and DynamoDB

The contents of a data lake have little value if they can’t be tracked and cataloged. We created a serverless system to track data movement using Lambda and DynamoDB. Any time data moves in the data lake, the source, destination, and associated metadata are captured in DynamoDB. This means that data movement can be replayed and tracked.

Serverless Data Catalog and Tables in AWS Glue

AWS Glue is a flexible data tool that provides both serverless ETL and a data catalog. Because data was stored in a data lake, we needed a way to create a schema and a table. Glue’s data catalog helped solve this problem. A Glue table is created and updated with a Glue Crawler. A Glue Crawler is scheduled to create and maintain Glue schemas of objects in the data lake, which are stored in Glue’s data catalog. These Glue tables were used to populate schemas and enable queries in both Redshift Spectrum and Athena.

Data Warehouse

At present, there are many great data warehouse options available on AWS. We evaluated Redshift and Snowflake, but ultimately settled on Redshift as it was easier to set up, fit Nicholas’s initial scale, and is pay-as-you-grow.

For reporting and analysis, a traditional star schema was chosen as the data model. The star schema fit the type of reporting needed at Nicholas - sales by customer, sales by product, and similar - as well as the skillset of the data team, who were trained in traditional data warehouse techniques. Redshift Spectrum held the fact tables; physical Redshift tables housed dimensions.

Most data warehouses need a good ETL tool. ETL tools come in a variety of costs and complexities. We came across Matillion during the search for a cost effective and robust ETL tool. Matillion operated as an AWS AMI, meaning that it only charged while it was running; it could easily be stopped and no charges would be incurred. It was the perfect tool for Nicholas’s stage where a blend of affordability and effectiveness were crucial.

Business Intelligence

Nicholas already had a subscription to Looker, a business intelligence tool. Thankfully, Ternary are huge fans of Looker. Looker provided a robust data modeling language which is useful for keeping consistent business logic across reports. It also integrated well with AWS. Both Athena and Redshift were connected to Looker, enabling near real time dashboards.

Nicholas and Company’s next generation data platform

About Ternary Data

Ternary Data is a specialty data architecture and consulting firm based in Salt Lake City, Utah. Ternary Data advises and coaches companies on discovering the value in their data through cloud services and best practices.

Ternary Data is partnered with Google Cloud and many more best in class data technology companies.