Building a Data Platform on AWS from Scratch – Part 1
The CIO of a Chicago-based global consultancy contacted SPR to build out a data platform. While the core platform was to be built by the corporate IT organization run by the CIO, the platform was to be used by other organizations outside corporate IT. It would also centralize data assets and data analysis across the company using a common architecture, building on top of it to meet the use case needs of each organization.
From the perspective of the client CIO, we were to build a data insights platform. In other words, a platform that enables data analysis leading to actionable, data-driven findings that create business value. This approach is a step beyond the given of analytics, which simply makes sense of data by uncovering meaningful trends, but does not necessarily lead to business value. As such, from the perspective of our team, the purpose of this platform was to enable such insights to take place.
As with many established firms, use of Microsoft Excel was commonplace, with spreadsheets commonly distributed within and across organizations, as well as between the firm and external clients. Additionally, business units and consultancy practices had become siloed, each making use of disparate processes and tooling. So in addition to centralizing data assets and data analysis, another goal was to implement the concept of data ownership, and enable the sharing of data across organizations in a secure, consistent manner.
As for technology, the client provided the following key guidelines:
- Prefer to use AWS services
- Make use of Azure Repos (Git) and Azure Pipelines (both Azure DevOps components) for source control and automated builds and deployments to AWS
These guidelines were based on the firm's previous decisions to adopt AWS as its public cloud, with a key exception to make use of Azure for its Git repositories and DevOps pipelines.
Before working through the product selection process for the architecture, the team prepared a conceptual architecture to outline the vision for the data platform, based on client requirements and team experience. As this effort was expected to be sizable, and we were building everything from scratch, SPR initially met with the client for an Inception, a phase focused on initial requirements, analysis, and architecture, just enough to create the starting point for a product backlog, and to calculate estimated cost and determine team makeup for an MVP (minimum viable product).
The conceptual architecture can be broken down into the following components, keeping in mind that this architecture concerned run-time only, with AWS CloudFormation used throughout to build out the infrastructure via Azure DevOps. As we didn't want to let implementation choices cloud this architecture, we kept some areas split out that we realized might be combined when it came time for implementation, such as for data pipelines and orchestration. While we sought to make use of general industry terms, many traditional data professionals do not have software development backgrounds (and vice versa), so to cater this article to a more diverse audience an attempt has been made here to explain each area from the context of both disciplines.
Unlike bronze data, silver data can rightly have different connotations, although in all these cases this data exists in one or more intermediary stages between bronze and gold that we called "staging" data. Again, our definition of silver data closely aligned with Databricks, albeit we distinguished between "readable" and "clean" data. In our case, the state of readable data was only slightly removed from the original raw data. This state made only minor modifications to data to ensure readability, followed by storing this data in Apache Parquet format to enable for more performant, subsequent processing. Clean data, on the other hand, was intended to be the result of optionally passing this data through general and business domain specific rules. I've adjusted Teradata's interpretation in the above diagram to read "minimum viable data" rather than "minimal viable data", in keeping with the spirit of an MVP (the "M" stands for "minimum", not "minimal"). That said, Teradata is going in the right direction, as this stage of the data provides a baseline for subsequent processing.
Gold data doesn't need much of an explanation. While clean data provides a baseline, the same business domain objects can originate from multiple places, with the output providing a synthesis. At this point, the data becomes trusted, as described by Teradata. Perhaps more importantly, it becomes a single version of the truth (when not used as one-off or self-service data). However, while this data does often involve aggregates, as suggested by Databricks, whether this is the case depends on the purpose of this data, and so this isn't always the case, as we broke this data down into both "canonical" and "denormalized" data.
In the past, it was common to describe components for loading data from disparate data sources as "ETL" (extract, transform, and load). However, the data lake revolution later swapped the ordering of transformation and loading to instead make this process "ELT", as organizations realized the importance of being able to perform analyses on raw data before being transformed downstream. Perhaps because of the connotations attached to these two terms due to their origins in enterprise software, as well as continued diversity in approaches to get data from point A to point B, performing operations on this data en route between these two locations, and the industry move to perform this work via code rather than specialized tools, these enterprisey-sounding terms evolved to become "data pipelines".
It's not easy to find how these terms evolved, or even what a "data pipeline" is all about. But this article summarizes the data pipeline well: "The data pipeline is an ideal mix of software technologies that automate the management, analysis and visualization of data from multiple sources, making it available for strategic use...Data pipelines are not miraculous insight and functionality machines either, but instead are the best end-to-end solution to meet the real-world expectations of business leaders...By developing and implementing data pipelines, data scientists and BI specialists benefit from multiple viable options regarding data preparation, management, processing, and data visualization. Data pipelines are an incredibly fruitful way of tackling technology experimentation and data exploration."
Orchestration and platform configuration
Because data pipelines are composed of multiple components working together, across multiple data stages, these need to be orchestrated to work together, executed via either events or schedules. We ended up referring to each end-to-end process for a given use case as a "data pipeline", with each portion of these pipelines between source and destination data stages as data pipeline "segments." In order that pipelines successfully execute across each of its segments, the flow of data and the processing of this data needs to be orchestrated so that these are performed at the right time, and only data that is ready for the next data stage is permitted to move forward.
In addition to the aforementioned technology guidelines provided by our client, the CIO also wanted us to organize the data platform by something he referred to as "insight zones." A given insight zone is owned by a subset of the business, and we implemented insight zones as one or more data pipelines designed for their specific needs, along with consideration for other insight zones that may want to make use of data it may be requested to publish, such as master reference data. As such, insight zone specific configuration needed to be stored somewhere, with each corresponding data pipeline using this data to determine how to execute, both for data pipeline segments developed to be reused across insight zones, as well as data pipeline segments unique to a given insight zone.
Landing and staging data stores
The first data pipeline segment ended in what we referred to as the "landing" data store. Data ingested by the platform was to be triggered via events indicating the presence of new data in the ingress data store external to the platform. Ingestion was to be simply a copy of data onto the platform, along with the cataloging of this data to indicate that it is now available. In addition to providing the original, raw data so that it would be available for reprocessing, we also determined that this data should not be available to be read. Since this data was simply being copied, the potential exists for unapproved data to be present, especially with respect to sensitive data. As such, we would not want anyone to have access to this data until this data was first approved.
Approval was to be automated during the subsequent data pipeline segment ending in the "staging" data store. For data expected by each insight zone, ingested data needed to abide by the configuration created for it, namely the fields configured to be present in the data as well as the data types of each of these fields. In order for a given dataset to be routed to staging, it needs to be compared to this configuration, with schemas matching the configuration set up for its associated insight zone. Datasets not passing these tests were to result in aborted data pipelines, with the expectation that the associated insight zone would correct any issues and reattempt ingestion. In addition, data stored in staging should be readable in a performant manner, with minimal modifications made to do so, by either users looking to do exploratory work, users looking to compare with corresponding data in the "ingress" data store, or the next pipeline segment that processes this data.
Operational and analytical data stores
The next stage would end in what we referred to as the "operational" data store (ODS). For this platform the purpose was to store ingested data into a canonical data model so business users could perform short-running ad hoc queries against data structures created with relatable business domain terms. Additionally, canonical data models are intended to provide data in standardized structures, with any differences resolved when the same data objects are provided by different source data systems. This data store could also be referred to as "transactional," in that it is data normalized across varying degrees of parent / child relationships at a level provided by source systems.
In contrast with the operational data store is the subsequent data store intended for analytical purposes, such as reporting. While the operational data store is normalized to enable ad hoc business consumption, it is not designed for analytics that often involve long-running queries involving aggregates intended to answer business questions for a given time period. As such, ingested data needs to be denormalized for this purpose, providing the ability to slice and dice data in a performant manner. Traditionally, analytical data structures are typically represented by star schemas, but as there are now options with respect to where this data can be hosted, satisfactory query performance can be provided via other data structures, and as this data might be used for purposes other than reporting – such as for machine learning models – this might not necessarily be the case.
Tech Stack Selection Process
Our working through the tech stack selection process started in parallel with our determining the conceptual architecture, and we narrowed down implementation options to two, presenting these in May 2019, keeping in mind our client's guideline to use AWS services whenever it made sense to do so:
- An AWS "all-in" architecture
- An architecture we referred to as an AWS / Databricks Unified Data Analytics Platform "hybrid" architecture
While this first option would solely comprise AWS services, this second option would additionally make use of Databricks from AWS Marketplace, as well as other third party components as needed such as Apache Airflow. We recommended the first option, making use of a tech stack comprising the following key components for the MVP: Amazon S3, AWS Lambda, AWS Lake Formation, AWS Glue jobs (Python shell and Apache Spark), AWS Glue crawlers, AWS Glue Data Catalog, Amazon Athena (Presto), and Amazon Aurora.
Keeping in mind the need to build the MVP by September 2019, the risks we identified at that time can be summarized as follows, with the given that availability of AWS services differs across regions:
- General availability (GA) of AWS Lake Formation was not expected until June or July 2019
- AWS recommended we not use AWS Lake Formation for production until sometime after it goes GA, due to expected instability
- Team ability to execute was dependent on the part of AWS to execute on their roadmap
- AWS Glue jobs were limited to an older version of Apache Spark (2.2.1, released in December 2017), with limited options for Spark nodes
- Data lineage functionality was not expected to become available until sometime after AWS Lake Formation goes GA
The quantity of risks we identified for the recommended tech stack outnumbered those of the second option. But these appeared to be of lower magnitude, keeping in mind that Delta Lake had not been open sourced by Databricks until late-April 2019. The AWS Lake Formation ecosystem appeared promising, providing out-of-the-box conveniences for a jump start. We advised that the products included in this tech stack were not comprehensive, since platform use cases were not well defined, and so the platform would likely need to evolve beyond AWS services as limitations of this option become known.
To mitigate risks, many AWS services are shared between these two options, including AWS Glue Data Catalog and AWS Glue crawlers, which were expected to provide core platform conveniences. Additionally, use of Apache Spark was key to both architectures, enabling migration across tech stacks if needed down the road. In a nutshell, the recommended option was expected to enable narrow functionality for initial use cases more quickly, barring GA release dependencies, additionally permitting failure to take place more quickly in the case risks materialize.
Since we were following an agile approach to the platform, however, keep in mind that while we initially moved forward with the chosen tech stack, the platform evolved over the course of the project, starting with proofs of concept (POCs) that were used to create the prototypes on which the rest of the platform was built, and continuing to prove everything out along the way during the remaining implementation, carried out in an evolutionary manner.
The platform MVP was successfully released to production on time and within budget in September 2019, making the first use case built on top of the platform available to business users from the client's corporate finance department. In addition to the data pipelines and data stores included as part of the platform, a canonical data model was created for corporate expenses, as was a machine learning (ML) model for anomaly detection using Amazon SageMaker, and a Power BI report implemented in Azure that accesses data in AWS via Power BI Gateway.
In subsequent months, all deployments from Azure DevOps to the new platform were made fully automated for insight zones, with much of the code base genericized so insight zones can configure non-programmatically according to their needs. Exceptions included insight zone specific Spark code, data models, ML models, and reports and visualizations, since these depend on the data being processed by each insight zone.
In addition to the platform source code, the SPR team prepared extensive documentation and held knowledge transfer sessions with the client development teams. These sessions covered everything from day-to-day tasks and the setting up of new insight zones, to walkthroughs of code and how to model, and architecture guiding principles and recommendations. In parallel with the build effort, SPR also led a data governance team that provided guidance on a breadth of areas such as data quality and data security.
At the time of posting this case study, work is under way to build proofs of concept for potential migration of the platform to AWS Lake Formation. Additionally, due diligence around expanded use of Amazon Athena (Presto), and potentially Amazon Redshift, that was performed during the course of this project will continue as the client better understands its use cases, and the data and usage of this data becomes better known.
Technologies used during this effort included Amazon S3, AWS Lambda, AWS Identity and Access Management (IAM), AWS Secrets Manager, Amazon CloudWatch, AWS CloudTrail, AWS Single Sign-On (SSO), Python 3.6, AWS Glue jobs (Python shell and Apache Spark), AWS Glue Data Catalog, Amazon Athena (Presto), Amazon Aurora (PostgreSQL 10.7), Amazon SageMaker, Azure Power BI, Power BI Gateway, Azure DevOps Repos (Git), Azure DevOps Pipelines, Amazon EC2, AWS CloudFormation, PowerShell, Liquibase, DBeaver, AWS Glue development endpoints (Amazon SageMaker notebooks), Visual Studio Code.