Building a Data Platform on AWS from Scratch – Part 1

Author: Erik Gfesser Posted In: AWS, Cloud, Data

Mar 052020

The CIO of a Chicago-based global consultancy contacted SPR to build out a data platform. While the core platform was to be built by the corporate IT organization run by the CIO, the platform was to be used by other organizations outside corporate IT. It would also centralize data assets and data analysis across the company using a common architecture, building on top of it to meet the use case needs of each organization.

From the perspective of the client CIO, we were to build a data insights platform. In other words, a platform that enables data analysis leading to actionable, data-driven findings that create business value. This approach is a step beyond the given of analytics, which simply makes sense of data by uncovering meaningful trends, but does not necessarily lead to business value. As such, from the perspective of our team, the purpose of this platform was to enable such insights to take place.

As with many established firms, use of Microsoft Excel was commonplace, with spreadsheets commonly distributed within and across organizations, as well as between the firm and external clients. Additionally, business units and consultancy practices had become siloed, each making use of disparate processes and tooling. So in addition to centralizing data assets and data analysis, another goal was to implement the concept of data ownership, and enable the sharing of data across organizations in a secure, consistent manner.

As for technology, the client provided the following key guidelines:

Prefer to use AWS services
Make use of Azure Repos (Git) and Azure Pipelines (both Azure DevOps components) for source control and automated builds and deployments to AWS

These guidelines were based on the firm's previous decisions to adopt AWS as its public cloud, with a key exception to make use of Azure for its Git repositories and DevOps pipelines.

Conceptual architecture

Before working through the product selection process for the architecture, the team prepared a conceptual architecture to outline the vision for the data platform, based on client requirements and team experience. As this effort was expected to be sizable, and we were building everything from scratch, SPR initially met with the client for an Inception, a phase focused on initial requirements, analysis, and architecture, just enough to create the starting point for a product backlog, and to calculate estimated cost and determine team makeup for an MVP (minimum viable product).

The conceptual architecture can be broken down into the following components, keeping in mind that this architecture concerned run-time only, with AWS CloudFormation used throughout to build out the infrastructure via Azure DevOps. As we didn't want to let implementation choices cloud this architecture, we kept some areas split out that we realized might be combined when it came time for implementation, such as for data pipelines and orchestration. While we sought to make use of general industry terms, many traditional data professionals do not have software development backgrounds (and vice versa), so to cater this article to a more diverse audience an attempt has been made here to explain each area from the context of both disciplines.

Bronze, silver, and gold data

Everything comes down to the state of the data that is used for any ad hoc queries, reporting, visualizations, or machine learning model results. And with respect to these states, the industry typically categorizes these different states (or stages in the context of any given data set) through which data progress using the three metals associated with Olympic medals: bronze, silver, and gold.

In keeping with industry terminology, we referred to bronze data in its original form as "raw" data, ingested to the platform to be processed. Just keep in mind that while usage of the three metals mentioned here is commonplace, these terms can mean different things depending on their origin. For example, Databricks uses a slightly more specific term, "raw ingestion" data, and while data doesn't necessarily need to be ingested to be raw, this makes sense from a platform perspective, regardless of the data source. The description, however, that Teradata provides for bronze data, "bring your own data", is probably too specific, as it can be interpreted as being limited to one-off or self-service data.

Unlike bronze data, silver data can rightly have different connotations, although in all these cases this data exists in one or more intermediary stages between bronze and gold that we called "staging" data. Again, our definition of silver data closely aligned with Databricks, albeit we distinguished between "readable" and "clean" data. In our case, the state of readable data was only slightly removed from the original raw data. This state made only minor modifications to data to ensure readability, followed by storing this data in Apache Parquet format to enable for more performant, subsequent processing. Clean data, on the other hand, was intended to be the result of optionally passing this data through general and business domain specific rules. I've adjusted Teradata's interpretation in the above diagram to read "minimum viable data" rather than "minimal viable data", in keeping with the spirit of an MVP (the "M" stands for "minimum", not "minimal"). That said, Teradata is going in the right direction, as this stage of the data provides a baseline for subsequent processing.

Gold data doesn't need much of an explanation. While clean data provides a baseline, the same business domain objects can originate from multiple places, with the output providing a synthesis. At this point, the data becomes trusted, as described by Teradata. Perhaps more importantly, it becomes a single version of the truth (when not used as one-off or self-service data). However, while this data does often involve aggregates, as suggested by Databricks, whether this is the case depends on the purpose of this data, and so this isn't always the case, as we broke this data down into both "canonical" and "denormalized" data.

Data pipelines

In the past, it was common to describe components for loading data from disparate data sources as "ETL" (extract, transform, and load). However, the data lake revolution later swapped the ordering of transformation and loading to instead make this process "ELT", as organizations realized the importance of being able to perform analyses on raw data before being transformed downstream. Perhaps because of the connotations attached to these two terms due to their origins in enterprise software, as well as continued diversity in approaches to get data from point A to point B, performing operations on this data en route between these two locations, and the industry move to perform this work via code rather than specialized tools, these enterprisey-sounding terms evolved to become "data pipelines".

It's not easy to find how these terms evolved, or even what a "data pipeline" is all about. But this article summarizes the data pipeline well: "The data pipeline is an ideal mix of software technologies that automate the management, analysis and visualization of data from multiple sources, making it available for strategic use...Data pipelines are not miraculous insight and functionality machines either, but instead are the best end-to-end solution to meet the real-world expectations of business leaders...By developing and implementing data pipelines, data scientists and BI specialists benefit from multiple viable options regarding data preparation, management, processing, and data visualization. Data pipelines are an incredibly fruitful way of tackling technology experimentation and data exploration."

Orchestration and platform configuration

Because data pipelines are composed of multiple components working together, across multiple data stages, these need to be orchestrated to work together, executed via either events or schedules. We ended up referring to each end-to-end process for a given use case as a "data pipeline", with each portion of these pipelines between source and destination data stages as data pipeline "segments." In order that pipelines successfully execute across each of its segments, the flow of data and the processing of this data needs to be orchestrated so that these are performed at the right time, and only data that is ready for the next data stage is permitted to move forward.

In addition to the aforementioned technology guidelines provided by our client, the CIO also wanted us to organize the data platform by something he referred to as "insight zones." A given insight zone is owned by a subset of the business, and we implemented insight zones as one or more data pipelines designed for their specific needs, along with consideration for other insight zones that may want to make use of data it may be requested to publish, such as master reference data. As such, insight zone specific configuration needed to be stored somewhere, with each corresponding data pipeline using this data to determine how to execute, both for data pipeline segments developed to be reused across insight zones, as well as data pipeline segments unique to a given insight zone.

Data pipeline data stores

When taking a look at the data pipeline we built from a high level, it looks something like the following diagram from the context of used data stores, each of which stores a different state of any given data set, keeping in mind that not all data makes its way to a gold state. This is because not all data will pass programmatic approvals (based on insight zone configuration) to move past the landing data store, the purpose of which is to store raw data. Additionally, even after data passes this initial approval, there may be issues preventing it from progressing until the data steward of the associated insight zone evaluates the output in staging, which might be flagged in cases it is not ready to be processed further.

In the same spirit, gold data can come in different forms. If the purpose of the data is to both feed reports and provide the ability to publish portions of this data to one or more other insight zones, it doesn't make sense to immediately denormalize, especially when not partially normalizing into something like a traditional star schema, because it is likely that other insight zones only want a small portion of this data, perhaps particular domain objects. And if the data is already denormalized into something like one big table, there will probably be a lot of redundancy, and this data may not provide what is needed at a granular enough level. As might be obvious, since we built our client solution as a platform, not all requirements are expected to be known ahead of time.

Landing and staging data stores

The first data pipeline segment ended in what we referred to as the "landing" data store. Data ingested by the platform was to be triggered via events indicating the presence of new data in the ingress data store external to the platform. Ingestion was to be simply a copy of data onto the platform, along with the cataloging of this data to indicate that it is now available. In addition to providing the original, raw data so that it would be available for reprocessing, we also determined that this data should not be available to be read. Since this data was simply being copied, the potential exists for unapproved data to be present, especially with respect to sensitive data. As such, we would not want anyone to have access to this data until this data was first approved.

Approval was to be automated during the subsequent data pipeline segment ending in the "staging" data store. For data expected by each insight zone, ingested data needed to abide by the configuration created for it, namely the fields configured to be present in the data as well as the data types of each of these fields. In order for a given dataset to be routed to staging, it needs to be compared to this configuration, with schemas matching the configuration set up for its associated insight zone. Datasets not passing these tests were to result in aborted data pipelines, with the expectation that the associated insight zone would correct any issues and reattempt ingestion. In addition, data stored in staging should be readable in a performant manner, with minimal modifications made to do so, by either users looking to do exploratory work, users looking to compare with corresponding data in the "ingress" data store, or the next pipeline segment that processes this data.

Operational and analytical data stores

The next stage would end in what we referred to as the "operational" data store (ODS). For this platform the purpose was to store ingested data into a canonical data model so business users could perform short-running ad hoc queries against data structures created with relatable business domain terms. Additionally, canonical data models are intended to provide data in standardized structures, with any differences resolved when the same data objects are provided by different source data systems. This data store could also be referred to as "transactional," in that it is data normalized across varying degrees of parent / child relationships at a level provided by source systems.

In contrast with the operational data store is the subsequent data store intended for analytical purposes, such as reporting. While the operational data store is normalized to enable ad hoc business consumption, it is not designed for analytics that often involve long-running queries involving aggregates intended to answer business questions for a given time period. As such, ingested data needs to be denormalized for this purpose, providing the ability to slice and dice data in a performant manner. Traditionally, analytical data structures are typically represented by star schemas, but as there are now options with respect to where this data can be hosted, satisfactory query performance can be provided via other data structures, and as this data might be used for purposes other than reporting – such as for machine learning models – this might not necessarily be the case.

Insight zone security

Security was one aspect of the data platform on which we spent a significant amount of time. While the following diagram is greatly simplified, partially because it only depicts data store components and not other platform components, it is intended to conceptually depict several aspects of the data platform.

The first is that each insight zone owned its own data pipelines, even though individual data stores used within each of these data pipelines are multitenant. The second is that security was locked down with respect to communication across insight zones, with one exception being cases in which one insight zone publishes data to another insight zone. The third is that security was locked down between stages of a given data pipeline, limiting the actions that components between each data store could perform.

Tech Stack Selection Process

Our working through the tech stack selection process started in parallel with our determining the conceptual architecture, and we narrowed down implementation options to two, presenting these in May 2019, keeping in mind our client's guideline to use AWS services whenever it made sense to do so:

An AWS "all-in" architecture
An architecture we referred to as an AWS / Databricks Unified Data Analytics Platform "hybrid" architecture

While this first option would solely comprise AWS services, this second option would additionally make use of Databricks from AWS Marketplace, as well as other third party components as needed such as Apache Airflow. We recommended the first option, making use of a tech stack comprising the following key components for the MVP: Amazon S3, AWS Lambda, AWS Lake Formation, AWS Glue jobs (Python shell and Apache Spark), AWS Glue crawlers, AWS Glue Data Catalog, Amazon Athena (Presto), and Amazon Aurora.

Keeping in mind the need to build the MVP by September 2019, the risks we identified at that time can be summarized as follows, with the given that availability of AWS services differs across regions:

General availability (GA) of AWS Lake Formation was not expected until June or July 2019
AWS recommended we not use AWS Lake Formation for production until sometime after it goes GA, due to expected instability
Team ability to execute was dependent on the part of AWS to execute on their roadmap
AWS Glue jobs were limited to an older version of Apache Spark (2.2.1, released in December 2017), with limited options for Spark nodes
Data lineage functionality was not expected to become available until sometime after AWS Lake Formation goes GA

The quantity of risks we identified for the recommended tech stack outnumbered those of the second option. But these appeared to be of lower magnitude, keeping in mind that Delta Lake had not been open sourced by Databricks until late-April 2019. The AWS Lake Formation ecosystem appeared promising, providing out-of-the-box conveniences for a jump start. We advised that the products included in this tech stack were not comprehensive, since platform use cases were not well defined, and so the platform would likely need to evolve beyond AWS services as limitations of this option become known.

To mitigate risks, many AWS services are shared between these two options, including AWS Glue Data Catalog and AWS Glue crawlers, which were expected to provide core platform conveniences. Additionally, use of Apache Spark was key to both architectures, enabling migration across tech stacks if needed down the road. In a nutshell, the recommended option was expected to enable narrow functionality for initial use cases more quickly, barring GA release dependencies, additionally permitting failure to take place more quickly in the case risks materialize.

Since we were following an agile approach to the platform, however, keep in mind that while we initially moved forward with the chosen tech stack, the platform evolved over the course of the project, starting with proofs of concept (POCs) that were used to create the prototypes on which the rest of the platform was built, and continuing to prove everything out along the way during the remaining implementation, carried out in an evolutionary manner.

Guiding principles

We provided several guiding principles, beginning with one of the two guidelines mentioned previously that were provided to us by the client.

#1: Product mix

AWS services are favored for platform components. While AWS services are not expected to trump components from other parties in all scenarios, the goal here is to make use of AWS services by default as long as it makes sense to do so. However, the decisions should align with other guiding principles, such as the ability to move between products whenever possible (#4), which is expected to help enable smoother platform evolution (#3).

#2: Solutioning

Simplicity is favored for platform solutions. Implementation of relatively simple solutions is also a goal. While this guiding principle is not as concrete as #1, the key here is to simplify whenever possible, and to bring a reasonable level of consistency to how solutions are implemented. This helps enable greater developer productivity.

#3: Platform build-out

Build-outs are to be iteratively carried out for the platform as needed over time. The components built out for the platform should not remain static, as needs change and new components and component features will be made available over time. When building out platform functionality, always start with what is minimally viable before adding unneeded bells and whistles. The platform was initially built from PoCs that were refined into prototypes later used as the foundation for the rest of the platform, with configuration added along the way. Start with what is functional, make sure you understand it, and then evolve it. Don’t waste time and money building what has low likelihood of being used, but make an effort to get ahead of future needs.

#4: Languages & libraries

Industry prevalent tooling with strong community support is to be used for the platform. Such tooling typically implies open source. Since the platform is largely based on commercial components, the scope here is largely the languages and libraries hosted in these components, which in some cases also includes commercialized open source on which these components have been built. For example, the platform makes use of the AWS SDK for Python, Boto 3.

#5: Versioning & deployments

Artifacts for the platform should be versioned via scripts or code in source control. Everything should be versioned, and versioning implies code. Doing so enables consistency with deployments, the ability to track changes to the platform as a cohesive unit, and the ability of developers to independently develop and test different versions of artifacts in parallel with each other.

The Result

The platform MVP was successfully released to production on time and within budget in September 2019, making the first use case built on top of the platform available to business users from the client's corporate finance department. In addition to the data pipelines and data stores included as part of the platform, a canonical data model was created for corporate expenses, as was a machine learning (ML) model for anomaly detection using Amazon SageMaker, and a Power BI report implemented in Azure that accesses data in AWS via Power BI Gateway.

In subsequent months, all deployments from Azure DevOps to the new platform were made fully automated for insight zones, with much of the code base genericized so insight zones can configure non-programmatically according to their needs. Exceptions included insight zone specific Spark code, data models, ML models, and reports and visualizations, since these depend on the data being processed by each insight zone.

In addition to the platform source code, the SPR team prepared extensive documentation and held knowledge transfer sessions with the client development teams. These sessions covered everything from day-to-day tasks and the setting up of new insight zones, to walkthroughs of code and how to model, and architecture guiding principles and recommendations. In parallel with the build effort, SPR also led a data governance team that provided guidance on a breadth of areas such as data quality and data security.

At the time of posting this case study, work is under way to build proofs of concept for potential migration of the platform to AWS Lake Formation. Additionally, due diligence around expanded use of Amazon Athena (Presto), and potentially Amazon Redshift, that was performed during the course of this project will continue as the client better understands its use cases, and the data and usage of this data becomes better known.

Technologies used during this effort included Amazon S3, AWS Lambda, AWS Identity and Access Management (IAM), AWS Secrets Manager, Amazon CloudWatch, AWS CloudTrail, AWS Single Sign-On (SSO), Python 3.6, AWS Glue jobs (Python shell and Apache Spark), AWS Glue Data Catalog, Amazon Athena (Presto), Amazon Aurora (PostgreSQL 10.7), Amazon SageMaker, Azure Power BI, Power BI Gateway, Azure DevOps Repos (Git), Azure DevOps Pipelines, Amazon EC2, AWS CloudFormation, PowerShell, Liquibase, DBeaver, AWS Glue development endpoints (Amazon SageMaker notebooks), Visual Studio Code.

Ready for what's next?

Together, we can help you identify the challenges facing you right now and take the first steps to elevate your cloud environment.

FIND OUT MORE

AWS data analytics data architecture data pipeline data platform

Who We Work With