Jamie Does Data Things

I previously wrote a post in here, or perhaps more of a rant, about why a sucky tech stack probably isn't the problem - but as time passes, more often, I think it's actually a lack of clarity on the overall aims and complexities of a data estate.

It's complex AND complicated

There is a lot that can be going on in any given data estate, and I think the problem is often that the estate can be both complicated and complex at once. The distinction in those is that complicated systems are those that, while they may have many small, moving pieces, they are all linear, repeatable and preditable - they will do the same thing every time. This reflects the movement of data between layers of the architecture, or deployment processes. In opposition to that, data estates are often also complex; they're systems that regularly face new, emergent problems and are generally managed in a responsive way. Even if future problems are met early, during the design phase, it still requires a responsive approach and - in some cases - a new 'section' of the estate to be wound up.

Increasing the complexity problem, is the emergence of new governance approaches (yes, I did say 'increasing'). While many of these approaches are designed to ensure greater robustness and ostensibly to deliver better outcomes - they are, themselves, an increase in the overall information cost of knowing how the estate works. It used to just be layers of data coming through. Now we have that same information being stored and the metadata being managed. And the data are being purged on a timescale. And we have appropriate quality checks (which should always be the case, but I found often never was in my earlier years as an engineer).

Ultimately, it's a lot of 'stuff' - all with a valuable job and place, but still difficult to keep track of nonetheless. Especially if you're not vendor-centric and trying to keep tabs on a whole industry of tools and approaches.

What might a whole landscape approach look like?

Just for fun (as I type that, I do read it back to myself, look to the ceiling, and sigh), I decided it might be an interesting exercise to try and create the 'perfect' architecture. A world where 99% of use cases are resolved and nobody cares that you're using Python and SQL interchangably (yes, yes, I know that inherently makes it imperfect - go away). How can we look at the wide world of tools and suggest an estate able to meet all the extreme needs of many end users, while being clear and consistent.

I've been trying to produce a perfect version of this diagram for years, so I'm excited to fail again.

Overall architecture

So, what am I planning to include in my big, fat diagram of all that is brilliant in the (data) world? Well:

Batch engineering layer
- A standard medallion-style architecture - but I'm not going to call them bronze, silver and gold. Not even if there's a fire.
Near-real-time (NRT) engineering layer
- A layer designed for reporting at the earliest feasible levels of the medallion architecture where data are, at the very least, 'clean'.
Streaming analytics layer
- A layer designed for reporting high-frequency stream data, particularly for anomaly detection or stream analytics.
Logging and auditing
- Includes solutions for quality checking, data availability, metadata management, failure alerting and lineage outputs.
Deployment lifecycles
- How the code is managed and deployed across the full range of technologies.
Governance processes
- How business decisions are made, particularly on the performance of the pipeline, the availability of information, or on the content of the output.
Self-service mechanisms
- Where and how end users might be able to self-serve to meet their own needs.
Reporting outputs
- For users who do not self-serve, through technical incapability, lack of engagement or otherwise, how are the reporting outputs served?
ML use cases
- Once a centralised source of information is created, how might users apply statistical or machine learning principles to the information to collect insights?
AI use cases
- How do we pivot our data outputs to allow the interrogation of them by AI agents - improving our interaction.

It turns out, this is a series...

I'll attack this in a series of blog posts, designed to ensure that I cover each bit with appropriate detail - but I will include a TL;DR version for people who have neither the time, nor the inclination, to listen to me bore on and on and on and on and on and on and on and on and on and on about this. Because I can. And I do. It's basically who I am at this stage.

I read that back. I look at the ceiling. I sigh.

The Great Big Sausage Machine in the Cloud: Part I

It's complex AND complicated

What might a whole landscape approach look like?

Overall architecture

It turns out, this is a series...