Decorative chevron

Three steps to build trust in your data and analytics

Featured image

CPCS data expert Brent Tucker outlines three work flow habits to make your analytics project more appealing to clients. 

Back in November, I delivered a presentation on PostGIS applications at a virtual conference held by CrunchyData Solutions.

Since then, I’ve been receiving questions on the principles of data observability and how to apply them in day-to-day analytics.

This article describes three steps for developing analytical data pipelines and visualization solutions. Most of these workflows consist of processing large datasets through a multistage data flow. The types of data treated include:

  • Parcel tracking events
  • Automatic identification system (AIS) vessel positioning data
  • Global positioning system (GPS) truck logs
  • Retail point of sale transactions
  • Various other data feeds generated from the movement of goods people and information.

In my experience, using these three steps in your workflow will help you close the trust gap between analytics practitioners and their clients.

1. Write code (or find someone to write code)

Analytics is code. Don’t base the entire analytical workflow on point-and-click graphical user interfaces (GUIs). GUIs make collaboration difficult and cannot integrate well with other tools. Adopt a code-first approach.

Typical code from one process in a larger data pipeline

2. Automate everything 

Use a modern data pipelining application that can wire code together and churn out a horizontal view of the data assets produced.

Learn from modern software engineering best practices and implement automated testing to profile, validate and document data at key points in the pipeline process.

Lastly, try to use a scheduler that can both run pipelines on a fixed interval and initiate runs based on external events (e.g., uploading of a file).

Screen capture of a running data pipeline

3. Peek inside the “black box” (link your data assets to computations)

For a moment, forget everything described in 1) and 2), because clients and data consumers don’t care what programming language or fancy data pipelining tool you use. They only care about the assets (tables, files, reports and visualizations) produced by these data flows. As such, these pipelines should be observable, documented and not composed of solely “black box” computations.

You should also consider linking your data assets to the code that produced them.

This is useful for debugging a complex algorithm or tracing data quality through key stages of the pipelines.

Some examples of intermediate data assets automatically generated from a complex analytical pipeline:

Automated, interactive data profiles generated at key stages of a data pipeline
Animation created as an output of a data processing flow for vessel movement

This type of workflow has served me well, though your mileage may vary. Let me know what you think at btucker@cpcs.ca.

Like what you read? Subscribe to our newsletter below, and be the first to get blog updates and gain free access to exclusive webinars and research.

This article was originally published on Brent’s LinkedIn. 

Share this: Twitter LinkedIn Send email