How are you detecting data quality issues? | C2C Community
Solved

How are you detecting data quality issues?


Hi,

We are building an open source Data Observability tool dqo.ai and we wondering what kind of data quality issues we should focus on. So far we are able to monitor timeliness (if the data is fresh), validity (values in columns not meeting requirements), consistency (anomalies how the number of rows is growing over time), uniqueness.

I will appreciate the feedback about the typical data quality issues that are worth continuous monitoring.

The code is open source on github: https://github.com/dqoai/dqo 

 

Best Regards, 

Piotr

icon

Best answer by YuriGrinshteyn 8 July 2022, 11:27

View original

4 replies

Two dimensions worth looking at are accuracy and completeness. 

Userlevel 7
Badge +29

Hi @piotrczarnas and welcome to C2C!

 

First things first, can I ask you to come over at the C2C Lounge and introduce yourself using this template? It would be great so that our community can get to meet you and to know you a bit better! :)

 

Concerning your question, let me see if tagging @nazrul@cloudaeye.com, @YuriGrinshteyn could help, as I do remember them having talked about observability in the past.  

Perhaps @Alfons or @ilias could help us out here? :)

Thanks for thinking of me!

we wondering what kind of data quality issues we should focus on

I would recommend thinking of this as a data pipeline.  Essentially, you have the ingestion stage, the processing stage, and the querying stage.  You can then consider specific SLOs or reliability/quality requirements for each stage.

So far we are able to monitor timeliness (if the data is fresh), validity (values in columns not meeting requirements), consistency (anomalies how the number of rows is growing over time), uniqueness.

These seem like great SLOs for the data/processing stages. We generally guide folks to consider correctness, freshness, and throughput as SLOs for data pipelines, and you seem to have most of those covered already.

I would encourage you to consider latency/availability for the ingestion and querying stages, as well, if those are applicable to what you’re building (I haven’t looked at your code).

I will appreciate the feedback about the typical data quality issues that are worth continuous monitoring.

I think you’re off to a great start here.   

Userlevel 7
Badge +29

Hi @YuriGrinshteyn, thanks for jumping in, appreciate it! :)

 

@piotrczarnas did @YuriGrinshteyn’s answer perhaps help you? :)

Reply