Re-Thinking Data Quality Dimensions www.dataqualitypro.com

登録は簡単!. 無料です
または 登録 あなたのEメールアドレスで登録
Re-Thinking Data Quality Dimensions www.dataqualitypro.com により Mind Map: Re-Thinking Data Quality Dimensions  www.dataqualitypro.com

1. The Problem with Data Quality Dimensions

1.1. A Common Approach

1.1.1. Select data that is deemed important

1.1.2. Select a subset of dimensions that are relevant to the scope of data

1.1.3. Perform rule discovery (typically using a data quality tool)

1.1.3.1. Structure Rules

1.1.3.2. Value Rules

1.1.3.3. Relationship Rules

1.1.4. Perform DQ assessment

1.1.4.1. Completeness %

1.1.4.1.1. Is the field null?

1.1.4.1.2. Does it have empty values?

1.1.4.1.3. Does it have inferred nulls?

1.1.4.2. Validity %

1.1.4.2.1. Are the post codes valid?

1.1.4.2.2. Does the NI Number follow the right pattern?

1.1.4.2.3. Is the patient age in an acceptable range?

1.1.4.3. Consistency %

1.1.4.3.1. Does the data in attribute x reflect the value in attribute y?

1.1.4.4. Uniqueness %

1.1.4.4.1. Are there any duplicates?

1.1.4.5. Referential integrity %

1.1.4.5.1. Are there any orphan records?

1.1.4.5.2. Does every detail record have a master?

1.1.4.6. etc...

1.1.5. Aggregate up to create a scorecard across data domains

1.1.5.1. Customer Data - Quality Score %

1.1.5.2. Equipment Data - Quality Score %

1.1.5.3. Location Data - Quality Score %

1.1.6. Attempt to link to business impact

1.1.7. An example - Housing Database

1.2. Who Cares?

1.2.1. NOT Management. They are not interested in facts, they want impacts and actions.

1.2.2. NOT Business users. They don't care because they only relate to THEIR vocabulary not this 'DQ statistical stuff'.

1.2.3. NOT Technicians. They don't care because they want to work through connected processes, not isolated facts

1.3. The Hammer and Nail Syndrome

1.3.1. Data Quality tools are a common starting point

1.3.2. Subset of Data Quality dimensions are supported by tools

1.3.3. Subset approach is regarded as a useful starting point

1.3.4. But is this subset of dimensions and associated rules going to measure the real problem?

1.3.5. REMEMBER: Data Quality is fitness for ALL purposes

1.4. The BT Story

1.4.1. When we walked into the client we (mistakenly) talked about...

1.4.1.1. Data Profiling

1.4.1.2. Rule Discovery

1.4.1.3. Data Quality Assessment

1.4.2. What they had already achieved...

1.4.2.1. Integrated data quality management reporting platform (linked business impact to data impact)

1.4.2.2. Produced 1 metric: Non-Transacted high value transmission assets

1.4.2.3. Spanned multiple systems and business units

1.4.2.4. Used a wide variety of data quality rules - all mapped to the process:

1.4.2.4.1. Completeness

1.4.2.4.2. Validity

1.4.2.4.3. Consistency

1.4.2.4.4. etc.

1.4.2.5. Created an end-to-end monitoring and reporting environment (using Ab Initio if you want to get technical)

1.4.2.6. All the complexity was hidden, the output was simple for the users

1.4.2.7. Simple dashboard

1.4.2.7.1. Assets recovered

1.4.2.7.2. Assets not recovered

1.4.2.7.3. Location of fault

1.4.2.7.4. Quality over time

1.4.2.7.5. Simple metrics of quality

1.4.2.8. Simple value proposition

1.4.2.8.1. "Yes, this Data Quality monitoring environment cost a fair amount to build, but now saves us millions in wasted Capex, every year"

1.5. The Moral

1.5.1. Dimensions represent 20+ years of research and development i.e. knowledge

1.5.2. By understanding ALL dimensions you can gain a much broader understanding of where data quality can go wrong

1.5.3. They have incredible value in building a common language within the trenches

1.5.4. They provide a common means of 'tagging' the issues found within complex data quality processes

1.5.5. But...they represent the wrong starting point

2. The History of Data Quality Dimensions

2.1. The Founding Fathers

2.1.1. Tom Redman

2.1.1.1. AT&T Bell Labs

2.1.1.1.1. 1993 Submission: Quality Dimensions of a Conceptual View

2.1.1.2. Data Quality - Management and Technology: 1994

2.1.1.3. Data Quality for the Information Age: 1996

2.1.1.4. Data Quality: The Field Guide: 2001

2.1.1.4.1. Accessibility/Delivery

2.1.1.4.2. Quality of Content

2.1.1.4.3. Quality of Values

2.1.1.4.4. Presentation Quality

2.1.1.4.5. Flexibility

2.1.1.4.6. Improvement

2.1.1.4.7. Privacy

2.1.1.4.8. Commitment

2.1.1.4.9. Architecture

2.1.2. Wang and Strong

2.1.2.1. Richard Wang

2.1.2.2. Diane Strong

2.1.2.3. - Beyond Accuracy: What Data Quality Means to Data Consumers: 1996

2.1.2.3.1. Intrinsic Data Quality

2.1.2.3.2. Contextual Data Quality

2.1.2.3.3. Representational Data Quality

2.1.2.3.4. Accessibility Data Quality

2.1.3. Larry English

2.1.3.1. Improving Data Warehouse and Business Information Quality: 1999

2.1.3.1.1. Inherent

2.1.3.1.2. Pragmatic

2.1.3.2. Information Quality Applied: 2009

2.1.3.2.1. Information Content

2.1.3.2.2. Information Presentation

2.2. Next-Generation Practitioners

2.2.1. David Loshin

2.2.1.1. Enterprise Knowledge Management: The Data Quality Approach: 2001

2.2.1.1.1. Dimension Categories

2.2.1.1.2. Strategy

2.2.1.2. The Practitioner's Guide to Data Quality Improvement: 2010

2.2.1.2.1. Intrinsic

2.2.1.2.2. Contextual

2.2.2. Arkady Maydanchik: 2007

2.2.2.1. Data Quality Assessment

2.2.2.1.1. Attribute Domain Constraints

2.2.2.1.2. Relational Integrity Rules

2.2.2.1.3. Historical Data Rules

2.2.2.1.4. State-Dependent Rules

2.2.2.1.5. Attribute Dependancy Rules

2.2.3. Danette McGilvray: 2008

2.2.3.1. Executing Data Quality Projects

2.2.3.1.1. Data Specifications

2.2.3.1.2. Data Integrity Fundamentals

2.2.3.1.3. Duplication

2.2.3.1.4. Accuracy

2.2.3.1.5. Consistency and Synchronization

2.2.3.1.6. Timeliness and Availability

2.2.3.1.7. Ease of Use and Maintainability

2.2.3.1.8. Data Coverage

2.2.3.1.9. Presentation Quality

2.2.3.1.10. Perception, Relevance and Trust

2.2.3.1.11. Data Decay

2.2.3.1.12. Transactability

2.2.3.1.13. Detailed List of Dimensions...

2.2.4. Laura Sebastian-Coleman: 2013

2.2.4.1. Measuring Data Quality for Ongoing Improvement

2.2.4.1.1. Completeness

2.2.4.1.2. Timeliness

2.2.4.1.3. Validity

2.2.4.1.4. Consistency

2.2.4.1.5. Integrity

2.2.4.2. Appendix B

2.3. Industry Bodies/Research

2.3.1. DAMA (UK)

2.3.1.1. Completeness

2.3.1.2. Uniqueness

2.3.1.3. Timeliness

2.3.1.4. Validity

2.3.1.5. Accuracy

2.3.1.6. Consistency

2.3.2. EDM Council

2.3.2.1. Completeness

2.3.2.2. Coverage

2.3.2.3. Conformity

2.3.2.4. Consistency

2.3.2.5. Accuracy

2.3.2.6. Duplication

2.3.2.7. Timeliness

2.3.3. Dan Myers

2.3.3.1. Dimensions of Data Quality

2.3.3.2. Accuracy

2.3.3.3. Completeness

2.3.3.4. Consistency

2.3.3.5. Validity

2.3.3.6. Timeliness

2.3.3.7. Integrity

2.3.3.8. Accessibility

2.3.3.9. Precision

2.3.3.10. Lineage

2.3.3.11. Currency

2.3.3.12. Representation

3. How to Utilise Data Quality Dimensions

3.1. What are you measured on?

3.1.1. Speed

3.1.2. Risk

3.1.3. Income

3.1.4. Expenditure

3.1.5. Customer Churn

3.1.6. Profit

3.1.7. Growth

3.2. Not enough values? Take a look at Bain's 40 elements of value that B2B companies are focused on...

3.3. ...and here's an example of the elements impacting insurance customers, so you can get a feel of how data quality could impact responsiveness for example

3.4. What are the most critical functions that impact your measure? e.g.

3.4.1. Find new customers

3.4.2. Onboard new customers

3.4.3. Bill new customers

3.5. What anecdotes can guide your focus?

3.6. Where are the greatest opportunities for data quality improvement?

3.6.1. Reduce costs

3.6.2. Increase income

3.6.3. Decrease lead times

3.7. Create a series of ACCURATE models that help define the business processes under investigation

3.7.1. Function Catalogue

3.7.2. Logical Model

3.7.3. Process Model

3.7.4. Procedure Model

3.7.5. Information Chain/Information Flow

3.7.6. Physical Model

3.8. Discover & document the data quality rules required for each function / process stage

3.8.1. Method

3.8.1.1. Workshops

3.8.1.2. Discovery & profiling tools

3.8.1.3. Visual inspection

3.8.1.4. System documentation

3.8.2. Document combinations of data quality rules using Dimensions as a common language (if it makes sense in the business context) e.g.

3.8.2.1. Completeness

3.8.2.2. Validity

3.8.2.3. Uniqueness

3.8.2.4. Consistency

3.8.2.5. Timeliness

3.8.2.6. State Transition

3.8.2.7. etc...

3.9. Build your data quality rules into an assessment, monitoring and improvement environment

3.9.1. Typically a data quality platform, not always

3.9.2. Rules should be sharable and conform to standards i.e. library of dimension names should be consistent

3.10. Provide granular metrics for data quality/stewardship roles but business-focused metrics for stakeholders and operational users

4. Case Study: Inside Plant

4.1. The Inside Plant Topography

4.1.1. Site

4.1.1.1. Building 01

4.1.1.1.1. Floor 01

4.1.1.1.2. Floor 02

4.1.1.1.3. Floor 03

4.1.1.2. Building 02

4.1.1.3. Building 03

4.1.2. System A

4.1.2.1. Power Units

4.1.3. System B

4.1.3.1. Transmission Units

4.1.4. System C

4.1.4.1. Cooling Units

4.1.5. System D

4.1.5.1. Provisioning

4.1.6. System E

4.1.6.1. Maintenance

4.1.7. System F

4.1.7.1. Layouts/Asset Location

4.2. Different business functions have different needs

4.2.1. Infrastructure

4.2.2. Maintenance

4.2.3. Risk

4.2.4. Security

4.3. Each need can span multiple systems, so DQ Dimensions can't simply be simple column stats

4.3.1. Provision New Customer = Power + Transmission + Cooling + Billing

4.4. Reminder: The Hammer and Nail Syndrome

4.4.1. Data Quality tools are a common starting point

4.4.2. Subset of Data Quality dimensions are supported by tools

4.4.2.1. Completeness: NULL's

4.4.2.2. Validity: Low pattern frequency

4.4.2.3. Uniqueness: Duplicate keys

4.4.2.4. etc...

4.4.3. Subset approach is regarded as a useful starting point

4.4.4. But is this subset of dimensions and associated rules going to measure the real problem?

4.4.5. REMEMBER: Data Quality is fitness for ALL purposes

4.5. Data Quality Dimensions Workflow

4.5.1. What are you measured on?

4.5.1.1. Speed

4.5.1.2. Risk

4.5.1.3. Income

4.5.1.4. Expenditure

4.5.1.5. Customer Churn

4.5.1.6. Profit

4.5.1.7. Growth

4.5.2. What are the most critical functions that impact this measure?

4.5.2.1. Mean Time to Repair

4.5.2.2. Speed of Provisioning

4.5.2.3. Risk of service failure

4.5.2.4. Health and Safety/Compliance

4.5.2.5. Security

4.5.3. What anecdotes can guide your focus?

4.5.3.1. "It often takes us hours to find the right floor let alone the right rack"

4.5.3.2. "Engineers are incentivised to fix problems quickly, not record good quality data"

4.5.4. Where are the greatest opportunities for data quality improvement?

4.5.4.1. Reduce costs

4.5.4.2. Increase income

4.5.4.3. Decrease lead times

4.5.5. Create a series of ACCURATE models that help define the business processes under investigation

4.5.5.1. Function Catalogue

4.5.5.2. Logical Model

4.5.5.3. Process Model

4.5.5.4. Procedure Model

4.5.5.5. Information Chain/Information Flow

4.5.5.6. Physical Model

4.5.6. Discover & document the data quality rules required for each function / process stage

4.5.6.1. Method

4.5.6.1.1. Workshops

4.5.6.1.2. Discovery & profiling tools

4.5.6.1.3. Visual inspection

4.5.6.1.4. System documentation

4.5.6.2. Document combinations of data quality rules using Dimensions as a common language e.g.

4.5.6.2.1. Completeness

4.5.6.2.2. Validity

4.5.6.2.3. Uniqueness

4.5.6.2.4. Consistency

4.5.6.2.5. Timeliness

4.5.6.2.6. State Transition

4.5.6.2.7. etc...

4.5.7. Build your data quality rules into an assessment, monitoring and improvement environment

4.5.7.1. Typically a data quality platform, not always

4.5.7.2. Rules should be sharable and conform to standards i.e. library of dimension names should be consistent

4.5.8. Provide granular metrics for data quality/stewardship roles but business-focused metrics for stakeholders and operational users

4.6. Dimensions

4.6.1. Completeness: Do I have sufficient information with which to perform my business function?

4.6.1.1. I want to service a customer

4.6.1.1.1. Cooling Required

4.6.1.1.2. Transmission Required

4.6.1.1.3. Power Required

4.6.1.2. Don't just do a COUNT of NULL values vs NON NULL values

4.6.1.2.1. TBA / UK / ??? / --- / "Jeff to enter" - These all equate to NULL!

4.6.1.3. Need to understand: What is the completeness rule for a serviceable connection?

4.6.1.4. In order to provision a circuit, each port must have an associated, card, slot, shelf, rack, suite, floor, building and site + other information to activate the service

4.6.1.5. Metric: How many active ports have missing information (throughout the topography?) vs those with complete information

4.6.2. Validity: Is my equipment clearly identifiable based on agreed standards?

4.6.2.1. Define equipment naming standards for racks, shelves, cards and ports

4.6.2.2. Build a library of agreed standards based on equipment types

4.6.2.3. Create data quality rules to validate definition against standard

4.6.2.4. Rules must be assessed based on equipment type, different equipment have different naming conventions e.g. Power / Cooling / Transmission

4.6.2.5. Perform regular site based audits to ensure validity of physical assets

4.6.3. Validity: Does my equipment have a valid active status across the topography

4.6.3.1. If Port = Active THEN

4.6.3.1.1. Card must be active

4.6.3.1.2. Slot must be active

4.6.3.1.3. Shelf/Chassis must be active

4.6.3.1.4. Rack must be powered

4.6.3.1.5. Suite line must be powered

4.6.4. Consistency: Is a rack of equipment clearly identifiable across all data repositories?

4.6.4.1. Cooling + Power + Transmission

4.6.4.2. Start with your business function e.g. Power servicing

4.6.4.3. Is the equivalent Power Rack equipment found in each of the required systems?

4.6.4.3.1. Not found in at least one system

4.6.4.3.2. Found but not recorded consistently

4.6.4.3.3. Found in all 3 systems

4.6.4.4. You need to build an integrated model and repository to verify consistency

4.6.5. Uniqueness: Do we have duplicate entries for vital transmission chassis?

4.6.5.1. Conventional Wisdom:

4.6.5.1.1. Count the unique values / total values * 100

4.6.5.2. Smarter Approach

4.6.5.2.1. Apply validity tests first

4.6.5.2.2. Clean up data

4.6.5.2.3. Perform duplication tests once data has been validated and cleaned

4.6.5.2.4. Case Study: 1% duplication - cleaned up 10% duplication

5. Key Lessons for Operational Data Quality Dimensions

5.1. Stop thinking in terms of simple attribute rules, it's far more complex than that

5.2. Do your functional modelling and business discovery workshops first (or at least in tandem with the data quality tools discovery)

5.3. Create dimensions that the business can use to deliver real change against things they care deeply about e.g. Revenue / Lead Times / Customer Complaints / Wasted Effort

5.4. Create the data quality rules first and then identify which dimension best fits, don't start with the dimension then try to retrofit problems to fit the dimension

5.5. Set different severity levels based on their impact e.g. don't just aggregate up basic completeness failures in one part of the business and compare them with another - it's meaningless and confuses the hell out of everyone

5.6. You don't always need a data quality tool to measure dimensions, use whatever you have to hand e.g. ETL tools, BI tools - get creative

6. Want to learn more Data Quality and Data Governance Techniques? Visit the Virtual Summit...