This document describes testing
guidelines and steps for verifying data, ETL processes, and SQL during the
construction, unit testing, system and integration testing of an application’s data
warehouse operational tables and data mart.
1.)
Verify and Maintain the Data Low Level Design (LLD)
A first level of testing and validation begins with the
formal acceptance of the logical data model and “low level design” (LLD). All
further testing and validation will be based on the understanding of each of
the data elements in the model.
Data elements that are created through a transformation or
summary process must be clearly identified and calculations for each of these
data elements must be clear and easily interpreted.
During the LLD reviews and updates, special consideration should
be given to typical modeling scenarios that exist in the project. Examples
follow:
1.
Verify that many-to-many
attribute relationships are clarified and resolved.
2.
Verify the types of keys that are
used: surrogate keys versus natural keys.
3.
Verify that the business
analyst / DBA reviewed with ETL architect and developers (application) the
lineage and business rules for extracting, transforming, and loading the data
warehouse?
4.
Verify that all transformation
rules, summarization rules, and matching and consolidation rules have clear
specifications.
5. Verify
that specified transformations, business rules and cleansing specified in LLD
and other application logic specs have been coded correctly in ETL, JAVA, and
SQL used for data loads.
6. Verify
that procedures are documented to monitor and control data extraction,
transformation and loading. The procedures should describe how to handle
exceptions and program failures.
7. Verify
that data consolidation of duplicate or merged data was properly handled.
8. Verify
that samplings of domain transformations will be taken to verify they are
properly changed.
9. Compare
unique values of key fields between source data and data loaded to the
warehouse. This is a useful technique that points out a variety of possible
data errors without doing a full validation on all fields.
10. Validate
that target data types are as specified in the design and/or the data model.
11.
Verify how sub-class/super-class
attributes depicted?
12.
Verify that data field types
and formats are specified.
13.
Verify that defaults are
specified for fields where needed.
14.
Verify that processing for
invalid field values in source are defined
15.
Verify that expected ranges of
field contents are specified where known.
16.
Verify that keys generated by
the “sequence generator” are identified.
17.
Verify that slowly changing
dimensions are described?
2.) Analyze Source Data
Before & After Extraction to Staging
Testers should extract representative
data from each source file (before or after extract to staging tables) and
confirm that the data is consistent with its definition; QA can discover any
anomalies in how the data is represented and write defect reports where necessary.
The objective is to discover data that does not meet “data quality factors” as described
in specifications. See list below and Table 1.
This verification process will be used
for temp tables used in a step process for data transformations, cleaning, etc.
·
Verify that the
scope of values in each column are within specifications
·
Identify unexpected
values in each field
·
Verify relationships
between fields
·
Identify frequencies
of values in columns and whether these frequencies make sense?
Inputs: Application
source data models and low level data design, data dictionaries, data attribute
sources.
Outputs:
Newly discovered attributes, undefined business rules, data anomalies such as
fields used for multiple purposes.
Techniques and Tools: Data extraction software, business rule discovery software, data
analysis tools.
Process Description:
1.
Extract
representative samples of data from each source or staging table.
2.
Parse
the data for the purpose of profiling.
3.
Verify
that not-null fields are populated and populated as expected.
4.
Structure
discovery – Does the data match the corresponding metadata? Do field attributes
of the data match expected patterns? Does the data adhere to appropriate
uniqueness and null value rules?
5.
Data
discovery – Are the data values complete, accurate and unambiguous?
6.
Relationship
discovery – Does the data adhere to specified required key relationships across
columns and tables? Are there inferred relationships across columns, tables or
databases? Is there redundant data?
7.
Verify
that all required data from the source was extracted. Verify that extraction
process did not extract more or less data source than it should have.
8.
Verify
or write defects for exceptions and errors discovered during the ETL process.
9.
Verify
that extraction process did not extract duplicate data from the source (usually
this happens in repeatable processes where at point zero we need to extract all
data from the source file, but the during the next intervals we only need to
capture the modified, and new rows.).
10. Validate that no data
truncation occurred during staging.
11. Utilize a data
profiling tool or methods that show the range and value distributions of fields
in the source data. This is used to identify any data anomalies from source
systems that may be missed even when the data movement is correct.
12. Validation &
Certification Method: it is sufficient to identify the requirements and count
(via SQL) the number of rows that should be extracted from the source systems.
The QA team will also count the number of rows in the result / target sets and
match the two for validation. The QA team will maintain a set of SQL statements
that are automatically run at this stage to validate that no duplicate data
have been extracted from the source systems.
Table 1:
Data Quality Factors
|
FACTOR |
DESCRIPTION |
EXAMPLE |
|
Data Consistency Issues: |
||
|
Varying Data Definitions |
The data type and length for a particular attribute
may vary in files or tables though the semantic definition is the same. |
Account
number may be defined as: Number (9) in one field or table and
Varchar2(11) in another table |
|
Misuse of Integrity Constraints |
When referential integrity constraints are misused,
foreign key values may be left “dangling” or inadvertently deleted. |
An account record is missing but dependent records
are not deleted. |
|
Nulls |
Nulls when field defined as “not-null”. |
The company has been entered as a null value for a
business. A report of all companies would not list the business. |
|
Data Completeness Issues: |
||
|
Missing data |
Data elements are missing due to a lack of integrity
constraints or nulls that are inadvertently not updated. |
An account date of estimated arrival is null thus
impacting an assessment of variances in estimated/actual account data. |
|
Inaccessible Data |
Inaccessible records due to missing or redundant
identifier values. |
Business numbers are used to identify a customer
record. Because uniqueness was not enforced, the business ID (45656)
identifies more than one customer. |
|
Missing Integrity Constraints |
Missing constraints can cause data errors due to
nulls, non-uniqueness, or missing relationships. |
Account records with a business identifier exist in
the database but cannot be matched to an existing business. |
|
Data Correctness Issues: |
||
|
Loss Projection |
Tables that are joined over non key attributes will
produce non existent data that is shown to the user. |
Lisa Evans works in the LA office in the Accounting
department. When a report is
generated, it shows her working in IT department. |
|
Incorrect Data Values |
Data that is misspelled or inaccurately recorded. |
123 Maple Street is recorded with a spelling mistake
and a street abbreviation (123 Maple St) |
|
Inappropriate Use of Views |
Data is updated incorrectly through views. |
A view contains non key attributes from base
tables. When the view is used to
update the database, null values are entered into the key columns of the
base tables. |
|
Disabled Integrity Constraints |
Null, non unique, or out of range data may be stored
when the integrity constraints are disabled. |
The primary key constraint is disabled during an
import function. Data is entered into
the existing data with null unique identifiers. |
|
Non-duplication |
Testing should be conducted to determine if there’s
duplication of data where there should not be. |
Duplicate rows or column data. |
|
Misuse of Integrity Constraints |
Check whether null or foreign key constraints are
inappropriate or too restrictive. |
Check constraint only allows hard coded values of “C”,
“A”, “X”, and “Z”. But a new code “B”
cannot be entered. |
|
Data
Comprehension Issues: |
||
|
Data Aggregation |
Aggregated data is used to represent a set of data elements. |
One name field is used to store surname, first name,
middle initial, and last name (e.g.,
John, Hanson, Mr.). |
|
Cryptic Object Definitions |
Database object (e.g., column) has a cryptic,
unidentifiable name. |
Customer table with a column labeled, “c_avd”. There is no documentation as to what the
column might contain. |
|
Unknown or Cryptic Data |
Cryptic data stored as codes, abbreviations,
truncated, or with no apparent meaning. |
Shipping codes used to represent various parts of
the customer base (‘01’, ‘02’, ‘03’).
No supporting document to explain the meaning of the codes. |
|
Accuracy |
Data will be matched against business rules. |
Boundary values (low, high’s) will be identified for
relevant fields and compared with expectations. |
|
Completeness |
Data will be assessed to verify that all required is
present. Missing rows will be identified; Null values will be identified in
data elements where a value is expected. |
|
|
Precision |
Precision testing is conducted to evaluate the level
of data not sufficiently precise based on specifications. |
|
This step works to improve the quality of
existing data in source files or “defects” that meet source specs but must be
corrected before load.
Inputs:
§ Files or tables (staging) that require cleansing; data
definition and business rule documents, data map of source files and fields;
business rules, data anomalies discovered in earlier steps of this process.
§ Fixes for data defects that will result in data that does
not meet specifications for the application DW.
Outputs:
Defect reports, cleansed data, rejected or uncorrectable data
Techniques and Tools: Data reengineering, transformation, and cleansing tools, MS Access,
Excel filtering.
Process Description: In this step, data with missing values, known errors, and suspect
data is corrected. Automated tools may
be identified to best to locate, clean / correct large volumes of data.
1. Document the type of
data cleansing approach taken for each data type in the repository.
2. Determine how
“uncorrectable” or suspect data is processed, rejected, maintained for
corrective action. SME’s and stakeholders should be involved in the decision.
3. Review ETL defect
reports to assess rejected data excluded from source files or information group
targeted for the warehouse.
4. Determine if data not
meeting quality rules was accepted.
5. Document in defect
reports, records and important fields that cannot be easily corrected.
6. Document records that
were corrected and how corrected.
Certification
Method: Validation of
data cleansing processes could be a tricky proposition, but certainly doable.
All data cleansing requirements should be clearly identified. The QA team should
learn all data cleansing tools available and their methods. QA should create
various conditions as specified in the requirements for the data cleansing tool
to support and validate its results. QA will run a volume of real data through
each tool to validate accuracy as well as performance.
There are often ETL processes where data
has been consolidated from various files into a single occurrence of
records. The cleaned and consolidated
data can be assessed to very matched and consolidated data.
Much of the ETL heavy lifting occurs in
the transform step where combined data, data with quality issues, updated data,
surrogate keys, build aggregates, are processed.
Inputs:
Analysis of all files or databases for each entity type
Outputs:
Report
of matched, consolidated, related data that is suspect or in error
List
of duplicate data records or fields
List
of duplicate data suspects.
Techniques and Tools: Data matching techniques or tools; data cleansing software with
matching and merging capabilities.
Process Description:
1.
Establish
match criteria for data. Select attributes to become the basis for possible
duplicate occurrences (e.g., names, account numbers).
2.
Determine
the impact of incorrectly consolidated records. If the negative impact of
consolidating two different occurrences such as different customers into a
single customer record exists, submit defect reports. The fix should be higher
controls to help avoid such consolidations in the future.
3.
Determine
the matching techniques to be used: Exact character match in two corresponding
fields such as wild card match, key words, close match, etc.
4.
Compare
match criteria for specific record with all other records within a given file
to look for intra-file duplicate records.
5.
Compare
match criteria for a specific record with all records in another file to seek
inter-file duplicate records.
6.
Evaluate
potential matched occurrences to assure they are, in fact, duplicate.
7.
Verify
that consolidated data into single occurrences is correct.
8.
Examine
and re-relate data related to old records being consolidated to new occurrence-of-reference
record. Validate that no related data was overlooked.
At this stage, base data is being
prepared for loading into the Application operational tables and the data mart.
This includes converting and formatting cleansed, consolidated data into the
new data architecture and possibly enhancing internal operational data with
external data licensed from service providers.
The objective is to successfully map the
cleaned, corrected and consolidated data into the DW environment.
Inputs:
Cleansed, consolidated data; external data from service providers; business
rules governing the source data; business rules governing the target DW data;
transformation rules governing the transformation process; DW or target data
architecture; data map of source data to standardized data.
Output: Transformed,
calculated, enhanced data; updated data map of source data to standardized
data; data map of source data to target data architecture
Techniques and Tools: Data transformation software; external or online or public
databases.
Process Description:
1.
Verify
that the data warehouse construction team is using the data map of source data
to the DW standardized data, verify the mapping.
2.
Verify
that the data transformation rules and routines are correct.
3.
Verify
the data transformations to the DW and assure that the processes were performed
according to specifications.
4.
Verify
that data loaded in the operational tables and data mart meets the definition
of the data architecture including data types, formats, accuracy, etc.
5.
Develop
scenarios to be covered in Load Integration Testing
6.
Count
Validation: Record Count Verification DWH backend/Reporting queries against
source and target as an initial check.
7.
Dimensional
Analysis: Data integrity exists between the various source tables and parent /
child relationships.
8.
Statistical
Analysis: Validation for various
calculations.
9.
Data
Quality Validation: - Check for missing data, negatives and consistency. Field-by-field
data verification will be done to check the consistency of source and target
data.
10. Granularity: Validate at the lowest granular level
possible (lowest in the hierarchy E.g. Country-City-Sector– start with test
cases).
11. Dynamic Transformation
Rules & Tables: such methods need to be checked continuously to ensure the
correct transformation routines are executed. Verify that dynamic mapping
tables and dynamic mapping rules provide an easy, documented, and automated way
for transforming values from one or more sources into a standard value
presented in the DW.
12. Verification Method:
The QA team will identify the detailed requirements as they relate to
transformation and validate the dynamic transformation rules and tables against
DW records. Utilizing SQL and related tools, the team will identify unique
values in source data files that are subject to transformation. The QA team
identifies the results from the transformation process and validate that such
transformation have accurately taken place.
End user reporting is a major component of the Application Project. The report code may
run aggregate SQL queries against the data stored in the data mart and/or the
operational tables then display results in a suitable format either in a Web
browser or on a client application interface. Once the initial view is
rendered, the reporting tool interface provides various ways of manipulating
the information such as sorting, pivoting, computing subtotals, and adding view
filters to slice-and-dice the information further. Special considerations such
as those below will be prepared while testing the reports:
1.
The ETL process should be
complete, the data mart must be populated and data quality testing should be
largely completed.
2.
The front-end will use a SQL
engine which will generate the SQL based on the how the dimension and fact
tables are mapped. Additionally, there may be global or report-specific
parameters set to handle very large database (VLDB)-related optimization
requirements. As such, testing of the front-end will concentrate on validating
the SQL generated; this in turn validates the dimensional model and the report
specification vis-à-vis the design.
3.
Unit testing of the reports will
be conducted to verify the layout format per the design mockup, style sheets,
prompts and filters, attributes and metrics on the report.
4.
Unit testing will be executed
both in the desktop and Web environment.
5.
System testing of the reports
will concentrate on various report manipulation techniques like the drilling,
sorting and export functions of the reports in the Web environment.
6.
Reports and/or documents need
special consideration for testing because they are high visibility reports used
by the top analysts and because they have various charts, gauges and data
points to provide a visual insight to the performance of the organization in
question.
7.
There may be some trending
reports, or more specifically called comp reports, that compare the performance
of an organizational unit over multiple time periods. Testing these reports
needs special consideration especially if a fiscal calendar is used instead of
an English calendar for time period comparison.
8.
For reports containing derived
metrics special focus should be paid to any subtotals. The subtotal row should
use a "smart-total," i.e., do the aggregation first and then do the
division instead of adding up the individual cost per click of each row in the
report.
9.
Reports with
"non-aggregate-able" metrics (e.g., inventory at hand) also need
special attention to the subtotal row. It should not, for example, add up the
inventory for each week and show the inventory of the month.
10. During unit testing, all data formats will be verified against the
standard. For example, metrics with monetary value should show the proper
currency symbol, decimal point precision (at least two places) and the
appropriate positive or negative. For example, negative numbers should be shown
in red and enclosed in braces.
11. During system testing, while testing the drill-down capability of
reports, care will be taken to verify that the subtotal at the drill-down
report matches with the corresponding row of the summary report. At times, it
is desirable to carry the parent attribute to the drill-down report; verify the
requirements for this.
12. When testing reports containing conditional metrics, care will be
taken to check for "outer join condition;" i.e., nonexistence of one
condition is reflected appropriately with the existence of the other condition.
13. Reports with multilevel sorting will get special attention for
testing especially if the multilevel sorting includes both attributes and
metrics to be sorted.
14. Reports containing metrics at different dimensionality and with
percent-to-total metrics and/or cumulative metrics needs will get special
attention to check that the subtotals are hierarchy-aware (i.e., they
"break" or "re-initialized" at the appropriate levels).
- Session Completions: All workflow sessions completed successfully using the Log
Viewer.
- Source to Target Counts: This process verifies that the number of records in the source
system matches the number of records received, and ultimately processed,
into the data warehouse. If Look-up’s are involved in the ETL process, the
count between source and target will not match. The ETL Session log and
target table counts are compared.
- Source to Target Data
Verification: The process verifies that
all source and reference tables have data before running ETLs. We verify
that all target tables were truncated before the load unless target tables
are updated. This process verifies
that the source field threshold is not subject to truncation during the
transformation or loading of data.
- Field to Field Verification:
This process verifies the field values
from the source system to target. This process ensures that the data
mapping from the source system to the target is correct, and that data
sent has been loaded accurately.
- ETL Exception Processing: Exception
processing verification looks for serious data errors that would cause
system processing failures or data corruption. An Exception report
verifying the number and types of errors encountered is produced and
reviewed for additional processing and / or reporting to the customer.
There are two types of Exception process:
1. Database Exception:
·
Not Null - Source column is
null while target is not null
·
Reference Key - The records
coming from the source data do not have a corresponding parent key in the
parent table.
·
Unique Key - The record already
exists in the target table.
·
Check Constraint - CHECK
constraints enforce domain integrity by limiting the values that are accepted
by a column
2. Business Exception
These are the exceptions thrown based on certain business rules
defined for specific data elements or group of data elements
·
ETL process utilizes a single
Exception Table to capture the exceptions from various ETL sessions and an
Error Lookup table which has various error codes and their description.
·
We check the Exception process
using the Session Log and Exception Table.
1.
No critical defects unfixed; No
more than 3 high severity defects.
2.
80% or more of build
functionality can be tested – functionality might fail because of JAVA / report code.
3.
Platform performance is such
that test team can productively work to schedule
4.
Fewer than 15% of build fixes
failed
- ETL,
- Data,
- Test,
- Guidelines,
- Large,
- Applications

 
 




satishkumarksk
neeqa

Thanks