Data Quality CSV Format
- Rules (folder with Data Quality information)
- Folder.csv (allows to recreate an hierarchy of folders in order to organize DQ rules)
- Id (folder unique identifier)
- Name (folder name)
- Parent Folder Id (identifier of the parent folder, optional)
- Description (folder description, optional)
- Rule.csv (provides DQ rule declarations)
- Id (rule unique identifier)
- Name (rule name. Name must be unique under the parent folder.)
- Folder Id (identifier of the parent folder)
- Description (rule description, optional)
- Operation (rule operation, optional)
- Dimension (rule dimension name. See the list of the possible values below)
- Url (hyperlink on the rule in the original DQ tool, optional)
- Rule Application.csv (allows to specify DQ rule instances and associate them with the individual columns)
- Id (rule application unique identifier)
- Rule Id (identifier of the associated DQ rule)
- Type (column path, connection type. Is needed to interpret and then correctly stitch the referenced column in MM. See the list of the possible values below.)
- Server (column path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
- Catalog (column path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
- Schema (column path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
- Table (column path, table name. It can also be a file, or partitioned directory name)
- Column (column path, column name)
- Description (rule application description, optional)
- Threshold (rule application threshold. It is a floating point value that is used to compute DQ Status in MM (Good, Acceptable or Bad). Use comma as a delimiter if more than one value is required)
- Rule Application Column.csv (provides additional associations between DQ Rule Applications and the corresponding columns if Rule Application refers to two or more columns)
- Rule Application Id (rule application identifier)
- Type (column path, connection type. Is needed to interpret and then correctly stitch the referenced column in MM. See the list of the possible values below.)
- Server (column path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
- Catalog (column path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
- Schema (column path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
- Table (column path, table name. It can also be a file, or partitioned directory name)
- Column (column path, column name)
- Rule Measure.csv (provides rule execution results)
- Id (rule measure unique identifier)
- Rule Application Id (rule application identifier)
- Date (the execution date when the values have been collected. See below the supported date time format)
- Rows Total (the overall number of the processed rows)
- Rows Failed (the number of rows that failed the DQ rule)
- Folder.csv (allows to recreate an hierarchy of folders in order to organize DQ rules)
- Profiling (folder with Data Profiling information, optional)
- Table.csv (table level profiling information)
- Type (table path, connection type. Is needed to interpret and then correctly stitch the referenced table in MM. See the list of the possible values below.)
- Server (table path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
- Catalog (table path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
- Schema (table path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
- Table (table path, table name. It can also be a file, or partitioned directory name)
- Creation Date (the date/time when data profiling information has been collected. See below the supported date time format)
- Data Update Date (the last date/time when data has been updated in the source table. See below the supported date time format)
- Records All (the number of records in the source table)
- Records Processed (the number of records that have been processed by the profiler)
- Column.csv (column level profiling information)
- Type (column path, connection type. Is needed to interpret and then correctly stitch the referenced column in MM. See the list of the possible values below.)
- Server (column path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
- Catalog (column path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
- Schema (column path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
- Table (column path, table name. It can also be a file, or partitioned directory name)
- Column (column path, column name)
- Creation Date (the date/time when data profiling information has been collected. See below the supported date time format)
- Count (KPI with the count of all the values in the column. Long value)
- Distinct (KPI with the counter for distinct values. Long value)
- Duplicate (KPI with the counter for duplicate values. Long value)
- Blank (KPI with the counter for blank string values. Long value)
- Zero (KPI with the counter for zero numeric values. Long value)
- Null (KPI with the counter for NULL values. Long value)
- Min (KPI with the minimum value in the column. Usually applicable for the numeric values.)
- Max (KPI with the maximum value in the column. Usually applicable for the numeric values.)
- Mean (KPI with the mean value in the column. Usually applicable for the numeric values.)
- Variance (KPI with the variance value in the column. Usually applicable for the numeric values.)
- Median (KPI with the median value in the column. Usually applicable for the numeric values.)
- Lower quantile (KPI with the lower quantile value in the column. Usually applicable for the numeric values.)
- Upper quantile (KPI with the upper quantile value in the column. Usually applicable for the numeric values.)
- Avg length (KPI with the average value length in the column. Usually applicable for the string values. Double value)
- Min length (KPI with the minimum value length in the column. Usually applicable for the string values. Long value)
- Max length (KPI with the maximum value length in the column. Usually applicable for the string values. Long value)
- Valid (KPI with the counter for valid values. Usually computed based on the column’s data type. Long value)
- Invalid (KPI with the counter for invalid values. Usually computed based on the column’s data type. Long value)
- Data Type Invalid Values (Serialized list of the invalid values. Usually computed based on the column’s data type. Use character semicolon ‘;’ as a delimiter and backward slash ‘\’ as an escape character to serialize multiple values)
- Data Type Frequency (Serialized list of data type frequencies in the format “[value],[count],[ratio]” (e.g. INTEGER,6,60.00;STRING,4,40.00). Use character semicolon ‘;’ as a delimiter and backward slash ‘\’ as an escape character to serialize multiple values)
- Pattern Frequency (Serialized list of pattern frequencies in the format “[value],[count],[ratio]” (e.g. 999-99-9999,19972,1.00000). Use character semicolon ‘;’ as a delimiter and backward slash ‘\’ as an escape character to serialize multiple values)
- Value Frequency (Serialized list of value frequencies in the format “[value],[count],[ratio]” (e.g. 987-66-3365,1,0.00005;987-66-5400,1,0.00005). Use character semicolon ‘;’ as a delimiter and backward slash ‘\’ as an escape character to serialize multiple values)
- Table.csv (table level profiling information)
- Sampling (folder with Data Sampling information, optional)
- Data Sampling.csv (mapping file that allows to lookup Data Sampling information for the individual tables/files. Is used to resolve special characters that can be part of the table names but not allowed in the file system to name the files accordingly)
- *[file path recreated from ‘Data Sampling.csv’] (e.g. "\JDBC\172.0.0.1\FinanceDWDimensional\dbo\Customer.csv")
- Table Url.csv (file that allows to specify a hyperlink to the original DQ source for the specified table, optional)
- Type (table path, connection type. Is needed to interpret and then correctly stitch the referenced table in MM. See the list of the possible values below.)
- Server (table path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
- Catalog (table path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
- Schema (table path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
- Table (table path, table name. It can also be a file, or partitioned directory name)
- Url (a hyperlink on the corresponding table record in the original DQ tool)