Skip to main content

Data Quality CSV Format

  • Rules (folder with Data Quality information)
    • Folder.csv (allows to recreate an hierarchy of folders in order to organize DQ rules)
      • Id (folder unique identifier)
      • Name (folder name)
      • Parent Folder Id (identifier of the parent folder, optional)
      • Description (folder description, optional)
    • Rule.csv (provides DQ rule declarations)
      • Id (rule unique identifier)
      • Name (rule name. Name must be unique under the parent folder.)
      • Folder Id (identifier of the parent folder)
      • Description (rule description, optional)
      • Operation (rule operation, optional)
      • Dimension (rule dimension name. See the list of the possible values below)
      • Url (hyperlink on the rule in the original DQ tool, optional)
    • Rule Application.csv (allows to specify DQ rule instances and associate them with the individual columns)
      • Id (rule application unique identifier)
      • Rule Id (identifier of the associated DQ rule)
      • Type (column path, connection type. Is needed to interpret and then correctly stitch the referenced column in MM. See the list of the possible values below.)
      • Server (column path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
      • Catalog (column path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
      • Schema (column path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
      • Table (column path, table name. It can also be a file, or partitioned directory name)
      • Column (column path, column name)
      • Description (rule application description, optional)
      • Threshold (rule application threshold. It is a floating point value that is used to compute DQ Status in MM (Good, Acceptable or Bad). Use comma as a delimiter if more than one value is required)
    • Rule Application Column.csv (provides additional associations between DQ Rule Applications and the corresponding columns if Rule Application refers to two or more columns)
      • Rule Application Id (rule application identifier)
      • Type (column path, connection type. Is needed to interpret and then correctly stitch the referenced column in MM. See the list of the possible values below.)
      • Server (column path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
      • Catalog (column path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
      • Schema (column path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
      • Table (column path, table name. It can also be a file, or partitioned directory name)
      • Column (column path, column name)
    • Rule Measure.csv (provides rule execution results)
      • Id (rule measure unique identifier)
      • Rule Application Id (rule application identifier)
      • Date (the execution date when the values have been collected. See below the supported date time format)
      • Rows Total (the overall number of the processed rows)
      • Rows Failed (the number of rows that failed the DQ rule)
  • Profiling (folder with Data Profiling information, optional)
    • Table.csv (table level profiling information)
      • Type (table path, connection type. Is needed to interpret and then correctly stitch the referenced table in MM. See the list of the possible values below.)
      • Server (table path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
      • Catalog (table path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
      • Schema (table path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
      • Table (table path, table name. It can also be a file, or partitioned directory name)
      • Creation Date (the date/time when data profiling information has been collected. See below the supported date time format)
      • Data Update Date (the last date/time when data has been updated in the source table. See below the supported date time format)
      • Records All (the number of records in the source table)
      • Records Processed (the number of records that have been processed by the profiler)
    • Column.csv (column level profiling information)
      • Type (column path, connection type. Is needed to interpret and then correctly stitch the referenced column in MM. See the list of the possible values below.)
      • Server (column path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
      • Catalog (column path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
      • Schema (column path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
      • Table (column path, table name. It can also be a file, or partitioned directory name)
      • Column (column path, column name)
      • Creation Date (the date/time when data profiling information has been collected. See below the supported date time format)
      • Count (KPI with the count of all the values in the column. Long value)
      • Distinct (KPI with the counter for distinct values. Long value)
      • Duplicate (KPI with the counter for duplicate values. Long value)
      • Blank (KPI with the counter for blank string values. Long value)
      • Zero (KPI with the counter for zero numeric values. Long value)
      • Null (KPI with the counter for NULL values. Long value)
      • Min (KPI with the minimum value in the column. Usually applicable for the numeric values.)
      • Max (KPI with the maximum value in the column. Usually applicable for the numeric values.)
      • Mean (KPI with the mean value in the column. Usually applicable for the numeric values.)
      • Variance (KPI with the variance value in the column. Usually applicable for the numeric values.)
      • Median (KPI with the median value in the column. Usually applicable for the numeric values.)
      • Lower quantile (KPI with the lower quantile value in the column. Usually applicable for the numeric values.)
      • Upper quantile (KPI with the upper quantile value in the column. Usually applicable for the numeric values.)
      • Avg length (KPI with the average value length in the column. Usually applicable for the string values. Double value)
      • Min length (KPI with the minimum value length in the column. Usually applicable for the string values. Long value)
      • Max length (KPI with the maximum value length in the column. Usually applicable for the string values. Long value)
      • Valid (KPI with the counter for valid values. Usually computed based on the column’s data type. Long value)
      • Invalid (KPI with the counter for invalid values. Usually computed based on the column’s data type. Long value)
      • Data Type Invalid Values (Serialized list of the invalid values. Usually computed based on the column’s data type. Use character semicolon ‘;’ as a delimiter and backward slash ‘\’ as an escape character to serialize multiple values)
      • Data Type Frequency (Serialized list of data type frequencies in the format “[value],[count],[ratio]” (e.g. INTEGER,6,60.00;STRING,4,40.00). Use character semicolon ‘;’ as a delimiter and backward slash ‘\’ as an escape character to serialize multiple values)
      • Pattern Frequency (Serialized list of pattern frequencies in the format “[value],[count],[ratio]” (e.g. 999-99-9999,19972,1.00000). Use character semicolon ‘;’ as a delimiter and backward slash ‘\’ as an escape character to serialize multiple values)
      • Value Frequency (Serialized list of value frequencies in the format “[value],[count],[ratio]” (e.g. 987-66-3365,1,0.00005;987-66-5400,1,0.00005). Use character semicolon ‘;’ as a delimiter and backward slash ‘\’ as an escape character to serialize multiple values)
  • Sampling (folder with Data Sampling information, optional)
    • Data Sampling.csv (mapping file that allows to lookup Data Sampling information for the individual tables/files. Is used to resolve special characters that can be part of the table names but not allowed in the file system to name the files accordingly)
    • *[file path recreated from ‘Data Sampling.csv’] (e.g. "\JDBC\172.0.0.1\FinanceDWDimensional\dbo\Customer.csv")
    • Table Url.csv (file that allows to specify a hyperlink to the original DQ source for the specified table, optional)
      • Type (table path, connection type. Is needed to interpret and then correctly stitch the referenced table in MM. See the list of the possible values below.)
      • Server (table path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
      • Catalog (table path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
      • Schema (table path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
      • Table (table path, table name. It can also be a file, or partitioned directory name)
      • Url (a hyperlink on the corresponding table record in the original DQ tool)

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!