Multi-models or Catalogs of Models
A model in the Talend Data Catalog repository is a word with multiple layers. For an imported model, a model is the minimal unit of change when importing or re-importing (versioning due to import), including in a configuration, or referencing in a semantic mapping (as source or target). This is the repository object that one sees in the Repository Manager tree (with its associated versions) and the Configuration Manager list.
A model may consist of only one self-contained model, such as that for a logical data model, a DDL import, or individual CSV or JSON file import. It is a very straightforward arrangement where one model (import scope) includes one organizational (internal) model.
However, in many cases, the import scope will include multiple self-contained models. E.g., an import from a data modeling tool with a logical and multiple related physical models, a RDBMS source with multiple databases and/or schemas (as self-contained models) organized by a catalog (another self-contained model), a file system bridge which imports multiple files (self-contained models) organized in a hierarchy or directory structure (another self-contained model). In these scenarios, a given import produced many models and a directory or catalog model contained comprising what is referred to as a multi-model.
Multi-models are very convenient, as they allow for:
- Incremental harvesting of only those databases, schemas, packages, reports or files which have changed since the last harvest, dramatically improving import performance.
- The repository needs only to store what has changed. In this way, if only one or two of these contained model requires to be created because those are the only portions which changed in the entire scope of import, then that is all that the repository needs to store.
In this way, multi-models are highly efficient. In particular, when storing a multi-model Talend Data Catalog attempts to minimize the number of version of individual contained models, reusing them to the best of its ability, thus only retaining the subset required to “reconstitute” and present the specific model versions (imported scope) that are harvested and currently maintained. In order to store a multi-model, then, after the multi-model is harvested (itself a process of only importing new versions of contained models if they are different than the cache of the last import) each contained model is checked for any difference with any previously harvested (and still stored) version of that same contained model already in the repository. If there already exists such a version of the contained model, then the full multi-model version will be associated with that existing contained model version. If the imported contained models is truly new, then a new version of that contained model is written to the repository and the full multi-model version will be associated with that new contained model version
In order to present a proper version of a multi-model, then, any search results, browsing, API results, etc., are all based upon which versions of the contained models are associated with the version of the imported model that is a part of the current configuration. It sounds a bit complex, but that complexity is well managed by the backend and thus transparent to the user experience.
For example, in the picture above, we have four different harvests of an imported model (a relational database in this illustration), and only some of the schemas prove to be updated at each harvest.
For the first harvest on 1 May 2023, we see there are three schemas (1, 2 and 3), and as this is the first harvest all three become versions of contained models.
For the second harvest on 15 May 2023, we see:
- Schema1 is unchanged and thus no new contained model version is created for it.
- Schema2 has changes from the previous version imported and thus a new version of the contained model for schema2 is imported and included in the new version of the imported model.
- Schema3 has been deleted and thus is no longer included in the new version of the imported model.
For the third harvest on 30 May 2023, we see:
- Schema1 has changes from the previous version imported and thus a new version of the contained model for schema1 is imported and included in the new version of the imported model.
- Schema2 is unchanged and thus no new contained model version is created for it.
- Schema4 has been added to the database, and a first new version of the contained model for schema4 is imported and included in the new version of the imported model.
For the fourth harvest on 30 May 2023, we see:
- Schema1 has changes from the previous version, but these changes merely restore it to the 20230501 version of that contained model. Thus, we have a case where the schema is reverted back to an earlier state and that earlier version is then included in the new version of the imported model.
- Schema2 has changes from the previous version imported and thus a new version of the contained model for schema2 is imported and included in the new version of the imported model.
- Schema4 is unchanged and thus no new contained model version is created for it.
Reuse of older versions of contained models because of reverted changes to the harvested model are common and implemented in the most efficient means by repointing back to that earlier state.