Data Models
Data models and enums for terminology mapping workflows.
This module defines the core structured data models used throughout the AATM package to represent source concepts, retrieved expressions, selected mapping results, translations, and task configurations. It also provides supporting enumerations for expression provenance and standard vocabularies, along with utility methods for loading, validating, serializing, and transforming these objects.
The models are primarily implemented with Pydantic and are designed to support terminology mapping pipelines that involve retrieval, selection, reranking, and export of standardized concepts.
Notes
The models in this module include validation helpers for coercing concept identifiers to strings, validating date fields, loading records from CSV files, reading task configurations from JSON or YAML, and serializing results back to plain dictionaries or config files.
ExpressionOrigin
Bases: Enum
Enumerate the possible origins of an expression in the mapping pipeline.
This enum identifies whether an expression comes directly from a standard concept, from a synonym of a standard concept, from a mapped non-standard concept, or from a synonym of a mapped non-standard concept.
Source code in aatm\data_models.py
StandardVocabulary
Bases: Enum
Enumerate the supported standard vocabularies.
This enum defines the target standardized vocabularies currently supported by the terminology mapping workflow.
Source code in aatm\data_models.py
ExpressionMetadata
Bases: BaseModel
Represent metadata for a terminology expression.
This model stores information about an expression, its originating
concept, and its associated standard concept metadata. A
deterministic expression_id is generated after initialization from
the main identifying fields.
Attributes:
| Name | Type | Description |
|---|---|---|
expression_id |
Optional[str]
|
Deterministic identifier generated from the expression metadata. |
expression |
Optional[str]
|
Original text of the expression. |
expression_concept_id |
Optional[str]
|
Identifier of the source concept associated with the expression. |
expression_origin |
Optional[ExpressionOrigin]
|
Origin of the expression in relation to standard or non-standard concepts. |
std_concept_id |
Optional[str]
|
Identifier of the mapped standard concept. |
std_concept_name |
Optional[str]
|
Name of the mapped standard concept. |
std_vocabulary_id |
Optional[StandardVocabulary]
|
Standard vocabulary to which the mapped concept belongs. |
std_vocabulary_code |
Optional[str]
|
Code of the mapped concept in the standard vocabulary. |
std_domain_id |
Optional[str]
|
Domain associated with the mapped standard concept. |
Source code in aatm\data_models.py
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | |
validate_concept_id(value)
classmethod
Convert concept-related values to strings before validation.
This validator normalizes concept identifier fields by coercing the
incoming value to str. It is applied before standard Pydantic
validation for selected identifier fields.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
Any
|
Raw value provided for a concept-related field. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The normalized string representation of the input value. |
Source code in aatm\data_models.py
model_post_init(*args, **kwargs)
Generate a deterministic expression identifier after initialization.
This hook runs after model initialization and populates
expression_id using a deterministic hash derived from the main
expression metadata fields.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*args
|
Any
|
Positional arguments passed by Pydantic's post-init lifecycle. |
()
|
**kwargs
|
Any
|
Keyword arguments passed by Pydantic's post-init lifecycle. |
{}
|
Returns:
| Type | Description |
|---|---|
None
|
None |
Source code in aatm\data_models.py
to_dict()
Convert the model to a dictionary with enum values serialized.
This method returns the model data as a plain dictionary and
replaces enum instances in expression_origin and
std_vocabulary_id with their raw .value representations.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary representation of the model with enum fields |
dict[str, Any]
|
serialized as plain values. |
Source code in aatm\data_models.py
Translation
Bases: BaseModel
Represent a translated text value.
Attributes:
| Name | Type | Description |
|---|---|---|
text |
str
|
Translated text string. |
Source code in aatm\data_models.py
SourceConcept
Bases: BaseModel
Represent a source concept to be mapped.
This model stores the original source terminology fields and related validity metadata. Extra fields are ignored to allow loading from broader tabular inputs.
Attributes:
| Name | Type | Description |
|---|---|---|
source_code |
Optional[str]
|
Code of the source concept. |
source_concept_id |
Optional[str]
|
Identifier of the source concept. |
source_vocabulary_id |
Optional[str]
|
Vocabulary identifier of the source concept. |
source_code_description |
Optional[str]
|
Human-readable description of the source concept. |
valid_start_date |
Optional[str]
|
Start date of validity in |
valid_end_date |
Optional[str]
|
End date of validity in |
invalid_reason |
Optional[str]
|
Reason why the source concept is invalid, when applicable. |
Source code in aatm\data_models.py
186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 | |
validate_strings(value)
classmethod
Convert source identifier fields to strings before validation.
This validator coerces selected source concept fields to str
before standard Pydantic validation is applied.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
Any
|
Raw value provided for a source identifier field. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The normalized string representation of the input value. |
Source code in aatm\data_models.py
validate_yyyy_mm_dd(v)
classmethod
Validate that a date string follows the YYYY-MM-DD format.
This validator accepts empty strings unchanged and otherwise checks that the provided value is a string matching the expected date format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
v
|
str
|
Raw date value to validate. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The validated date string. |
Raises:
| Type | Description |
|---|---|
TypeError
|
If the provided value is not a string. |
ValueError
|
If the string is not empty and does not match the
|
Source code in aatm\data_models.py
from_csv(path)
classmethod
Load source concepts from a CSV file.
This class method reads a CSV file into a pandas DataFrame, replaces
missing values with empty strings, and constructs one
SourceConcept instance per row.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to the CSV file containing source concept records. |
required |
Returns:
| Type | Description |
|---|---|
List[SourceConcept]
|
A list of |
Source code in aatm\data_models.py
MappedSourceConcept
Bases: SourceConcept
Represent a source concept together with its mapped target concept.
This model extends SourceConcept by including the selected target
standardized concept and related vocabulary metadata.
Attributes:
| Name | Type | Description |
|---|---|---|
target_concept_id |
Optional[str]
|
Identifier of the mapped target concept. |
target_vocabulary_id |
Optional[StandardVocabulary]
|
Standard vocabulary of the mapped target concept. |
target_vocabulary_code |
Optional[str]
|
Code of the mapped target concept in the target vocabulary. |
domain_id |
Optional[str]
|
Domain of the mapped target concept. |
confidence_score |
Optional[float]
|
Confidence score of the mapping. |
source_code_description_original |
Optional[str]
|
Source code description before translation. |
Source code in aatm\data_models.py
296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 | |
validate_strings(value)
classmethod
Convert selected source and target fields to strings before validation.
This validator coerces selected identifier and vocabulary fields to
str when a value is provided. None values are preserved.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
Any
|
Raw value provided for a validated field. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The normalized string representation of the input value, or |
str
|
|
Source code in aatm\data_models.py
from_csv(path)
classmethod
Load mapped source concepts from a CSV file.
This class method reads a CSV file into a pandas DataFrame, replaces
missing values with empty strings, and constructs one
MappedSourceConcept instance per row.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to the CSV file containing mapped source concept records. |
required |
Returns:
| Type | Description |
|---|---|
List[MappedSourceConcept]
|
A list of |
List[MappedSourceConcept]
|
file. |
Source code in aatm\data_models.py
from_selector_results(source_concepts, results, translated_source_code_descriptions=None)
classmethod
Build mapped source concepts from source concepts and selector results.
This class method combines each SourceConcept with its
corresponding selected result and produces a list of
MappedSourceConcept instances containing both source and mapped
target fields.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_concepts
|
List[SourceConcept]
|
Source concepts to be combined with mapping selections. |
required |
results
|
SelectorResults
|
Selection results containing the chosen standardized concept for each source concept. |
required |
translated_source_code_descriptions
|
Optional[List[Translation]]
|
Optional list of translations for the source code descriptions. |
None
|
Returns:
| Type | Description |
|---|---|
List[MappedSourceConcept]
|
A list of |
List[MappedSourceConcept]
|
concepts and selector results. |
Source code in aatm\data_models.py
to_dict()
Convert the model to a dictionary with enum values serialized.
This method returns the model data as a plain dictionary and
replaces target_vocabulary_id with its raw .value
representation when present.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary representation of the model with enum fields |
dict[str, Any]
|
serialized as plain values. |
Source code in aatm\data_models.py
RetrievedExpressionMetadata
Bases: ExpressionMetadata
Represent retrieved expression metadata with ranking information.
This model extends ExpressionMetadata with fields produced during
retrieval and reranking steps. Extra fields are ignored to support
flexible loading from retrieval outputs.
Attributes:
| Name | Type | Description |
|---|---|---|
distance |
Optional[float]
|
Retrieval distance or similarity-derived distance for the expression. |
rerank_score |
Optional[float]
|
Score assigned during reranking. |
Source code in aatm\data_models.py
to_prompt_object(*args, **kwargs)
Convert retrieved metadata to a prompt-friendly dictionary.
This method returns a reduced dictionary representation containing the main standardized concept fields needed for prompt construction in downstream components.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*args
|
Any
|
Additional positional arguments accepted for interface compatibility. |
()
|
**kwargs
|
Any
|
Additional keyword arguments accepted for interface compatibility. |
{}
|
Returns:
| Type | Description |
|---|---|
Dict[str, str]
|
A dictionary containing the expression identifier, expression |
Dict[str, str]
|
text, standard concept name, standard vocabulary identifier, |
Dict[str, str]
|
standard vocabulary code, and standard domain identifier. |
Source code in aatm\data_models.py
SelectedExpressionMetadata
Bases: RetrievedExpressionMetadata
Represent the expression selected from a retrieval result list.
This model extends RetrievedExpressionMetadata by recording the
index of the selected item in the original candidate list.
Attributes:
| Name | Type | Description |
|---|---|---|
result_list_index |
int
|
Index of the selected result in the candidate list. |
Source code in aatm\data_models.py
EmptySelectionMetadata
Bases: BaseModel
Represent an empty selection result.
This placeholder model is used when no expression is selected. All
fields are explicitly set to None to preserve the expected schema.
Attributes:
| Name | Type | Description |
|---|---|---|
expression_id |
None
|
Always |
expression |
None
|
Always |
expression_concept_id |
None
|
Always |
expression_origin |
None
|
Always |
std_concept_id |
None
|
Always |
std_concept_name |
None
|
Always |
std_vocabulary_id |
None
|
Always |
std_vocabulary_code |
None
|
Always |
std_domain_id |
None
|
Always |
result_list_index |
None
|
Always |
distance |
None
|
Always |
Source code in aatm\data_models.py
RetrieverResults
Bases: BaseModel
Represent the output of a retrieval step.
Attributes:
| Name | Type | Description |
|---|---|---|
results |
List[List[RetrievedExpressionMetadata]]
|
Nested list of retrieved expression metadata grouped by query. |
queries |
List[str]
|
List of query strings used in retrieval. |
Source code in aatm\data_models.py
SelectorResults
Bases: BaseModel
Represent the output of a selection step.
Attributes:
| Name | Type | Description |
|---|---|---|
results |
List[SelectedExpressionMetadata]
|
List of selected expression metadata, one per query or source item. |
queries |
List[str]
|
List of query strings associated with the selections. |
Source code in aatm\data_models.py
SelectedResult
Bases: BaseModel
Represent a minimal selected result reference.
Attributes:
| Name | Type | Description |
|---|---|---|
expression_id |
Optional[str]
|
Identifier of the selected expression. |
Source code in aatm\data_models.py
TerminologyMappingTask
Bases: BaseModel
Represent the configuration for a terminology mapping task.
This model defines the inputs and execution parameters required to run a terminology mapping workflow. It also provides convenience methods for loading task definitions from JSON or YAML files and for saving them back to disk.
Attributes:
| Name | Type | Description |
|---|---|---|
input_file |
Path
|
Path to the input file containing source concepts. |
output_dir |
Optional[Path]
|
Directory where mapping outputs will be written. |
translator_id |
Optional[str]
|
Identifier of the translator component to use. |
retriever_id |
Optional[str]
|
Identifier of the retriever component to use. |
selector_id |
Optional[str]
|
Identifier of the selector component to use. |
reranker_id |
Optional[str]
|
Identifier of the reranker component to use. |
batch_size |
Optional[int]
|
Batch size for processing source concepts. |
rate_limit |
Optional[int]
|
Rate limit applied during processing. |
column_mapping |
Optional[dict]
|
Optional mapping describing how input columns correspond to expected fields. |
limit_to |
Optional[int]
|
Optional limit on the number of source concepts to process. |
Source code in aatm\data_models.py
579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 | |
validate_paths(value)
Convert path-like input values to Path objects before validation.
This validator normalizes the input_file and output_dir fields by
converting incoming string values to Path instances. Existing
Path objects are returned unchanged.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
Any
|
Raw value provided for a path field. |
required |
Returns:
| Type | Description |
|---|---|
Path
|
A normalized |
Source code in aatm\data_models.py
from_json(path)
classmethod
Load a terminology mapping task from a JSON configuration file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to a JSON file containing the task configuration. |
required |
Returns:
| Type | Description |
|---|---|
TerminologyMappingTask
|
A |
Source code in aatm\data_models.py
from_yaml(path)
classmethod
Load a terminology mapping task from a YAML configuration file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to a YAML file containing the task configuration. |
required |
Returns:
| Type | Description |
|---|---|
TerminologyMappingTask
|
A |
Source code in aatm\data_models.py
from_config_file(path)
classmethod
Load a terminology mapping task from a supported config file.
This method dispatches to from_json or from_yaml based on the
file extension.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to a supported configuration file. |
required |
Returns:
| Type | Description |
|---|---|
TerminologyMappingTask
|
A |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the file extension is not supported. |
Source code in aatm\data_models.py
save_to_disk(path)
Save the task configuration to a JSON or YAML file.
This method serializes the current model and writes it to disk based on the output file extension.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Destination path for the serialized configuration file. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the file extension is not supported. |
Source code in aatm\data_models.py
deterministic_id_from_strings(strings, digest_size=16)
Generate a deterministic id from a list of strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strings
|
list[str]
|
A list of strings to generate the id from. |
required |
digest_size
|
int
|
The size of the digest in bytes. |
16
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
A deterministic id as a hexadecimal string. |