Skip to content

Data Models

Data models and enums for terminology mapping workflows.

This module defines the core structured data models used throughout the AATM package to represent source concepts, retrieved expressions, selected mapping results, translations, and task configurations. It also provides supporting enumerations for expression provenance and standard vocabularies, along with utility methods for loading, validating, serializing, and transforming these objects.

The models are primarily implemented with Pydantic and are designed to support terminology mapping pipelines that involve retrieval, selection, reranking, and export of standardized concepts.

Notes

The models in this module include validation helpers for coercing concept identifiers to strings, validating date fields, loading records from CSV files, reading task configurations from JSON or YAML, and serializing results back to plain dictionaries or config files.

ExpressionOrigin

Bases: Enum

Enumerate the possible origins of an expression in the mapping pipeline.

This enum identifies whether an expression comes directly from a standard concept, from a synonym of a standard concept, from a mapped non-standard concept, or from a synonym of a mapped non-standard concept.

Source code in aatm\data_models.py
class ExpressionOrigin(Enum):
    """Enumerate the possible origins of an expression in the mapping pipeline.

    This enum identifies whether an expression comes directly from a
    standard concept, from a synonym of a standard concept, from a
    mapped non-standard concept, or from a synonym of a mapped
    non-standard concept.
    """

    STANDARD_CONCEPT = "standard_concept"
    STANDARD_CONCEPT_SYNONYM = "standard_concept_synonym"
    NON_STANDARD_CONCEPT = "mapped_non_standard_concept"
    NON_STANDARD_CONCEPT_SYNONYM = "mapped_non_standard_concept_synonym"

StandardVocabulary

Bases: Enum

Enumerate the supported standard vocabularies.

This enum defines the target standardized vocabularies currently supported by the terminology mapping workflow.

Source code in aatm\data_models.py
class StandardVocabulary(Enum):
    """Enumerate the supported standard vocabularies.

    This enum defines the target standardized vocabularies currently
    supported by the terminology mapping workflow.
    """

    SNOMED = "SNOMED"
    RXNORM = "RxNorm"
    LOINC = "LOINC"

ExpressionMetadata

Bases: BaseModel

Represent metadata for a terminology expression.

This model stores information about an expression, its originating concept, and its associated standard concept metadata. A deterministic expression_id is generated after initialization from the main identifying fields.

Attributes:

Name Type Description
expression_id Optional[str]

Deterministic identifier generated from the expression metadata.

expression Optional[str]

Original text of the expression.

expression_concept_id Optional[str]

Identifier of the source concept associated with the expression.

expression_origin Optional[ExpressionOrigin]

Origin of the expression in relation to standard or non-standard concepts.

std_concept_id Optional[str]

Identifier of the mapped standard concept.

std_concept_name Optional[str]

Name of the mapped standard concept.

std_vocabulary_id Optional[StandardVocabulary]

Standard vocabulary to which the mapped concept belongs.

std_vocabulary_code Optional[str]

Code of the mapped concept in the standard vocabulary.

std_domain_id Optional[str]

Domain associated with the mapped standard concept.

Source code in aatm\data_models.py
class ExpressionMetadata(BaseModel):
    """Represent metadata for a terminology expression.

    This model stores information about an expression, its originating
    concept, and its associated standard concept metadata. A
    deterministic `expression_id` is generated after initialization from
    the main identifying fields.

    Attributes:
        expression_id: Deterministic identifier generated from the
            expression metadata.
        expression: Original text of the expression.
        expression_concept_id: Identifier of the source concept
            associated with the expression.
        expression_origin: Origin of the expression in relation to
            standard or non-standard concepts.
        std_concept_id: Identifier of the mapped standard concept.
        std_concept_name: Name of the mapped standard concept.
        std_vocabulary_id: Standard vocabulary to which the mapped
            concept belongs.
        std_vocabulary_code: Code of the mapped concept in the standard
            vocabulary.
        std_domain_id: Domain associated with the mapped standard
            concept.
    """

    expression_id: Optional[str] = None
    expression: Optional[str]
    expression_concept_id: Optional[str]
    expression_origin: Optional[ExpressionOrigin]
    std_concept_id: Optional[str]
    std_concept_name: Optional[str]
    std_vocabulary_id: Optional[StandardVocabulary]
    std_vocabulary_code: Optional[str]
    std_domain_id: Optional[str]

    @field_validator(
        "expression_concept_id", "std_concept_id", "std_vocabulary_code", mode="before"
    )
    @classmethod
    def validate_concept_id(cls, value: Any) -> str:
        """Convert concept-related values to strings before validation.

        This validator normalizes concept identifier fields by coercing the
        incoming value to `str`. It is applied before standard Pydantic
        validation for selected identifier fields.

        Args:
            value: Raw value provided for a concept-related field.

        Returns:
            The normalized string representation of the input value.
        """
        return str(value)

    def model_post_init(self, *args: Any, **kwargs: Any) -> None:
        """Generate a deterministic expression identifier after initialization.

        This hook runs after model initialization and populates
        `expression_id` using a deterministic hash derived from the main
        expression metadata fields.

        Args:
            *args: Positional arguments passed by Pydantic's post-init
                lifecycle.
            **kwargs: Keyword arguments passed by Pydantic's post-init
                lifecycle.

        Returns:
            None
        """
        self.expression_id = deterministic_id_from_strings(
            [
                self.expression,
                self.expression_concept_id,
                self.expression_origin.value if self.expression_origin else None,
                self.std_concept_id,
                self.std_concept_name,
                self.std_vocabulary_id.value if self.std_vocabulary_id else None,
                self.std_vocabulary_code,
            ]
        )

    def to_dict(self) -> dict[str, Any]:
        """Convert the model to a dictionary with enum values serialized.

        This method returns the model data as a plain dictionary and
        replaces enum instances in `expression_origin` and
        `std_vocabulary_id` with their raw `.value` representations.

        Returns:
            A dictionary representation of the model with enum fields
            serialized as plain values.
        """
        model_dict = self.model_dump()
        model_dict["expression_origin"] = model_dict["expression_origin"].value
        model_dict["std_vocabulary_id"] = model_dict["std_vocabulary_id"].value
        return model_dict

validate_concept_id(value) classmethod

Convert concept-related values to strings before validation.

This validator normalizes concept identifier fields by coercing the incoming value to str. It is applied before standard Pydantic validation for selected identifier fields.

Parameters:

Name Type Description Default
value Any

Raw value provided for a concept-related field.

required

Returns:

Type Description
str

The normalized string representation of the input value.

Source code in aatm\data_models.py
@field_validator(
    "expression_concept_id", "std_concept_id", "std_vocabulary_code", mode="before"
)
@classmethod
def validate_concept_id(cls, value: Any) -> str:
    """Convert concept-related values to strings before validation.

    This validator normalizes concept identifier fields by coercing the
    incoming value to `str`. It is applied before standard Pydantic
    validation for selected identifier fields.

    Args:
        value: Raw value provided for a concept-related field.

    Returns:
        The normalized string representation of the input value.
    """
    return str(value)

model_post_init(*args, **kwargs)

Generate a deterministic expression identifier after initialization.

This hook runs after model initialization and populates expression_id using a deterministic hash derived from the main expression metadata fields.

Parameters:

Name Type Description Default
*args Any

Positional arguments passed by Pydantic's post-init lifecycle.

()
**kwargs Any

Keyword arguments passed by Pydantic's post-init lifecycle.

{}

Returns:

Type Description
None

None

Source code in aatm\data_models.py
def model_post_init(self, *args: Any, **kwargs: Any) -> None:
    """Generate a deterministic expression identifier after initialization.

    This hook runs after model initialization and populates
    `expression_id` using a deterministic hash derived from the main
    expression metadata fields.

    Args:
        *args: Positional arguments passed by Pydantic's post-init
            lifecycle.
        **kwargs: Keyword arguments passed by Pydantic's post-init
            lifecycle.

    Returns:
        None
    """
    self.expression_id = deterministic_id_from_strings(
        [
            self.expression,
            self.expression_concept_id,
            self.expression_origin.value if self.expression_origin else None,
            self.std_concept_id,
            self.std_concept_name,
            self.std_vocabulary_id.value if self.std_vocabulary_id else None,
            self.std_vocabulary_code,
        ]
    )

to_dict()

Convert the model to a dictionary with enum values serialized.

This method returns the model data as a plain dictionary and replaces enum instances in expression_origin and std_vocabulary_id with their raw .value representations.

Returns:

Type Description
dict[str, Any]

A dictionary representation of the model with enum fields

dict[str, Any]

serialized as plain values.

Source code in aatm\data_models.py
def to_dict(self) -> dict[str, Any]:
    """Convert the model to a dictionary with enum values serialized.

    This method returns the model data as a plain dictionary and
    replaces enum instances in `expression_origin` and
    `std_vocabulary_id` with their raw `.value` representations.

    Returns:
        A dictionary representation of the model with enum fields
        serialized as plain values.
    """
    model_dict = self.model_dump()
    model_dict["expression_origin"] = model_dict["expression_origin"].value
    model_dict["std_vocabulary_id"] = model_dict["std_vocabulary_id"].value
    return model_dict

Translation

Bases: BaseModel

Represent a translated text value.

Attributes:

Name Type Description
text str

Translated text string.

Source code in aatm\data_models.py
class Translation(BaseModel):
    """Represent a translated text value.

    Attributes:
        text: Translated text string.
    """

    text: str

SourceConcept

Bases: BaseModel

Represent a source concept to be mapped.

This model stores the original source terminology fields and related validity metadata. Extra fields are ignored to allow loading from broader tabular inputs.

Attributes:

Name Type Description
source_code Optional[str]

Code of the source concept.

source_concept_id Optional[str]

Identifier of the source concept.

source_vocabulary_id Optional[str]

Vocabulary identifier of the source concept.

source_code_description Optional[str]

Human-readable description of the source concept.

valid_start_date Optional[str]

Start date of validity in YYYY-MM-DD format.

valid_end_date Optional[str]

End date of validity in YYYY-MM-DD format.

invalid_reason Optional[str]

Reason why the source concept is invalid, when applicable.

Source code in aatm\data_models.py
class SourceConcept(BaseModel):
    """Represent a source concept to be mapped.

    This model stores the original source terminology fields and related
    validity metadata. Extra fields are ignored to allow loading from
    broader tabular inputs.

    Attributes:
        source_code: Code of the source concept.
        source_concept_id: Identifier of the source concept.
        source_vocabulary_id: Vocabulary identifier of the source
            concept.
        source_code_description: Human-readable description of the
            source concept.
        valid_start_date: Start date of validity in `YYYY-MM-DD` format.
        valid_end_date: End date of validity in `YYYY-MM-DD` format.
        invalid_reason: Reason why the source concept is invalid, when
            applicable.
    """

    # ignore extra fields
    model_config = ConfigDict(extra="ignore")

    # fields
    source_code: Optional[str] = Field(None, examples=["I63"])
    source_concept_id: Optional[str] = Field(None, examples=["45543186"])
    source_vocabulary_id: Optional[str] = Field(None, examples=["ICD10"])
    source_code_description: Optional[str] = Field(
        None, examples=["Cerebral infarction"]
    )
    valid_start_date: Optional[str] = Field(None, examples=["1990-05-01"])
    valid_end_date: Optional[str] = Field(None, examples=["2099-12-31"])
    invalid_reason: Optional[str] = Field(None, examples=[None])

    @field_validator(
        "source_code", "source_concept_id", "source_vocabulary_id", mode="before"
    )
    @classmethod
    def validate_strings(cls, value: Any) -> str:
        """Convert source identifier fields to strings before validation.

        This validator coerces selected source concept fields to `str`
        before standard Pydantic validation is applied.

        Args:
            value: Raw value provided for a source identifier field.

        Returns:
            The normalized string representation of the input value.
        """
        return str(value) if value is not None else value

    @field_validator("valid_start_date", "valid_end_date", mode="before")
    @classmethod
    def validate_yyyy_mm_dd(cls, v: str) -> str:
        """Validate that a date string follows the `YYYY-MM-DD` format.

        This validator accepts empty strings unchanged and otherwise checks
        that the provided value is a string matching the expected date
        format.

        Args:
            v: Raw date value to validate.

        Returns:
            The validated date string.

        Raises:
            TypeError: If the provided value is not a string.
            ValueError: If the string is not empty and does not match the
                `YYYY-MM-DD` format.
        """
        if v is None:
            return v

        if not isinstance(v, str):
            raise TypeError("date must be a string in YYYY-MM-DD format")

        elif v == "":
            return v

        try:
            # Strict format check
            datetime.strptime(v, "%Y-%m-%d")
        except ValueError:
            raise ValueError("date must be in YYYY-MM-DD format")

        return v

    @classmethod
    def from_csv(cls, path: str | Path) -> List["SourceConcept"]:
        """Load source concepts from a CSV file.

        This class method reads a CSV file into a pandas DataFrame, replaces
        missing values with empty strings, and constructs one
        `SourceConcept` instance per row.

        Args:
            path: Path to the CSV file containing source concept records.

        Returns:
            A list of `SourceConcept` instances loaded from the CSV file.
        """
        if isinstance(path, str):
            path = Path(path)
        df = pd.read_csv(path).fillna("")

        return [cls(**row) for row in df.to_dict("records")]

validate_strings(value) classmethod

Convert source identifier fields to strings before validation.

This validator coerces selected source concept fields to str before standard Pydantic validation is applied.

Parameters:

Name Type Description Default
value Any

Raw value provided for a source identifier field.

required

Returns:

Type Description
str

The normalized string representation of the input value.

Source code in aatm\data_models.py
@field_validator(
    "source_code", "source_concept_id", "source_vocabulary_id", mode="before"
)
@classmethod
def validate_strings(cls, value: Any) -> str:
    """Convert source identifier fields to strings before validation.

    This validator coerces selected source concept fields to `str`
    before standard Pydantic validation is applied.

    Args:
        value: Raw value provided for a source identifier field.

    Returns:
        The normalized string representation of the input value.
    """
    return str(value) if value is not None else value

validate_yyyy_mm_dd(v) classmethod

Validate that a date string follows the YYYY-MM-DD format.

This validator accepts empty strings unchanged and otherwise checks that the provided value is a string matching the expected date format.

Parameters:

Name Type Description Default
v str

Raw date value to validate.

required

Returns:

Type Description
str

The validated date string.

Raises:

Type Description
TypeError

If the provided value is not a string.

ValueError

If the string is not empty and does not match the YYYY-MM-DD format.

Source code in aatm\data_models.py
@field_validator("valid_start_date", "valid_end_date", mode="before")
@classmethod
def validate_yyyy_mm_dd(cls, v: str) -> str:
    """Validate that a date string follows the `YYYY-MM-DD` format.

    This validator accepts empty strings unchanged and otherwise checks
    that the provided value is a string matching the expected date
    format.

    Args:
        v: Raw date value to validate.

    Returns:
        The validated date string.

    Raises:
        TypeError: If the provided value is not a string.
        ValueError: If the string is not empty and does not match the
            `YYYY-MM-DD` format.
    """
    if v is None:
        return v

    if not isinstance(v, str):
        raise TypeError("date must be a string in YYYY-MM-DD format")

    elif v == "":
        return v

    try:
        # Strict format check
        datetime.strptime(v, "%Y-%m-%d")
    except ValueError:
        raise ValueError("date must be in YYYY-MM-DD format")

    return v

from_csv(path) classmethod

Load source concepts from a CSV file.

This class method reads a CSV file into a pandas DataFrame, replaces missing values with empty strings, and constructs one SourceConcept instance per row.

Parameters:

Name Type Description Default
path str | Path

Path to the CSV file containing source concept records.

required

Returns:

Type Description
List[SourceConcept]

A list of SourceConcept instances loaded from the CSV file.

Source code in aatm\data_models.py
@classmethod
def from_csv(cls, path: str | Path) -> List["SourceConcept"]:
    """Load source concepts from a CSV file.

    This class method reads a CSV file into a pandas DataFrame, replaces
    missing values with empty strings, and constructs one
    `SourceConcept` instance per row.

    Args:
        path: Path to the CSV file containing source concept records.

    Returns:
        A list of `SourceConcept` instances loaded from the CSV file.
    """
    if isinstance(path, str):
        path = Path(path)
    df = pd.read_csv(path).fillna("")

    return [cls(**row) for row in df.to_dict("records")]

MappedSourceConcept

Bases: SourceConcept

Represent a source concept together with its mapped target concept.

This model extends SourceConcept by including the selected target standardized concept and related vocabulary metadata.

Attributes:

Name Type Description
target_concept_id Optional[str]

Identifier of the mapped target concept.

target_vocabulary_id Optional[StandardVocabulary]

Standard vocabulary of the mapped target concept.

target_vocabulary_code Optional[str]

Code of the mapped target concept in the target vocabulary.

domain_id Optional[str]

Domain of the mapped target concept.

confidence_score Optional[float]

Confidence score of the mapping.

source_code_description_original Optional[str]

Source code description before translation.

Source code in aatm\data_models.py
class MappedSourceConcept(SourceConcept):
    """Represent a source concept together with its mapped target concept.

    This model extends `SourceConcept` by including the selected target
    standardized concept and related vocabulary metadata.

    Attributes:
        target_concept_id: Identifier of the mapped target concept.
        target_vocabulary_id: Standard vocabulary of the mapped target
            concept.
        target_vocabulary_code: Code of the mapped target concept in the
            target vocabulary.
        domain_id: Domain of the mapped target concept.
        confidence_score: Confidence score of the mapping.
        source_code_description_original: Source code description before translation.
    """

    target_concept_id: Optional[str]
    target_vocabulary_id: Optional[StandardVocabulary]
    target_vocabulary_code: Optional[str]
    domain_id: Optional[str]
    confidence_score: Optional[float]
    source_code_description_original: Optional[str]

    @field_validator(
        "source_code",
        "source_concept_id",
        "source_vocabulary_id",
        "target_concept_id",
        "target_vocabulary_id",
        mode="before",
    )
    @classmethod
    def validate_strings(cls, value: Any) -> str:
        """Convert selected source and target fields to strings before validation.

        This validator coerces selected identifier and vocabulary fields to
        `str` when a value is provided. `None` values are preserved.

        Args:
            value: Raw value provided for a validated field.

        Returns:
            The normalized string representation of the input value, or
            `None` when the input value is `None`.
        """
        if value is not None:
            return str(value)

    @classmethod
    def from_csv(cls, path: str | Path) -> List["MappedSourceConcept"]:
        """Load mapped source concepts from a CSV file.

        This class method reads a CSV file into a pandas DataFrame, replaces
        missing values with empty strings, and constructs one
        `MappedSourceConcept` instance per row.

        Args:
            path: Path to the CSV file containing mapped source concept
                records.

        Returns:
            A list of `MappedSourceConcept` instances loaded from the CSV
            file.
        """
        if isinstance(path, str):
            path = Path(path)
        df = pd.read_csv(path).fillna("")

        return [cls(**row) for row in df.to_dict("records")]

    @classmethod
    def from_selector_results(
        cls,
        source_concepts: List[SourceConcept],
        results: "SelectorResults",
        translated_source_code_descriptions: Optional[List[Translation]] = None,
    ) -> List["MappedSourceConcept"]:
        """Build mapped source concepts from source concepts and selector results.

        This class method combines each `SourceConcept` with its
        corresponding selected result and produces a list of
        `MappedSourceConcept` instances containing both source and mapped
        target fields.

        Args:
            source_concepts: Source concepts to be combined with mapping
                selections.
            results: Selection results containing the chosen standardized
                concept for each source concept.
            translated_source_code_descriptions: Optional list of translations for the source code descriptions.

        Returns:
            A list of `MappedSourceConcept` instances built from the source
            concepts and selector results.
        """
        mapped_source_concepts = []
        for source_concept, selected_result, translation in itertools.zip_longest(
            source_concepts, results.results, translated_source_code_descriptions
        ):
            # Add type annotations
            source_concept: SourceConcept
            selected_result: SelectedExpressionMetadata
            translation: Optional[Translation]

            # Create MappedSourceConcept
            mapped_source_concepts.append(
                cls(
                    source_code=source_concept.source_code,
                    source_concept_id=source_concept.source_concept_id,
                    source_vocabulary_id=source_concept.source_vocabulary_id,
                    source_code_description=translation.text
                    if translation
                    else source_concept.source_code_description,
                    target_concept_id=selected_result.std_concept_id,
                    target_vocabulary_id=selected_result.std_vocabulary_id.value
                    if selected_result.std_vocabulary_id
                    else None,
                    domain_id=selected_result.std_domain_id,
                    valid_start_date=source_concept.valid_start_date,
                    valid_end_date=source_concept.valid_end_date,
                    invalid_reason=source_concept.invalid_reason,
                    target_vocabulary_code=selected_result.std_vocabulary_code,
                    confidence_score=1 - selected_result.distance,
                    source_code_description_original=source_concept.source_code_description
                    if translation
                    else None,
                )
            )

        return mapped_source_concepts

    def to_dict(self) -> dict[str, Any]:
        """Convert the model to a dictionary with enum values serialized.

        This method returns the model data as a plain dictionary and
        replaces `target_vocabulary_id` with its raw `.value`
        representation when present.

        Returns:
            A dictionary representation of the model with enum fields
            serialized as plain values.
        """
        model_dict = self.model_dump()
        model_dict["target_vocabulary_id"] = (
            model_dict["target_vocabulary_id"].value
            if model_dict["target_vocabulary_id"]
            else None
        )
        return model_dict

validate_strings(value) classmethod

Convert selected source and target fields to strings before validation.

This validator coerces selected identifier and vocabulary fields to str when a value is provided. None values are preserved.

Parameters:

Name Type Description Default
value Any

Raw value provided for a validated field.

required

Returns:

Type Description
str

The normalized string representation of the input value, or

str

None when the input value is None.

Source code in aatm\data_models.py
@field_validator(
    "source_code",
    "source_concept_id",
    "source_vocabulary_id",
    "target_concept_id",
    "target_vocabulary_id",
    mode="before",
)
@classmethod
def validate_strings(cls, value: Any) -> str:
    """Convert selected source and target fields to strings before validation.

    This validator coerces selected identifier and vocabulary fields to
    `str` when a value is provided. `None` values are preserved.

    Args:
        value: Raw value provided for a validated field.

    Returns:
        The normalized string representation of the input value, or
        `None` when the input value is `None`.
    """
    if value is not None:
        return str(value)

from_csv(path) classmethod

Load mapped source concepts from a CSV file.

This class method reads a CSV file into a pandas DataFrame, replaces missing values with empty strings, and constructs one MappedSourceConcept instance per row.

Parameters:

Name Type Description Default
path str | Path

Path to the CSV file containing mapped source concept records.

required

Returns:

Type Description
List[MappedSourceConcept]

A list of MappedSourceConcept instances loaded from the CSV

List[MappedSourceConcept]

file.

Source code in aatm\data_models.py
@classmethod
def from_csv(cls, path: str | Path) -> List["MappedSourceConcept"]:
    """Load mapped source concepts from a CSV file.

    This class method reads a CSV file into a pandas DataFrame, replaces
    missing values with empty strings, and constructs one
    `MappedSourceConcept` instance per row.

    Args:
        path: Path to the CSV file containing mapped source concept
            records.

    Returns:
        A list of `MappedSourceConcept` instances loaded from the CSV
        file.
    """
    if isinstance(path, str):
        path = Path(path)
    df = pd.read_csv(path).fillna("")

    return [cls(**row) for row in df.to_dict("records")]

from_selector_results(source_concepts, results, translated_source_code_descriptions=None) classmethod

Build mapped source concepts from source concepts and selector results.

This class method combines each SourceConcept with its corresponding selected result and produces a list of MappedSourceConcept instances containing both source and mapped target fields.

Parameters:

Name Type Description Default
source_concepts List[SourceConcept]

Source concepts to be combined with mapping selections.

required
results SelectorResults

Selection results containing the chosen standardized concept for each source concept.

required
translated_source_code_descriptions Optional[List[Translation]]

Optional list of translations for the source code descriptions.

None

Returns:

Type Description
List[MappedSourceConcept]

A list of MappedSourceConcept instances built from the source

List[MappedSourceConcept]

concepts and selector results.

Source code in aatm\data_models.py
@classmethod
def from_selector_results(
    cls,
    source_concepts: List[SourceConcept],
    results: "SelectorResults",
    translated_source_code_descriptions: Optional[List[Translation]] = None,
) -> List["MappedSourceConcept"]:
    """Build mapped source concepts from source concepts and selector results.

    This class method combines each `SourceConcept` with its
    corresponding selected result and produces a list of
    `MappedSourceConcept` instances containing both source and mapped
    target fields.

    Args:
        source_concepts: Source concepts to be combined with mapping
            selections.
        results: Selection results containing the chosen standardized
            concept for each source concept.
        translated_source_code_descriptions: Optional list of translations for the source code descriptions.

    Returns:
        A list of `MappedSourceConcept` instances built from the source
        concepts and selector results.
    """
    mapped_source_concepts = []
    for source_concept, selected_result, translation in itertools.zip_longest(
        source_concepts, results.results, translated_source_code_descriptions
    ):
        # Add type annotations
        source_concept: SourceConcept
        selected_result: SelectedExpressionMetadata
        translation: Optional[Translation]

        # Create MappedSourceConcept
        mapped_source_concepts.append(
            cls(
                source_code=source_concept.source_code,
                source_concept_id=source_concept.source_concept_id,
                source_vocabulary_id=source_concept.source_vocabulary_id,
                source_code_description=translation.text
                if translation
                else source_concept.source_code_description,
                target_concept_id=selected_result.std_concept_id,
                target_vocabulary_id=selected_result.std_vocabulary_id.value
                if selected_result.std_vocabulary_id
                else None,
                domain_id=selected_result.std_domain_id,
                valid_start_date=source_concept.valid_start_date,
                valid_end_date=source_concept.valid_end_date,
                invalid_reason=source_concept.invalid_reason,
                target_vocabulary_code=selected_result.std_vocabulary_code,
                confidence_score=1 - selected_result.distance,
                source_code_description_original=source_concept.source_code_description
                if translation
                else None,
            )
        )

    return mapped_source_concepts

to_dict()

Convert the model to a dictionary with enum values serialized.

This method returns the model data as a plain dictionary and replaces target_vocabulary_id with its raw .value representation when present.

Returns:

Type Description
dict[str, Any]

A dictionary representation of the model with enum fields

dict[str, Any]

serialized as plain values.

Source code in aatm\data_models.py
def to_dict(self) -> dict[str, Any]:
    """Convert the model to a dictionary with enum values serialized.

    This method returns the model data as a plain dictionary and
    replaces `target_vocabulary_id` with its raw `.value`
    representation when present.

    Returns:
        A dictionary representation of the model with enum fields
        serialized as plain values.
    """
    model_dict = self.model_dump()
    model_dict["target_vocabulary_id"] = (
        model_dict["target_vocabulary_id"].value
        if model_dict["target_vocabulary_id"]
        else None
    )
    return model_dict

RetrievedExpressionMetadata

Bases: ExpressionMetadata

Represent retrieved expression metadata with ranking information.

This model extends ExpressionMetadata with fields produced during retrieval and reranking steps. Extra fields are ignored to support flexible loading from retrieval outputs.

Attributes:

Name Type Description
distance Optional[float]

Retrieval distance or similarity-derived distance for the expression.

rerank_score Optional[float]

Score assigned during reranking.

Source code in aatm\data_models.py
class RetrievedExpressionMetadata(ExpressionMetadata):
    """Represent retrieved expression metadata with ranking information.

    This model extends `ExpressionMetadata` with fields produced during
    retrieval and reranking steps. Extra fields are ignored to support
    flexible loading from retrieval outputs.

    Attributes:
        distance: Retrieval distance or similarity-derived distance for
            the expression.
        rerank_score: Score assigned during reranking.
    """

    # ignore extra fields
    model_config = ConfigDict(extra="ignore")

    # fields
    distance: Optional[float] = None
    rerank_score: Optional[float] = None

    def to_prompt_object(self, *args: Any, **kwargs: Any) -> Dict[str, str]:
        """Convert retrieved metadata to a prompt-friendly dictionary.

        This method returns a reduced dictionary representation containing
        the main standardized concept fields needed for prompt construction
        in downstream components.

        Args:
            *args: Additional positional arguments accepted for interface
                compatibility.
            **kwargs: Additional keyword arguments accepted for interface
                compatibility.

        Returns:
            A dictionary containing the expression identifier, expression
            text, standard concept name, standard vocabulary identifier,
            standard vocabulary code, and standard domain identifier.
        """
        return {
            "expression_id": self.expression_id,
            "expression": self.expression,
            "standard_concept_name": self.std_concept_name,
            "standard_vocabulary_id": self.std_vocabulary_id.value,
            "standard_vocabulary_code": self.std_vocabulary_code,
            "standard_domain_id": self.std_domain_id,
        }

to_prompt_object(*args, **kwargs)

Convert retrieved metadata to a prompt-friendly dictionary.

This method returns a reduced dictionary representation containing the main standardized concept fields needed for prompt construction in downstream components.

Parameters:

Name Type Description Default
*args Any

Additional positional arguments accepted for interface compatibility.

()
**kwargs Any

Additional keyword arguments accepted for interface compatibility.

{}

Returns:

Type Description
Dict[str, str]

A dictionary containing the expression identifier, expression

Dict[str, str]

text, standard concept name, standard vocabulary identifier,

Dict[str, str]

standard vocabulary code, and standard domain identifier.

Source code in aatm\data_models.py
def to_prompt_object(self, *args: Any, **kwargs: Any) -> Dict[str, str]:
    """Convert retrieved metadata to a prompt-friendly dictionary.

    This method returns a reduced dictionary representation containing
    the main standardized concept fields needed for prompt construction
    in downstream components.

    Args:
        *args: Additional positional arguments accepted for interface
            compatibility.
        **kwargs: Additional keyword arguments accepted for interface
            compatibility.

    Returns:
        A dictionary containing the expression identifier, expression
        text, standard concept name, standard vocabulary identifier,
        standard vocabulary code, and standard domain identifier.
    """
    return {
        "expression_id": self.expression_id,
        "expression": self.expression,
        "standard_concept_name": self.std_concept_name,
        "standard_vocabulary_id": self.std_vocabulary_id.value,
        "standard_vocabulary_code": self.std_vocabulary_code,
        "standard_domain_id": self.std_domain_id,
    }

SelectedExpressionMetadata

Bases: RetrievedExpressionMetadata

Represent the expression selected from a retrieval result list.

This model extends RetrievedExpressionMetadata by recording the index of the selected item in the original candidate list.

Attributes:

Name Type Description
result_list_index int

Index of the selected result in the candidate list.

Source code in aatm\data_models.py
class SelectedExpressionMetadata(RetrievedExpressionMetadata):
    """Represent the expression selected from a retrieval result list.

    This model extends `RetrievedExpressionMetadata` by recording the
    index of the selected item in the original candidate list.

    Attributes:
        result_list_index: Index of the selected result in the candidate
            list.
    """

    result_list_index: int

EmptySelectionMetadata

Bases: BaseModel

Represent an empty selection result.

This placeholder model is used when no expression is selected. All fields are explicitly set to None to preserve the expected schema.

Attributes:

Name Type Description
expression_id None

Always None.

expression None

Always None.

expression_concept_id None

Always None.

expression_origin None

Always None.

std_concept_id None

Always None.

std_concept_name None

Always None.

std_vocabulary_id None

Always None.

std_vocabulary_code None

Always None.

std_domain_id None

Always None.

result_list_index None

Always None.

distance None

Always None.

Source code in aatm\data_models.py
class EmptySelectionMetadata(BaseModel):
    """Represent an empty selection result.

    This placeholder model is used when no expression is selected. All
    fields are explicitly set to `None` to preserve the expected schema.

    Attributes:
        expression_id: Always `None`.
        expression: Always `None`.
        expression_concept_id: Always `None`.
        expression_origin: Always `None`.
        std_concept_id: Always `None`.
        std_concept_name: Always `None`.
        std_vocabulary_id: Always `None`.
        std_vocabulary_code: Always `None`.
        std_domain_id: Always `None`.
        result_list_index: Always `None`.
        distance: Always `None`.
    """

    expression_id: None = None
    expression: None = None
    expression_concept_id: None = None
    expression_origin: None = None
    std_concept_id: None = None
    std_concept_name: None = None
    std_vocabulary_id: None = None
    std_vocabulary_code: None = None
    std_domain_id: None = None
    result_list_index: None = None
    distance: None = None

RetrieverResults

Bases: BaseModel

Represent the output of a retrieval step.

Attributes:

Name Type Description
results List[List[RetrievedExpressionMetadata]]

Nested list of retrieved expression metadata grouped by query.

queries List[str]

List of query strings used in retrieval.

Source code in aatm\data_models.py
class RetrieverResults(BaseModel):
    """Represent the output of a retrieval step.

    Attributes:
        results: Nested list of retrieved expression metadata grouped by
            query.
        queries: List of query strings used in retrieval.
    """

    results: List[List[RetrievedExpressionMetadata]]
    queries: List[str]

SelectorResults

Bases: BaseModel

Represent the output of a selection step.

Attributes:

Name Type Description
results List[SelectedExpressionMetadata]

List of selected expression metadata, one per query or source item.

queries List[str]

List of query strings associated with the selections.

Source code in aatm\data_models.py
class SelectorResults(BaseModel):
    """Represent the output of a selection step.

    Attributes:
        results: List of selected expression metadata, one per query or
            source item.
        queries: List of query strings associated with the selections.
    """

    results: List[SelectedExpressionMetadata]
    queries: List[str]

SelectedResult

Bases: BaseModel

Represent a minimal selected result reference.

Attributes:

Name Type Description
expression_id Optional[str]

Identifier of the selected expression.

Source code in aatm\data_models.py
class SelectedResult(BaseModel):
    """Represent a minimal selected result reference.

    Attributes:
        expression_id: Identifier of the selected expression.
    """

    expression_id: Optional[str] = None

TerminologyMappingTask

Bases: BaseModel

Represent the configuration for a terminology mapping task.

This model defines the inputs and execution parameters required to run a terminology mapping workflow. It also provides convenience methods for loading task definitions from JSON or YAML files and for saving them back to disk.

Attributes:

Name Type Description
input_file Path

Path to the input file containing source concepts.

output_dir Optional[Path]

Directory where mapping outputs will be written.

translator_id Optional[str]

Identifier of the translator component to use.

retriever_id Optional[str]

Identifier of the retriever component to use.

selector_id Optional[str]

Identifier of the selector component to use.

reranker_id Optional[str]

Identifier of the reranker component to use.

batch_size Optional[int]

Batch size for processing source concepts.

rate_limit Optional[int]

Rate limit applied during processing.

column_mapping Optional[dict]

Optional mapping describing how input columns correspond to expected fields.

limit_to Optional[int]

Optional limit on the number of source concepts to process.

Source code in aatm\data_models.py
class TerminologyMappingTask(BaseModel):
    """Represent the configuration for a terminology mapping task.

    This model defines the inputs and execution parameters required to
    run a terminology mapping workflow. It also provides convenience
    methods for loading task definitions from JSON or YAML files and for
    saving them back to disk.

    Attributes:
        input_file: Path to the input file containing source concepts.
        output_dir: Directory where mapping outputs will be written.
        translator_id: Identifier of the translator component to use.
        retriever_id: Identifier of the retriever component to use.
        selector_id: Identifier of the selector component to use.
        reranker_id: Identifier of the reranker component to use.
        batch_size: Batch size for processing source concepts.
        rate_limit: Rate limit applied during processing.
        column_mapping: Optional mapping describing how input columns
            correspond to expected fields.
        limit_to: Optional limit on the number of source concepts to
            process.
    """

    input_file: Path
    output_dir: Optional[Path] = None
    translator_id: Optional[str] = None
    retriever_id: Optional[str] = None
    selector_id: Optional[str] = None
    reranker_id: Optional[str] = None
    batch_size: Optional[int] = None
    rate_limit: Optional[int] = None
    column_mapping: Optional[dict] = None
    limit_to: Optional[int] = None

    @field_validator("input_file", "output_dir", mode="before")
    def validate_paths(cls, value: Any) -> Path:
        """Convert path-like input values to `Path` objects before validation.

        This validator normalizes the `input_file` and `output_dir` fields by
        converting incoming string values to `Path` instances. Existing
        `Path` objects are returned unchanged.

        Args:
            value: Raw value provided for a path field.

        Returns:
            A normalized `Path` instance.
        """
        if isinstance(value, Path):
            return value

        return Path(value)

    @classmethod
    def from_json(cls, path: str | Path) -> "TerminologyMappingTask":
        """Load a terminology mapping task from a JSON configuration file.

        Args:
            path: Path to a JSON file containing the task configuration.

        Returns:
            A `TerminologyMappingTask` instance created from the JSON file.
        """
        if isinstance(path, str):
            path = Path(path)
        return cls(**json.loads(path.read_text()))

    @classmethod
    def from_yaml(cls, path: str | Path) -> "TerminologyMappingTask":
        """Load a terminology mapping task from a YAML configuration file.

        Args:
            path: Path to a YAML file containing the task configuration.

        Returns:
            A `TerminologyMappingTask` instance created from the YAML file.
        """
        if isinstance(path, str):
            path = Path(path)
        return cls(**yaml.safe_load(path.read_text()))

    @classmethod
    def from_config_file(cls, path: str | Path) -> "TerminologyMappingTask":
        """Load a terminology mapping task from a supported config file.

        This method dispatches to `from_json` or `from_yaml` based on the
        file extension.

        Args:
            path: Path to a supported configuration file.

        Returns:
            A `TerminologyMappingTask` instance created from the provided            configuration file.

        Raises:
            ValueError: If the file extension is not supported.
        """
        if isinstance(path, str):
            path = Path(path)
        if path.suffix == ".json":
            return cls.from_json(path)
        elif path.suffix == ".yaml":
            return cls.from_yaml(path)
        else:
            raise ValueError(f"Unsupported config file format: '{path.suffix}'")

    def save_to_disk(self, path: str | Path) -> None:
        """Save the task configuration to a JSON or YAML file.

        This method serializes the current model and writes it to disk based
        on the output file extension.

        Args:
            path: Destination path for the serialized configuration file.

        Returns:
            None

        Raises:
            ValueError: If the file extension is not supported.
        """
        if isinstance(path, str):
            path = Path(path)
        if path.suffix == ".json":
            path.write_text(json.dumps(self.model_dump()))
        elif path.suffix == ".yaml":
            path.write_text(yaml.safe_dump(self.model_dump()))
        else:
            raise ValueError(f"Unsupported config file format: '{path.suffix}'")

validate_paths(value)

Convert path-like input values to Path objects before validation.

This validator normalizes the input_file and output_dir fields by converting incoming string values to Path instances. Existing Path objects are returned unchanged.

Parameters:

Name Type Description Default
value Any

Raw value provided for a path field.

required

Returns:

Type Description
Path

A normalized Path instance.

Source code in aatm\data_models.py
@field_validator("input_file", "output_dir", mode="before")
def validate_paths(cls, value: Any) -> Path:
    """Convert path-like input values to `Path` objects before validation.

    This validator normalizes the `input_file` and `output_dir` fields by
    converting incoming string values to `Path` instances. Existing
    `Path` objects are returned unchanged.

    Args:
        value: Raw value provided for a path field.

    Returns:
        A normalized `Path` instance.
    """
    if isinstance(value, Path):
        return value

    return Path(value)

from_json(path) classmethod

Load a terminology mapping task from a JSON configuration file.

Parameters:

Name Type Description Default
path str | Path

Path to a JSON file containing the task configuration.

required

Returns:

Type Description
TerminologyMappingTask

A TerminologyMappingTask instance created from the JSON file.

Source code in aatm\data_models.py
@classmethod
def from_json(cls, path: str | Path) -> "TerminologyMappingTask":
    """Load a terminology mapping task from a JSON configuration file.

    Args:
        path: Path to a JSON file containing the task configuration.

    Returns:
        A `TerminologyMappingTask` instance created from the JSON file.
    """
    if isinstance(path, str):
        path = Path(path)
    return cls(**json.loads(path.read_text()))

from_yaml(path) classmethod

Load a terminology mapping task from a YAML configuration file.

Parameters:

Name Type Description Default
path str | Path

Path to a YAML file containing the task configuration.

required

Returns:

Type Description
TerminologyMappingTask

A TerminologyMappingTask instance created from the YAML file.

Source code in aatm\data_models.py
@classmethod
def from_yaml(cls, path: str | Path) -> "TerminologyMappingTask":
    """Load a terminology mapping task from a YAML configuration file.

    Args:
        path: Path to a YAML file containing the task configuration.

    Returns:
        A `TerminologyMappingTask` instance created from the YAML file.
    """
    if isinstance(path, str):
        path = Path(path)
    return cls(**yaml.safe_load(path.read_text()))

from_config_file(path) classmethod

Load a terminology mapping task from a supported config file.

This method dispatches to from_json or from_yaml based on the file extension.

Parameters:

Name Type Description Default
path str | Path

Path to a supported configuration file.

required

Returns:

Type Description
TerminologyMappingTask

A TerminologyMappingTask instance created from the provided configuration file.

Raises:

Type Description
ValueError

If the file extension is not supported.

Source code in aatm\data_models.py
@classmethod
def from_config_file(cls, path: str | Path) -> "TerminologyMappingTask":
    """Load a terminology mapping task from a supported config file.

    This method dispatches to `from_json` or `from_yaml` based on the
    file extension.

    Args:
        path: Path to a supported configuration file.

    Returns:
        A `TerminologyMappingTask` instance created from the provided            configuration file.

    Raises:
        ValueError: If the file extension is not supported.
    """
    if isinstance(path, str):
        path = Path(path)
    if path.suffix == ".json":
        return cls.from_json(path)
    elif path.suffix == ".yaml":
        return cls.from_yaml(path)
    else:
        raise ValueError(f"Unsupported config file format: '{path.suffix}'")

save_to_disk(path)

Save the task configuration to a JSON or YAML file.

This method serializes the current model and writes it to disk based on the output file extension.

Parameters:

Name Type Description Default
path str | Path

Destination path for the serialized configuration file.

required

Returns:

Type Description
None

None

Raises:

Type Description
ValueError

If the file extension is not supported.

Source code in aatm\data_models.py
def save_to_disk(self, path: str | Path) -> None:
    """Save the task configuration to a JSON or YAML file.

    This method serializes the current model and writes it to disk based
    on the output file extension.

    Args:
        path: Destination path for the serialized configuration file.

    Returns:
        None

    Raises:
        ValueError: If the file extension is not supported.
    """
    if isinstance(path, str):
        path = Path(path)
    if path.suffix == ".json":
        path.write_text(json.dumps(self.model_dump()))
    elif path.suffix == ".yaml":
        path.write_text(yaml.safe_dump(self.model_dump()))
    else:
        raise ValueError(f"Unsupported config file format: '{path.suffix}'")

deterministic_id_from_strings(strings, digest_size=16)

Generate a deterministic id from a list of strings.

Parameters:

Name Type Description Default
strings list[str]

A list of strings to generate the id from.

required
digest_size int

The size of the digest in bytes.

16

Returns:

Name Type Description
str str

A deterministic id as a hexadecimal string.

Source code in aatm\data_models.py
def deterministic_id_from_strings(strings: list[str], digest_size: int = 16) -> str:
    """
    Generate a deterministic id from a list of strings.

    Args:
        strings (list[str]): A list of strings to generate the id from.
        digest_size (int, optional): The size of the digest in bytes.

    Returns:
        str: A deterministic id as a hexadecimal string.
    """
    joined = "||".join(strings)
    return hashlib.blake2b(joined.encode("utf-8"), digest_size=digest_size).hexdigest()