Local database utilities
Utilities for building and managing local terminology-mapping data assets.
This module provides helper functions for constructing the local resources required by the terminology-mapping workflow. These resources include a local SQLite vocabulary database, derived CSV datasets used for terminology mapping, and a local vector database for semantic retrieval.
The module also includes a simple rate-limiting helper for embedding generation workloads and interactive prompts to prevent accidental overwrites of existing artifacts.
rate_limiter(n_docs, rate_limit, next_allowed_time)
Apply a document-based rate limit and return the next allowed execution time.
This helper ensures that document processing respects a maximum throughput expressed in documents per minute. If the current time is earlier than the next permitted execution time, the function sleeps until processing is allowed. It then reserves time for the current batch and returns the updated timestamp for the next permitted request.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_docs
|
int
|
Number of documents that will be processed in the current batch. |
required |
rate_limit
|
int
|
Maximum allowed processing rate in documents per minute. |
required |
next_allowed_time
|
float
|
Monotonic timestamp representing when the next batch is allowed to start. |
required |
Returns:
| Type | Description |
|---|---|
float
|
The updated monotonic timestamp indicating when the next batch may be |
float
|
processed. |
Raises:
| Type | Description |
|---|---|
ValueError
|
In contexts where the caller validates that the provided rate limit is invalid, such as non-positive values. |
Notes
This function uses time.monotonic() so that elapsed-time calculations
are not affected by system clock changes.
Source code in aatm\local_database_utils.py
build_local_sqlite_vocab_database(vocab_dir)
Build a local SQLite vocabulary database from OMOP vocabulary files.
This function reads vocabulary files from a directory, converts each file into a SQLite table, and stores the result in a local database file. If a database already exists, the user is prompted to either skip rebuilding it or overwrite the existing database.
CSV parsing behavior depends on the table being imported. The
source_to_concept_map file is read as comma-separated, while the other
vocabulary files are read as tab-separated.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vocab_dir
|
Path
|
Path to the directory containing the vocabulary files to be imported into the local SQLite database. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None. |
Side Effects
Creates or overwrites the local SQLite database at .aatm/omop.db.
Prompts the user for confirmation when an existing database is found.
Writes one table per vocabulary file into the database.
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the expected vocabulary files are not present in the provided directory. |
Error
|
If a database connection or write operation fails. |
ParserError
|
If a vocabulary file cannot be parsed. |
Source code in aatm\local_database_utils.py
build_mapping_datasets(standard_vocabularies)
Generate terminology-mapping datasets from SQL templates and a local database.
This function loads SQL command templates from a packaged YAML file, formats them using the provided list of standard vocabularies, executes each query against the local OMOP SQLite database, and saves the resulting datasets as CSV files in the local datasets directory.
If mapping datasets already exist, the user is prompted to either skip regeneration or overwrite the existing files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
standard_vocabularies
|
list[str]
|
List of standard vocabulary names used to parameterize the SQL queries that generate the mapping datasets. |
required |
Returns:
| Type | Description |
|---|---|
None
|
None. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no standard vocabularies are provided. |
FileNotFoundError
|
If the SQL command resource file cannot be located. |
Error
|
If an error occurs while querying the local SQLite database. |
YAMLError
|
If the SQL command YAML file cannot be parsed. |
Side Effects
Creates the .aatm/datasets directory when needed.
Writes one CSV file per SQL command to the local datasets directory.
Prompts the user before overwriting existing dataset files.
Source code in aatm\local_database_utils.py
build_local_vector_database(embedding_model_name, vector_db_dir=None, rate_limit=None, batch_size=100)
Build or repair a local vector database for terminology mapping.
This function creates a persistent vector database from the generated terminology-mapping datasets. It loads records from local CSV files, converts them into structured metadata objects, avoids duplicate or already indexed entries, and stores embeddings in a ChromaDB collection using the configured embedding model.
If a vector database already exists, the user may choose to skip the operation, overwrite the existing database, or repair it. Optional rate-limiting can be applied to control embedding throughput for providers with request limits.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embedding_model_name
|
str
|
Registry key identifying the embedding model configuration to use. |
required |
vector_db_dir
|
Path | None
|
Optional path to the vector database directory. If not provided, the default path from the retriever model registry is used. |
None
|
rate_limit
|
int | None
|
Optional maximum number of documents to embed per minute. If not provided, the default value from the model registry is used. |
None
|
batch_size
|
int
|
Number of records to process and add to the vector database in each batch. |
100
|
Returns:
| Type | Description |
|---|---|
None
|
None. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
ValueError
|
If |
FileNotFoundError
|
If required dataset files are missing. |
Error
|
If dependent local resources are unavailable or invalid. |
Exception
|
If the vector database client, embedding function, or collection operations fail. |
Side Effects
Creates, repairs, or overwrites a local persistent vector database.
Reads CSV datasets from .aatm/datasets.
Prompts the user before modifying an existing vector database.
Generates embeddings and stores documents and metadata in the target
collection.
Notes
The function performs lazy imports for some dependencies to reduce startup overhead for workflows that do not require vector database creation.
Source code in aatm\local_database_utils.py
193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 | |