Configuration Documentation¶

This document provides detailed information on the configuration options available in the provided configuration file. The configuration is divided into several sections: snakemake, general, optimization, and training. Each section and its respective options are described below.

Snakemake Configuration¶

The snakemake section contains settings related to the Snakemake workflow management system.

Options¶

use_slurm: false
- Type: Boolean
- Description: Flag to indicate if SLURM is available (e.g., on Loewenburg).
with_gan: false
- Type: Boolean
- Description: Flag to indicate if a GAN variant should be used in addition to the normal training.
with_mtl: false
- Type: Boolean
- Description: Flag to indicate if multitask learning variants should be used in addition to the normal training.
output_dir: "reports"
- Type: String
- Description: Output directory for all generated output files.
bn: Bayesian network configuration; typically does not need to be changed.
- refactor: true
  - Type: Boolean
- cv_runs: 5
  - Type: Integer
- cv_restart: 5
  - Type: Integer
- fit: "mle-cg"
  - Type: String
- maxp: 5
  - Type: Integer
- loss: null
  - Type: Null
- score: "bic-cg"
  - Type: String
- folds: 3
  - Type: Integer
- n_bootstrap: 500
  - Type: Integer
- seed: 42
  - Type: Integer
excluded_datasets: List of datasets to be excluded from the pipeline.
- Type: List of Strings
- Description: List of datasets to be excluded from the pipeline. The datasets are defined by their folder name in the data/raw directory.
- Example: ["texas"]
exclusive_dataset: null
- Type: Null or String
- Description: Define this if you only want to run the pipeline for a single dataset for e.g. testing purposes.
cluster_modules:
- R: null
  - Type: Null or String
  - Description: Optional cluster module for R (e.g., "R/4.0.3"). Not required if R is available on the system or in a Conda environment.
r_env: "/path/to/R.yaml"
- Type: String
- Description: Path to the R environment file. The file is used to setup an conda environment for R. If this is used snakemake needs to be run with the --use-conda flag.

General Configuration¶

The general section contains general settings for the application.

Options¶

seed: 42
- Type: Integer
- Description: Seed for reproducibility.
eval_batch_size: 64
- Type: Integer
- Description: Batch size for evaluation. Does not affect training and is only restricted by the available memory.
device: "cpu"
- Type: String
- Description: Device to use for training. Use "cuda" for GPU training.
optuna_db: "postgresql://localhost/optuna"
- Type: String or Null
- Description: Database connection for Optuna. If not available, set to null to use SQLite databases. PostgreSQL is recommended since all results would be stored in a single database.
logging:
- level: 20
  - Type: Integer
  - Description: Logging level. Default is 20 (INFO). Other options are 10 (DEBUG), 30 (WARNING), 40 (ERROR), and 50 (CRITICAL).
- mlflow:
  - use: false
    - Type: Boolean
    - Description: Flag to indicate if MLflow should be used for logging.
  - tracking_uri: "http://localhost:5000"
    - Type: String
    - Description: URI for the MLflow tracking server. Can be a local or remote server.
  - experiment_name: "VAMBN2"
    - Type: String
    - Description: Name of the MLflow experiment.

Optimization Configuration¶

The optimization section contains settings related to the optimization process.

Options¶

folds: 3
- Type: Integer
n_traditional_trials: 20
- Type: Integer
n_modular_trials: 20
- Type: Integer
s_dim_lower: 1
- Type: Integer
s_dim_upper: 5
- Type: Integer
s_dim_step: 1
- Type: Integer
fixed_s_dim: false
- Type: Boolean
y_dim_lower: 1
- Type: Integer
y_dim_upper: 5
- Type: Integer
y_dim_step: 1
- Type: Integer
fixed_y_dim: false
- Type: Boolean
latent_dim_lower: 1
- Type: Integer
latent_dim_upper: 5
- Type: Integer
latent_dim_step: 1
- Type: Integer
batch_size_lower_n: 4
- Type: Integer
batch_size_upper_n: 8
- Type: Integer
max_epochs: 2500
- Type: Integer
- Description: Maximum number of epochs. Currently, early stopping is used.
learning_rate_lower: 0.0001
- Type: Float
learning_rate_upper: 0.1
- Type: Float
fixed_learning_rate: true
- Type: Boolean
lstm_layers_lower: 1
- Type: Integer
lstm_layers_upper: 4
- Type: Integer
lstm_layers_step: 1
- Type: Integer
use_relative_correlation_error_for_optimization: false
- Type: Boolean
- Description: Flag to indicate if the relative correlation error should be used as optuna metric.
use_auc_for_optimization: false
- Type: Boolean
- Description: Flag to indicate if the Area under the ROC curve should be used as optuna metric.

Training Configuration¶

The training section contains settings related to the training process.

Options¶

use_imputation_layer: true
- Type: Boolean
- Description: Flag to indicate if the imputation layer should be used.