MPMOQ Learning¶

class morl_baselines.multi_policy.multi_policy_moqlearning.mp_mo_q_learning.MPMOQLearning(env, scalarization=<function weighted_sum>, learning_rate: float = 0.1, gamma: float = 0.9, initial_epsilon: float = 0.1, final_epsilon: float = 0.1, epsilon_decay_steps: int | None = None, weight_selection_algo: str = 'random', epsilon_ols: float | None = None, use_gpi_policy: bool = False, transfer_q_table: bool = True, dyna: bool = False, dyna_updates: int = 5, gpi_pd: bool = False, project_name: str = 'MORL-Baselines', experiment_name: str = 'MultiPolicy MO Q-Learning', wandb_entity: str | None = None, seed: int | None = None, log: bool = True)¶

Multi-policy MOQ-Learning: Outer loop version of mo_q_learning.

Paper: Paper: K. Van Moffaert, M. Drugan, and A. Nowe, Scalarized Multi-Objective Reinforcement Learning: Novel Design Techniques. 2013. doi: 10.1109/ADPRL.2013.6615007.

Initialize the Multi-policy MOQ-learning algorithm.

Parameters:

env – The environment to learn from.
scalarization – The scalarization function to use.
learning_rate – The learning rate.
gamma – The discount factor.
initial_epsilon – The initial epsilon value.
final_epsilon – The final epsilon value.
epsilon_decay_steps – The number of steps for epsilon decay.
weight_selection_algo – The algorithm to use for weight selection. Options: “random”, “ols”, “gpi-ls”
epsilon_ols – The epsilon value for the optimistic linear support.
use_gpi_policy – Whether to use Generalized Policy Improvement (GPI) or not.
transfer_q_table – Whether to reuse a Q-table from a previous learned policy when initializing a new policy.
dyna – Whether to use Dyna-Q or not.
dyna_updates – The number of Dyna-Q updates to perform.
gpi_pd – Whether to use the GPI-PD method to prioritize Dyna updates.
project_name – The name of the project for logging.
experiment_name – The name of the experiment for logging.
wandb_entity – The entity to use for logging.
seed – The seed to use for reproducibility.
log – Whether to log or not.

delete_policies(delete_indx: List[int])¶: Delete the policies with the given indices.

eval(obs: array, w: ndarray | None = None) → int¶: If use_gpi is True, return the action given by the GPI policy. Otherwise, chooses the best policy for w and follows it.

get_config() → dict¶

Generates dictionary of the algorithm parameters configuration.

Returns:: dict – Config

max_scalar_q_value(state: ndarray, w: ndarray) → float¶: Get the maximum Q-value over all policies for the given state and weights.

train(total_timesteps: int, eval_env: Env, ref_point: ndarray, known_pareto_front: List[ndarray] | None = None, timesteps_per_iteration: int = 200000, num_eval_weights_for_front: int = 100, num_eval_episodes_for_front: int = 5, num_eval_weights_for_eval: int = 50, eval_freq: int = 1000)¶

Learn a set of policies.

Parameters:

total_timesteps – The total number of timesteps to train for.
eval_env – The environment to use for evaluation.
ref_point – The reference point for the hypervolume calculation.
known_pareto_front – The optimal Pareto front, if known. Used for metrics.
timesteps_per_iteration – The number of timesteps per iteration.
num_eval_weights_for_front – The number of weights to use to construct a Pareto front for evaluation.
num_eval_episodes_for_front – The number of episodes to run when evaluating the policy.
num_eval_weights_for_eval (int) – Number of weights use when evaluating the Pareto front, e.g., for computing expected utility.
eval_freq – The frequency of evaluation.