MPMOQ Learning

class morl_baselines.multi_policy.multi_policy_moqlearning.mp_mo_q_learning.MPMOQLearning(env, scalarization=<function weighted_sum>, learning_rate: float = 0.1, gamma: float = 0.9, initial_epsilon: float = 0.1, final_epsilon: float = 0.1, epsilon_decay_steps: int | None = None, weight_selection_algo: str = 'random', epsilon_ols: float | None = None, use_gpi_policy: bool = False, transfer_q_table: bool = True, dyna: bool = False, dyna_updates: int = 5, gpi_pd: bool = False, project_name: str = 'MORL-Baselines', experiment_name: str = 'MultiPolicy MO Q-Learning', wandb_entity: str | None = None, seed: int | None = None, log: bool = True)

Multi-policy MOQ-Learning: Outer loop version of mo_q_learning.

Paper: Paper: K. Van Moffaert, M. Drugan, and A. Nowe, Scalarized Multi-Objective Reinforcement Learning: Novel Design Techniques. 2013. doi: 10.1109/ADPRL.2013.6615007.

Initialize the Multi-policy MOQ-learning algorithm.

Parameters:
  • env – The environment to learn from.

  • scalarization – The scalarization function to use.

  • learning_rate – The learning rate.

  • gamma – The discount factor.

  • initial_epsilon – The initial epsilon value.

  • final_epsilon – The final epsilon value.

  • epsilon_decay_steps – The number of steps for epsilon decay.

  • weight_selection_algo – The algorithm to use for weight selection. Options: “random”, “ols”, “gpi-ls”

  • epsilon_ols – The epsilon value for the optimistic linear support.

  • use_gpi_policy – Whether to use Generalized Policy Improvement (GPI) or not.

  • transfer_q_table – Whether to reuse a Q-table from a previous learned policy when initializing a new policy.

  • dyna – Whether to use Dyna-Q or not.

  • dyna_updates – The number of Dyna-Q updates to perform.

  • gpi_pd – Whether to use the GPI-PD method to prioritize Dyna updates.

  • project_name – The name of the project for logging.

  • experiment_name – The name of the experiment for logging.

  • wandb_entity – The entity to use for logging.

  • seed – The seed to use for reproducibility.

  • log – Whether to log or not.

delete_policies(delete_indx: List[int])

Delete the policies with the given indices.

eval(obs: array, w: ndarray | None = None) int

If use_gpi is True, return the action given by the GPI policy. Otherwise, chooses the best policy for w and follows it.

get_config() dict

Generates dictionary of the algorithm parameters configuration.

Returns:

dict – Config

max_scalar_q_value(state: ndarray, w: ndarray) float

Get the maximum Q-value over all policies for the given state and weights.

train(total_timesteps: int, eval_env: Env, ref_point: ndarray, known_pareto_front: List[ndarray] | None = None, timesteps_per_iteration: int = 200000, num_eval_weights_for_front: int = 100, num_eval_episodes_for_front: int = 5, num_eval_weights_for_eval: int = 50, eval_freq: int = 1000)

Learn a set of policies.

Parameters:
  • total_timesteps – The total number of timesteps to train for.

  • eval_env – The environment to use for evaluation.

  • ref_point – The reference point for the hypervolume calculation.

  • known_pareto_front – The optimal Pareto front, if known. Used for metrics.

  • timesteps_per_iteration – The number of timesteps per iteration.

  • num_eval_weights_for_front – The number of weights to use to construct a Pareto front for evaluation.

  • num_eval_episodes_for_front – The number of episodes to run when evaluating the policy.

  • num_eval_weights_for_eval (int) – Number of weights use when evaluating the Pareto front, e.g., for computing expected utility.

  • eval_freq – The frequency of evaluation.