MOQ-Learning

class morl_baselines.single_policy.ser.mo_q_learning.MOQLearning(env, id: int | None = None, weights: ~numpy.ndarray = array([0.5, 0.5]), scalarization=<function weighted_sum>, learning_rate: float = 0.1, gamma: float = 0.9, initial_epsilon: float = 0.1, final_epsilon: float = 0.1, epsilon_decay_steps: int | None = None, learning_starts: int = 0, use_gpi_policy: bool = False, dyna: bool = False, dyna_updates: int = 5, model: ~morl_baselines.common.model_based.tabular_model.TabularModel | None = None, gpi_pd: bool = False, min_priority: float = 0.0001, alpha: float = 0.6, parent=None, project_name: str = 'MORL-baselines', experiment_name: str = 'MO Q-Learning', wandb_entity: str | None = None, log: bool = True, seed: int | None = None, parent_rng: ~numpy.random._generator.Generator | None = None)

Scalarized Q learning for single policy multi-objective reinforcement learning.

Maintains one Q-table per objective, rely on a scalarization function to choose the moves. Paper: K. Van Moffaert, M. Drugan, and A. Nowe, Scalarized Multi-Objective Reinforcement Learning: Novel Design Techniques. 2013. doi: 10.1109/ADPRL.2013.6615007.

Initializes the MOQ-learning algorithm.

Parameters:
  • env – The environment to train on.

  • id – The id of the policy.

  • weights – The weights to use for the scalarization function.

  • scalarization – The scalarization function to use.

  • learning_rate – The learning rate.

  • gamma – The discount factor.

  • initial_epsilon – The initial epsilon value.

  • final_epsilon – The final epsilon value.

  • epsilon_decay_steps – The number of steps to decay epsilon over.

  • learning_starts – The number of steps to wait before starting to learn.

  • use_gpi_policy – Whether to use Generalized Policy Improvement (GPI) or not.

  • dyna – Whether to use Dyna-Q or not.

  • dyna_updates – The number of Dyna-Q updates to perform each step.

  • model – The model to use for Dyna. If None and dyna==True, a new one is created.

  • gpi_pd – Whether to use the GPI-PD method to prioritize Dyna updates.

  • min_priority – The minimum priority to use for GPI-PD.

  • alpha – The alpha value to use to smooth GPI-PD priorities.

  • parent – The parent MPMOQLearning class in the case of multi-policy training.

  • project_name – The name of the project used for logging.

  • experiment_name – The name of the experiment used for logging.

  • wandb_entity – The entity to use for logging.

  • log – Whether to log or not.

  • seed – The seed to use for the experiment.

  • parent_rng – The random number generator to use. If None, a new one is created.

eval(obs: array, w: ndarray | None = None) int

Gives the best action for the given observation.

Parameters:
  • obs (np.array) – Observation

  • w (optional np.array) – weight for scalarization

Returns:

np.array or int – Action

get_config() dict

Generates dictionary of the algorithm parameters configuration.

Returns:

dict – Config

scalarized_q_values(obs, w: ndarray) ndarray

Returns the scalarized Q values for each action, given observation and weights.

train(start_time, total_timesteps: int = 500000, reset_num_timesteps: bool = True, eval_env: Env | None = None, eval_freq: int = 1000)

Learning for the agent.

Parameters:
  • start_time – time when the training started

  • total_timesteps – max number of timesteps to learn

  • reset_num_timesteps – whether to reset timesteps or not when recalling learn

  • eval_env – other environment to launch greedy evaluations

  • eval_freq – number of timesteps between each policy evaluation

update()

Updates the Q table.