Pareto Q-Learning

class morl_baselines.multi_policy.pareto_q_learning.pql.PQL(env, ref_point: ndarray, gamma: float = 0.8, initial_epsilon: float = 1.0, epsilon_decay_steps: int = 100000, final_epsilon: float = 0.1, seed: int | None = None, project_name: str = 'MORL-Baselines', experiment_name: str = 'Pareto Q-Learning', wandb_entity: str | None = None, log: bool = True)

Pareto Q-learning.

Tabular method relying on pareto pruning. Paper: K. Van Moffaert and A. Nowé, “Multi-objective reinforcement learning using sets of pareto dominating policies,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3483–3512, 2014.

Initialize the Pareto Q-learning algorithm.

Parameters:
  • env – The environment.

  • ref_point – The reference point for the hypervolume metric.

  • gamma – The discount factor.

  • initial_epsilon – The initial epsilon value.

  • epsilon_decay_steps – The number of steps to decay epsilon.

  • final_epsilon – The final epsilon value.

  • seed – The random seed.

  • project_name – The name of the project used for logging.

  • experiment_name – The name of the experiment used for logging.

  • wandb_entity – The wandb entity used for logging.

  • log – Whether to log or not.

calc_non_dominated(state: int)

Get the non-dominated vectors in a given state.

Parameters:

state (int) – The current state.

Returns:

Set – A set of Pareto non-dominated vectors.

get_config() dict

Get the configuration dictionary.

Returns:

Dict – A dictionary of parameters and values.

get_local_pcs(state: int = 0)

Collect the local PCS in a given state.

Parameters:

state (int) – The state to get a local PCS for. (Default value = 0)

Returns:

Set – A set of Pareto optimal vectors.

get_q_set(state: int, action: int)

Compute the Q-set for a given state-action pair.

Parameters:
  • state (int) – The current state.

  • action (int) – The action.

Returns:

A set of Q vectors.

score_hypervolume(state: int)

Compute the action scores based upon the hypervolume metric.

Parameters:

state (int) – The current state.

Returns:

ndarray – A score per action.

score_pareto_cardinality(state: int)

Compute the action scores based upon the Pareto cardinality metric.

Parameters:

state (int) – The current state.

Returns:

ndarray – A score per action.

select_action(state: int, score_func: Callable)

Select an action in the current state.

Parameters:
  • state (int) – The current state.

  • score_func (callable) – A function that returns a score per action.

Returns:

int – The selected action.

track_policy(vec, env: Env, tol=0.001)

Track a policy from its return vector.

Parameters:
  • vec (array_like) – The return vector to track.

  • env (gym.Env) – The environment to track the policy in.

  • tol (float, optional) – The tolerance for the return vector. (Default value = 1e-3)

train(total_timesteps: int, eval_env: Env, ref_point: ndarray | None = None, known_pareto_front: List[ndarray] | None = None, num_eval_weights_for_eval: int = 50, log_every: int | None = 10000, action_eval: str | None = 'hypervolume')

Learn the Pareto front.

Parameters:
  • total_timesteps (int, optional) – The number of episodes to train for.

  • eval_env (gym.Env) – The environment to evaluate the policies on.

  • eval_ref_point (ndarray, optional) – The reference point for the hypervolume metric during evaluation. If none, use the same ref point as training.

  • known_pareto_front (List[ndarray], optional) – The optimal Pareto front, if known.

  • num_eval_weights_for_eval (int) – Number of weights use when evaluating the Pareto front, e.g., for computing expected utility.

  • log_every (int, optional) – Log the results every number of timesteps. (Default value = 1000)

  • action_eval (str, optional) – The action evaluation function name. (Default value = ‘hypervolume’)

Returns:

Set – The final Pareto front.