Pareto Q-Learning¶

class morl_baselines.multi_policy.pareto_q_learning.pql.PQL(env, ref_point: ndarray, gamma: float = 0.8, initial_epsilon: float = 1.0, epsilon_decay_steps: int = 100000, final_epsilon: float = 0.1, seed: int | None = None, project_name: str = 'MORL-Baselines', experiment_name: str = 'Pareto Q-Learning', wandb_entity: str | None = None, log: bool = True)¶

Pareto Q-learning.

Tabular method relying on pareto pruning. Paper: K. Van Moffaert and A. Nowé, “Multi-objective reinforcement learning using sets of pareto dominating policies,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3483–3512, 2014.

Initialize the Pareto Q-learning algorithm.

Parameters:

env – The environment.
ref_point – The reference point for the hypervolume metric.
gamma – The discount factor.
initial_epsilon – The initial epsilon value.
epsilon_decay_steps – The number of steps to decay epsilon.
final_epsilon – The final epsilon value.
seed – The random seed.
project_name – The name of the project used for logging.
experiment_name – The name of the experiment used for logging.
wandb_entity – The wandb entity used for logging.
log – Whether to log or not.

calc_non_dominated(state: int)¶

Get the non-dominated vectors in a given state.

Parameters:: state (int) – The current state.
Returns:: Set – A set of Pareto non-dominated vectors.

get_config() → dict¶

Get the configuration dictionary.

Returns:: Dict – A dictionary of parameters and values.

get_local_pcs(state: int = 0)¶

Collect the local PCS in a given state.

Parameters:: state (int) – The state to get a local PCS for. (Default value = 0)
Returns:: Set – A set of Pareto optimal vectors.

get_q_set(state: int, action: int)¶

Compute the Q-set for a given state-action pair.

Parameters:

state (int) – The current state.
action (int) – The action.

Returns:

A set of Q vectors.

score_hypervolume(state: int)¶

Compute the action scores based upon the hypervolume metric.

Parameters:: state (int) – The current state.
Returns:: ndarray – A score per action.

score_pareto_cardinality(state: int)¶

Compute the action scores based upon the Pareto cardinality metric.

Parameters:: state (int) – The current state.
Returns:: ndarray – A score per action.

select_action(state: int, score_func: Callable)¶

Select an action in the current state.

Parameters:

state (int) – The current state.
score_func (callable) – A function that returns a score per action.

Returns:

int – The selected action.

track_policy(vec, env: Env, tol=0.001)¶

Track a policy from its return vector.

Parameters:

vec (array_like) – The return vector to track.
env (gym.Env) – The environment to track the policy in.
tol (float, optional) – The tolerance for the return vector. (Default value = 1e-3)

train(total_timesteps: int, eval_env: Env, ref_point: ndarray | None = None, known_pareto_front: List[ndarray] | None = None, num_eval_weights_for_eval: int = 50, log_every: int | None = 10000, action_eval: str | None = 'hypervolume')¶

Learn the Pareto front.

Parameters:

total_timesteps (int, optional) – The number of episodes to train for.
eval_env (gym.Env) – The environment to evaluate the policies on.
eval_ref_point (ndarray, optional) – The reference point for the hypervolume metric during evaluation. If none, use the same ref point as training.
known_pareto_front (List[ndarray], optional) – The optimal Pareto front, if known.
num_eval_weights_for_eval (int) – Number of weights use when evaluating the Pareto front, e.g., for computing expected utility.
log_every (int, optional) – Log the results every number of timesteps. (Default value = 1000)
action_eval (str, optional) – The action evaluation function name. (Default value = ‘hypervolume’)

Returns:

Set – The final Pareto front.