Pareto Q-Learning¶
- class morl_baselines.multi_policy.pareto_q_learning.pql.PQL(env, ref_point: ndarray, gamma: float = 0.8, initial_epsilon: float = 1.0, epsilon_decay_steps: int = 100000, final_epsilon: float = 0.1, seed: int | None = None, project_name: str = 'MORL-Baselines', experiment_name: str = 'Pareto Q-Learning', wandb_entity: str | None = None, log: bool = True)¶
Pareto Q-learning.
Tabular method relying on pareto pruning. Paper: K. Van Moffaert and A. Nowé, “Multi-objective reinforcement learning using sets of pareto dominating policies,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3483–3512, 2014.
Initialize the Pareto Q-learning algorithm.
- Parameters:
env – The environment.
ref_point – The reference point for the hypervolume metric.
gamma – The discount factor.
initial_epsilon – The initial epsilon value.
epsilon_decay_steps – The number of steps to decay epsilon.
final_epsilon – The final epsilon value.
seed – The random seed.
project_name – The name of the project used for logging.
experiment_name – The name of the experiment used for logging.
wandb_entity – The wandb entity used for logging.
log – Whether to log or not.
- calc_non_dominated(state: int)¶
Get the non-dominated vectors in a given state.
- Parameters:
state (int) – The current state.
- Returns:
Set – A set of Pareto non-dominated vectors.
- get_config() dict ¶
Get the configuration dictionary.
- Returns:
Dict – A dictionary of parameters and values.
- get_local_pcs(state: int = 0)¶
Collect the local PCS in a given state.
- Parameters:
state (int) – The state to get a local PCS for. (Default value = 0)
- Returns:
Set – A set of Pareto optimal vectors.
- get_q_set(state: int, action: int)¶
Compute the Q-set for a given state-action pair.
- Parameters:
state (int) – The current state.
action (int) – The action.
- Returns:
A set of Q vectors.
- score_hypervolume(state: int)¶
Compute the action scores based upon the hypervolume metric.
- Parameters:
state (int) – The current state.
- Returns:
ndarray – A score per action.
- score_pareto_cardinality(state: int)¶
Compute the action scores based upon the Pareto cardinality metric.
- Parameters:
state (int) – The current state.
- Returns:
ndarray – A score per action.
- select_action(state: int, score_func: Callable)¶
Select an action in the current state.
- Parameters:
state (int) – The current state.
score_func (callable) – A function that returns a score per action.
- Returns:
int – The selected action.
- track_policy(vec, env: Env, tol=0.001)¶
Track a policy from its return vector.
- Parameters:
vec (array_like) – The return vector to track.
env (gym.Env) – The environment to track the policy in.
tol (float, optional) – The tolerance for the return vector. (Default value = 1e-3)
- train(total_timesteps: int, eval_env: Env, ref_point: ndarray | None = None, known_pareto_front: List[ndarray] | None = None, num_eval_weights_for_eval: int = 50, log_every: int | None = 10000, action_eval: str | None = 'hypervolume')¶
Learn the Pareto front.
- Parameters:
total_timesteps (int, optional) – The number of episodes to train for.
eval_env (gym.Env) – The environment to evaluate the policies on.
eval_ref_point (ndarray, optional) – The reference point for the hypervolume metric during evaluation. If none, use the same ref point as training.
known_pareto_front (List[ndarray], optional) – The optimal Pareto front, if known.
num_eval_weights_for_eval (int) – Number of weights use when evaluating the Pareto front, e.g., for computing expected utility.
log_every (int, optional) – Log the results every number of timesteps. (Default value = 1000)
action_eval (str, optional) – The action evaluation function name. (Default value = ‘hypervolume’)
- Returns:
Set – The final Pareto front.