PGMORL

Some code for this algorithm has been adapted from the original code provided by the authors of the paper GitHub.

Applicability and limitations

  • Supports continuous observation and continuous action spaces.

  • Limited to 2 objectives for now.

  • The post-processing phase (Pareto analysis stage) has not been implemented yet.

Principle

PGMORL

The principle of this algorithm is to rely on multiple PPO agents to look for various tradeoffs. This algorithm keeps a population of PPO agents along with their current performances. At each iteration, the algorithm selects a few best agents in the population and assigns to each of these a weight vector that is used to train further. The weight vector are generated based on a prediction model computed from historical data gathered during the learning process.

MOPPO

Our implementation of multi-objective PPO is essentially a refactor of cleanRL. The main difference is that the value network returns a multi-objective value and this value is then scalarized using a weighted sum and the given weight vector.

Note: it might be possible to enhance this algorithm by relying on something else than PPO.

class morl_baselines.single_policy.ser.mo_ppo.MOPPO(id: int, networks: MOPPONet, weights: ndarray, envs: SyncVectorEnv, log: bool = False, steps_per_iteration: int = 2048, num_minibatches: int = 32, update_epochs: int = 10, learning_rate: float = 0.0003, gamma: float = 0.995, anneal_lr: bool = False, clip_coef: float = 0.2, ent_coef: float = 0.0, vf_coef: float = 0.5, clip_vloss: bool = True, max_grad_norm: float = 0.5, norm_adv: bool = True, target_kl: float | None = None, gae: bool = True, gae_lambda: float = 0.95, device: device | str = 'auto', seed: int = 42, rng: Generator | None = None)

Modified PPO to have a multi-objective value net (returning a vector) and applying weighted sum scalarization.

This code has been adapted from the PPO implementation of clean RL https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_continuous_action.py

Multi-objective PPO.

Parameters:
  • id – Policy ID

  • networks – Actor-Critic networks

  • weights – Weights of the objectives

  • envs – Vectorized environments

  • log – Whether to log

  • steps_per_iteration – Number of steps per iteration

  • num_minibatches – Number of minibatches

  • update_epochs – Number of epochs to update the network

  • learning_rate – Learning rate

  • gamma – Discount factor

  • anneal_lr – Whether to anneal the learning rate

  • clip_coef – PPO clipping coefficient

  • ent_coef – Entropy coefficient

  • vf_coef – Value function coefficient

  • clip_vloss – Whether to clip the value loss

  • max_grad_norm – Maximum gradient norm

  • norm_adv – Whether to normalize the advantage

  • target_kl – Target KL divergence

  • gae – Whether to use Generalized Advantage Estimation

  • gae_lambda – GAE lambda

  • device – Device to use

  • seed – Random seed

  • rng – Random number generator

change_weights(new_weights: ndarray)

Change the weights of the scalarization function.

Parameters:

new_weights – New weights to apply.

eval(obs: ndarray, w)

Returns the best action to perform for the given obs

Returns:

action as a numpy array (continuous actions)

train(start_time, current_iteration: int, max_iterations: int)

A training iteration: trains MOPPO for self.steps_per_iteration * self.num_envs.

Parameters:
  • start_time – time.time() when the training started

  • current_iteration – current iteration number

  • max_iterations – maximum number of iterations

update()

Update algorithm’s parameters (e.g. using experiences from the buffer).

Weight generator - prediction model

See section 3.3 of the paper for more details.

class morl_baselines.multi_policy.pgmorl.pgmorl.PerformancePredictor(neighborhood_threshold: float = 0.1, sigma: float = 0.03, A_bound_min: float = 1.0, A_bound_max: float = 500.0, f_scale: float = 20.0)

Performance prediction model.

Stores the performance deltas along with the used weights after each generation. Then, uses these stored samples to perform a regression for predicting the performance of using a given weight to train a given policy. Predicts: Weight & performance -> delta performance

Initialize the performance predictor.

Parameters:
  • neighborhood_threshold – The threshold for the neighborhood of an evaluation.

  • sigma – The sigma value for the prediction model

  • A_bound_min – The minimum value for the A parameter of the prediction model.

  • A_bound_max – The maximum value for the A parameter of the prediction model.

  • f_scale – The scale value for the prediction model.

add(weight: ndarray, eval_before_pg: ndarray, eval_after_pg: ndarray) None

Add a new sample to the performance predictor.

Parameters:
  • weight – The weight used to train the policy.

  • eval_before_pg – The evaluation before training the policy.

  • eval_after_pg – The evaluation after training the policy.

Returns:

None

predict_next_evaluation(weight_candidate: ndarray, policy_eval: ndarray) Tuple[ndarray, ndarray]

Predict the next evaluation of the policy.

Use a part of the collected data (determined by the neighborhood threshold) to predict the performance after using weight to train the policy whose current evaluation is policy_eval.

Parameters:
  • weight_candidate – weight candidate

  • policy_eval – current evaluation of the policy

Returns:

the delta prediction, along with the predicted next evaluations

PGMORL

class morl_baselines.multi_policy.pgmorl.pgmorl.PGMORL(env_id: str, origin: ndarray, num_envs: int = 4, pop_size: int = 6, warmup_iterations: int = 80, steps_per_iteration: int = 2048, evolutionary_iterations: int = 20, num_weight_candidates: int = 7, num_performance_buffer: int = 100, performance_buffer_size: int = 2, min_weight: float = 0.0, max_weight: float = 1.0, delta_weight: float = 0.2, env=None, gamma: float = 0.995, project_name: str = 'MORL-baselines', experiment_name: str = 'PGMORL', wandb_entity: str | None = None, seed: int | None = None, log: bool = True, net_arch: List = [64, 64], num_minibatches: int = 32, update_epochs: int = 10, learning_rate: float = 0.0003, anneal_lr: bool = False, clip_coef: float = 0.2, ent_coef: float = 0.0, vf_coef: float = 0.5, clip_vloss: bool = True, max_grad_norm: float = 0.5, norm_adv: bool = True, target_kl: float | None = None, gae: bool = True, gae_lambda: float = 0.95, device: device | str = 'auto', group: str | None = None)

Prediction Guided Multi-Objective Reinforcement Learning.

Reference: J. Xu, Y. Tian, P. Ma, D. Rus, S. Sueda, and W. Matusik, “Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control,” in Proceedings of the 37th International Conference on Machine Learning, Nov. 2020, pp. 10607–10616. Available: https://proceedings.mlr.press/v119/xu20h.html

Paper: https://people.csail.mit.edu/jiex/papers/PGMORL/paper.pdf Supplementary materials: https://people.csail.mit.edu/jiex/papers/PGMORL/supp.pdf

Initializes the PGMORL agent.

Parameters:
  • env_id – environment id

  • origin – reference point to make the objectives positive in the performance buffer

  • num_envs – number of environments to use (VectorizedEnvs)

  • pop_size – population size

  • warmup_iterations – number of warmup iterations

  • steps_per_iteration – number of steps per iteration

  • evolutionary_iterations – number of evolutionary iterations

  • num_weight_candidates – number of weight candidates

  • num_performance_buffer – number of performance buffers

  • performance_buffer_size – size of the performance buffers

  • min_weight – minimum weight

  • max_weight – maximum weight

  • delta_weight – delta weight for weight generation

  • env – environment

  • gamma – discount factor

  • project_name – name of the project. Usually MORL-baselines.

  • experiment_name – name of the experiment. Usually PGMORL.

  • wandb_entity – wandb entity, defaults to None.

  • seed – seed for the random number generator

  • log – whether to log the results

  • net_arch – number of units per layer

  • num_minibatches – number of minibatches

  • update_epochs – number of update epochs

  • learning_rate – learning rate

  • anneal_lr – whether to anneal the learning rate

  • clip_coef – coefficient for the policy gradient clipping

  • ent_coef – coefficient for the entropy term

  • vf_coef – coefficient for the value function loss

  • clip_vloss – whether to clip the value function loss

  • max_grad_norm – maximum gradient norm

  • norm_adv – whether to normalize the advantages

  • target_kl – target KL divergence

  • gae – whether to use generalized advantage estimation

  • gae_lambda – lambda parameter for GAE

  • device – device on which the code should run

  • group – The wandb group to use for logging.

get_config() dict

Generates dictionary of the algorithm parameters configuration.

Returns:

dict – Config

train(total_timesteps: int, eval_env: Env, ref_point: ndarray, known_pareto_front: List[ndarray] | None = None, num_eval_weights_for_eval: int = 50)

Trains the agents.