PGMORL¶

Some code for this algorithm has been adapted from the original code provided by the authors of the paper GitHub.

Applicability and limitations¶

Supports continuous observation and continuous action spaces.
Limited to 2 objectives for now.
The post-processing phase (Pareto analysis stage) has not been implemented yet.

Principle¶

The principle of this algorithm is to rely on multiple PPO agents to look for various tradeoffs. This algorithm keeps a population of PPO agents along with their current performances. At each iteration, the algorithm selects a few best agents in the population and assigns to each of these a weight vector that is used to train further. The weight vector are generated based on a prediction model computed from historical data gathered during the learning process.

MOPPO¶

Our implementation of multi-objective PPO is essentially a refactor of cleanRL. The main difference is that the value network returns a multi-objective value and this value is then scalarized using a weighted sum and the given weight vector.

Note: it might be possible to enhance this algorithm by relying on something else than PPO.

class morl_baselines.single_policy.ser.mo_ppo.MOPPO(id: int, networks: MOPPONet, weights: ndarray, envs: SyncVectorEnv, log: bool = False, steps_per_iteration: int = 2048, num_minibatches: int = 32, update_epochs: int = 10, learning_rate: float = 0.0003, gamma: float = 0.995, anneal_lr: bool = False, clip_coef: float = 0.2, ent_coef: float = 0.0, vf_coef: float = 0.5, clip_vloss: bool = True, max_grad_norm: float = 0.5, norm_adv: bool = True, target_kl: float | None = None, gae: bool = True, gae_lambda: float = 0.95, device: device | str = 'auto', seed: int = 42, rng: Generator | None = None)¶

Modified PPO to have a multi-objective value net (returning a vector) and applying weighted sum scalarization.

This code has been adapted from the PPO implementation of clean RL https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_continuous_action.py

Multi-objective PPO.

Parameters:

id – Policy ID
networks – Actor-Critic networks
weights – Weights of the objectives
envs – Vectorized environments
log – Whether to log
steps_per_iteration – Number of steps per iteration
num_minibatches – Number of minibatches
update_epochs – Number of epochs to update the network
learning_rate – Learning rate
gamma – Discount factor
anneal_lr – Whether to anneal the learning rate
clip_coef – PPO clipping coefficient
ent_coef – Entropy coefficient
vf_coef – Value function coefficient
clip_vloss – Whether to clip the value loss
max_grad_norm – Maximum gradient norm
norm_adv – Whether to normalize the advantage
target_kl – Target KL divergence
gae – Whether to use Generalized Advantage Estimation
gae_lambda – GAE lambda
device – Device to use
seed – Random seed
rng – Random number generator

change_weights(new_weights: ndarray)¶

Change the weights of the scalarization function.

Parameters:: new_weights – New weights to apply.

eval(obs: ndarray, w)¶

Returns the best action to perform for the given obs

Returns:: action as a numpy array (continuous actions)

train(start_time, current_iteration: int, max_iterations: int)¶

A training iteration: trains MOPPO for self.steps_per_iteration * self.num_envs.

Parameters:

start_time – time.time() when the training started
current_iteration – current iteration number
max_iterations – maximum number of iterations

update()¶: Update algorithm’s parameters (e.g. using experiences from the buffer).

Weight generator - prediction model¶

See section 3.3 of the paper for more details.

class morl_baselines.multi_policy.pgmorl.pgmorl.PerformancePredictor(neighborhood_threshold: float = 0.1, sigma: float = 0.03, A_bound_min: float = 1.0, A_bound_max: float = 500.0, f_scale: float = 20.0)¶

Performance prediction model.

Stores the performance deltas along with the used weights after each generation. Then, uses these stored samples to perform a regression for predicting the performance of using a given weight to train a given policy. Predicts: Weight & performance -> delta performance

Initialize the performance predictor.

Parameters:

neighborhood_threshold – The threshold for the neighborhood of an evaluation.
sigma – The sigma value for the prediction model
A_bound_min – The minimum value for the A parameter of the prediction model.
A_bound_max – The maximum value for the A parameter of the prediction model.
f_scale – The scale value for the prediction model.

add(weight: ndarray, eval_before_pg: ndarray, eval_after_pg: ndarray) → None¶

Add a new sample to the performance predictor.

Parameters:

weight – The weight used to train the policy.
eval_before_pg – The evaluation before training the policy.
eval_after_pg – The evaluation after training the policy.

Returns:

None

predict_next_evaluation(weight_candidate: ndarray, policy_eval: ndarray) → Tuple[ndarray, ndarray]¶

Predict the next evaluation of the policy.

Use a part of the collected data (determined by the neighborhood threshold) to predict the performance after using weight to train the policy whose current evaluation is policy_eval.

Parameters:

weight_candidate – weight candidate
policy_eval – current evaluation of the policy

Returns:

the delta prediction, along with the predicted next evaluations

PGMORL¶

class morl_baselines.multi_policy.pgmorl.pgmorl.PGMORL(env_id: str, origin: ndarray, num_envs: int = 4, pop_size: int = 6, warmup_iterations: int = 80, steps_per_iteration: int = 2048, evolutionary_iterations: int = 20, num_weight_candidates: int = 7, num_performance_buffer: int = 100, performance_buffer_size: int = 2, min_weight: float = 0.0, max_weight: float = 1.0, delta_weight: float = 0.2, env=None, gamma: float = 0.995, project_name: str = 'MORL-baselines', experiment_name: str = 'PGMORL', wandb_entity: str | None = None, seed: int | None = None, log: bool = True, net_arch: List = [64, 64], num_minibatches: int = 32, update_epochs: int = 10, learning_rate: float = 0.0003, anneal_lr: bool = False, clip_coef: float = 0.2, ent_coef: float = 0.0, vf_coef: float = 0.5, clip_vloss: bool = True, max_grad_norm: float = 0.5, norm_adv: bool = True, target_kl: float | None = None, gae: bool = True, gae_lambda: float = 0.95, device: device | str = 'auto', group: str | None = None)¶

Prediction Guided Multi-Objective Reinforcement Learning.

Reference: J. Xu, Y. Tian, P. Ma, D. Rus, S. Sueda, and W. Matusik, “Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control,” in Proceedings of the 37th International Conference on Machine Learning, Nov. 2020, pp. 10607–10616. Available: https://proceedings.mlr.press/v119/xu20h.html

Paper: https://people.csail.mit.edu/jiex/papers/PGMORL/paper.pdf Supplementary materials: https://people.csail.mit.edu/jiex/papers/PGMORL/supp.pdf

Initializes the PGMORL agent.

Parameters:

env_id – environment id
origin – reference point to make the objectives positive in the performance buffer
num_envs – number of environments to use (VectorizedEnvs)
pop_size – population size
warmup_iterations – number of warmup iterations
steps_per_iteration – number of steps per iteration
evolutionary_iterations – number of evolutionary iterations
num_weight_candidates – number of weight candidates
num_performance_buffer – number of performance buffers
performance_buffer_size – size of the performance buffers
min_weight – minimum weight
max_weight – maximum weight
delta_weight – delta weight for weight generation
env – environment
gamma – discount factor
project_name – name of the project. Usually MORL-baselines.
experiment_name – name of the experiment. Usually PGMORL.
wandb_entity – wandb entity, defaults to None.
seed – seed for the random number generator
log – whether to log the results
net_arch – number of units per layer
num_minibatches – number of minibatches
update_epochs – number of update epochs
learning_rate – learning rate
anneal_lr – whether to anneal the learning rate
clip_coef – coefficient for the policy gradient clipping
ent_coef – coefficient for the entropy term
vf_coef – coefficient for the value function loss
clip_vloss – whether to clip the value function loss
max_grad_norm – maximum gradient norm
norm_adv – whether to normalize the advantages
target_kl – target KL divergence
gae – whether to use generalized advantage estimation
gae_lambda – lambda parameter for GAE
device – device on which the code should run
group – The wandb group to use for logging.

get_config() → dict¶

Generates dictionary of the algorithm parameters configuration.

Returns:: dict – Config

train(total_timesteps: int, eval_env: Env, ref_point: ndarray, known_pareto_front: List[ndarray] | None = None, num_eval_weights_for_eval: int = 50)¶: Trains the agents.