MORL/D¶
Multi-Objective Reinforcement Learning based on Decomposition. The idea of this framework is to decompose the multi-objective problem into a set of single-objective problems. The single-objective problems are then solved by a single-objective RL algorithm (or something close). There are multiple tricks which can be applied to improve the sample efficiency when compared to just sequentially solving each single-objective RL problem.
See the paper Multi-Objective Reinforcement Learning based on Decomposition for more details.
- class morl_baselines.multi_policy.morld.morld.MORLD(env: ~gymnasium.core.Env, scalarization_method: str = 'ws', evaluation_mode: str = 'ser', policy_name: str = 'MOSAC', policy_args: dict = {}, gamma: float = 0.995, pop_size: int = 6, seed: int = 42, rng: ~numpy.random._generator.Generator | None = None, exchange_every: int = 40000, neighborhood_size: int = 1, dist_metric: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray], float] = <function MORLD.<lambda>>, shared_buffer: bool = False, sharing_mechanism: ~typing.List[str] = [], update_passes: int = 10, weight_init_method: str = 'uniform', weight_adaptation_method: str | None = None, project_name: str = 'MORL-Baselines', experiment_name: str = 'MORL-D', wandb_entity: str | None = None, log: bool = True, device: ~torch.device | str = 'auto')¶
MORL/D implementation, decomposition based technique for MORL.
Initializes MORL/D.
- Parameters:
env – environment
scalarization_method – scalarization method to apply. “ws” or “tch”.
evaluation_mode – esr or ser (for evaluation env)
policy_name – name of the underlying policy to use: “MOSAC”, EUPG can be easily adapted.
policy_args – arguments for the policy
gamma – gamma
pop_size – size of population
seed – seed for RNG
rng – RNG
exchange_every – exchange trigger (timesteps based)
neighborhood_size – size of the neighbordhood ( in [0, pop_size)
dist_metric – distance metric between weight vectors to determine neighborhood
shared_buffer – whether buffer should be shared or not
sharing_mechanism – list containing potential sharing mechanisms: “transfer” is only supported for now.
update_passes – number of times to update all policies after sampling from one policy.
weight_init_method – weight initialization method. “uniform” or “random”
weight_adaptation_method – weight adaptation method. “PSA” or None.
project_name – For wandb logging
experiment_name – For wandb logging
wandb_entity – For wandb logging
log – For wandb logging
device – torch device
- get_config() dict ¶
Generates dictionary of the algorithm parameters configuration.
- Returns:
dict – Config
- train(total_timesteps: int, eval_env: Env, ref_point: ndarray, known_pareto_front: List[ndarray] | None = None, num_eval_episodes_for_front: int = 5, num_eval_weights_for_eval: int = 50, reset_num_timesteps: bool = False)¶
Trains the algorithm.
- Parameters:
total_timesteps – total number of timesteps
eval_env – evaluation environment
ref_point – reference point for the hypervolume metric
known_pareto_front – optimal pareto front for the problem if known
num_eval_episodes_for_front – number of episodes for each policy evaluation
num_eval_weights_for_eval (int) – Number of weights use when evaluating the Pareto front, e.g., for computing expected utility.
reset_num_timesteps – whether to reset the number of timesteps or not