MORL/D

Multi-Objective Reinforcement Learning based on Decomposition. The idea of this framework is to decompose the multi-objective problem into a set of single-objective problems. The single-objective problems are then solved by a single-objective RL algorithm (or something close). There are multiple tricks which can be applied to improve the sample efficiency when compared to just sequentially solving each single-objective RL problem.

See the paper Multi-Objective Reinforcement Learning based on Decomposition for more details.

class morl_baselines.multi_policy.morld.morld.MORLD(env: ~gymnasium.core.Env, scalarization_method: str = 'ws', evaluation_mode: str = 'ser', policy_name: str = 'MOSAC', policy_args: dict = {}, gamma: float = 0.995, pop_size: int = 6, seed: int = 42, rng: ~numpy.random._generator.Generator | None = None, exchange_every: int = 40000, neighborhood_size: int = 1, dist_metric: ~typing.Callable[[~numpy.ndarray, ~numpy.ndarray], float] = <function MORLD.<lambda>>, shared_buffer: bool = False, sharing_mechanism: ~typing.List[str] = [], update_passes: int = 10, weight_init_method: str = 'uniform', weight_adaptation_method: str | None = None, project_name: str = 'MORL-Baselines', experiment_name: str = 'MORL-D', wandb_entity: str | None = None, log: bool = True, device: ~torch.device | str = 'auto')

MORL/D implementation, decomposition based technique for MORL.

Initializes MORL/D.

Parameters:
  • env – environment

  • scalarization_method – scalarization method to apply. “ws” or “tch”.

  • evaluation_mode – esr or ser (for evaluation env)

  • policy_name – name of the underlying policy to use: “MOSAC”, EUPG can be easily adapted.

  • policy_args – arguments for the policy

  • gamma – gamma

  • pop_size – size of population

  • seed – seed for RNG

  • rng – RNG

  • exchange_every – exchange trigger (timesteps based)

  • neighborhood_size – size of the neighbordhood ( in [0, pop_size)

  • dist_metric – distance metric between weight vectors to determine neighborhood

  • shared_buffer – whether buffer should be shared or not

  • sharing_mechanism – list containing potential sharing mechanisms: “transfer” is only supported for now.

  • update_passes – number of times to update all policies after sampling from one policy.

  • weight_init_method – weight initialization method. “uniform” or “random”

  • weight_adaptation_method – weight adaptation method. “PSA” or None.

  • project_name – For wandb logging

  • experiment_name – For wandb logging

  • wandb_entity – For wandb logging

  • log – For wandb logging

  • device – torch device

get_config() dict

Generates dictionary of the algorithm parameters configuration.

Returns:

dict – Config

train(total_timesteps: int, eval_env: Env, ref_point: ndarray, known_pareto_front: List[ndarray] | None = None, num_eval_episodes_for_front: int = 5, num_eval_weights_for_eval: int = 50, reset_num_timesteps: bool = False)

Trains the algorithm.

Parameters:
  • total_timesteps – total number of timesteps

  • eval_env – evaluation environment

  • ref_point – reference point for the hypervolume metric

  • known_pareto_front – optimal pareto front for the problem if known

  • num_eval_episodes_for_front – number of episodes for each policy evaluation

  • num_eval_weights_for_eval (int) – Number of weights use when evaluating the Pareto front, e.g., for computing expected utility.

  • reset_num_timesteps – whether to reset the number of timesteps or not