GPI-Prioritized Dyna

class morl_baselines.multi_policy.gpi_pd.gpi_pd.GPIPD(env, learning_rate: float = 0.0003, initial_epsilon: float = 0.01, final_epsilon: float = 0.01, epsilon_decay_steps: int | None = None, tau: float = 1.0, target_net_update_freq: int = 1000, buffer_size: int = 1000000, net_arch: ~typing.List = [256, 256, 256, 256], num_nets: int = 2, batch_size: int = 128, learning_starts: int = 100, gradient_updates: int = 20, gamma: float = 0.99, max_grad_norm: float | None = None, use_gpi: bool = True, dyna: bool = True, per: bool = True, gpi_pd: bool = True, alpha_per: float = 0.6, min_priority: float = 0.01, drop_rate: float = 0.01, layer_norm: bool = True, dynamics_normalize_inputs: bool = False, dynamics_uncertainty_threshold: float = 1.5, dynamics_train_freq: ~typing.Callable = <function GPIPD.<lambda>>, dynamics_rollout_len: int = 1, dynamics_rollout_starts: int = 5000, dynamics_rollout_freq: int = 250, dynamics_rollout_batch_size: int = 25000, dynamics_buffer_size: int = 100000, dynamics_net_arch: ~typing.List = [256, 256, 256], dynamics_ensemble_size: int = 5, dynamics_num_elites: int = 2, real_ratio: float = 0.5, project_name: str = 'MORL-Baselines', experiment_name: str = 'GPI-PD', wandb_entity: str | None = None, log: bool = True, seed: int | None = None, device: ~torch.device | str = 'auto')

GPI-PD Algorithm.

Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization Lucas N. Alegre, Ana L. C. Bazzan, Diederik M. Roijers, Ann Nowé, Bruno C. da Silva AAMAS 2023 Paper: https://arxiv.org/abs/2301.07784

Initialize the GPI-PD algorithm.

Parameters:
  • env – The environment to learn from.

  • learning_rate – The learning rate.

  • initial_epsilon – The initial epsilon value.

  • final_epsilon – The final epsilon value.

  • epsilon_decay_steps – The number of steps to decay epsilon.

  • tau – The soft update coefficient.

  • target_net_update_freq – The target network update frequency.

  • buffer_size – The size of the replay buffer.

  • net_arch – The network architecture.

  • num_nets – The number of networks.

  • batch_size – The batch size.

  • learning_starts – The number of steps before learning starts.

  • gradient_updates – The number of gradient updates per step.

  • gamma – The discount factor.

  • max_grad_norm – The maximum gradient norm.

  • use_gpi – Whether to use GPI.

  • dyna – Whether to use Dyna.

  • per – Whether to use PER.

  • gpi_pd – Whether to use GPI-PD.

  • alpha_per – The alpha parameter for PER.

  • min_priority – The minimum priority for PER.

  • drop_rate – The dropout rate.

  • layer_norm – Whether to use layer normalization.

  • dynamics_normalize_inputs – Whether to normalize inputs to the dynamics model.

  • dynamics_uncertainty_threshold – The uncertainty threshold for the dynamics model.

  • dynamics_train_freq – The dynamics model training frequency.

  • dynamics_rollout_len – The rollout length for the dynamics model.

  • dynamics_rollout_starts – The number of steps before the first rollout.

  • dynamics_rollout_freq – The rollout frequency.

  • dynamics_rollout_batch_size – The rollout batch size.

  • dynamics_buffer_size – The size of the dynamics model buffer.

  • dynamics_net_arch – The network architecture for the dynamics model.

  • dynamics_ensemble_size – The ensemble size for the dynamics model.

  • dynamics_num_elites – The number of elites for the dynamics model.

  • real_ratio – The ratio of real transitions to sample.

  • project_name – The name of the project.

  • experiment_name – The name of the experiment.

  • wandb_entity – The name of the wandb entity.

  • log – Whether to log.

  • seed – The seed for random number generators.

  • device – The device to use.

eval(obs: ndarray, w: ndarray) int

Select an action for the given obs and weight vector.

get_config()

Return the configuration of the agent.

gpi_action(obs: Tensor, w: Tensor, return_policy_index=False, include_w=False)

Select an action using GPI.

load(path, load_replay_buffer=True)

Load the model parameters and the replay buffer.

max_action(obs: Tensor, w: Tensor) int

Select the greedy action.

save(save_replay_buffer=True, save_dir='weights/', filename=None)

Save the model parameters and the replay buffer.

set_weight_support(weight_list: List[ndarray])

Set the weight support set.

train(total_timesteps: int, eval_env, ref_point: ndarray, known_pareto_front: List[ndarray] | None = None, num_eval_weights_for_front: int = 100, num_eval_episodes_for_front: int = 5, num_eval_weights_for_eval: int = 50, timesteps_per_iter: int = 10000, weight_selection_algo: str = 'gpi-ls', eval_freq: int = 1000, eval_mo_freq: int = 10000, checkpoints: bool = True)

Train agent.

Parameters:
  • total_timesteps (int) – Number of timesteps to train for.

  • eval_env (gym.Env) – Environment to evaluate on.

  • ref_point (np.ndarray) – Reference point for hypervolume calculation.

  • known_pareto_front (Optional[List[np.ndarray]]) – Optimal Pareto front if known.

  • num_eval_weights_for_front – Number of weights to evaluate for the Pareto front.

  • num_eval_episodes_for_front – number of episodes to run when evaluating the policy.

  • num_eval_weights_for_eval (int) – Number of weights use when evaluating the Pareto front, e.g., for computing expected utility.

  • timesteps_per_iter (int) – Number of timesteps to train for per iteration.

  • weight_selection_algo (str) – Weight selection algorithm to use.

  • eval_freq (int) – Number of timesteps between evaluations.

  • eval_mo_freq (int) – Number of timesteps between multi-objective evaluations.

  • checkpoints (bool) – Whether to save checkpoints.

train_iteration(total_timesteps: int, weight: ndarray, weight_support: List[ndarray], change_w_every_episode: bool = True, reset_num_timesteps: bool = True, eval_env: Env | None = None, eval_freq: int = 1000, reset_learning_starts: bool = False)

Train the agent for one iteration.

Parameters:
  • total_timesteps (int) – Number of timesteps to train for

  • weight (np.ndarray) – Weight vector

  • weight_support (List[np.ndarray]) – Weight support set

  • change_w_every_episode (bool) – Whether to change the weight vector at the end of each episode

  • reset_num_timesteps (bool) – Whether to reset the number of timesteps

  • eval_env (Optional[gym.Env]) – Environment to evaluate on

  • eval_freq (int) – Number of timesteps between evaluations

  • reset_learning_starts (bool) – Whether to reset the learning starts

update(weight: Tensor)

Update the parameters of the networks.