GPI-Prioritized Dyna¶

class morl_baselines.multi_policy.gpi_pd.gpi_pd.GPIPD(env, learning_rate: float = 0.0003, initial_epsilon: float = 0.01, final_epsilon: float = 0.01, epsilon_decay_steps: int | None = None, tau: float = 1.0, target_net_update_freq: int = 1000, buffer_size: int = 1000000, net_arch: ~typing.List = [256, 256, 256, 256], num_nets: int = 2, batch_size: int = 128, learning_starts: int = 100, gradient_updates: int = 20, gamma: float = 0.99, max_grad_norm: float | None = None, use_gpi: bool = True, dyna: bool = True, per: bool = True, gpi_pd: bool = True, alpha_per: float = 0.6, min_priority: float = 0.01, drop_rate: float = 0.01, layer_norm: bool = True, dynamics_normalize_inputs: bool = False, dynamics_uncertainty_threshold: float = 1.5, dynamics_train_freq: ~typing.Callable = <function GPIPD.<lambda>>, dynamics_rollout_len: int = 1, dynamics_rollout_starts: int = 5000, dynamics_rollout_freq: int = 250, dynamics_rollout_batch_size: int = 25000, dynamics_buffer_size: int = 100000, dynamics_net_arch: ~typing.List = [256, 256, 256], dynamics_ensemble_size: int = 5, dynamics_num_elites: int = 2, real_ratio: float = 0.5, project_name: str = 'MORL-Baselines', experiment_name: str = 'GPI-PD', wandb_entity: str | None = None, log: bool = True, seed: int | None = None, device: ~torch.device | str = 'auto')¶

GPI-PD Algorithm.

Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization Lucas N. Alegre, Ana L. C. Bazzan, Diederik M. Roijers, Ann Nowé, Bruno C. da Silva AAMAS 2023 Paper: https://arxiv.org/abs/2301.07784

Initialize the GPI-PD algorithm.

Parameters:

env – The environment to learn from.
learning_rate – The learning rate.
initial_epsilon – The initial epsilon value.
final_epsilon – The final epsilon value.
epsilon_decay_steps – The number of steps to decay epsilon.
tau – The soft update coefficient.
target_net_update_freq – The target network update frequency.
buffer_size – The size of the replay buffer.
net_arch – The network architecture.
num_nets – The number of networks.
batch_size – The batch size.
learning_starts – The number of steps before learning starts.
gradient_updates – The number of gradient updates per step.
gamma – The discount factor.
max_grad_norm – The maximum gradient norm.
use_gpi – Whether to use GPI.
dyna – Whether to use Dyna.
per – Whether to use PER.
gpi_pd – Whether to use GPI-PD.
alpha_per – The alpha parameter for PER.
min_priority – The minimum priority for PER.
drop_rate – The dropout rate.
layer_norm – Whether to use layer normalization.
dynamics_normalize_inputs – Whether to normalize inputs to the dynamics model.
dynamics_uncertainty_threshold – The uncertainty threshold for the dynamics model.
dynamics_train_freq – The dynamics model training frequency.
dynamics_rollout_len – The rollout length for the dynamics model.
dynamics_rollout_starts – The number of steps before the first rollout.
dynamics_rollout_freq – The rollout frequency.
dynamics_rollout_batch_size – The rollout batch size.
dynamics_buffer_size – The size of the dynamics model buffer.
dynamics_net_arch – The network architecture for the dynamics model.
dynamics_ensemble_size – The ensemble size for the dynamics model.
dynamics_num_elites – The number of elites for the dynamics model.
real_ratio – The ratio of real transitions to sample.
project_name – The name of the project.
experiment_name – The name of the experiment.
wandb_entity – The name of the wandb entity.
log – Whether to log.
seed – The seed for random number generators.
device – The device to use.

eval(obs: ndarray, w: ndarray) → int¶: Select an action for the given obs and weight vector.

get_config()¶: Return the configuration of the agent.

gpi_action(obs: Tensor, w: Tensor, return_policy_index=False, include_w=False)¶: Select an action using GPI.

load(path, load_replay_buffer=True)¶: Load the model parameters and the replay buffer.

max_action(obs: Tensor, w: Tensor) → int¶: Select the greedy action.

save(save_replay_buffer=True, save_dir='weights/', filename=None)¶: Save the model parameters and the replay buffer.

set_weight_support(weight_list: List[ndarray])¶: Set the weight support set.

train(total_timesteps: int, eval_env, ref_point: ndarray, known_pareto_front: List[ndarray] | None = None, num_eval_weights_for_front: int = 100, num_eval_episodes_for_front: int = 5, num_eval_weights_for_eval: int = 50, timesteps_per_iter: int = 10000, weight_selection_algo: str = 'gpi-ls', eval_freq: int = 1000, eval_mo_freq: int = 10000, checkpoints: bool = True)¶

Train agent.

Parameters:

total_timesteps (int) – Number of timesteps to train for.
eval_env (gym.Env) – Environment to evaluate on.
ref_point (np.ndarray) – Reference point for hypervolume calculation.
known_pareto_front (Optional[List[np.ndarray]]) – Optimal Pareto front if known.
num_eval_weights_for_front – Number of weights to evaluate for the Pareto front.
num_eval_episodes_for_front – number of episodes to run when evaluating the policy.
num_eval_weights_for_eval (int) – Number of weights use when evaluating the Pareto front, e.g., for computing expected utility.
timesteps_per_iter (int) – Number of timesteps to train for per iteration.
weight_selection_algo (str) – Weight selection algorithm to use.
eval_freq (int) – Number of timesteps between evaluations.
eval_mo_freq (int) – Number of timesteps between multi-objective evaluations.
checkpoints (bool) – Whether to save checkpoints.

train_iteration(total_timesteps: int, weight: ndarray, weight_support: List[ndarray], change_w_every_episode: bool = True, reset_num_timesteps: bool = True, eval_env: Env | None = None, eval_freq: int = 1000, reset_learning_starts: bool = False)¶

Train the agent for one iteration.

Parameters:

total_timesteps (int) – Number of timesteps to train for
weight (np.ndarray) – Weight vector
weight_support (List[np.ndarray]) – Weight support set
change_w_every_episode (bool) – Whether to change the weight vector at the end of each episode
reset_num_timesteps (bool) – Whether to reset the number of timesteps
eval_env (Optional[gym.Env]) – Environment to evaluate on
eval_freq (int) – Number of timesteps between evaluations
reset_learning_starts (bool) – Whether to reset the learning starts

update(weight: Tensor)¶: Update the parameters of the networks.