Envelope Q-Learning

class morl_baselines.multi_policy.envelope.envelope.Envelope(env, learning_rate: float = 0.0003, initial_epsilon: float = 0.01, final_epsilon: float = 0.01, epsilon_decay_steps: int | None = None, tau: float = 1.0, target_net_update_freq: int = 200, buffer_size: int = 1000000, net_arch: List = [256, 256, 256, 256], batch_size: int = 256, learning_starts: int = 100, gradient_updates: int = 1, gamma: float = 0.99, max_grad_norm: float | None = 1.0, envelope: bool = True, num_sample_w: int = 4, per: bool = True, per_alpha: float = 0.6, initial_homotopy_lambda: float = 0.0, final_homotopy_lambda: float = 1.0, homotopy_decay_steps: int | None = None, project_name: str = 'MORL-Baselines', experiment_name: str = 'Envelope', wandb_entity: str | None = None, log: bool = True, seed: int | None = None, device: device | str = 'auto', group: str | None = None)

Envelope Q-Leaning Algorithm.

Envelope uses a conditioned network to embed multiple policies (taking the weight as input). The main change of this algorithm compare to a scalarized CN DQN is the target update. Paper: R. Yang, X. Sun, and K. Narasimhan, “A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation,” arXiv:1908.08342 [cs], Nov. 2019, Accessed: Sep. 06, 2021. [Online]. Available: http://arxiv.org/abs/1908.08342.

Envelope Q-learning algorithm.

Parameters:
  • env – The environment to learn from.

  • learning_rate – The learning rate (alpha).

  • initial_epsilon – The initial epsilon value for epsilon-greedy exploration.

  • final_epsilon – The final epsilon value for epsilon-greedy exploration.

  • epsilon_decay_steps – The number of steps to decay epsilon over.

  • tau – The soft update coefficient (keep in [0, 1]).

  • target_net_update_freq – The frequency with which the target network is updated.

  • buffer_size – The size of the replay buffer.

  • net_arch – The size of the hidden layers of the value net.

  • batch_size – The size of the batch to sample from the replay buffer.

  • learning_starts – The number of steps before learning starts i.e. the agent will be random until learning starts.

  • gradient_updates – The number of gradient updates per step.

  • gamma – The discount factor (gamma).

  • max_grad_norm – The maximum norm for the gradient clipping. If None, no gradient clipping is applied.

  • envelope – Whether to use the envelope method.

  • num_sample_w – The number of weight vectors to sample for the envelope target.

  • per – Whether to use prioritized experience replay.

  • per_alpha – The alpha parameter for prioritized experience replay.

  • initial_homotopy_lambda – The initial value of the homotopy parameter for homotopy optimization.

  • final_homotopy_lambda – The final value of the homotopy parameter.

  • homotopy_decay_steps – The number of steps to decay the homotopy parameter over.

  • project_name – The name of the project, for wandb logging.

  • experiment_name – The name of the experiment, for wandb logging.

  • wandb_entity – The entity of the project, for wandb logging.

  • log – Whether to log to wandb.

  • seed – The seed for the random number generator.

  • device – The device to use for training.

  • group – The wandb group to use for logging.

act(obs: Tensor, w: Tensor) int

Epsilon-greedily select an action given an observation and weight.

Parameters:
  • obs – observation

  • w – weight vector

Returns: an integer representing the action to take.

ddqn_target(obs: Tensor, w: Tensor) Tensor

Double DQN target for the given observation and weight.

Parameters:
  • obs – observation

  • w – weight vector.

Returns: the DQN target.

envelope_target(obs: Tensor, w: Tensor, sampled_w: Tensor) Tensor

Computes the envelope target for the given observation and weight.

Parameters:
  • obs – current observation.

  • w – current weight vector.

  • sampled_w – set of sampled weight vectors (>1!).

Returns: the envelope target.

eval(obs: ndarray, w: ndarray) int

Gives the best action for the given observation.

Parameters:
  • obs (np.array) – Observation

  • w (optional np.array) – weight for scalarization

Returns:

np.array or int – Action

get_config()

Generates dictionary of the algorithm parameters configuration.

Returns:

dict – Config

load(path: str, load_replay_buffer: bool = True)

Load the model and the replay buffer if specified.

Parameters:
  • path – Path to the model.

  • load_replay_buffer – Whether to load the replay buffer too.

max_action(obs: Tensor, w: Tensor) int

Select the action with the highest Q-value given an observation and weight.

Parameters:
  • obs – observation

  • w – weight vector

Returns: the action with the highest Q-value.

save(save_replay_buffer: bool = True, save_dir: str = 'weights/', filename: str | None = None)

Save the model and the replay buffer if specified.

Parameters:
  • save_replay_buffer – Whether to save the replay buffer too.

  • save_dir – Directory to save the model.

  • filename – filename to save the model.

train(total_timesteps: int, eval_env: Env | None = None, ref_point: ndarray | None = None, known_pareto_front: List[ndarray] | None = None, weight: ndarray | None = None, total_episodes: int | None = None, reset_num_timesteps: bool = True, eval_freq: int = 10000, num_eval_weights_for_front: int = 100, num_eval_episodes_for_front: int = 5, num_eval_weights_for_eval: int = 50, reset_learning_starts: bool = False, verbose: bool = False)

Train the agent.

Parameters:
  • total_timesteps – total number of timesteps to train for.

  • eval_env – environment to use for evaluation. If None, it is ignored.

  • ref_point – reference point for the hypervolume computation.

  • known_pareto_front – known pareto front for the hypervolume computation.

  • weight – weight vector. If None, it is randomly sampled every episode (as done in the paper).

  • total_episodes – total number of episodes to train for. If None, it is ignored.

  • reset_num_timesteps – whether to reset the number of timesteps. Useful when training multiple times.

  • eval_freq – policy evaluation frequency (in number of steps).

  • num_eval_weights_for_front – number of weights to sample for creating the pareto front when evaluating.

  • num_eval_episodes_for_front – number of episodes to run when evaluating the policy.

  • num_eval_weights_for_eval (int) – Number of weights use when evaluating the Pareto front, e.g., for computing expected utility.

  • reset_learning_starts – whether to reset the learning starts. Useful when training multiple times.

  • verbose – whether to print the episode info.

update()

Update algorithm’s parameters (e.g. using experiences from the buffer).