Envelope Q-Learning¶

class morl_baselines.multi_policy.envelope.envelope.Envelope(env, learning_rate: float = 0.0003, initial_epsilon: float = 0.01, final_epsilon: float = 0.01, epsilon_decay_steps: int | None = None, tau: float = 1.0, target_net_update_freq: int = 200, buffer_size: int = 1000000, net_arch: List = [256, 256, 256, 256], batch_size: int = 256, learning_starts: int = 100, gradient_updates: int = 1, gamma: float = 0.99, max_grad_norm: float | None = 1.0, envelope: bool = True, num_sample_w: int = 4, per: bool = True, per_alpha: float = 0.6, initial_homotopy_lambda: float = 0.0, final_homotopy_lambda: float = 1.0, homotopy_decay_steps: int | None = None, project_name: str = 'MORL-Baselines', experiment_name: str = 'Envelope', wandb_entity: str | None = None, log: bool = True, seed: int | None = None, device: device | str = 'auto', group: str | None = None)¶

Envelope Q-Leaning Algorithm.

Envelope uses a conditioned network to embed multiple policies (taking the weight as input). The main change of this algorithm compare to a scalarized CN DQN is the target update. Paper: R. Yang, X. Sun, and K. Narasimhan, “A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation,” arXiv:1908.08342 [cs], Nov. 2019, Accessed: Sep. 06, 2021. [Online]. Available: http://arxiv.org/abs/1908.08342.

Envelope Q-learning algorithm.

Parameters:

env – The environment to learn from.
learning_rate – The learning rate (alpha).
initial_epsilon – The initial epsilon value for epsilon-greedy exploration.
final_epsilon – The final epsilon value for epsilon-greedy exploration.
epsilon_decay_steps – The number of steps to decay epsilon over.
tau – The soft update coefficient (keep in [0, 1]).
target_net_update_freq – The frequency with which the target network is updated.
buffer_size – The size of the replay buffer.
net_arch – The size of the hidden layers of the value net.
batch_size – The size of the batch to sample from the replay buffer.
learning_starts – The number of steps before learning starts i.e. the agent will be random until learning starts.
gradient_updates – The number of gradient updates per step.
gamma – The discount factor (gamma).
max_grad_norm – The maximum norm for the gradient clipping. If None, no gradient clipping is applied.
envelope – Whether to use the envelope method.
num_sample_w – The number of weight vectors to sample for the envelope target.
per – Whether to use prioritized experience replay.
per_alpha – The alpha parameter for prioritized experience replay.
initial_homotopy_lambda – The initial value of the homotopy parameter for homotopy optimization.
final_homotopy_lambda – The final value of the homotopy parameter.
homotopy_decay_steps – The number of steps to decay the homotopy parameter over.
project_name – The name of the project, for wandb logging.
experiment_name – The name of the experiment, for wandb logging.
wandb_entity – The entity of the project, for wandb logging.
log – Whether to log to wandb.
seed – The seed for the random number generator.
device – The device to use for training.
group – The wandb group to use for logging.

act(obs: Tensor, w: Tensor) → int¶

Epsilon-greedily select an action given an observation and weight.

Parameters:

obs – observation
w – weight vector

Returns: an integer representing the action to take.

ddqn_target(obs: Tensor, w: Tensor) → Tensor¶

Double DQN target for the given observation and weight.

Parameters:

obs – observation
w – weight vector.

Returns: the DQN target.

envelope_target(obs: Tensor, w: Tensor, sampled_w: Tensor) → Tensor¶

Computes the envelope target for the given observation and weight.

Parameters:

obs – current observation.
w – current weight vector.
sampled_w – set of sampled weight vectors (>1!).

Returns: the envelope target.

eval(obs: ndarray, w: ndarray) → int¶

Gives the best action for the given observation.

Parameters:

obs (np.array) – Observation
w (optional np.array) – weight for scalarization

Returns:

np.array or int – Action

get_config()¶

Generates dictionary of the algorithm parameters configuration.

Returns:: dict – Config

load(path: str, load_replay_buffer: bool = True)¶

Load the model and the replay buffer if specified.

Parameters:

path – Path to the model.
load_replay_buffer – Whether to load the replay buffer too.

max_action(obs: Tensor, w: Tensor) → int¶

Select the action with the highest Q-value given an observation and weight.

Parameters:

obs – observation
w – weight vector

Returns: the action with the highest Q-value.

save(save_replay_buffer: bool = True, save_dir: str = 'weights/', filename: str | None = None)¶

Save the model and the replay buffer if specified.

Parameters:

save_replay_buffer – Whether to save the replay buffer too.
save_dir – Directory to save the model.
filename – filename to save the model.

train(total_timesteps: int, eval_env: Env | None = None, ref_point: ndarray | None = None, known_pareto_front: List[ndarray] | None = None, weight: ndarray | None = None, total_episodes: int | None = None, reset_num_timesteps: bool = True, eval_freq: int = 10000, num_eval_weights_for_front: int = 100, num_eval_episodes_for_front: int = 5, num_eval_weights_for_eval: int = 50, reset_learning_starts: bool = False, verbose: bool = False)¶

Train the agent.

Parameters:

total_timesteps – total number of timesteps to train for.
eval_env – environment to use for evaluation. If None, it is ignored.
ref_point – reference point for the hypervolume computation.
known_pareto_front – known pareto front for the hypervolume computation.
weight – weight vector. If None, it is randomly sampled every episode (as done in the paper).
total_episodes – total number of episodes to train for. If None, it is ignored.
reset_num_timesteps – whether to reset the number of timesteps. Useful when training multiple times.
eval_freq – policy evaluation frequency (in number of steps).
num_eval_weights_for_front – number of weights to sample for creating the pareto front when evaluating.
num_eval_episodes_for_front – number of episodes to run when evaluating the policy.
num_eval_weights_for_eval (int) – Number of weights use when evaluating the Pareto front, e.g., for computing expected utility.
reset_learning_starts – whether to reset the learning starts. Useful when training multiple times.
verbose – whether to print the episode info.

update()¶: Update algorithm’s parameters (e.g. using experiences from the buffer).