EUPG¶
- class morl_baselines.single_policy.esr.eupg.EUPG(env: Env, scalarization: Callable[[ndarray, ndarray], float], weights: ndarray = array([1., 1.]), id: int | None = None, buffer_size: int = 100000, net_arch: List = [50], gamma: float = 0.99, learning_rate: float = 0.001, project_name: str = 'MORL-Baselines', experiment_name: str = 'EUPG', wandb_entity: str | None = None, log: bool = True, log_every: int = 1000, device: device | str = 'auto', seed: int | None = None, parent_rng: Generator | None = None)¶
Expected Utility Policy Gradient Algorithm.
The idea is to condition the network on the accrued reward and to scalarize the rewards based on the episodic return (accrued + future rewards) Paper: D. Roijers, D. Steckelmacher, and A. Nowe, Multi-objective Reinforcement Learning for the Expected Utility of the Return. 2018.
Initialize the EUPG algorithm.
- Parameters:
env – Environment
scalarization – Scalarization function to use (can be non-linear)
weights – Weights to use for the scalarization function
id – Id of the agent (for logging)
buffer_size – Size of the replay buffer
net_arch – Number of units per layer
gamma – Discount factor
learning_rate – Learning rate (alpha)
project_name – Name of the project (for logging)
experiment_name – Name of the experiment (for logging)
wandb_entity – Entity to use for wandb
log – Whether to log or not
log_every – Log every n episodes
device – Device to use for NN. Can be “cpu”, “cuda” or “auto”.
seed – Seed for the random number generator
parent_rng – Parent random number generator (for reproducibility)
- eval(obs: ndarray, accrued_reward: ndarray | None) int | ndarray ¶
Gives the best action for the given observation.
- Parameters:
obs (np.array) – Observation
w (optional np.array) – weight for scalarization
- Returns:
np.array or int – Action
- get_buffer()¶
Returns a pointer to the replay buffer.
- get_config() dict ¶
Generates dictionary of the algorithm parameters configuration.
- Returns:
dict – Config
- get_policy_net() Module ¶
Returns the weights of the policy net.
- set_buffer(buffer)¶
Sets the buffer to the passed buffer.
- Parameters:
buffer – new buffer (potentially shared)
- set_weights(weights: ndarray)¶
Sets new weights.
- Parameters:
weights – the new weights to use in scalarization.
- train(total_timesteps: int, eval_env: Env | None = None, eval_freq: int = 1000, start_time=None)¶
Train the agent.
- Parameters:
total_timesteps – Number of timesteps to train for
eval_env – Environment to run policy evaluation on
eval_freq – Frequency of policy evaluation
start_time – Start time of the training (for SPS)
- update()¶
Update algorithm’s parameters (e.g. using experiences from the buffer).