EUPG¶

class morl_baselines.single_policy.esr.eupg.EUPG(env: Env, scalarization: Callable[[ndarray, ndarray], float], weights: ndarray = array([1., 1.]), id: int | None = None, buffer_size: int = 100000, net_arch: List = [50], gamma: float = 0.99, learning_rate: float = 0.001, project_name: str = 'MORL-Baselines', experiment_name: str = 'EUPG', wandb_entity: str | None = None, log: bool = True, log_every: int = 1000, device: device | str = 'auto', seed: int | None = None, parent_rng: Generator | None = None)¶

Expected Utility Policy Gradient Algorithm.

The idea is to condition the network on the accrued reward and to scalarize the rewards based on the episodic return (accrued + future rewards) Paper: D. Roijers, D. Steckelmacher, and A. Nowe, Multi-objective Reinforcement Learning for the Expected Utility of the Return. 2018.

Initialize the EUPG algorithm.

Parameters:

env – Environment
scalarization – Scalarization function to use (can be non-linear)
weights – Weights to use for the scalarization function
id – Id of the agent (for logging)
buffer_size – Size of the replay buffer
net_arch – Number of units per layer
gamma – Discount factor
learning_rate – Learning rate (alpha)
project_name – Name of the project (for logging)
experiment_name – Name of the experiment (for logging)
wandb_entity – Entity to use for wandb
log – Whether to log or not
log_every – Log every n episodes
device – Device to use for NN. Can be “cpu”, “cuda” or “auto”.
seed – Seed for the random number generator
parent_rng – Parent random number generator (for reproducibility)

eval(obs: ndarray, accrued_reward: ndarray | None) → int | ndarray¶

Gives the best action for the given observation.

Parameters:

obs (np.array) – Observation
w (optional np.array) – weight for scalarization

Returns:

np.array or int – Action

get_buffer()¶: Returns a pointer to the replay buffer.

get_config() → dict¶

Generates dictionary of the algorithm parameters configuration.

Returns:: dict – Config

get_policy_net() → Module¶: Returns the weights of the policy net.

get_save_dict(save_replay_buffer=True)¶: Retrieve a dictionary containing all information needed to save the policy.

load(save_dict: dict | None = None, path: str | None = None, load_replay_buffer: bool = True)¶: Load the agent’s weights and replay buffer.

set_buffer(buffer)¶

Sets the buffer to the passed buffer.

Parameters:: buffer – new buffer (potentially shared)

set_weights(weights: ndarray)¶

Sets new weights.

Parameters:: weights – the new weights to use in scalarization.

train(total_timesteps: int, eval_env: Env | None = None, eval_freq: int = 1000, start_time=None)¶

Train the agent.

Parameters:

total_timesteps – Number of timesteps to train for
eval_env – Environment to run policy evaluation on
eval_freq – Frequency of policy evaluation
start_time – Start time of the training (for SPS)

update()¶: Update algorithm’s parameters (e.g. using experiences from the buffer).