EUPG

class morl_baselines.single_policy.esr.eupg.EUPG(env: Env, scalarization: Callable[[ndarray, ndarray], float], weights: ndarray = array([1., 1.]), id: int | None = None, buffer_size: int = 100000, net_arch: List = [50], gamma: float = 0.99, learning_rate: float = 0.001, project_name: str = 'MORL-Baselines', experiment_name: str = 'EUPG', wandb_entity: str | None = None, log: bool = True, log_every: int = 1000, device: device | str = 'auto', seed: int | None = None, parent_rng: Generator | None = None)

Expected Utility Policy Gradient Algorithm.

The idea is to condition the network on the accrued reward and to scalarize the rewards based on the episodic return (accrued + future rewards) Paper: D. Roijers, D. Steckelmacher, and A. Nowe, Multi-objective Reinforcement Learning for the Expected Utility of the Return. 2018.

Initialize the EUPG algorithm.

Parameters:
  • env – Environment

  • scalarization – Scalarization function to use (can be non-linear)

  • weights – Weights to use for the scalarization function

  • id – Id of the agent (for logging)

  • buffer_size – Size of the replay buffer

  • net_arch – Number of units per layer

  • gamma – Discount factor

  • learning_rate – Learning rate (alpha)

  • project_name – Name of the project (for logging)

  • experiment_name – Name of the experiment (for logging)

  • wandb_entity – Entity to use for wandb

  • log – Whether to log or not

  • log_every – Log every n episodes

  • device – Device to use for NN. Can be “cpu”, “cuda” or “auto”.

  • seed – Seed for the random number generator

  • parent_rng – Parent random number generator (for reproducibility)

eval(obs: ndarray, accrued_reward: ndarray | None) int | ndarray

Gives the best action for the given observation.

Parameters:
  • obs (np.array) – Observation

  • w (optional np.array) – weight for scalarization

Returns:

np.array or int – Action

get_buffer()

Returns a pointer to the replay buffer.

get_config() dict

Generates dictionary of the algorithm parameters configuration.

Returns:

dict – Config

get_policy_net() Module

Returns the weights of the policy net.

set_buffer(buffer)

Sets the buffer to the passed buffer.

Parameters:

buffer – new buffer (potentially shared)

set_weights(weights: ndarray)

Sets new weights.

Parameters:

weights – the new weights to use in scalarization.

train(total_timesteps: int, eval_env: Env | None = None, eval_freq: int = 1000, start_time=None)

Train the agent.

Parameters:
  • total_timesteps – Number of timesteps to train for

  • eval_env – Environment to run policy evaluation on

  • eval_freq – Frequency of policy evaluation

  • start_time – Start time of the training (for SPS)

update()

Update algorithm’s parameters (e.g. using experiences from the buffer).