MOQ-Learning¶

class morl_baselines.single_policy.ser.mo_q_learning.MOQLearning(env, id: int | None = None, weights: ~numpy.ndarray = array([0.5, 0.5]), scalarization=<function weighted_sum>, learning_rate: float = 0.1, gamma: float = 0.9, initial_epsilon: float = 0.1, final_epsilon: float = 0.1, epsilon_decay_steps: int | None = None, learning_starts: int = 0, use_gpi_policy: bool = False, dyna: bool = False, dyna_updates: int = 5, model: ~morl_baselines.common.model_based.tabular_model.TabularModel | None = None, gpi_pd: bool = False, min_priority: float = 0.0001, alpha: float = 0.6, parent=None, project_name: str = 'MORL-baselines', experiment_name: str = 'MO Q-Learning', wandb_entity: str | None = None, log: bool = True, seed: int | None = None, parent_rng: ~numpy.random._generator.Generator | None = None)¶

Scalarized Q learning for single policy multi-objective reinforcement learning.

Maintains one Q-table per objective, rely on a scalarization function to choose the moves. Paper: K. Van Moffaert, M. Drugan, and A. Nowe, Scalarized Multi-Objective Reinforcement Learning: Novel Design Techniques. 2013. doi: 10.1109/ADPRL.2013.6615007.

Initializes the MOQ-learning algorithm.

Parameters:

env – The environment to train on.
id – The id of the policy.
weights – The weights to use for the scalarization function.
scalarization – The scalarization function to use.
learning_rate – The learning rate.
gamma – The discount factor.
initial_epsilon – The initial epsilon value.
final_epsilon – The final epsilon value.
epsilon_decay_steps – The number of steps to decay epsilon over.
learning_starts – The number of steps to wait before starting to learn.
use_gpi_policy – Whether to use Generalized Policy Improvement (GPI) or not.
dyna – Whether to use Dyna-Q or not.
dyna_updates – The number of Dyna-Q updates to perform each step.
model – The model to use for Dyna. If None and dyna==True, a new one is created.
gpi_pd – Whether to use the GPI-PD method to prioritize Dyna updates.
min_priority – The minimum priority to use for GPI-PD.
alpha – The alpha value to use to smooth GPI-PD priorities.
parent – The parent MPMOQLearning class in the case of multi-policy training.
project_name – The name of the project used for logging.
experiment_name – The name of the experiment used for logging.
wandb_entity – The entity to use for logging.
log – Whether to log or not.
seed – The seed to use for the experiment.
parent_rng – The random number generator to use. If None, a new one is created.

eval(obs: array, w: ndarray | None = None) → int¶

Gives the best action for the given observation.

Parameters:

obs (np.array) – Observation
w (optional np.array) – weight for scalarization

Returns:

np.array or int – Action

get_config() → dict¶

Generates dictionary of the algorithm parameters configuration.

Returns:: dict – Config

scalarized_q_values(obs, w: ndarray) → ndarray¶: Returns the scalarized Q values for each action, given observation and weights.

train(start_time, total_timesteps: int = 500000, reset_num_timesteps: bool = True, eval_env: Env | None = None, eval_freq: int = 1000)¶

Learning for the agent.

Parameters:

start_time – time when the training started
total_timesteps – max number of timesteps to learn
reset_num_timesteps – whether to reset timesteps or not when recalling learn
eval_env – other environment to launch greedy evaluations
eval_freq – number of timesteps between each policy evaluation

update()¶: Updates the Q table.