rl_equation_solver.environment.algebraic.Env

class Env(order=2, init_state=None, config=None)[source]

Bases: Env, RewardMixin, HistoryMixin

Environment for solving algebraic equations using RL.

Example

\(a x + b = 0\)

The agent starts at state = 1 and chooses an action by combing operations and terms:

operations: (add, subtract, mulitple, divide, pow) terms: (a, b, 0, 1)

action[i][j] = (operation[i], terms[j])

So taking action[0][0] = (add, a) in state 1 would result in

new_state = \(a + 1\)

Followed by an action (div, b) would result in

new_state = \((a + 1) / b\)

The states are represented using sympy and can be mapped onto a directed acyclic graph (dag). These state representation is what we will feed the RL agent.

The agent is rewarded if it reduces the “loss” of the equation defined as the length of the state graph – intuitively, the complexity of the state:

loss = num_nodes + num_leaves of state graph

If the agent finds the solution, the equation terminates.

Parameters:
  • order (int) – Order of alegraic equation. e.g. if order = 2 then the equation to solve will be a0 * x + a1 = 0

  • init_state (sympy.Equation | None) – Optional initial guess for equation solution. e.g. -b/a, using symbols from sympy.symbols(‘x a b’). If None then initial guess will be (-1) * constant_term.

  • config (dict | None) – Model configuration. If None then the default model configuration in rl_equation_solver.config will be used.

Methods

append_history(entry)

Append latest step for training history of policy_network

close()

Override close in your subclass to perform any necessary cleanup.

diff_loss_reward(state_old, state_new)

Reward is decrease in complexity

exp_loss_reward(state_old, state_new)

Reward is decrease in complexity

expression_complexity(state)

Compute graph / expression complexity for the given state

find_reward(state_old, state_new)

Parameters:
  • state_old (str) -- String representation of last state

init_config()

Initialize model configuration

inv_loss_reward(state_old, state_new)

Reward is decrease in complexity

log_info()

Write info to logger

render([mode])

Print the state string representation

reset([seed, options])

Reset environment state

reset_history()

Clear history

step(action)

Take step corresponding to the given action

sub_loss_reward(state_old, state_new)

Reward is decrease in complexity

too_long(state)

Check if state dimension is too large

update_config(config)

Update configuration

update_history(key, value)

Update latest step for training history of policy_network

Attributes

actions

Get list of fundamental actions

avg_history

Get history averaged over each episode

equation

Get equation from symbols

feature_dict

Get the feature dictionary

history

Get training history of policy_network

metadata

node_labels

Get node labels for current state graph

np_random

Returns the environment's internal _np_random that if not set will initialise with a random seed.

operations

Get list of valid operations

render_mode

reward_range

spec

state_graph

Get current state graph

state_string

Get string representation of the solution state

state_vec

Get current state vector

terms

Get list of fundamental terms

unwrapped

Returns the base non-wrapped environment.

init_config()[source]

Initialize model configuration

update_config(config)[source]

Update configuration

property state_string

Get string representation of the solution state

property operations

Get list of valid operations

property feature_dict

Get the feature dictionary

property terms

Get list of fundamental terms

property actions

Get list of fundamental actions

property equation

Get equation from symbols

property state_vec

Get current state vector

property state_graph

Get current state graph

property node_labels

Get node labels for current state graph

reset(seed=None, options=None)[source]

Reset environment state

Returns:

  • state_vec (np.ndarray) – State vector representing environment state

  • info (dict) – Dictionary with training info

find_reward(state_old, state_new)[source]
Parameters:
  • state_old (str) – String representation of last state

  • state_new (str) – String representation of new state

Returns:

reward (int) – Difference between loss for state_new and state_old

too_long(state)[source]

Check if state dimension is too large

Parameters:

state (str) – State string representation

Returns:

bool

append_history(entry)

Append latest step for training history of policy_network

property avg_history

Get history averaged over each episode

close()[source]

Override close in your subclass to perform any necessary cleanup.

Environments will automatically close() themselves when garbage collected or when the program exits.

diff_loss_reward(state_old, state_new)

Reward is decrease in complexity

Parameters:
  • state_old (str) – String representation of last state

  • state_new (str) – String representation of new state

Returns:

reward (int) – Difference between loss for state_new and state_old

exp_loss_reward(state_old, state_new)

Reward is decrease in complexity

Parameters:
  • state_old (str) – String representation of last state

  • state_new (str) – String representation of new state

Returns:

reward (int) – Difference between loss for state_new and state_old

expression_complexity(state)[source]

Compute graph / expression complexity for the given state

Parameters:

state (str) – String representation of the current state

Returns:

complexity (int) – Number of edges plus number of nodes in graph representation / expression_tree of the current solution approximation

property history

Get training history of policy_network

inv_loss_reward(state_old, state_new)

Reward is decrease in complexity

Parameters:
  • state_old (str) – String representation of last state

  • state_new (str) – String representation of new state

Returns:

reward (int) – Difference between loss for state_new and state_old

log_info()

Write info to logger

property np_random: Generator

Returns the environment’s internal _np_random that if not set will initialise with a random seed.

reset_history()

Clear history

sub_loss_reward(state_old, state_new)

Reward is decrease in complexity

Parameters:
  • state_old (str) – String representation of last state

  • state_new (str) – String representation of new state

Returns:

reward (int) – Difference between loss for state_new and state_old

property unwrapped: Env

Returns the base non-wrapped environment.

Returns:

Env: The base non-wrapped gym.Env instance

update_history(key, value)

Update latest step for training history of policy_network

step(action: int)[source]

Take step corresponding to the given action

Parameters:
  • action (int) – Action index corresponding to the entry in the action list constructed in _make_physical_actions

  • step_number (int) – Number of steps taken so far.

Returns:

  • new_state (Tensor | GraphEmbedding) – New state after action. Represented as a pytorch Tensor or GraphEmbedding

  • reward (float) – Reward from taking this step

  • done (bool) – Whether problem is solved or if maximum state dimension is reached

  • info (dict) – Additional information

render(mode='human')[source]

Print the state string representation