rl_equation_solver.environment.algebraic.Env

class Env(order=2, init_state=None, config=None)[source]

Bases: Env, RewardMixin, HistoryMixin

Environment for solving algebraic equations using RL.

Example

\(a x + b = 0\)

The agent starts at state = 1 and chooses an action by combing operations and terms:

operations: (add, subtract, mulitple, divide, pow) terms: (a, b, 0, 1)

action[i][j] = (operation[i], terms[j])

So taking action[0][0] = (add, a) in state 1 would result in

new_state = \(a + 1\)

Followed by an action (div, b) would result in

new_state = \((a + 1) / b\)

The states are represented using sympy and can be mapped onto a directed acyclic graph (dag). These state representation is what we will feed the RL agent.

The agent is rewarded if it reduces the “loss” of the equation defined as the length of the state graph – intuitively, the complexity of the state:

loss = num_nodes + num_leaves of state graph

If the agent finds the solution, the equation terminates.

Parameters:

order (int) – Order of alegraic equation. e.g. if order = 2 then the equation to solve will be a0 * x + a1 = 0
init_state (sympy.Equation | None) – Optional initial guess for equation solution. e.g. -b/a, using symbols from sympy.symbols(‘x a b’). If None then initial guess will be (-1) * constant_term.
config (dict | None) – Model configuration. If None then the default model configuration in rl_equation_solver.config will be used.

Methods

`append_history`(entry)	Append latest step for training history of policy_network
`close`()	Override close in your subclass to perform any necessary cleanup.
`diff_loss_reward`(state_old, state_new)	Reward is decrease in complexity
`exp_loss_reward`(state_old, state_new)	Reward is decrease in complexity
`expression_complexity`(state)	Compute graph / expression complexity for the given state
`find_reward`(state_old, state_new)	Parameters: state_old (str) -- String representation of last state
`init_config`()	Initialize model configuration
`inv_loss_reward`(state_old, state_new)	Reward is decrease in complexity
`log_info`()	Write info to logger
`render`([mode])	Print the state string representation
`reset`([seed, options])	Reset environment state
`reset_history`()	Clear history
`step`(action)	Take step corresponding to the given action
`sub_loss_reward`(state_old, state_new)	Reward is decrease in complexity
`too_long`(state)	Check if state dimension is too large
`update_config`(config)	Update configuration
`update_history`(key, value)	Update latest step for training history of policy_network

Attributes

`actions`	Get list of fundamental actions
`avg_history`	Get history averaged over each episode
`equation`	Get equation from symbols
`feature_dict`	Get the feature dictionary
`history`	Get training history of policy_network
`metadata`
`node_labels`	Get node labels for current state graph
`np_random`	Returns the environment's internal `_np_random` that if not set will initialise with a random seed.
`operations`	Get list of valid operations
`render_mode`
`reward_range`
`spec`
`state_graph`	Get current state graph
`state_string`	Get string representation of the solution state
`state_vec`	Get current state vector
`terms`	Get list of fundamental terms
`unwrapped`	Returns the base non-wrapped environment.

init_config()[source]: Initialize model configuration

update_config(config)[source]: Update configuration

property state_string: Get string representation of the solution state

property operations: Get list of valid operations

property feature_dict: Get the feature dictionary

property terms: Get list of fundamental terms

property actions: Get list of fundamental actions

property equation: Get equation from symbols

property state_vec: Get current state vector

property state_graph: Get current state graph

property node_labels: Get node labels for current state graph

reset(seed=None, options=None)[source]

Reset environment state

Returns:

state_vec (np.ndarray) – State vector representing environment state
info (dict) – Dictionary with training info

find_reward(state_old, state_new)[source]

Parameters:

state_old (str) – String representation of last state
state_new (str) – String representation of new state

Returns:

reward (int) – Difference between loss for state_new and state_old

too_long(state)[source]

Check if state dimension is too large

Parameters:: state (str) – State string representation
Returns:: bool

append_history(entry): Append latest step for training history of policy_network

property avg_history: Get history averaged over each episode

close()[source]

Override close in your subclass to perform any necessary cleanup.

Environments will automatically close() themselves when garbage collected or when the program exits.

diff_loss_reward(state_old, state_new)

Reward is decrease in complexity

Parameters:

state_old (str) – String representation of last state
state_new (str) – String representation of new state

Returns:

reward (int) – Difference between loss for state_new and state_old

exp_loss_reward(state_old, state_new)

Reward is decrease in complexity

Parameters:

state_old (str) – String representation of last state
state_new (str) – String representation of new state

Returns:

reward (int) – Difference between loss for state_new and state_old

expression_complexity(state)[source]

Compute graph / expression complexity for the given state

Parameters:: state (str) – String representation of the current state
Returns:: complexity (int) – Number of edges plus number of nodes in graph representation / expression_tree of the current solution approximation

property history: Get training history of policy_network

inv_loss_reward(state_old, state_new)

Reward is decrease in complexity

Parameters:

state_old (str) – String representation of last state
state_new (str) – String representation of new state

Returns:

reward (int) – Difference between loss for state_new and state_old

log_info(): Write info to logger

property np_random: Generator: Returns the environment’s internal _np_random that if not set will initialise with a random seed.

reset_history(): Clear history

sub_loss_reward(state_old, state_new)

Reward is decrease in complexity

Parameters:

state_old (str) – String representation of last state
state_new (str) – String representation of new state

Returns:

reward (int) – Difference between loss for state_new and state_old

property unwrapped: Env

Returns the base non-wrapped environment.

Returns:: Env: The base non-wrapped gym.Env instance

update_history(key, value): Update latest step for training history of policy_network

step(action: int)[source]

Take step corresponding to the given action

Parameters:

action (int) – Action index corresponding to the entry in the action list constructed in _make_physical_actions
step_number (int) – Number of steps taken so far.

Returns:

new_state (Tensor | GraphEmbedding) – New state after action. Represented as a pytorch Tensor or GraphEmbedding
reward (float) – Reward from taking this step
done (bool) – Whether problem is solved or if maximum state dimension is reached
info (dict) – Additional information

render(mode='human')[source]: Print the state string representation