rl_equation_solver.environment.algebraic.Env
- class Env(order=2, init_state=None, config=None)[source]
Bases:
Env
,RewardMixin
,HistoryMixin
Environment for solving algebraic equations using RL.
Example
\(a x + b = 0\)
The agent starts at state = 1 and chooses an action by combing operations and terms:
operations: (add, subtract, mulitple, divide, pow) terms: (a, b, 0, 1)
action[i][j] = (operation[i], terms[j])
So taking action[0][0] = (add, a) in state 1 would result in
new_state = \(a + 1\)
Followed by an action (div, b) would result in
new_state = \((a + 1) / b\)
The states are represented using sympy and can be mapped onto a directed acyclic graph (dag). These state representation is what we will feed the RL agent.
The agent is rewarded if it reduces the “loss” of the equation defined as the length of the state graph – intuitively, the complexity of the state:
loss = num_nodes + num_leaves of state graph
If the agent finds the solution, the equation terminates.
- Parameters:
order (int) – Order of alegraic equation. e.g. if order = 2 then the equation to solve will be a0 * x + a1 = 0
init_state (sympy.Equation | None) – Optional initial guess for equation solution. e.g. -b/a, using symbols from sympy.symbols(‘x a b’). If None then initial guess will be (-1) * constant_term.
config (dict | None) – Model configuration. If None then the default model configuration in rl_equation_solver.config will be used.
Methods
append_history
(entry)Append latest step for training history of policy_network
close
()Override close in your subclass to perform any necessary cleanup.
diff_loss_reward
(state_old, state_new)Reward is decrease in complexity
exp_loss_reward
(state_old, state_new)Reward is decrease in complexity
expression_complexity
(state)Compute graph / expression complexity for the given state
find_reward
(state_old, state_new)- Parameters:
state_old (str) -- String representation of last state
Initialize model configuration
inv_loss_reward
(state_old, state_new)Reward is decrease in complexity
log_info
()Write info to logger
render
([mode])Print the state string representation
reset
([seed, options])Reset environment state
Clear history
step
(action)Take step corresponding to the given action
sub_loss_reward
(state_old, state_new)Reward is decrease in complexity
too_long
(state)Check if state dimension is too large
update_config
(config)Update configuration
update_history
(key, value)Update latest step for training history of policy_network
Attributes
Get list of fundamental actions
Get history averaged over each episode
Get equation from symbols
Get the feature dictionary
Get training history of policy_network
metadata
Get node labels for current state graph
Returns the environment's internal
_np_random
that if not set will initialise with a random seed.Get list of valid operations
render_mode
reward_range
spec
Get current state graph
Get string representation of the solution state
Get current state vector
Get list of fundamental terms
Returns the base non-wrapped environment.
- property state_string
Get string representation of the solution state
- property operations
Get list of valid operations
- property feature_dict
Get the feature dictionary
- property terms
Get list of fundamental terms
- property actions
Get list of fundamental actions
- property equation
Get equation from symbols
- property state_vec
Get current state vector
- property state_graph
Get current state graph
- property node_labels
Get node labels for current state graph
- reset(seed=None, options=None)[source]
Reset environment state
- Returns:
state_vec (np.ndarray) – State vector representing environment state
info (dict) – Dictionary with training info
- find_reward(state_old, state_new)[source]
- Parameters:
state_old (str) – String representation of last state
state_new (str) – String representation of new state
- Returns:
reward (int) – Difference between loss for state_new and state_old
- too_long(state)[source]
Check if state dimension is too large
- Parameters:
state (str) – State string representation
- Returns:
bool
- append_history(entry)
Append latest step for training history of policy_network
- property avg_history
Get history averaged over each episode
- close()[source]
Override close in your subclass to perform any necessary cleanup.
Environments will automatically
close()
themselves when garbage collected or when the program exits.
- diff_loss_reward(state_old, state_new)
Reward is decrease in complexity
- Parameters:
state_old (str) – String representation of last state
state_new (str) – String representation of new state
- Returns:
reward (int) – Difference between loss for state_new and state_old
- exp_loss_reward(state_old, state_new)
Reward is decrease in complexity
- Parameters:
state_old (str) – String representation of last state
state_new (str) – String representation of new state
- Returns:
reward (int) – Difference between loss for state_new and state_old
- expression_complexity(state)[source]
Compute graph / expression complexity for the given state
- Parameters:
state (str) – String representation of the current state
- Returns:
complexity (int) – Number of edges plus number of nodes in graph representation / expression_tree of the current solution approximation
- property history
Get training history of policy_network
- inv_loss_reward(state_old, state_new)
Reward is decrease in complexity
- Parameters:
state_old (str) – String representation of last state
state_new (str) – String representation of new state
- Returns:
reward (int) – Difference between loss for state_new and state_old
- log_info()
Write info to logger
- property np_random: Generator
Returns the environment’s internal
_np_random
that if not set will initialise with a random seed.
- reset_history()
Clear history
- sub_loss_reward(state_old, state_new)
Reward is decrease in complexity
- Parameters:
state_old (str) – String representation of last state
state_new (str) – String representation of new state
- Returns:
reward (int) – Difference between loss for state_new and state_old
- property unwrapped: Env
Returns the base non-wrapped environment.
- Returns:
Env: The base non-wrapped gym.Env instance
- update_history(key, value)
Update latest step for training history of policy_network
- step(action: int)[source]
Take step corresponding to the given action
- Parameters:
action (int) – Action index corresponding to the entry in the action list constructed in _make_physical_actions
step_number (int) – Number of steps taken so far.
- Returns:
new_state (Tensor | GraphEmbedding) – New state after action. Represented as a pytorch Tensor or GraphEmbedding
reward (float) – Reward from taking this step
done (bool) – Whether problem is solved or if maximum state dimension is reached
info (dict) – Additional information