In this post, we show how InsideOpt Seeker can be used to optimize policies according to the universal framework introduced by Prof. Powell.
Universal Framework for Sequential Decision Problems
Dr. Warren Powell's universal framework for sequential decision problems identifies four classes of policies for making decisions under uncertainty:
Policy Search Class
-
1. Policy Function Approximations (PFAs): These are functions with tunable parameters that directly map the current state (information) to an action (decision).
-
2. Cost Function Approximations (CFAs): These approximate the optimal decision by solving a simplified optimization problem (often a deterministic model) that has been parameterized to work better under uncertainty over time.
Lookahead Class
-
3. Value Function Approximations (VFAs): Policies based on approximating the long-term value of being in a future state after a decision is made now. They optimize the immediate reward plus an estimate of the future value. This class is central to Approximate Dynamic Programming (ADP) and Reinforcement Learning (RL).
-
4. Direct Lookahead Approximations (DLAs): This explicitly plans over a future horizon using an approximate lookahead model to choose the best current decision. It models both future information and future actions.
InsideOpt Seeker
Seeker is a simulation-based optimization solver. As such, it can, in terms of semantic modeling expressiveness, solve any problem of the form:

for any computable function

Side note: Obviously, the fact that we can express these problems in the Seeker API does not imply that Seeker can solve all such problems efficiently. This is particularly true if f is given as a black box, as the API allows. However, assuming that the modeler is interested in solving their problem, Seeker obviously does not expect adversarial models but formulations that solve well.
When considering sequential decision problems, we automatically have to consider stochasticity. The idea of sequential problems is that not all decisions have to be taken upfront, but only some. Then, new information may become available before more decisions have to be taken, and so on. It is therefore important to point out that the function f may well reflect a (computationally approximated) statistic over a stochastic distribution.
In most textbook examples, the statistic used is the expected value of a random variable. In the Seeker API, the modeler is not limited to means, though. Seeker also supports variance, standard deviation, quantiles, the conditional value at risk, probabilities like P(Y<=T) for a random variable Y and a given threshold T, or even user-defined statistics (computed from a finite set of samples). Let us denote with S a general statistic. Then, of particular importance to our discussion below is that Seeker allows us to optimize functions of the type:

The Use of Seeker for Various Sequential Decision Problems
When we aim to determine how to set the controls at our disposition in every stage of a sequential decision problem, after defining what these controls actually are (see What is a Decision), we need to define how we measure the quality of our decisions.
I.e., we need to define the function f. This function obviously has to be computable, which either means that the optimization model has to contain a limited number of stages or provide a closed form or numerical approximation for an infinite application of the chosen algorithmic way of deciding our control settings at every stage.
In any case, f now measures the associated cost value of the decisions that we optimize in Seeker.
Here, it is important to point out that the variables x to be optimized in Seeker may not coincide with the real-world controls that avail themselves to us at every stage. Instead, the decisions to be optimized in Seeker may regard the parameters of an algorithm that translates the metrics that describe the state at the beginning of each stage into a choice of how to set the control levers available.
With this in mind, let us now consider the four classes of policies and how Seeker may be of use to optimize outcomes.
1. Policy Function Approximations
This is a direct application of the last note above. Rather than deciding the controls for each stage directly and then assessing (under approximate stochastic evaluation S) the value of f, we determine the parameters that translate the current state at a given stage into the decisions that we take at this stage.
Example: Here and throughout this article we will consider an inventory replenishment policy where we want to decide, for a given number of periods, whether to place an order and, if so, the quantity ordered. These are our actual real-world controls. To create our policy, we rely on forecasts regarding demands and lead times when orders are fulfilled by suppliers. These forecasts come with uncertainty, i.e., we do not know exactly how much will be bought in each period or when exactly a replenishment order that we place will arrive.
We could decide statically when to order how much. These decisions would then always be the same, no matter how our system actually evolves. Since this lacks the ability to react to changing conditions, we will parameterize the decisions in the Seeker optimizer instead.
A canonical parameterization in replenishment is to set min/max values for each period. The Seeker variables x are therefore a set of 2n non-negative integer variables, where n is the number of periods in our planning horizon. The actual order decisions in each period are then determined by considering the current inventory level. The real-world decision is to refill to the max level whenever our inventory drops below the min value for the respective period. Note that this means that, while the min/max values stay the same for all scenarios, the actual real-world order decisions depend on the stochastic evolution of our inventory.
Now, since our inventory costs and fulfillment levels will vary from stochastic scenario to scenario, we need to choose a statistic S over the distribution of both before we can constrain or optimize either one, as discussed above. Customers often start with the expected values and then quickly add the conditional value at risk as additional consideration.
This illustrates that, for policy function approximations, we can use Seeker to optimize the parameters of our policies.
2. Cost Function Approximations
For this class of policies, the algorithm for translating the current state into actual controls is a bit more involved, as it entails the nested solution of an optimization problem. The principle is, however, exactly the same as for policy function approximation: We need to set the parameters in such a way that the execution of the policy will result in favorable overall measures of target function f.
Example: Consider the inventory replenishment again, but this time we do not use simple min/max values, but we solve, in each period, an optimization problem that sets the order decisions for the next k periods, whereby we assume that demands and lead times are exactly their expected values for these k periods. Then, we use the ordering decisions of this solution only for the current period.
We can use Seeker to determine a favorable value for the look-ahead window size k and also other parameters meant to make the optimization results more amenable for the sequential application of this policy. Note that the objective function used in the nested optimization (for which we can use Seeker's nested linear programming operator, if applicable, or any user-defined function) may very well differ from our function f. For example, we may want to place some form of penalty for leaving the inventory in shambles at the end of the look-ahead window k. A parameter that we can let Seeker optimize could then be the relative importance of the primary objective and that penalty for leaving the system in a poor end-state.
Mathematically, all that has changed by introducing a nested optimization are the outcomes Y given parameters x. This makes the search for x more costly, because the specific optimization problems to be solved in each scenario differ, which implies we have to solve number of scenarios times number of stages many nested optimization problem for one function evaluation of f at a given x. Seeker provides built-in strategies to make this more efficient, such as the ability to fully distribute the optimization to hundreds and even thousands of cores, using warm starts for the nested optimizations, and of course efficient ways to allow optimizing for hundreds and even thousands of scenarios.
Seeker is therefore able to simulate how the system would progress using this parameterized optimization model to determine the next action under stochasticity and aggregate by forming a user-defined statistic over the outcomes under different demand and lead-time scenarios.
In other words: Seeker can be used to determine the parameters of cost function approximation policies.
3. Value Function Approximations
In this class, we choose our current actions based on our assessment of the immediate and the expected future value of our action. We aim to optimize this future value assessment so that we achieve good performance overall.
Side note: The value assessment of a state may differ from its true value, depending on the choice of the outer aggregation statistic S we optimize for. This would be the case, for example, when S does not compute the expected costs but, e.g., the conditional value at a given risk threshold. Obviously, this is not the textbook setting in reinforcement learning, where expected values rule the day.
Example: Back to inventory replenishment, using a value function approximation, we aim to learn the parameters of a function v that translates a given state into a numerical value. Note that the description of this state can be complex. It may contain anything from the current inventory level to the already-realized costs and service levels to the orders already in the system to the number of remaining periods to the current demand and lead-time outlooks.
Given such a description D(a) of the state we would reach when taking action a (how much to order in this period), v(D(a),x) returns the state valuation given the parameters x that we aim to optimize. We would then choose, in each period, the action a that minimizes c(a) + v(D(a),x), whereby c(a) returns the immediate costs of taking action a and v returns the long-term value of reaching a state with description D(a).
Executing this policy for all planning periods then gives us our S(Y|x) which Seeker will optimize.
Two side notes: 1. If the action taken does not translate into a new state deterministically, we can obviously use some statistic over the distribution of valuation outcomes to determine the next action. Note that this internal statistical aggregation in each demand and lead time scenario and in each period may, and generally will, differ from the outer aggregation S.
2. Given the large action space of our example, we run into the same computational issues here as traditional RL would, but they can be overcome in much the same ways, for example, by moving from the Q-learning-related setting as described above (price each possible action) to an actor-critic-related setting (e.g., we could learn the parameters of a distribution that is used to choose the order volume, whereby this remedy would obviously mean that we leave the classification of a value function approximation policy).
As we found for the other classes, for the optimization of value function approximations we once more find ourselves in the situation where we need to optimize some parameters x so that some stochastic simulation returns a favorable statistic S(Y|x). And that means we can use Seeker to optimize this type of policy as well.
4. Direct Lookahead Approximations
In this last class, we compute a course of action for some future time window and then only enact the decisions that must be taken now. Typical examples for this are two- or multistage stochastic programming, model predictive control, or roll-outs in RL.
Example: When discussing cost-function approximations, we already used an optimization for a full time window rather than looking ahead only one period, so we already illustrated the use of Seeker for optimizing direct lookahead approximation policies earlier.
Online Optimization
In the discussion above, we make one crucial assumption. Namely, that we can compute f, which obviously requires that we have an algorithmic description of our system as well as a computational model of its stochasticity.
What, if we lack such an understanding of the system dynamics? What, if we must learn to navigate the system while we are already "on policy?"
This scneario is the subject of Bayesian optimization, the related bandit theory, and on-policy reinforcement learning. And the core idea for handling this situation is always the same: We use our understanding as we have it so far, including our own uncertainty, and determine what to try next based off of this "surrogate."
While the strategies differ, the two forces at play here are the direct exploitation of what we believe is most likely and the desire to conduct an exploration to retire uncertainty where we do not have garnered much experience yet. In RL, we may flip a coin whether we try a random acion next or go with the one we would consider best given our current world view. In BO, we may subtract our uncertainty from our expected costs (and thereby make cheaper and more uncertain actions more attractive) in some aggregate, so-called "acquisition function." And bandit theory can give us beautiful strategies that are optimal under some assumptions.
Seeker can help again.
-
In BO, we need to optimize over some surrogate that reflects our current expectations and heteroscedastic uncertainty. We can use Seeker's multi-objective optimization capability to balance exploration and exploitation instead of simply aggregating uncertainty and opportunity as is done in canonical BO.
-
In real-world applications, we often have to balance multiple KPIs. We can also use Seeker's MO capabilities when we seek controls that optimize and balance multiple objectives.
-
Expensive or critical assets rarely allow for arbitrary exploration of the control space. We can use Seeker's constrained optimization capabilities to ensure safe exploration.
The Bottom Line
In summary, Seeker can be a useful tool in your arsenal when tackling sequential decision problems. And, as Warren would say: What real-world problem is not a sequential decison problem?
