Adverserial Signaling Games: Stochastic Approach

6 minute read

Published:

In the previous post we described an adversarial signaling game as a two player game, where each player has the capacity to manipulate a shared environment. Additionally each player can observe the others’ actions along with the effect of those actions on the shared environment. Hidden intention’s, however, may remain obfuscated. We first offered a quantum games approach to modeling an aversarial signaling game. The second potential approach is to model the two-player adversarial signaling game using classical stochastic control. Specifically we make use of the Hamilton-Jacobi-Isaacs (HJI) equations which naturally arises in the context of zero-sum differential games, where each player is allowed to employ a continuous-time dynamic strategy. We first spend some time discussing the general HJI modeling framework.

General HJI Framework

Generally the dynamic game is simple to set up and follows the following structure. First we have an environmental state $x(t)$, along with the decision strategy of the players which are often referred to as policies or controls and denoted as \(u_1(t)\) and \(u_2(t)\). The environment evolves as a function of its current state, and the decisions of each player.

\[dx(t) = f(x(t),u_1(t),u_2(t),t)\]

This general differential is natural within the context of adversarial signaling games since each player’s actions change the environment and serve to signal patterns to their opponent or to perturb the environment to a state which is favorable to their own objectives. To that end we must also define an objective function which is typically defined as some functional $J$. The functional is composed of a running cost function \(L_i\) evaluation over the course of the game from \([0,T]\) along with a terminal cost/reward obtained at the end of the game based on the environment at time \(T\).

\[J(x,u_1,u_2) = g(x(T))+\int_0^T L(x(t),u_1(t),u_2(t))dt\]

The cost function is again an intuitive component of modeling an adversarial signaling game, since the game is objective oriented and zero sum. In other words there is some measurable quantity the the players are competing for. To that end it is one players objective to minimize \(J\) and the other players objective to maximize \(J\). For example if Trader A and Trader B engage in a two person market, and Trader A and B have respective wealth \(W_A,W_B\) then function \(J\) will be roughly \(W_A - W_B\) where Trader A is trying to maximize \(J\) and the Trader B is trying to minimize \(J\). We know introduce the concept of the value function \(V(t,x)\) which represents the best possible objective for both players if both players behave optimally. We can express \(V(t,x)\) as a basic min-max problem, namely

\[V(t,x)= \min_{u_1}\max_{u_2}J(t,x,u_1,u_2)\]

From the value function \(V\) we can derive the PDE the controls the dynamics of the value funciton \(V\).

Theorem

Given a cost functional \(J(x,u_1,u_2) = g(x(T))+\int_0^T L(x(t),u_1(t),u_2(t))dt\) which is controlled by two players with respective controls \(u_1,u_2\) the corresponding value function \(V(t,x)= \min_{u_1}\max_{u_2}J(t,x,u_1,u_2)\) follows the following PDE.

\[\frac{\partial V}{\partial t}(t,x) + \inf_u \sup_v \left\{ L(x,u,v) + \nabla_x V(t,x) \cdot f(x,u,v) \right\} = 0, \quad V(T,x(T))=g(x(T))\]
Proof

Let \(\Delta t\) be a small time increment. We define a dynamic programming equation which the value function satisfies by the principle of optimality of dynamic programming:

\[V(t,x) = \inf_u \sup_v \left[ \int_t^{t+\Delta t} L(x(s), u, v) \, ds + V(t+\Delta t, x(t+\Delta t)) \right]\]

Take the Taylor expansion we find that

\[\int_t^{t+\Delta t} L(x(s), u, v) \, ds \approx L(x, u, v) \Delta t \approx \frac{\partial V}{\partial t}(t,x) \Delta t + \nabla_x V(t,x) \cdot \dot{x}(t) \Delta t\] \[\quad = V(t,x) + \left( \frac{\partial V}{\partial t}(t,x) + \nabla_x V(t,x) \cdot f(x, u, v) \right) \Delta t\]

Substitute into the dynamic programming equation:

\[V(t,x) = \inf_u \sup_v \left[ L(x,u,v) \Delta t + V(t,x) + \left( \frac{\partial V}{\partial t}(t,x) + \nabla_x V(t,x) \cdot f(x, u, v) \right) \Delta t \right]\] \[\quad = V(t,x) + \inf_u \sup_v \left[ \left( L(x,u,v) + \nabla_x V(t,x) \cdot f(x,u,v) + \frac{\partial V}{\partial t}(t,x) \right) \Delta t \right]\]

Thus, we arrive at the Hamilton–Jacobi–Isaacs equation:

\[\frac{\partial V}{\partial t}(t,x) + \inf_u \sup_v \left\{ L(x,u,v) + \nabla_x V(t,x) \cdot f(x,u,v) \right\} = 0, \quad V(T,x(T))=g(x(T))\]

note that a similar HJI equation can be derived for a stochastic environment, although because the problem is a closed loop control problem the analytical solution is difficult to derive if the stochastic component is also controlled by \(u_1\) and \(u_2\). We can solve this system numerically and observe whether or not deception plays a part in either player’s strategy. We can, however build deception into the model directly, which we discuss in the next section.

HJI for Two-Player Adversarial Signaling Game

The HJI framework as discussed above lends itself naturally toward signaling games, however we can make the connection more explicit by redefining the full state space as \(z(t)\), where \(x(t)\) is the external system state, \(b_i\) are each players beliefs about the other players likelihood of currently carrying out a deceptive action strategy, and \(d_i\) are each players current level of deception or commitment to a pattern. For example if \(d_1\) is high then player 1 will likely switch their action strategy soon to catch player 2 off guard. As a result player 2 will likely increase \(b_2\) to reflect player 1’s deception.

\[z(t) = \begin{bmatrix}x(t)&b_1(t)&b_2(t)&d_1(t)&d_2(t)\end{bmatrix}\]

Each player needs some way of choosing their actions which we again as the control \(u_{i,x}\), along with some controls over their beliefs and levels of deception which we denote as \(u_{i,b}\) and \(u_{i,d}\) respectively. So the updated dynamics are as follows.

\[dx(t) = f_x(x,u_{1,x},u_{2,x}, d_1,d_2)\] \[d(d_i) = f_d(x,d_i,u_{i,d})\] \[d(b_i) = f_d(x,b_i,u_{i,b})\]

As a result we can rewrite the HJI equation to account for the enlarged state space which now includes the belief and hidden intention states for each player.

\[\frac{\partial V}{\partial t}(t, z) + \inf_u \sup_v \left\{ L(z, u, v) + C_d(d_A, b_B) - R_b(b_A, b_B) + \nabla_z V(t, z) \cdot f(z, u, v) \right\} = 0,\quad V(T, z(T)) = g(x(T))\] \[f(z, u, v) = \begin{bmatrix} f_x(x, u_x, v_x, d_A, d_B) \\ f_b^A(b_A, d_B, v_b) \\ f_b^B(b_B, d_A, u_b) \\ f_d^A(d_A, u_d) \\ f_d^B(d_B, v_d) \end{bmatrix}\]

note that each the dynamics themselves including the objective function design are left up to the user. It is well known however that HJI equations are difficult to solve analytically and instead are often evaluate through numerical techniques like reinforcement learning (RL).