Most successes in AI come from developing specific responses to specific problems. We can create an AI that outperforms humans at chess, for instance. Or, as we demonstrated with our Pluribus bot in 2019, one that defeats World Series of Poker champions in Texas Hold’em.

What we really want, however, is an AI system that can simply excel at whatever game or task it’s presented with. For AI to be more useful, it needs to be able to generalize, to learn and understand new situations as they occur without additional help. Unfortunately, while we as humans recognize both chess and poker as games in the broadest sense, even teaching a single AI to play both is difficult.

Chess belongs to the class known as perfect-information games, or ones where nothing is hidden to either player. You can see the entire board and all possible moves at all times. AI has proven fairly successful at mastering perfect-information games like chess, Go, and shogi, with bots like AlphaZero even combining reinforcement learning with search (which we’ll call RL+Search) to teach themselves to master these games from scratch.

Prior RL+Search algorithms break down in imperfect-information games like poker though, where players keep their cards secret. These algorithms tend to assume each action has a fixed value regardless of the probability that the action is chosen. This is true in a game like chess, where a good action is good regardless of whether it is chosen every time or only rarely. However, in a game like poker, the value of bluffing goes down the more a player does it, because a savvy opponent can adjust their strategy to call more of those bluffs. Thus our Pluribus poker bot was built on a different approach that used search during actual gameplay but not before, when the bot was learning how to play.

But today we are unveiling Recursive Belief-based Learning (ReBeL), a general RL+Search algorithm that can work in all two-player zero-sum games, including imperfect-information games. ReBeL builds on the RL+Search algorithms like AlphaZero that have proved successful in perfect-information games. Unlike those previous AIs, however, ReBeL makes decisions by factoring in the probability distribution of different beliefs each player might have about the current state of the game, which we call a public belief state (PBS). In other words, ReBeL can assess the chances that its poker opponent thinks it has, for example, a pair of aces.

By accounting for the beliefs of each player, ReBeL is able to treat imperfect-information games akin to perfect-information games. ReBeL can then leverage a modified RL+Search algorithm that we developed to work with the more complex (higher-dimensional) state and action space of imperfect-information games.

Experimental results show that ReBeL is effective in large-scale two-player zero-sum imperfect-information games like Liar’s Dice and poker. ReBeL even managed to defeat a top human professional in the benchmark game of heads-up no-limit Texas Hold’em. While this feat has been accomplished before, ReBeL achieves it using far less expert domain knowledge than any previous poker AI, bringing us closer to building a generalized AI that can tackle a wide variety of complex real-world tasks that involve hidden information, such as negotiations, fraud detection, and cybersecurity. To facilitate further research into this problem, we have open-sourced our implementation of ReBeL for Liar’s Dice.

# From states to public belief states

An example of why prior RL+Search algorithms fail in imperfect-information games is illustrated in Figure 1, which shows a modified form of Rock-Paper-Scissors in which the winner receives two points (and the loser loses two points) when either player chooses Scissors. The figure shows the game in a sequential form in which Player 2 acts after Player 1 but does not observe Player 1’s action.

The optimal policy for both players (i.e., the strategy known as the Nash equilibrium) in this modified version of the game is to choose Rock and Paper with 40 percent probability, and Scissors with 20 percent. In the Nash equilibrium, each action results in an expected value of zero. However, as shown in Figure 2, if Player 1 were to conduct one-ply lookahead search as is done in perfect-information games (in which the value of a leaf node is set to its Nash equilibrium value), then there would not be enough information for Player 1 to arrive at their optimal strategy.

The problem is that the search algorithm in Figure 2 assumes that Player 2 would play the Nash equilibrium policy regardless of what Player 1 does. If it were indeed true that Player 2’s policy would remain fixed at 40 percent Rock, 40 percent Paper, and 20 percent Scissors, then Player 1 could choose Rock every time and achieve an expected value of 0. But in reality, if Player 1’s policy were to always choose Rock, then Player 2 would adjust to always choosing Paper, and the value to Player 1 of choosing Rock would drop from 0 to -1.

This illustrates a critical challenge of imperfect-information games: Unlike perfect-information games and single-agent settings, the value of an action may depend on the probability that it’s chosen.

This is why public belief states, or PBSs, are crucial. We need to change the definition of a state so that it is not defined simply by a sequence of actions but also includes the probability that different sequences of actions occurred.

Consider an imperfect-information card game using a standard 52-card deck wherein a single card is privately dealt to each player. Players take turns choosing between three actions: fold, call, and raise.

Now consider a modification of this game (shown on the right side of Figure 3) in which players cannot see their cards. Instead, their cards are seen only by an impartial “referee.” In this hypothetical game, instead of acting directly, players instead announce how likely they are to take a specific action given each possible private card. The referee then chooses an action on the player’s behalf according to the announced probability distribution for the player’s true private card.

At the beginning of this modified game, players have a uniform random belief distribution over all possibilities for their private card. The player can glean information from the referee’s actions each round though, drawing conclusions about what cards are potentially in play based on what happens. Using Bayes’s theorem, players can update their belief distribution for both their own private card and their opponent’s card. Thus the probability that each player is holding a specific card is common knowledge among all players at all times in this game.

The critical insight is that these two games — the game where players see their own cards and the game where all cards are viewed only by a referee — are strategically identical, but the latter game contains no private information and is instead a continuous-state perfect-information game. While players do not announce their action probabilities for each possible card in the first game, we nevertheless assume that all players’ strategies are common knowledge, and therefore the probability that a player would choose each action for each possible card is indeed known by all players. (Of course, in reality, human opponents do not announce their entire policy when playing. We address this problem in the paper.)

Interpreting imperfect-information games as continuous-state perfect-information games is not a new idea. Originating in work on cooperative imperfect-information games, this alternative view has been around for a long time. Until now, though, nobody knew how to combine it with self-play reinforcement learning in an adversarial setting.

With our imperfect-information game converted into a perfect-information game, one could in principle run a perfect-information RL+Search algorithm like AlphaZero directly on the converted game. Doing so would be incredibly inefficient, though. In the card game we just described, the action space has 156 dimensions (the probability of folding, calling, and betting for each of the 52 cards), and traditional RL+Search algorithms prove unwieldy in such scenarios.

Fortunately, in two-player zero-sum games these high-dimensional state and action spaces are convex optimization problems. ReBeL therefore uses a gradient-descent-like algorithm called counterfactual regret minimization (CFR) to search efficiently.

ReBeL is proven to converge to an optimal policy, a Nash equilibrium, in any two-player zero-sum game. We evaluated it on two games: heads-up no-limit Texas Hold’em and Liar’s Dice. In heads-up no-limit Texas Hold’em, we found that it beat two prior benchmark bots and also beat a top human professional with statistical significance.

In order to show that ReBeL really is a general framework, we also implemented the algorithm for Liar’s Dice, another popular imperfect-information game. Our experiments confirmed that ReBeL does indeed converge to an approximate Nash equilibrium in this game as well, and we have open-sourced our implementation to allow the wider AI research community to build upon these results.

### Conclusions

ReBeL is the first AI to enable sound RL+Search in imperfect-information games, but there are limitations to our current implementation. Most prominently, the amount of computation ReBeL requires grows unwieldy in certain games, such as Recon Chess, that have strategic depth but very little common knowledge.

ReBeL also relies on knowing the exact rules of the game. This is reasonable in recreational games like Go and poker where the rules and rewards are well defined, but problematic in real-world interactions. Recently, MuZero extended AlphaZero to games with unknown rules, and a similar extension of ReBeL would be a powerful development. Finally, ReBeL’s theoretical guarantees are limited to two-player zero-sum games, which are relatively rare in real-world interactions.

Nevertheless, ReBeL achieves low exploitability in benchmark games and superhuman performance in heads-up no-limit Texas Hold’em while leveraging far less expert knowledge than any prior bot. As such, we view this as a major step toward developing universal techniques for multiagent interactions, and thus as a step toward complex real-world applications like fraud detection and cybersecurity.

*By Noam Brown, Research Scientist | Anton Bakhtin, Research Engineer*

*Source: Facebook AI Blog*