The machines have proven their superiority in one-on-one games like chess and go, and even poker — but in complex multiplayer versions of the card game humans have retained their edge… until now. An evolution of the last AI agent to flummox poker pros individually is now decisively beating them in championship-style 6-person game.
As documented in a paper published in the journal Science today, the CMU/Facebook collaboration they call Pluribus reliably beats five professional poker players in the same game, or one pro pitted against five independent copies of itself. It’s a major leap forward in capability for the machines, and amazingly is also far more efficient than previous agents as well.
One-on-one poker is a weird game, and not a simple one, but the zero-sum nature of it (whatever you lose, the other player gets) makes it susceptible to certain strategies in which computer able to calculate out far enough can put itself at an advantage. But add four more players into the mix and things get real complex, real fast.
With six players, the possibilities for hands, bets, and possible outcomes are so numerous that it is effectively impossible to account for all of them, especially in a minute or less. It’d be like trying to exhaustively document every grain of sand on a beach between waves.
Yet over 10,000 hands played with champions, Pluribus managed to win money at a steady rate, exposing no weaknesses or habits that its opponents could take advantage of. What’s the secret? Consistent randomness.
Even computers have regrets
Pluribus was trained, like many game-playing AI agents these days, not by studying how humans play but by playing against itself. At the beginning this is probably like watching kids, or for that matter me, play poker — constant mistakes, but at least the AI and the kids learn from them.
The training program used something called Monte Carlo counterfactual regret minimization. Sounds like when you have whiskey for breakfast after losing your shirt at the casino, and in a way it is — machine learning style.
Regret minimization just means that when the system would finish a hand (against itself, remember), it would then play that hand out again in different ways, exploring what might have happened had it checked here instead of raised, folded instead of called, and so on. (Since it didn’t really happen, it’s counterfactual.)
A Monte Carlo tree is a way of organizing and evaluating lots of possibilities, akin to climbing a tree of them branch by branch and noting the quality of each leaf you find, then picking the best one once you think you’ve climbed enough.
If you do it ahead of time (this is done in chess, for instance) you’re looking for the best move to choose from. But if you combine it with the regret function, you’re looking through a catalog of possible ways the game could have gone and observing which would have had the best outcome.
So Monte Carlo counterfactual regret minimization is just a way of systematically investigating what might have happened if the computer had acted differently, and adjusting its model of how to play accordingly.
Of course the number of games is nigh-infinite if you want to consider what would happen if you had bet $101 rather than $100, or you would have won that big hand if you’d had an eight kicker instead of a seven. Therein also lies nigh-infinite regret, the kind that keeps you in bed in your hotel room until past lunch.
The truth is these minor changes matter so seldom that the possibility can basically be ignored entirely. It will never really matter that you bet an extra buck — so any bet within, say, 70 and 130 can be considered exactly the same by the computer. Same with cards — whether the jack is a heart or a spade doesn’t matter except in very specific (and usually obvious) situations, so 99.999 percent of the time the hands can be considered equivalent.
This “abstraction” of gameplay sequences and “bucketing” of possibilities greatly reduces the possibilities Pluribus has to consider. It also helps keep the calculation load low; Pluribus was trained on a relatively ordinary 64-core server rack over about a week, while other models might take processor-years in high-power clusters. It even runs on a (admittedly beefy) rig with two CPUs and 128 gigs of RAM.
Random like a fox
The training produces what the team calls a “blueprint” for how to play that’s fundamentally strong and would probably beat plenty of players. But a weakness of AI models is that they develop tendencies that can be detected and exploited.
In Facebook’s writeup of Pluribus, it provides the example of two computers playing rock-paper-scissors. One picks randomly while the other always picks rock. Theoretically they’d both win the same amount of games. But if the computer tried the all-rock strategy on a human, it would start losing with a quickness and never stop.
As a simple example in poker, maybe a particular series of bets always makes the computer go all in regardless of its hand. If a player can spot that series, they can take the computer to town any time they like. Finding and preventing ruts like these is important to creating a game-playing agent that can beat resourceful and observant humans.
To do this Pluribus does a couple things. First, it has modified versions of its blueprint to put into play should the game lean towards folding, calling, or raising. Different strategies for different games mean it’s less predictable, and it can switch in a minute should the bet patterns change and the hand go from a calling to a bluffing one.
It also engages in a short but comprehensive introspective search looking at how it would play if it had every other hand, from a big nothing up to a straight flush, and how it would bet. It then picks its bet in the context of all those, careful to do so in such a way that it doesn’t point to any one in particular. Given the same hand and same play again, Pluribus wouldn’t choose the same bet, but rather vary it to remain unpredictable.
These strategies contribute to the “consistent randomness” I alluded to earlier, and which were a part of the model’s ability to slowly but reliably put some of the best players in the world.
The human’s lament
There are too many hands to point to a particular one or ten that indicate the power Pluribus was bringing to bear on the game. Poker is a game of skill, luck, and determination, and one where winners emerge after only dozens or hundreds of hands.
And here it must be said that the experimental setup is not entirely reflective of an ordinary 6-person poker game. Unlike a real game, chip counts are not maintained as an ongoing total — for every hand, each player was given 10,000 chips to use as they pleased, and win or lose they were given 10,000 in the next hand as well.
Obviously this rather limits the long-term strategies possible, and indeed “the bot was not looking for weaknesses in its opponents that it could exploit,” said Facebook AI research scientist Noam Brown. Truly Pluribus was living in the moment the way few humans can.
But simply because it was not basing its play on long-term observations of opponents’ individual habits or styles does not mean that its strategy was shallow. On the contrary, it is arguably more impressive, and casts the game in a different light, that a winning strategy exists that does not rely on behavioral cues or exploitation of individual weaknesses.
The pros who had their lunch money taken by the implacable Pluribus were good sports, however. They praised the system’s high level play, its validation of existing techniques, and inventive use of new ones. Here’s a selection of laments from the fallen humans:
I was one of the earliest players to test the bot so I got to see its earlier versions. The bot went from being a beatable mediocre player to competing with the best players in the world in a few weeks. Its major strength is its ability to use mixed strategies. That’s the same thing that humans try to do. It’s a matter of execution for humans — to do this in a perfectly random way and to do so consistently. It was also satisfying to see that a lot of the strategies the bot employs are things that we do already in poker at the highest level. To have your strategies more or less confirmed as correct by a supercomputer is a good feeling. -Darren Elias
It was incredibly fascinating getting to play against the poker bot and seeing some of the strategies it chose. There were several plays that humans simply are not making at all, especially relating to its bet sizing. -Michael ‘Gags’ Gagliano
Whenever playing the bot, I feel like I pick up something new to incorporate into my game. As humans I think we tend to oversimplify the game for ourselves, making strategies easier to adopt and remember. The bot doesn’t take any of these short cuts and has an immensely complicated/balanced game tree for every decision. -Jimmy Chou
In a game that will, more often than not, reward you when you exhibit mental discipline, focus, and consistency, and certainly punish you when you lack any of the three, competing for hours on end against an AI bot that obviously doesn’t have to worry about these shortcomings is a grueling task. The technicalities and deep intricacies of the AI bot’s poker ability was remarkable, but what I underestimated was its most transparent strength – its relentless consistency. -Sean Ruane
Beating humans at poker is just the start. As good a player as it is, Pluribus is more importantly a demonstration that an AI agent can achieve superhuman performance at something as complicated as 6-player poker.
“Many real-world interactions, such as financial markets, auctions, and traffic navigation, can similarly be modeled as multi-agent interactions with limited communication and collusion among participants,” writes Facebook in its blog.
Yes, and war.
No comments:
Post a Comment