How many games ?
How many games are needed to prove a change? That's the question. Below are the results of a simulation running random matches and to find out the equilibrium when it is proven that 30 matches in a row end between a percentage of 49.9% - 50.1% and the number seems to be 100,000 games. See the examples [ one ] [ two ] and [ three ]
As one can see from example one, playing 5000 games between 2 identical engines sometimes (emphasis added) may result in a total unreliable results, see the 51.1% result vs the 48.9% result. In the worst case scenario you might accept a change that in reality is a regression because of the convincing 51.1% score (indicating +7 ELO) with a LOS of 96.6% and never look back, hence the regression remains forever.
Not funny. Download the emulate tool to do your own experiments. Source code included.
On the other hand the simulation is based on the random outcome of 1-0 | ½-½ and 0-1 and each outcome is valued equal 1/3 or 33.33%. While that is true for engines playing random noves it isn't true in the real world. The stronger the engine the higher the draw rate, not 33.33% but a lot higher and 1-0 | 0-1 outcomes will decrease and decrease. And so the numbers have to be taken as an indication. For a deeper investigation see chapter What time control.
Experimenting with MEA (Multiple Move EPD Analyzer) an alternative testing method.
MEA analyzes epd files having multiple solution moves with points such as the Strategic Test Suite [STS] containing 1500 theme based positions. First we test the reliability if quickly analyzing 1500 positions at 1 second per move will produce reasonable results. For this purpose we use 4 engines (Stochfish, Komodo, Arasan and Andscacs) that through the years frequently were updated and compare if the progress is in balance with the CCRL 40/4 rating list progress.