Experiments

_____________________________________________________________________

Diminishing returns for Stockfish 9

It's known that a hardware doubling in speed gives significant ELO, it's also known that the ELO gain will lower and lower when the time control increases, the so called diminishing returns principle (effect).

In the below table one can see the effects of SF9 self-play matches when we give the second SF9 engine twice as much time and on fast time controls even a factor of 4 or 8 more thinking time. By doing so we may get some glimpse what the future may hold when hardware keeps improving.

Stockfish noteworthiness

The ERL page (a try to test the quality of the evaluation) shows that ProDeo does not bad at all even against SF8 and it awakened my curiosity how it would perform increasing the depth to 2 plies, 3 plies etc. The result was amazing, see the table.

Match | ProDeo % |
| Match | ProDeo % |

D=1 | 44.8% |
| D=7 | 83.0% |

D=2 | 67.7% |
| D=8 | 73.0% |

D=3 | 90.1% |
| D=9 | 58.2% |

D=4 | 92.2% |
| D=10 | 44.2% |

D=5 | 91.1% |
| D=11 | 35.6% |

D=6 | 87.1% |
| D=12 | 30.0% |

Stockfish 8 drop in score is dramatic once it enters the main-search starting at D=2. It shows how extremely selective the search is but then as by magic the tables are turned starting at D=10 and the rest is known history. Perhaps (emphasis added) it makes sense for SF to be a bit more conservative in the early plies for reasons of a better move ordering?

Remark - Repeating the test with Stockfish 9 shows and even more selectivity, the turning point is not on depth=9 but 10 and even the root search [D=1] goes up for ProDeo to 48.5% which indicates that SF9 perhaps even is selective at root move procedure..

Topics

Stockfish noteworthiness

Diminishing returns for Stockfish 9

Stockfish 9 Depth vs ELO

Scaling results chess engines

How many games are needed to prove a change?

ELO progression measured by year

Core Scheduling. A closer look.

Experimenting with MEA an alternative testing method.

What time control ?

The white advantage of the first move.

Similarity between the top engines back then and now.

Factor 2 | Factor 4 | Factor 8 |

Match | Games | Perc | ELO | Draws |
| Match | Games | Perc | ELO | Draws |
| Match | Games | Perc | ELO | Draws |

40/15 | 1000 | 65.1% | 109 | 55% |
| 40/15 | 1000 | 74.8% | 189 | 44% |
| 40/15 | 1000 | 81.8% | 253 | 34% |

40/30 | 1000 | 63.0% | 94 | 59% |
| 40/30 | 1000 | 72.8% | 170 | 48% |
| 40/30 | 500 | 77.7% | 215 | 41% |

40/1m | 500 | 61.7% | 83 | 61% |
| 40/1m | 500 | 69.6% | 144 | 53% |
| 40/1m | 250 | 73.0% | 173 | 51% |

40/2m | 500 | 59.4% | 66 | 63% |
| 40/2m | 250 | 65.4% | 111 | 62% |
| 40/2m | 100 | 70.5% | 152 | 53% |

40/5m | 200 | 56.9% | 46 | 65% |
| 40/5m | 100 | 62.5% | 90 | 59% | ||||||

40/10m | 100 | 57.5% | 52 | 61% |
| 40/10m | 100 | 64.0% | 101 | 58% | ||||||

40/20m | 100 | 56.5% | 45 | 67% | ||||||||||||

40/1h | 100 | TODO |

The good news is that doubling speed still gives amazingly much at long time control even for top-engines. What happened to the dark age of the late 90's and the early years of the millennium when it was considered that engines at [D=14] would mainly produce draws? It seems the end is nowhere in sight.

___________________________________________________________________________________________________

Diminishing returns - Depth vs ELO

Another way to measure the diminishing returns instead of using time is to do it by iteration depth. We do it for Stockfish 9 and an old fashioned engine like ProDeo in self-play matches Depth=x vs Depth=x+1

Stockfish 9 | Depth+1 | ELO |
| Stockfish 9 | Depth+1 | ELO |
| ProDeo 2.2 | Depth+1 | ELO |

Depth = 1 | 53.3% | +23 |
| Depth = 16 | 59.4% | +66 |
| Depth = 1 | 93.6% | +429 |

Depth = 2 | 52.9% | +20 |
| Depth = 17 | 61.8% | +84 |
| Depth = 2 | 88.8% | +340 |

Depth = 3 | 64.5% | +105 |
| Depth = 18 | 56.2% | +43 |
| Depth = 3 | 70.8% | +154 |

Depth = 4 | 70.4% | +150 |
| Depth = 19 | 57.0% | +49 |
| Depth = 4 | 80.7% | +244 |

Depth = 5 | 75.1% | +191 |
| Depth = 20 | 54.2% | +29 |
| Depth = 5 | 77.2% | +209 |

Depth = 6 | 76.8% | +207 |
| Depth = 21 | 55.6% | +39 |
| Depth = 6 | 74.4% | +187 |

Depth = 7 | 76.0% | +200 |
| Depth = 22 | 58.0% | +56 |
| Depth = 7 | 70.4% | +151 |

Depth = 8 | 81.8% | +254 |
| Depth = 23 | 53.3% | +21 |
| Depth = 8 | 69.4% | +142 |

Depth = 9 | 74.9% | +190 |
| Depth = 24 | 54.0% | +28 |
| Depth = 9 | 68.6% | +136 |

Depth = 10 | 72.7% | +170 |
| Depth = 25 | 54.5% | +31 |
| Depth = 10 | 68.4% | +134 |

Depth = 11 | 68.6% | +136 |
| Depth = 26 | 54.3% | +29 |
| Depth = 11 | 62.8% | +91 |

Depth = 12 | 67.0% | +124 |
| Depth = 27 | 52.0% | +13 |
| Depth = 12 | 64.3% | +102 |

Depth = 13 | 62.9% | +93 |
| Depth = 28 | 54.5% | +31 |
| Depth = 13 | 62.2% | +87 |

Depth = 14 | 59.9% | +70 |
| Depth = 29 | 55.0% | +35 |
| Depth = 14 | 60.8% | +76 |

Depth = 15 | 61.7% | +83 |
| Depth = 30 | 50.0% | 0 |
| Depth = 15 | 63.8% | +98 |

Remarks - Numbers of the last 3 Stockfish matches (28 vs 29, 29 vs 30 and 30 vs 31) statistically are not very reliable since only 100 games were played due to the time such a long time control takes, same as for ProDeo, last matches only 200 games. Nevertheless the table gives some insight what the future might hold.

_________________________________________________________________________________________________

Scaling results chess engines

How well do chess engines scale? When they (for instance) are given double, triple time, do they lose or win ELO in comparison with other engines? Below are the results of the CEGTrating list comparing the 40/4 versus the 40/20 time control and the gain or loss in ELO. Minimum games played is 300.

Engine | ELO 40/4 | ELO 40/20 | Scaling |

Stockfish 9 | 3475 | 3480 | +5 |

Houdini 6 | 3457 | 3440 | -17 |

Komodo 11 | 3404 | 3398 | -6 |

Shredder 13 | 3249 | 3218 | -31 |

Engine | ELO 40/4 | ELO 40/20 | Scaling |

Alfil 15.4 | 2599 | 2721 | +122 |

Zappa Mexico | 2507 | 2617 | +110 |

Vajolet2 1.28 | 2655 | 2712 | +62 |

Arasan 15.3 | 2517 | 2574 | +57 |

_________________________________________________________________________________________________

How many games ?

How many games are needed to prove a change? That's the question. Below are the results of a simulation running random matches and to find out the equilibrium when it is proven that 30 matches in a row end between a percentage of 49.9% - 50.1% and the number seems to be 100,000 games. See the examples [ one ] [ two ] and [ three ]

As one can see from example one, playing 5000 games between 2 identical engines sometimes (emphasis added) may result in a total unreliable results, see the 51.1% result vs the 48.9% result. In the worst case scenario you might accept a change that in reality is a regression because of the convincing 51.1% score (indicating +7 ELO) with a LOS of 96.6% and never look back, hence the regression remains forever.

Not funny.

On the other hand the simulation is based on the random outcome of 1-0 | Â½-Â½ and 0-1 and each outcome is valued equal 1/3 or 33.33%. While that is true for engines playing random noves it isn't true in the real world. The stronger the engine the higher the draw rate, not 33.33% but a lot higher and 1-0 | 0-1 outcomes will decrease and decrease. And so the numbers have to be taken as an indication. For a deeper investigation see chapter What time control.

_________________________________________________________________________________________________

Experimenting with MEA (Multiple Move EPD Analyzer) an alternative testing method.

MEA analyzes epd files having multiple solution moves with points such as the Strategic Test Suite [STS] containing 1500 theme based positions. First we test the reliability if quickly analyzing 1500 positions at 1 second per move will produce reasonable results. For this purpose we use 4 engines (Stochfish, Komodo, Arasan and Andscacs) that through the years frequently were updated and compare if the progress is in balance with the CCRL 40/4 rating list progress.

Stockfish | ELO | Score | Komodo | ELO | Score | Arasan | ELO | Score | Andscacs | ELO | Score | |||

4 | 3180 | 12.678 | 4 | 3127 | 12.167 | 13.1 | 2575 | 9.927 | 0.70 | 2827 | 10.719 | |||

5 | 3243 | 12.883 | 5 | 3164 | 12.297 | 14.1 | 2591 | 9.949 | 0.82 | 3030 | 11.338 | |||

6 | 3318 | 12.853 | 6 | 3185 | 12.260 | 15.1 | 2693 | 10.133 | 0.87 | 3131 | 11.913 | |||

7 | 3354 | 12.831 | 7 | 3205 | 12.613 | 17.3 | 2792 | 10.380 | 0.90 | 3176 | 12.210 | |||

8 | 3424 | 13.030 | 8 | 3236 | 12.699 | 19.2 | 2979 | 11.418 | 0.93 | 3208 | 12.317 | |||

9 | 3485 | 13.067 | 9 | 3338 | 13.027 | 20.4.1 | 3046 | 11.569 |

Except for the 3 red exceptions the results are reasonable in sync with the CCRL 4/40 rating list and perhaps an increase of the time control from 1 second per move to (say) 5 seconds per move would clear the sky completely. For a more detailed research click here.

_____________________________________________________________________________________________

What time control ?

What time control to use to prove an engine change is better or not is closely related the the above subject how many games making the question hard to answer.

In about 2011-2012 programmers increasingly became aware of the "how many games" problem and started to play 10,000-15,000-20,000+ bullet games (10-15 seconds for one game, or something similar) to prove the LOS, the certainty a change is good. You can download a simple LOS calculator for a better understanding, programmers in general accept an engine change if the LOS is 95% and preferable higher. With a LOS of 95% there is still a 5% chance the engine change was not an improvement but a regression. On the other hand in a rare 5% case the regression can't be much which is a comfortable thought.

This major shift in engine testing has worked out quite well, see the ELO progression measured by year statistic.

After this long introduction let's move to the point of this subject and have a look at the CEGT and CCRL top-5 engines in July 2018.

CEGT 40/4 | ELO | Games | CEGT 40/20 | ELO | Games | CEGT 5+3 PB=ON | ELO | Games | ||

Stockfish 9 | 3387 | 2950 | Stockfish 9 | 3384 | 2000 | Stockfish 9 | 3340 | 1400 | ||

Houdini 6 | 3368 | 4500 | Houdini 6 | 3354 | 1999 | Houdini 6 | 3321 | 1400 | ||

Komodo 11 | 3340 | 3186 | Komodo 11 | 3337 | 1410 | Komodo 11 | 3297 | 1400 | ||

Fire 7 | 3237 | 1800 | Fire 7 | 3233 | 1200 | Fire 7 | 3179 | 1400 | ||

Shredder 13 | 3151 | 6950 | Shredder 13 | 3159 | 4676 | Shredder 13 | 3104 | 1400 | ||

CCRL 40/4 | ELO | Games | CCRL 40/40 | ELO | Games | |||||

Stockfish 9 | 3491 | 5084 | Stockfish 9 | 3381 | 1682 | |||||

Houdini 6 | 3456 | 5584 | Houdini 6 | 3344 | 1760 | |||||

Komodo 11 | 3429 | 510 | Komodo 11 | 3335 | 856 | |||||

Fire 7 | 3346 | 1387 | Fire 7 | 3261 | 950 | |||||

Shredder 13 | 3271 | 4616 | Shredder 13 | 3201 | 2933 |

Aside from (or maybe better because of) the extreme consistency that on all 5 lists the order of the 5 best programs match, even their elo differences are in reasonable sync, than what is the interesting observation to make?

[A] - that especially looking at the CEGT 5+3 PB=ON list you don't need to play 15,000-20,000 games. That testing on longer time controls likely (emphasis added, hence the wording observation) is a good alternative also with the advantage of a much better control of how an engine scales which is always a factor of uncertainty playing bullet games.

_____________________________________________________________________________________________

The white advantage of the first move

We all know that playing the white pieces gives us a small advantage, but how big is it in terms of elo? And also is there a difference between the average rated player and the top grandmasters? And, does the draw rate increases? And... is there a difference between humans and computers? Questions, questions... questions.

Yes, there are remarkable differences. From large databases we extracted the below statistics.

Database | Type | Games | White advantage | Draw rate |

Megabase 2003 | Human | 2.211.673 | 53.7% | 32.6% |

MillionBase 2.9 | Human | 2.813.817 | 54.1% | 32.2% |

ELO-2500.pgn | Human | 204.332 | 55.3% | 52.4% |

ELO-2600.pgn | Human | 65.199 | 55.3% | 52.8% |

ELO-2700.pgn | Human | 15.301 | 54.7% | 52.5% |

ELO-2800.pgn | Human | 264 * | 49.4% | 49.6% |

CCRL 40/4 40m in 4m | Computer | 1.918.042 | 53.7% | 29.4% |

CCRL 40/4 [ elo 2800 ] | Computer | 1.359.458 | 55.5% | 43.6% |

CCRL 40/4 [ elo 2900 ] | Computer | 906.987 | 55.6% | 45.8% |

CCRL 40/4 [ elo 3000 ] | Computer | 539.539 | 55.7% | 48.2% |

CCRL 40/4 [ elo 3100 ] | Computer | 259.397 | 557% | 49.9% |

CCRL 40/4 [ elo 3200 ] | Computer | 85.001 | 55.9% | 53.9% |

CCRL 40/4 [ elo 3300 ] | Computer | 22.497 | 56.5% | 64.6% |

CCRL 40/4 [ elo 3400 ] | Computer | 5.668 | 57.2% | 66.7% |

CCRL 40/40 40m in 40m ** | ||||

CCRL 40/40 [ elo 3200 ] | Computer | 28.478 | 56.0% | 62.6% |

CCRL 40/40 [ elo 3300 ] | Computer | 7.995 | 55.7% | 74.9% |

CCRL 40/40 [ elo 3400 ] | Computer | 203 * | 54.2% | 84.7% |

TCEC super finals S10-S14 *** | Computer | 402 * | 52.2% | 89.0% |

* With only so few games the numbers are not representative.

** The time control in CCRL 40/40 is 10 times higher than in 40/4 games, hence better play, we compare with 40/4.

*** The TCEC super finals are played on massive hardware and long time control.

Observations

1. Despite the increasing draw rate (especially in the computer part) the advantage of the first move between top-humans and top-computers remains significant.

2. In general the computers utilize the advantage of the first move somewhat better than humans.

The statistics were made with ProTools and ProDeo 2.9, an overview

can be seen here.

____________________________________________________________________________________________

Similarity between the top engines during 2006-2014 and now, anno 2019

With the release of the open source code of Fruit 2.1 many clones / derivatives appeared, later followed by many clones / derivatives of the hacked Rybka 3. We will give a short overview of both periods, the Fruit-family and the Rybka-family and then look into the situation of the top engines today.

Procedure - Somewhere in 2010/11 programmer Don Daily of Komodo released a tool (called Similarity Tester) that measures the similarity between engines.

Legenda - before viewing the results of the Fruit-family and the Rybka-family consider the meaning of the percentages first.

1. A similarity percentage of over 70% means that engine is a clone with only a few changes.

2. A similarity percentage between 65-69% is considered a clone with considerable changes.

3. A similarity percentage between 60-64% is called a derivative work.

4. A similarity percentage between 55-59% is the grey area, likely a derivative work but impossible to conclude for sure.

5. A similarity percentage between 50-54% Still a bit high, impossible to draw a conclusion.

My recollection the general consensus on these percentages was about 80-90% among the chess programmers, a few wanted to shift the percentages one place up or one place down.

Limitations - Similarity Tester can only proof a clone or a derivative work, it can not conclude the opposite. An engine might score a similarity percentage of 40% (an indication of a very original engine) it still can be a derivative work. There are some tricks to fool Similarity Tester without much elo loss.

And now for anno 2019 and notice the current healthy situation, computer chess healed itself.

And out of curiosity I also put Rebelfish to the test and measure the similarity between Stockfish and ProDeo with its flexible score margins.