Stockfish noteworthiness
The ERL page (a try to test the quality of the evaluation) shows that ProDeo does not bad at all even against SF8 and it awakened my curiosity how it would perform increasing the depth to 2 plies, 3 plies etc. The result was amazing, see the table.
Match  ProDeo % 
 Match  ProDeo % 
D=1  44.8% 
 D=7  83.0% 
D=2  67.7% 
 D=8  73.0% 
D=3  90.1% 
 D=9  58.2% 
D=4  92.2% 
 D=10  44.2% 
D=5  91.1% 
 D=11  35.6% 
D=6  87.1% 
 D=12  30.0% 
Stockfish 8 drop in score is dramatic once it enters the mainsearch starting at D=2. It shows how extremely selective the search is but then as by magic the tables are turned starting at D=10 and the rest is known history. Perhaps (emphasis added) it makes sense for SF to be a bit more conservative in the early plies for reasons of a better move ordering?
Remark  Repeating the test with Stockfish 9 shows and even more selectivity, the turning point is not on depth=9 but 10 and even the root search [D=1] goes up for ProDeo to 48.5% which indicates that SF9 perhaps even is selective at root move procedure..
Topics
Stockfish noteworthiness
Diminishing returns for Stockfish 9
Stockfish 9 Depth vs ELO
Scaling results chess engines
How many games are needed to prove a change?
ELO progression measured by year
Core Scheduling. A closer look.
Experimenting with MEA an alternative testing method.
What time control ?
_______________________________________________________________________________________
Diminishing returns for Stockfish 9
It's known that a hardware doubling in speed gives significant ELO, it's also known that the ELO gain will lower and lower when the time control increases, the so called diminishing returns principle (effect).
In the below table one can see the effects of SF9 selfplay matches when we give the second SF9 engine twice as much time and on fast time controls even a factor of 4 or 8 more thinking time. By doing so we may get some glimpse what the future may hold when hardware keeps improving.
Factor 2  Factor 4  Factor 8 
Match  Games  Perc  ELO  Draws 
 Match  Games  Perc  ELO  Draws 
 Match  Games  Perc  ELO  Draws 
40/15  1000  65.1%  109  55% 
 40/15  1000  74.8%  189  44% 
 40/15  1000  81.8%  253  34% 
40/30  1000  63.0%  94  59% 
 40/30  1000  72.8%  170  48% 
 40/30  500  77.7%  215  41% 
40/1m  500  61.7%  83  61% 
 40/1m  500  69.6%  144  53% 
 40/1m  250  73.0%  173  51% 
40/2m  500  59.4%  66  63% 
 40/2m  250  65.4%  111  62% 
 40/2m  100  70.5%  152  53% 
40/5m  200  56.9%  46  65% 
 40/5m  100  62.5%  90  59% 






40/10m  100  57.5%  52  61% 
 40/10m  100  64.0%  101  58% 






40/20m  100  56.5%  45  67% 












40/1h  100  TODO 














The good news is that doubling speed still gives amazingly much at long time control even for topengines. What happened to the dark age of the late 90's and the early years of the millennium when it was considered that engines at [D=14] would mainly produce draws? It seems the end is nowhere in sight.
_______________________________________________________________________________________________________
Diminishing returns  Depth vs ELO
Another way to measure the diminishing returns instead of using time is to do it by iteration depth. We do it for Stockfish 9 and an old fashioned engine like ProDeo in selfplay matches Depth=x vs Depth=x+1
Stockfish 9  Depth+1  ELO 
 Stockfish 9  Depth+1  ELO 
 ProDeo 2.2  Depth+1  ELO 
Depth = 1  53.3%  +23 
 Depth = 16  59.4%  +66 
 Depth = 1  93.6%  +429 
Depth = 2  52.9%  +20 
 Depth = 17  61.8%  +84 
 Depth = 2  88.8%  +340 
Depth = 3  64.5%  +105 
 Depth = 18  56.2%  +43 
 Depth = 3  70.8%  +154 
Depth = 4  70.4%  +150 
 Depth = 19  57.0%  +49 
 Depth = 4  80.7%  +244 
Depth = 5  75.1%  +191 
 Depth = 20  54.2%  +29 
 Depth = 5  77.2%  +209 
Depth = 6  76.8%  +207 
 Depth = 21  55.6%  +39 
 Depth = 6  74.4%  +187 
Depth = 7  76.0%  +200 
 Depth = 22  58.0%  +56 
 Depth = 7  70.4%  +151 
Depth = 8  81.8%  +254 
 Depth = 23  53.3%  +21 
 Depth = 8  69.4%  +142 
Depth = 9  74.9%  +190 
 Depth = 24  54.0%  +28 
 Depth = 9  68.6%  +136 
Depth = 10  72.7%  +170 
 Depth = 25  54.5%  +31 
 Depth = 10  68.4%  +134 
Depth = 11  68.6%  +136 
 Depth = 26  54.3%  +29 
 Depth = 11  62.8%  +91 
Depth = 12  67.0%  +124 
 Depth = 27  52.0%  +13 
 Depth = 12  64.3%  +102 
Depth = 13  62.9%  +93 
 Depth = 28  54.5%  +31 
 Depth = 13  62.2%  +87 
Depth = 14  59.9%  +70 
 Depth = 29  55.0%  +35 
 Depth = 14  60.8%  +76 
Depth = 15  61.7%  +83 
 Depth = 30  50.0%  0 
 Depth = 15  63.8%  +98 
Remarks  Numbers of the last 3 Stockfish matches (28 vs 29, 29 vs 30 and 30 vs 31) statistically are not very reliable since only 100 games were played due to the time such a long time control takes, same as for ProDeo, last matches only 200 games. Nevertheless the table gives some insight what the future might hold.
______________________________________________________________________________________________________
Scaling results chess engines
How well do chess engines scale? When they (for instance) are given double, triple time, do they lose or win ELO in comparison with other engines? Below are the results of the CEGTrating list comparing the 40/4 versus the 40/20 time control and the gain or loss in ELO. Minimum games played is 300.
Engine  ELO 40/4  ELO 40/20  Scaling 
Stockfish 9  3475  3480  +5 
Houdini 6  3457  3440  17 
Komodo 11  3404  3398  6 
Shredder 13  3249  3218  31 
Engine  ELO 40/4  ELO 40/20  Scaling 
Alfil 15.4  2599  2721  +122 
Zappa Mexico  2507  2617  +110 
Vajolet2 1.28  2655  2712  +62 
Arasan 15.3  2517  2574  +57 
______________________________________________________________________________________________________
How many games ?
How many games are needed to prove a change? That's the question. Below are the results of a simulation running random matches and to find out the equilibrium when it is proven that 30 matches in a row end between a percentage of 49.9%  50.1% and the number seems to be 100,000 games. See the examples [ one ] [ two ] and [ three ]
As one can see from example one, playing 5000 games between 2 identical engines sometimes (emphasis added) may result in a total unreliable results, see the 51.1% result vs the 48.9% result. In the worst case scenario you might accept a change that in reality is a regression because of the convincing 51.1% score (indicating +7 ELO) with a LOS of 96.6% and never look back, hence the regression remains forever.
Not funny.
On the other hand the simulation is based on the random outcome of 10  ½½ and 01 and each outcome is valued equal 1/3 or 33.33%. While that is true for engines playing random noves it isn't true in the real world. The stronger the engine the higher the draw rate, not 33.33% but a lot higher and 10  01 outcomes will decrease and decrease. And so the numbers have to be taken as an indication. For a deeper investigation see chapter What time control.
_____________________________________________________________________________________________________
Experimenting with MEA (Multiple Move EPD Analyzer) an alternative testing method.
MEA analyzes epd files having multiple solution moves with points such as the Strategic Test Suite [STS] containing 1500 theme based positions. First we test the reliability if quickly analyzing 1500 positions at 1 second per move will produce reasonable results. For this purpose we use 4 engines (Stochfish, Komodo, Arasan and Andscacs) that through the years frequently were updated and compare if the progress is in balance with the CCRL 40/4 rating list progress.
Stockfish  ELO  Score 
 Komodo  ELO  Score 
 Arasan  ELO  Score 
 Andscacs  ELO  Score 
4  3180  12.678 
 4  3127  12.167 
 13.1  2575  9.927 
 0.70  2827  10.719 
5  3243  12.883 
 5  3164  12.297 
 14.1  2591  9.949 
 0.82  3030  11.338 
6  3318  12.853 
 6  3185  12.260 
 15.1  2693  10.133 
 0.87  3131  11.913 
7  3354  12.831 
 7  3205  12.613 
 17.3  2792  10.380 
 0.90  3176  12.210 
8  3424  13.030 
 8  3236  12.699 
 19.2  2979  11.418 
 0.93  3208  12.317 
9  3485  13.067 
 9  3338  13.027 
 20.4.1  3046  11.569 




Except for the 3 red exceptions the results are reasonable in sync with the CCRL 4/40 rating list and perhaps an increase of the time control from 1 second per move to (say) 5 seconds per move would clear the sky completely. For a more detailed research click here.
_________________________________________________________________________________________________
What time control ?
What time control to use to prove an engine change is better or not is closely related the the above subject how many games making the question hard to answer.
In about 20112012 programmers increasingly became aware of the "how many games" problem and started to play 10,00015,00020,000+ bullet games (1015 seconds for one game, or something similar) to prove the LOS, the certainty a change is good. You can download a simple LOS calculator for a better understanding, programmers in general accept an engine change if the LOS is 95% and preferable higher. With a LOS of 95% there is still a 5% chance the engine change was not an improvement but a regression. On the other hand in a rare 5% case the regression can't be much which is a comfortable thought.
This major shift in engine testing has worked out quite well, see the ELO progression measured by year statistic.
After this long introduction let's move to the point of this subject and have a look at the CEGT and CCRL top5 engines in July 2018.
CEGT 40/4  ELO  Games 
 CEGT 40/20  ELO  Games 
 CEGT 5+3 PB=ON  ELO  Games 
Stockfish 9  3387  2950 
 Stockfish 9  3384  2000 
 Stockfish 9  3340  1400 
Houdini 6  3368  4500 
 Houdini 6  3354  1999 
 Houdini 6  3321  1400 
Komodo 11  3340  3186 
 Komodo 11  3337  1410 
 Komodo 11  3297  1400 
Fire 7  3237  1800 
 Fire 7  3233  1200 
 Fire 7  3179  1400 
Shredder 13  3151  6950 
 Shredder 13  3159  4676 
 Shredder 13  3104  1400 











CCRL 40/4  ELO  Games 
 CCRL 40/40  ELO  Games 




Stockfish 9  3491  5084 
 Stockfish 9  3381  1682 




Houdini 6  3456  5584 
 Houdini 6  3344  1760 




Komodo 11  3429  510 
 Komodo 11  3335  856 




Fire 7  3346  1387 
 Fire 7  3261  950 




Shredder 13  3271  4616 
 Shredder 13  3201  2933 




Aside from (or maybe better because of) the extreme consistency that on all 5 lists the order of the 5 best programs match, even their elo differences are in reasonable sync, than what is the interesting observation to make?
[A]  that especially looking at the CEGT 5+3 PB=ON list you don't need to play 15,00020,000 games. That testing on longer time controls likely (emphasis added, hence the wording observation) is a good alternative also with the advantage of a much better control of how an engine scales which is always a factor of uncertainty playing bullet games.