Experiments | Home of the Dutch Rebel

Short articles about various aspects of computer chess

Stockfish noteworthiness

The ERL page (a try to test the quality of the evaluation) shows that ProDeo does not bad at all even against SF8 and it awakened my curiosity how it would perform increasing the depth to 2 plies, 3 plies etc. The result was amazing, see the table.

Match	ProDeo %	Match	ProDeo %
D=1	44.8%	D=7	83.0%
D=2	67.7%	D=8	73.0%
D=3	90.1%	D=9	58.2%
D=4	92.2%	D=10	44.2%
D=5	91.1%	D=11	35.6%
D=6	87.1%	D=12	30.0%

Stockfish 8 drop in score is dramatic once it enters the main-search starting at D=2. It shows how extremely selective the search is but then as by magic the tables are turned starting at D=10 and the rest is known history. Perhaps (emphasis added) it makes sense for SF to be a bit more conservative in the early plies for reasons of a better move ordering?

Remark - Repeating the test with Stockfish 9 shows and even more selectivity, the turning point is not on depth=9 but 10 and even the root search [D=1] goes up for ProDeo to 48.5% which indicates that SF9 perhaps even is selective at root move procedure..

Topics

Stockfish noteworthiness

Diminishing returns for Stockfish 9

Stockfish 9 Depth vs ELO

Scaling results chess engines

How many games are needed to prove a change?

ELO progression measured by year

Core Scheduling. A closer look.

Experimenting with MEA an alternative testing method.

What time control ?

The white advantage of the first move.

Educational fun with knight-odds matches.

_____________________________________________________________________

Diminishing returns for Stockfish 9

It's known that a hardware doubling in speed gives significant ELO, it's also known that the ELO gain will lower and lower when the time control increases, the so called diminishing returns principle (effect).

In the below table one can see the effects of SF9 self-play matches when we give the second SF9 engine twice as much time and on fast time controls even a factor of 4 or 8 more thinking time. By doing so we may get some glimpse what the future may hold when hardware keeps improving.

Factor 2

Factor 4

Factor 8

Match	Games	Perc	ELO	Draws	Match	Games	Perc	ELO	Draws	Match	Games	Perc	ELO	Draws
40/15	1000	65.1%	109	55%	40/15	1000	74.8%	189	44%	40/15	1000	81.8%	253	34%
40/30	1000	63.0%	94	59%	40/30	1000	72.8%	170	48%	40/30	500	77.7%	215	41%
40/1m	500	61.7%	83	61%	40/1m	500	69.6%	144	53%	40/1m	250	73.0%	173	51%
40/2m	500	59.4%	66	63%	40/2m	250	65.4%	111	62%	40/2m	100	70.5%	152	53%
40/5m	200	56.9%	46	65%	40/5m	100	62.5%	90	59%
40/10m	100	57.5%	52	61%	40/10m	100	64.0%	101	58%
40/20m	100	56.5%	45	67%
40/1h	100	TODO

The good news is that doubling speed still gives amazingly much at long time control even for top-engines. What happened to the dark age of the late 90's and the early years of the millennium when it was considered that engines at [D=14] would mainly produce draws? It seems the end is nowhere in sight.

___________________________________________________________________________________________________

Diminishing returns - Depth vs ELO

Another way to measure the diminishing returns instead of using time is to do it by iteration depth. We do it for Stockfish 9 and an old fashioned engine like ProDeo in self-play matches Depth=x vs Depth=x+1

Stockfish 9	Depth+1	ELO	Stockfish 9	Depth+1	ELO	ProDeo 2.2	Depth+1	ELO
Depth = 1	53.3%	+23	Depth = 16	59.4%	+66	Depth = 1	93.6%	+429
Depth = 2	52.9%	+20	Depth = 17	61.8%	+84	Depth = 2	88.8%	+340
Depth = 3	64.5%	+105	Depth = 18	56.2%	+43	Depth = 3	70.8%	+154
Depth = 4	70.4%	+150	Depth = 19	57.0%	+49	Depth = 4	80.7%	+244
Depth = 5	75.1%	+191	Depth = 20	54.2%	+29	Depth = 5	77.2%	+209
Depth = 6	76.8%	+207	Depth = 21	55.6%	+39	Depth = 6	74.4%	+187
Depth = 7	76.0%	+200	Depth = 22	58.0%	+56	Depth = 7	70.4%	+151
Depth = 8	81.8%	+254	Depth = 23	53.3%	+21	Depth = 8	69.4%	+142
Depth = 9	74.9%	+190	Depth = 24	54.0%	+28	Depth = 9	68.6%	+136
Depth = 10	72.7%	+170	Depth = 25	54.5%	+31	Depth = 10	68.4%	+134
Depth = 11	68.6%	+136	Depth = 26	54.3%	+29	Depth = 11	62.8%	+91
Depth = 12	67.0%	+124	Depth = 27	52.0%	+13	Depth = 12	64.3%	+102
Depth = 13	62.9%	+93	Depth = 28	54.5%	+31	Depth = 13	62.2%	+87
Depth = 14	59.9%	+70	Depth = 29	55.0%	+35	Depth = 14	60.8%	+76
Depth = 15	61.7%	+83	Depth = 30	50.0%	0	Depth = 15	63.8%	+98

Remarks - Numbers of the last 3 Stockfish matches (28 vs 29, 29 vs 30 and 30 vs 31) statistically are not very reliable since only 100 games were played due to the time such a long time control takes, same as for ProDeo, last matches only 200 games. Nevertheless the table gives some insight what the future might hold.

_________________________________________________________________________________________________

Scaling results chess engines

How well do chess engines scale? When they (for instance) are given double, triple time, do they lose or win ELO in comparison with other engines? Below are the results of the CEGTrating list comparing the 40/4 versus the 40/20 time control and the gain or loss in ELO. Minimum games played is 300.

Engine	ELO 40/4	ELO 40/20	Scaling
Stockfish 9	3475	3480	+5
Houdini 6	3457	3440	-17
Komodo 11	3404	3398	-6
Shredder 13	3249	3218	-31

Sorted list as on the CEGT rating list April 2018.

Full list

Engine	ELO 40/4	ELO 40/20	Scaling
Alfil 15.4	2599	2721	+122
Zappa Mexico	2507	2617	+110
Vajolet2 1.28	2655	2712	+62
Arasan 15.3	2517	2574	+57

Sorted list of engines based on scaling.

Full list

_________________________________________________________________________________________________

How many games ?

How many games are needed to prove a change? That's the question. Below are the results of a simulation running random matches and to find out the equilibrium when it is proven that 30 matches in a row end between a percentage of 49.9% - 50.1% and the number seems to be 100,000 games. See the examples [ one ] [ two ] and [ three ]

As one can see from example one, playing 5000 games between 2 identical engines sometimes (emphasis added) may result in a total unreliable results, see the 51.1% result vs the 48.9% result. In the worst case scenario you might accept a change that in reality is a regression because of the convincing 51.1% score (indicating +7 ELO) with a LOS of 96.6% and never look back, hence the regression remains forever.

Not funny. Download the emulate tool to do your own experiments. Source code included.

On the other hand the simulation is based on the random outcome of 1-0 | ½-½ and 0-1 and each outcome is valued equal 1/3 or 33.33%. While that is true for engines playing random noves it isn't true in the real world. The stronger the engine the higher the draw rate, not 33.33% but a lot higher and 1-0 | 0-1 outcomes will decrease and decrease. And so the numbers have to be taken as an indication. For a deeper investigation see chapter What time control.

_________________________________________________________________________________________________

Experimenting with MEA (Multiple Move EPD Analyzer) an alternative testing method.

MEA analyzes epd files having multiple solution moves with points such as the Strategic Test Suite [STS] containing 1500 theme based positions. First we test the reliability if quickly analyzing 1500 positions at 1 second per move will produce reasonable results. For this purpose we use 4 engines (Stochfish, Komodo, Arasan and Andscacs) that through the years frequently were updated and compare if the progress is in balance with the CCRL 40/4 rating list progress.

Stockfish	ELO	Score	Komodo	ELO	Score	Arasan	ELO	Score	Andscacs	ELO	Score
4	3180	12.678	4	3127	12.167	13.1	2575	9.927	0.70	2827	10.719
5	3243	12.883	5	3164	12.297	14.1	2591	9.949	0.82	3030	11.338
6	3318	12.853	6	3185	12.260	15.1	2693	10.133	0.87	3131	11.913
7	3354	12.831	7	3205	12.613	17.3	2792	10.380	0.90	3176	12.210
8	3424	13.030	8	3236	12.699	19.2	2979	11.418	0.93	3208	12.317
9	3485	13.067	9	3338	13.027	20.4.1	3046	11.569

Except for the 3 red exceptions the results are reasonable in sync with the CCRL 4/40 rating list and perhaps an increase of the time control from 1 second per move to (say) 5 seconds per move would clear the sky completely. For a more detailed research click here.

_____________________________________________________________________________________________

What time control ?

What time control to use to prove an engine change is better or not is closely related the the above subject how many games making the question hard to answer.

In about 2011-2012 programmers increasingly became aware of the "how many games" problem and started to play 10,000-15,000-20,000+ bullet games (10-15 seconds for one game, or something similar) to prove the LOS, the certainty a change is good.

You can download a simple LOS calculator for a better understanding, programmers in general accept an engine change if the LOS is 95% and preferable higher. With a LOS of 95% there is still a 5% chance the engine change was not an improvement but a regression. On the other hand in a rare 5% case the regression can't be much which is a comfortable thought.

This major shift in engine testing has worked out quite well, see the ELO progression measured by year statistic.

After this long introduction let's move to the point of this subject and have a look at the CEGT and CCRL top-5 engines in July 2018.

CEGT 40/4	ELO	Games	CEGT 40/20	ELO	Games	CEGT 5+3 PB=ON	ELO	Games
Stockfish 9	3387	2950	Stockfish 9	3384	2000	Stockfish 9	3340	1400
Houdini 6	3368	4500	Houdini 6	3354	1999	Houdini 6	3321	1400
Komodo 11	3340	3186	Komodo 11	3337	1410	Komodo 11	3297	1400
Fire 7	3237	1800	Fire 7	3233	1200	Fire 7	3179	1400
Shredder 13	3151	6950	Shredder 13	3159	4676	Shredder 13	3104	1400

CCRL 40/4	ELO	Games	CCRL 40/40	ELO	Games
Stockfish 9	3491	5084	Stockfish 9	3381	1682
Houdini 6	3456	5584	Houdini 6	3344	1760
Komodo 11	3429	510	Komodo 11	3335	856
Fire 7	3346	1387	Fire 7	3261	950
Shredder 13	3271	4616	Shredder 13	3201	2933

Aside from (or maybe better because of) the extreme consistency that on all 5 lists the order of the 5 best programs match, even their elo differences are in reasonable sync, than what is the interesting observation to make?

[A] - that especially looking at the CEGT 5+3 PB=ON list you don't need to play 15,000-20,000 games. That testing on longer time controls likely (emphasis added, hence the wording observation) is a good alternative also with the advantage of a much better control of how an engine scales which is always a factor of uncertainty playing bullet games.

_____________________________________________________________________________________________

The white advantage of the first move

We all know that playing the white pieces gives us a small advantage, but how big is it in terms of elo? And also is there a difference between the average rated player and the top grandmasters? And, does the draw rate increases? And... is there a difference between humans and computers? Questions, questions... questions.

Yes, there are remarkable differences. From large databases we extracted the below statistics.

Database	Type	Games	White advantage	Draw rate
Megabase 2003	Human	2.211.673	53.7%	32.6%
MillionBase 2.9	Human	2.813.817	54.1%	32.2%
ELO-2500.pgn	Human	204.332	55.3%	52.4%
ELO-2600.pgn	Human	65.199	55.3%	52.8%
ELO-2700.pgn	Human	15.301	54.7%	52.5%
ELO-2800.pgn	Human	264 *	49.4%	49.6%

CCRL 40/4 40m in 4m	Computer	1.918.042	53.7%	29.4%
CCRL 40/4 [ elo 2800 ]	Computer	1.359.458	55.5%	43.6%
CCRL 40/4 [ elo 2900 ]	Computer	906.987	55.6%	45.8%
CCRL 40/4 [ elo 3000 ]	Computer	539.539	55.7%	48.2%
CCRL 40/4 [ elo 3100 ]	Computer	259.397	557%	49.9%
CCRL 40/4 [ elo 3200 ]	Computer	85.001	55.9%	53.9%
CCRL 40/4 [ elo 3300 ]	Computer	22.497	56.5%	64.6%
CCRL 40/4 [ elo 3400 ]	Computer	5.668	57.2%	66.7%

CCRL 40/40 40m in 40m **
CCRL 40/40 [ elo 3200 ]	Computer	28.478	56.0%	62.6%
CCRL 40/40 [ elo 3300 ]	Computer	7.995	55.7%	74.9%
CCRL 40/40 [ elo 3400 ]	Computer	203 *	54.2%	84.7%

TCEC super finals S10-S14 ***	Computer	402 *	52.2%	89.0%

* With only so few games the numbers are not representative.

** The time control in CCRL 40/40 is 10 times higher than in 40/4 games, hence better play, we compare with 40/4.

*** The TCEC super finals are played on massive hardware and long time control.

Observations

1. Despite the increasing draw rate (especially in the computer part) the advantage of the first move between top-humans and top-computers remains significant.

2. In general the computers utilize the advantage of the first move somewhat better than humans.

The statistics were made with ProTools and ProDeo 2.9, an overview

can be seen here.

____________________________________________________________________________________________

Educational fun with knight-odds matches

This is a spin off experiment of the Stockfish Handicap Matches that quite differently shows the strength of Stockfish 11. We play 200 game matches, Stockfish 11 playing with the white pieces without a knight on b1 or g1 against a bunch of other engines in order to see how decisive other engines are and beat Stockfish, or lose anyway!

 # ENGINE            : RATING  POINTS PLAYED  (%)
 1 Stockfish 11    > : 3764.7   200.0   200  100.0%
 2 Komodo 14         : 3569.0   198.5   200   99.3%
 3 Houdini 6.03      : 3337.7   194.5   200   97.3%
 4 Ethereal 12.25    : 3307.5   193.5   200   96.8%
 5 rofChade 2.3      : 3238.3   190.5   200   95.3%
 6 Fire 7.1          : 3228.8   190.0   200   95.0%
 7 Xiphos 0.6        : 3219.8   189.5   200   94.8%
 8 Andscacs 0.95     : 3153.1   185.0   200   92.5%
 9 Booot 6.4         : 3135.0   183.5   200   91.8%
10 RubiChess 1.7.3   : 3129.3   183.0   200   91.5%
11 Laser 1.7         : 3118.3   182.0   200   91.0%
12 Schooner 2.2      : 3093.1   179.5   200   89.8%
13 Demolito          : 3057.9   175.5   200   87.8%
14 Wasp 4.00         : 3049.9   174.5   200   87.3%
15 Senpai 2          : 3013.4   169.5   200   84.8%
16 Defenchess 2.2    : 3010.0   169.0   200   84.5%
17 ice 4.0           : 3006.7   168.5   200   84.3%
18 Texel 1.7         : 2997.0   167.0   200   83.5%
19 Arasan 22         : 2987.6   165.5   200   82.8%
20 Vajolet 2.8       : 2958.5   160.5   200   80.3%
21 Shredder 13       : 2947.7   158.5   200   79.3%
22 cheng4 4.39       : 2861.3   140.0   200   70.0%
23 Weiss 1.0         : 2851.0   137.5   200   68.8%
24 Bobcat 8          : 2840.9   135.0   200   67.5%
25 Crafty 25.6       : 2782.1   119.5   200   59.8%
26 Benjamin          : 2739.2   107.5   200   53.8%
27 ProDeo            : 2730.4   105.0   200   52.5%
28 SF11              : 2712.8  1264.0  6000   21.1%
29 Fruit 2.3         : 2658.1    84.5   200   42.3%
30 Fruit 2.1         : 2588.7    66.0   200   33.0%
31 Ruffian 2         : 2576.7    63.0   200   31.5%

SF11 (Stockfish 11) is the engine that plays with a knight down and as the ranking list shows the turning point lies around ~2700 elo.

The data suggests most of the GM's will lose against Stockfish with a knight down.

The 6000 games

Technical

Material odds openings can be created with a tool as mentioned elsewhere.