Experiments

                         Short articles about various aspects of computer chess

Stockfish noteworthiness 


The ERL page (a try to test the quality of the evaluation) shows that ProDeo does not bad at all even against SF8 and it awakened my curiosity how it would perform increasing the depth to 2 plies, 3 plies etc. The result was amazing, see the table.

Match

ProDeo %

   

Match

ProDeo %

D=1

44.8%

   

D=7

83.0%

D=2

67.7%

   

D=8

73.0%

D=3

90.1%

   

D=9

58.2%

D=4

92.2%

   

D=10

44.2%

D=5

91.1%

   

D=11

35.6%

D=6

87.1%

   

D=12

30.0%

Stockfish 8 drop in score is dramatic once it enters the main-search starting at D=2. It shows how extremely selective the search is but then as by magic the tables are turned starting at D=10 and the rest is known history. Perhaps (emphasis added) it makes sense for SF to be a bit more conservative in the early plies for reasons of a better move ordering?

Remark - Repeating the test with Stockfish 9 shows and even more selectivity, the turning point is not on depth=9 but 10 and even the root search [D=1] goes up for ProDeo to 48.5% which indicates that SF9 perhaps even is selective at root move procedure..

Topics


Stockfish noteworthiness


Diminishing returns for Stockfish 9


Stockfish 9 Depth vs ELO


Scaling results chess engines


How many games are needed to prove a change?


ELO progression measured by year


Core Scheduling. A closer look.


Experimenting with MEA an alternative testing method.


What time control ?


The white advantage of the first move.


Educational fun with knight-odds matches.

_____________________________________________________________________


Diminishing returns for Stockfish 9


It's known that a hardware doubling in speed gives significant ELO, it's also known that the ELO gain will lower and lower when the time control increases, the so called diminishing returns principle (effect).


In the below table one can see the effects of SF9 self-play matches when we give the second SF9 engine twice as much time and on fast time controls even a factor of 4 or 8 more thinking time. By doing so we may get some glimpse what the future may hold when hardware keeps improving.

Factor 2

Factor 4

Factor 8

Match

Games

Perc

ELO

Draws

   

Match

Games

Perc

ELO

Draws

 

Match

Games

Perc

ELO

Draws

40/15

1000

65.1%

109

55%

 

40/15

1000

74.8%

189

44%

 

40/15

1000

81.8%

253

34%

40/30

1000

63.0%

94

59%

  

40/30

1000

72.8%

170

48%

 

40/30

500

77.7%

215

41%

40/1m

500

61.7%

83

61%

 

40/1m

500

69.6%

144

53%

 

40/1m

250

73.0%

173

51%

40/2m

500

59.4%

66

63%

 

40/2m

250

65.4%

111

62%

 

40/2m

100

70.5%

152

53%

40/5m

200

56.9%

46

65%

 

40/5m

100

62.5%

90

59%







40/10m

100

57.5%

52

61%

 

40/10m

100

64.0%

101

58%







40/20m

100

56.5%

45

67%













40/1h

100

TODO















The good news is that doubling speed still gives amazingly much at long time control even for top-engines. What happened to the dark age of the late 90's and the early years of the millennium when it was considered that engines at [D=14] would mainly produce draws? It seems the end is nowhere in sight.


___________________________________________________________________________________________________


Diminishing returns - Depth vs ELO


Another way to measure the diminishing returns instead of using time is to do it by iteration depth. We do it for Stockfish 9 and an old fashioned engine like ProDeo in self-play matches Depth=x vs Depth=x+1

Stockfish 9

Depth+1

ELO

 

Stockfish 9

Depth+1

ELO

 

ProDeo 2.2

Depth+1

ELO

Depth = 1

53.3%

+23

 

Depth = 16

59.4%

+66

 

Depth = 1

93.6%

+429

Depth = 2

52.9%

+20

 

Depth = 17

61.8%

+84

 

Depth = 2

88.8%

+340

Depth = 3

64.5%

+105

 

Depth = 18

56.2%

+43

 

Depth = 3

70.8%

+154

Depth = 4

70.4%

+150

 

Depth = 19

57.0%

+49

 

Depth = 4

80.7%

+244

Depth = 5

75.1%

+191

 

Depth = 20

54.2%

+29

 

Depth = 5

77.2%

+209

Depth = 6

76.8%

+207

 

Depth = 21

55.6%

+39

 

Depth = 6

74.4%

+187

Depth = 7

76.0%

+200

 

Depth = 22

58.0%

+56

 

Depth = 7

70.4%

+151

Depth = 8

81.8%

+254

 

Depth = 23

53.3%

+21

 

Depth = 8

69.4%

+142

Depth = 9

74.9%

+190

 

Depth = 24

54.0%

+28

 

Depth = 9

68.6%

+136

Depth = 10

72.7%

+170

 

Depth = 25

54.5%

+31

 

Depth = 10

68.4%

+134

Depth = 11

68.6%

+136

 

Depth = 26

54.3%

+29

 

Depth = 11

62.8%

+91

Depth = 12

67.0%

+124

 

Depth = 27

52.0%

+13

 

Depth = 12

64.3%

+102

Depth = 13

62.9%

+93

 

Depth = 28

54.5%

+31

 

Depth = 13

62.2%

+87

Depth = 14

59.9%

+70

 

Depth = 29

55.0%

+35

 

Depth = 14

60.8%

+76

Depth = 15

61.7%

+83

 

Depth = 30

50.0%

0

 

Depth = 15

63.8%

+98

Remarks - Numbers of the last 3 Stockfish matches (28 vs 29, 29 vs 30 and 30 vs 31) statistically are not very reliable since only 100 games were played due to the time such a long time control takes, same as for ProDeo, last matches only 200 games. Nevertheless the table gives some insight what the future might hold.

_________________________________________________________________________________________________


Scaling results chess engines


How well do chess engines scale? When they (for instance) are given double, triple time, do they lose or win ELO in comparison with other engines? Below are the results of the CEGTrating list comparing the 40/4 versus the 40/20 time control and the gain or loss in ELO. Minimum games played is 300.

Engine

ELO 40/4

ELO 40/20

Scaling

Stockfish 9

3475

3480

+5

Houdini 6

3457

3440

-17

Komodo 11

3404

3398

-6

Shredder 13

3249

3218

-31

Sorted list as on the CEGT rating list April 2018.


Full list

Engine

ELO 40/4

ELO 40/20

Scaling

Alfil 15.4

2599

2721

+122

Zappa Mexico

2507

2617

+110

Vajolet2 1.28

2655

2712

+62

Arasan 15.3

2517

2574

+57

Sorted list of engines based on scaling.


Full list

_________________________________________________________________________________________________


How many games ?


How many games are needed to prove a change? That's the question. Below are the results of a simulation running random matches and to find out the equilibrium when it is proven that 30 matches in a row end between a percentage of 49.9% - 50.1% and the number seems to be 100,000 games. See the examples [ one ] [ two ] and [ three ]


As one can see from example one, playing 5000 games between 2 identical engines sometimes (emphasis added) may result in a total unreliable results, see the 51.1% result vs the 48.9% result. In the worst case scenario you might accept a change that in reality is a regression because of the convincing 51.1% score (indicating +7 ELO) with a LOS of 96.6% and never look back, hence the regression remains forever.


Not funny. Download the emulate tool to do your own experiments. Source code included.


On the other hand the simulation is based on the random outcome of 1-0 | ½-½ and 0-1 and each outcome is valued equal 1/3 or 33.33%. While that is true for engines playing random noves it isn't true in the real world. The stronger the engine the higher the draw rate, not 33.33% but a lot higher and 1-0 | 0-1 outcomes will decrease and decrease. And so the numbers have to be taken as an indication. For a deeper investigation see chapter What time control.

_________________________________________________________________________________________________


Experimenting with MEA (Multiple Move EPD Analyzer) an alternative testing method.


MEA analyzes epd files having multiple solution moves with points such as the Strategic Test Suite [STS] containing 1500 theme based positions. First we test the reliability if quickly analyzing 1500 positions at 1 second per move will produce reasonable results. For this purpose we use 4 engines (Stochfish, Komodo, Arasan and Andscacs) that through the years frequently were updated and compare if the progress is in balance with the CCRL 40/4 rating list progress.

Stockfish

ELO

Score


Komodo

ELO

Score


Arasan

ELO

Score


Andscacs

ELO

Score

4

3180

12.678


4

3127

12.167


13.1

2575

9.927


0.70

2827

10.719

5

3243

12.883


5

3164

12.297


14.1

2591

9.949


0.82

3030

11.338

6

3318

12.853


6

3185

12.260


15.1

2693

10.133


0.87

3131

11.913

7

3354

12.831


7

3205

12.613


17.3

2792

10.380


0.90

3176

12.210

8

3424

13.030


8

3236

12.699


19.2

2979

11.418


0.93

3208

12.317

9

3485

13.067


9

3338

13.027


20.4.1

3046

11.569





Except for the 3 red exceptions the results are reasonable in sync with the CCRL 4/40 rating list and perhaps an increase of the time control from 1 second per move to (say) 5 seconds per move would clear the sky completely. For a more detailed research click here.

_____________________________________________________________________________________________


What time control ?


What time control to use to prove an engine change is better or not is closely related the the above subject how many games making the question hard to answer.


In about 2011-2012 programmers increasingly became aware of the "how many games" problem and started to play 10,000-15,000-20,000+ bullet games (10-15 seconds for one game, or something similar) to prove the LOS, the certainty a change is good.


You can download a simple LOS calculator for a better understanding, programmers in general accept an engine change if the LOS is 95% and preferable higher. With a LOS of 95% there is still a 5% chance the engine change was not an improvement but a regression. On the other hand in a rare 5% case the regression can't be much which is a comfortable thought.


This major shift in engine testing has worked out quite well, see the ELO progression measured by year statistic.


After this long introduction let's move to the point of this subject and have a look at the CEGT and CCRL top-5 engines in July 2018.

CEGT 40/4

ELO

Games


CEGT 40/20

ELO

Games


CEGT 5+3 PB=ON

ELO

Games

Stockfish 9

3387

2950


Stockfish 9

3384

2000


Stockfish 9

3340

1400

Houdini 6

3368

4500


Houdini 6

3354

1999


Houdini 6

3321

1400

Komodo 11

3340

3186


Komodo 11

3337

1410


Komodo 11

3297

1400

Fire 7

3237

1800


Fire 7

3233

1200


Fire 7

3179

1400

Shredder 13

3151

6950


Shredder 13

3159

4676


Shredder 13

3104

1400












CCRL 40/4

ELO

Games


CCRL 40/40

ELO

Games





Stockfish 9

3491

5084


Stockfish 9

3381

1682





Houdini 6

3456

5584


Houdini 6

3344

1760





Komodo 11

3429

510


Komodo 11

3335

856





Fire 7

3346

1387


Fire 7

3261

950





Shredder 13

3271

4616


Shredder 13

3201

2933





Aside from (or maybe better because of) the extreme consistency that on all 5 lists the order of the 5 best programs match, even their elo differences are in reasonable sync, than what is the interesting observation to make?


[A] - that especially looking at the CEGT 5+3 PB=ON list you don't need to play 15,000-20,000 games. That testing on longer time controls likely (emphasis added, hence the wording observation) is a good alternative also with the advantage of a much better control of how an engine scales which is always a factor of uncertainty playing bullet games.

_____________________________________________________________________________________________

The white advantage of the first move


We all know that playing the white pieces gives us a small advantage, but how big is it in terms of elo? And also is there a difference between the average rated player and the top grandmasters? And, does the draw rate increases? And... is there a difference between humans and computers? Questions, questions... questions.


Yes, there are remarkable differences. From large databases we extracted the below statistics.

Database

Type

Games

White advantage

Draw rate

Human

2.211.673

53.7%

32.6%

Human

2.813.817

54.1%

32.2%

ELO-2500.pgn

Human

204.332

55.3%

52.4%

ELO-2600.pgn

Human

65.199

55.3%

52.8%

ELO-2700.pgn

Human

15.301

54.7%

52.5%

ELO-2800.pgn

Human

264 *

49.4%

49.6%






CCRL 40/4 40m in 4m

Computer

1.918.042

53.7%

29.4%

CCRL 40/4 [ elo 2800 ]

Computer

1.359.458

55.5%

43.6%

CCRL 40/4 [ elo 2900 ]

Computer

906.987

55.6%

45.8%

CCRL 40/4 [ elo 3000 ]

Computer

539.539

55.7%

48.2%

CCRL 40/4 [ elo 3100 ]

Computer

259.397

557%

49.9%

CCRL 40/4 [ elo 3200 ]

Computer

85.001

55.9%

53.9%

CCRL 40/4 [ elo 3300 ]

Computer

22.497

56.5%

64.6%

CCRL 40/4 [ elo 3400 ]

Computer

5.668

57.2%

66.7%






CCRL 40/40 40m in 40m **





CCRL 40/40 [ elo 3200 ]

Computer

28.478

56.0%

62.6%

CCRL 40/40 [ elo 3300 ]

Computer

7.995

55.7%

74.9%

CCRL 40/40 [ elo 3400 ]

Computer

203 *

54.2%

84.7%






TCEC super finals S10-S14 ***

Computer

402 *

52.2%

89.0%

* With only so few games the numbers are not representative.


** The time control in CCRL 40/40 is 10 times higher than in 40/4 games, hence better play, we compare with 40/4.


*** The TCEC super finals are played on massive hardware and long time control.


Observations


1. Despite the increasing draw rate (especially in the computer part) the advantage of the first move between top-humans and top-computers remains significant.


2. In general the computers utilize the advantage of the first move somewhat better than humans.



The statistics were made with ProTools and ProDeo 2.9, an overview

can be seen here.


____________________________________________________________________________________________

Educational fun with knight-odds matches


This is a spin off experiment of the Stockfish Handicap Matches that quite differently shows the strength of Stockfish 11. We play 200 game matches, Stockfish 11 playing with the white pieces without a knight on b1 or g1 against a bunch of other engines in order to see how decisive other engines are and beat Stockfish, or lose anyway!

 # ENGINE            : RATING  POINTS PLAYED  (%)
 1 Stockfish 11    > : 3764.7   200.0   200  100.0%
 2 Komodo 14         : 3569.0   198.5   200   99.3%
 3 Houdini 6.03      : 3337.7   194.5   200   97.3%
 4 Ethereal 12.25    : 3307.5   193.5   200   96.8%
 5 rofChade 2.3      : 3238.3   190.5   200   95.3%
 6 Fire 7.1          : 3228.8   190.0   200   95.0%
 7 Xiphos 0.6        : 3219.8   189.5   200   94.8%
 8 Andscacs 0.95     : 3153.1   185.0   200   92.5%
 9 Booot 6.4         : 3135.0   183.5   200   91.8%
10 RubiChess 1.7.3   : 3129.3   183.0   200   91.5%
11 Laser 1.7         : 3118.3   182.0   200   91.0%
12 Schooner 2.2      : 3093.1   179.5   200   89.8%
13 Demolito          : 3057.9   175.5   200   87.8%
14 Wasp 4.00         : 3049.9   174.5   200   87.3%
15 Senpai 2          : 3013.4   169.5   200   84.8%
16 Defenchess 2.2    : 3010.0   169.0   200   84.5%
17 ice 4.0           : 3006.7   168.5   200   84.3%
18 Texel 1.7         : 2997.0   167.0   200   83.5%
19 Arasan 22         : 2987.6   165.5   200   82.8%
20 Vajolet 2.8       : 2958.5   160.5   200   80.3%
21 Shredder 13       : 2947.7   158.5   200   79.3%
22 cheng4 4.39       : 2861.3   140.0   200   70.0%
23 Weiss 1.0         : 2851.0   137.5   200   68.8%
24 Bobcat 8          : 2840.9   135.0   200   67.5%
25 Crafty 25.6       : 2782.1   119.5   200   59.8%
26 Benjamin          : 2739.2   107.5   200   53.8%
27 ProDeo            : 2730.4   105.0   200   52.5%
28 SF11              : 2712.8  1264.0  6000   21.1%
29 Fruit 2.3         : 2658.1    84.5   200   42.3%
30 Fruit 2.1         : 2588.7    66.0   200   33.0%
31 Ruffian 2         : 2576.7    63.0   200   31.5%

SF11 (Stockfish 11) is the engine that plays with a knight down and as the ranking list shows the turning point lies around ~2700 elo.


The data suggests most of the GM's will lose against Stockfish with a knight down.

The 6000 games

Technical


Material odds openings can be created with a tool as mentioned elsewhere.