Experiments

Short articles about various aspects of computer chess

Stockfish noteworthiness

 

The ERL page (a try to test the quality of the evaluation) shows that ProDeo does not bad at all even against SF8 and it awakened my curiosity how it would perform increasing the depth to 2 plies, 3 plies etc. The result was amazing, see the table.

Match

ProDeo %

Match

ProDeo %

D=1

44.8%

D=7

83.0%

D=2

67.7%

D=8

73.0%

D=3

90.1%

D=9

58.2%

D=4

92.2%

D=10

44.2%

D=5

91.1%

D=11

35.6%

D=6

87.1%

D=12

30.0%

Stockfish 8 drop in score is dramatic once it enters the main-search starting at D=2. It shows how extremely selective the search is but then as by magic the tables are turned starting at D=10 and the rest is known history. Perhaps (emphasis added) it makes sense for SF to be a bit more conservative in the early plies for reasons of a better move ordering?

Remark - Repeating the test with Stockfish 9 shows and even more selectivity, the turning point is not on depth=9 but 10 and even the root search [D=1] goes up for ProDeo to 48.5% which indicates that SF9 perhaps even is selective at root move procedure..

Topics

 

Stockfish noteworthiness

 

Diminishing returns for Stockfish 9

 

Stockfish 9 Depth vs ELO

 

Scaling results chess engines

 

How many games are needed to prove a change?

 

ELO progression measured by year

 

Core Scheduling. A closer look.

 

Experimenting with MEA an alternative testing method.

 

What time control ?

 

Book experiment.

_______________________________________________________________________________________

 

Diminishing returns for Stockfish 9

 

It's known that a hardware doubling in speed gives significant ELO, it's also known that the ELO gain will lower and lower when the time control increases, the so called diminishing returns principle (effect).

 

In the below table one can see the effects of SF9 self-play matches when we give the second SF9 engine twice as much time and on fast time controls even a factor of 4 or 8 more thinking time. By doing so we may get some glimpse what the future may hold when hardware keeps improving.

Factor 2

Factor 4

Factor 8

Match

Games

Perc

ELO

Draws

Match

Games

Perc

ELO

Draws

Match

Games

Perc

ELO

Draws

40/15

1000

65.1%

109

55%

40/15

1000

74.8%

189

44%

40/15

1000

81.8%

253

34%

40/30

1000

63.0%

94

59%

40/30

1000

72.8%

170

48%

40/30

500

77.7%

215

41%

40/1m

500

61.7%

83

61%

40/1m

500

69.6%

144

53%

40/1m

250

73.0%

173

51%

40/2m

500

59.4%

66

63%

40/2m

250

65.4%

111

62%

40/2m

100

70.5%

152

53%

40/5m

200

56.9%

46

65%

40/5m

100

62.5%

90

59%

 

 

 

 

 

 

40/10m

100

57.5%

52

61%

40/10m

100

64.0%

101

58%

 

 

 

 

 

 

40/20m

100

56.5%

45

67%

 

 

 

 

 

 

 

 

 

 

 

 

40/1h

100

TODO

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The good news is that doubling speed still gives amazingly much at long time control even for top-engines. What happened to the dark age of the late 90's and the early years of the millennium when it was considered that engines at [D=14] would mainly produce draws? It seems the end is nowhere in sight.

 

_______________________________________________________________________________________________________

 

Diminishing returns - Depth vs ELO

 

Another way to measure the diminishing returns instead of using time is to do it by iteration depth. We do it for Stockfish 9 and an old fashioned engine like ProDeo in self-play matches Depth=x vs Depth=x+1

Stockfish 9

Depth+1

ELO

Stockfish 9

Depth+1

ELO

ProDeo 2.2

Depth+1

ELO

Depth = 1

53.3%

+23

Depth = 16

59.4%

+66

Depth = 1

93.6%

+429

Depth = 2

52.9%

+20

Depth = 17

61.8%

+84

Depth = 2

88.8%

+340

Depth = 3

64.5%

+105

Depth = 18

56.2%

+43

Depth = 3

70.8%

+154

Depth = 4

70.4%

+150

Depth = 19

57.0%

+49

Depth = 4

80.7%

+244

Depth = 5

75.1%

+191

Depth = 20

54.2%

+29

Depth = 5

77.2%

+209

Depth = 6

76.8%

+207

Depth = 21

55.6%

+39

Depth = 6

74.4%

+187

Depth = 7

76.0%

+200

Depth = 22

58.0%

+56

Depth = 7

70.4%

+151

Depth = 8

81.8%

+254

Depth = 23

53.3%

+21

Depth = 8

69.4%

+142

Depth = 9

74.9%

+190

Depth = 24

54.0%

+28

Depth = 9

68.6%

+136

Depth = 10

72.7%

+170

Depth = 25

54.5%

+31

Depth = 10

68.4%

+134

Depth = 11

68.6%

+136

Depth = 26

54.3%

+29

Depth = 11

62.8%

+91

Depth = 12

67.0%

+124

Depth = 27

52.0%

+13

Depth = 12

64.3%

+102

Depth = 13

62.9%

+93

Depth = 28

54.5%

+31

Depth = 13

62.2%

+87

Depth = 14

59.9%

+70

Depth = 29

55.0%

+35

Depth = 14

60.8%

+76

Depth = 15

61.7%

+83

Depth = 30

50.0%

0

Depth = 15

63.8%

+98

Remarks - Numbers of the last 3 Stockfish matches (28 vs 29, 29 vs 30 and 30 vs 31) statistically are not very reliable since only 100 games were played due to the time such a long time control takes, same as for ProDeo, last matches only 200 games. Nevertheless the table gives some insight what the future might hold.

______________________________________________________________________________________________________

 

Scaling results chess engines

 

How well do chess engines scale? When they (for instance) are given double, triple time, do they lose or win ELO in comparison with other engines? Below are the results of the CEGTrating list comparing the 40/4 versus the 40/20 time control and the gain or loss in ELO. Minimum games played is 300.

Engine

ELO 40/4

ELO 40/20

Scaling

Stockfish 9

3475

3480

+5

Houdini 6

3457

3440

-17

Komodo 11

3404

3398

-6

Shredder 13

3249

3218

-31

Sorted list as on the CEGT rating list April 2018.

 

Full list

Engine

ELO 40/4

ELO 40/20

Scaling

Alfil 15.4

2599

2721

+122

Zappa Mexico

2507

2617

+110

Vajolet2 1.28

2655

2712

+62

Arasan 15.3

2517

2574

+57

Sorted list of engines based on scaling.

 

Full list

______________________________________________________________________________________________________

 

How many games ?

 

How many games are needed to prove a change? That's the question. Below are the results of a simulation running random matches and to find out the equilibrium when it is proven that 30 matches in a row end between a percentage of 49.9% - 50.1% and the number seems to be 100,000 games. See the examples [ one ] [ two ] and [ three ]

 

As one can see from example one, playing 5000 games between 2 identical engines sometimes (emphasis added) may result in a total unreliable results, see the 51.1% result vs the 48.9% result. In the worst case scenario you might accept a change that in reality is a regression because of the convincing 51.1% score (indicating +7 ELO) with a LOS of 96.6% and never look back, hence the regression remains forever.

 

Not funny.

 

On the other hand the simulation is based on the random outcome of 1-0 | ½-½ and 0-1 and each outcome is valued equal 1/3 or 33.33%. While that is true for engines playing random noves it isn't true in the real world. The stronger the engine the higher the draw rate, not 33.33% but a lot higher and 1-0 | 0-1 outcomes will decrease and decrease. And so the numbers have to be taken as an indication. For a deeper investigation see chapter What time control.

_____________________________________________________________________________________________________

 

Experimenting with MEA (Multiple Move EPD Analyzer) an alternative testing method.

 

MEA analyzes epd files having multiple solution moves with points such as the Strategic Test Suite [STS] containing 1500 theme based positions. First we test the reliability if quickly analyzing 1500 positions at 1 second per move will produce reasonable results. For this purpose we use 4 engines (Stochfish, Komodo, Arasan and Andscacs) that through the years frequently were updated and compare if the progress is in balance with the CCRL 40/4 rating list progress.

Stockfish

ELO

Score

 

Komodo

ELO

Score

 

Arasan

ELO

Score

 

Andscacs

ELO

Score

4

3180

12.678

 

4

3127

12.167

 

13.1

2575

9.927

 

0.70

2827

10.719

5

3243

12.883

 

5

3164

12.297

 

14.1

2591

9.949

 

0.82

3030

11.338

6

3318

12.853

 

6

3185

12.260

 

15.1

2693

10.133

 

0.87

3131

11.913

7

3354

12.831

 

7

3205

12.613

 

17.3

2792

10.380

 

0.90

3176

12.210

8

3424

13.030

 

8

3236

12.699

 

19.2

2979

11.418

 

0.93

3208

12.317

9

3485

13.067

 

9

3338

13.027

 

20.4.1

3046

11.569

 

 

 

 

Except for the 3 red exceptions the results are reasonable in sync with the CCRL 4/40 rating list and perhaps an increase of the time control from 1 second per move to (say) 5 seconds per move would clear the sky completely. For a more detailed research click here.

_________________________________________________________________________________________________

 

What time control ?

 

What time control to use to prove an engine change is better or not is closely related the the above subject how many games making the question hard to answer.

 

In about 2011-2012 programmers increasingly became aware of the "how many games" problem and started to play 10,000-15,000-20,000+ bullet games (10-15 seconds for one game, or something similar) to prove the LOS, the certainty a change is good. You can download a simple LOS calculator for a better understanding, programmers in general accept an engine change if the LOS is 95% and preferable higher. With a LOS of 95% there is still a 5% chance the engine change was not an improvement but a regression. On the other hand in a rare 5% case the regression can't be much which is a comfortable thought.

 

This major shift in engine testing has worked out quite well, see the ELO progression measured by year statistic.

 

After this long introduction let's move to the point of this subject and have a look at the CEGT and CCRL top-5 engines in July 2018.

CEGT 40/4

ELO

Games

 

CEGT 40/20

ELO

Games

 

CEGT 5+3 PB=ON

ELO

Games

Stockfish 9

3387

2950

 

Stockfish 9

3384

2000

 

Stockfish 9

3340

1400

Houdini 6

3368

4500

 

Houdini 6

3354

1999

 

Houdini 6

3321

1400

Komodo 11

3340

3186

 

Komodo 11

3337

1410

 

Komodo 11

3297

1400

Fire 7

3237

1800

 

Fire 7

3233

1200

 

Fire 7

3179

1400

Shredder 13

3151

6950

 

Shredder 13

3159

4676

 

Shredder 13

3104

1400

 

 

 

 

 

 

 

 

 

 

 

CCRL 40/4

ELO

Games

 

CCRL 40/40

ELO

Games

 

 

 

 

Stockfish 9

3491

5084

 

Stockfish 9

3381

1682

 

 

 

 

Houdini 6

3456

5584

 

Houdini 6

3344

1760

 

 

 

 

Komodo 11

3429

510

 

Komodo 11

3335

856

 

 

 

 

Fire 7

3346

1387

 

Fire 7

3261

950

 

 

 

 

Shredder 13

3271

4616

 

Shredder 13

3201

2933

 

 

 

 

Aside from (or maybe better because of) the extreme consistency that on all 5 lists the order of the 5 best programs match, even their elo differences are in reasonable sync, than what is the interesting observation to make?

 

[A] - that especially looking at the CEGT 5+3 PB=ON list you don't need to play 15,000-20,000 games. That testing on longer time controls likely (emphasis added, hence the wording observation) is a good alternative also with the advantage of a much better control of how an engine scales which is always a factor of uncertainty playing bullet games.

_____________________________________________________________________________________________

Book experiment

 

Since 2016 it's a well established fact the polyglot Cerebellum opening book by Thomas Zipproth is the strongest available. With Brainfish (see link) 8 matches of each 1000 games were played against 8 other Polyglot books and Cerebellum won all 8 matches convincingly.

 

From the 8000 games with Polyglot a Polyglot book was created and the 1000 game match was repeated in order to see what would happen now that the Cerebellum book more or less had to play against itself but (likely) with better tuned weights. This already was the case, the new tiny book won with 51.3% despite the fact it contains many holes.

 

To fix the latter the ProDeo book was converted to Polyglot format and merged with the new tiny book and the match was repeated, resulting in a convincing victory 536-464 (53.6%) approx 25 elo stronger.

 

The book was given the name Bookfish since all credit goes to Brainfish with Cerebellum.

 

Bookfish can be downloaded from the Book page.

 

Remarks

1. It's not unlikely this approach is a working principle to improve other existing books and Polyglot will be your friend.

2. It's unclear how Bookfish will perform against other books.

3. Obviously there is room for refinement, the above approach is just a broad-brush experiment.