Experiments

Short articles about various aspects of computer chess

Stockfish noteworthiness

 

The ERL page (a try to test the quality of the evaluation) shows that ProDeo does not bad at all even against SF8 and it awakened my curiosity how it would perform increasing the depth to 2 plies, 3 plies etc. The result was amazing, see the table.

Match

ProDeo %

Match

ProDeo %

D=1

44.8%

D=7

83.0%

D=2

67.7%

D=8

73.0%

D=3

90.1%

D=9

58.2%

D=4

92.2%

D=10

44.2%

D=5

91.1%

D=11

35.6%

D=6

87.1%

D=12

30.0%

Stockfish 8 drop in score is dramatic once it enters the main-search starting at D=2. It shows how extremely selective the search is but then as by magic the tables are turned starting at D=10 and the rest is known history. Perhaps (emphasis added) it makes sense for SF to be a bit more conservative in the early plies for reasons of a better move ordering?

Remark - Repeating the test with Stockfish 9 shows and even more selectivity, the turning point is not on depth=9 but 10 and even the root search [D=1] goes up for ProDeo to 48.5% which indicates that SF9 perhaps even is selective at root move procedure..

Topics

 

Stockfish noteworthiness

 

Diminishing returns for Stockfish 9

 

Stockfish 9 Depth vs ELO

 

Scaling results chess engines

 

How many games are needed to prove a change?

 

ELO progression measured by year

 

Core Scheduling. A closer look.

_______________________________________________________________________________________

 

Diminishing returns for Stockfish 9

 

It's known that a hardware doubling in speed gives significant ELO, it's also known that the ELO gain will lower and lower when the time control increases, the so called diminishing returns principle (effect).

 

In the below table one can see the effects of SF9 self-play matches when we give the second SF9 engine twice as much time and on fast time controls even a factor of 4 or 8 more thinking time. By doing so we may get some glimpse what the future may hold when hardware keeps improving.

Factor 2

Factor 4

Factor 8

Match

Games

Perc

ELO

Draws

Match

Games

Perc

ELO

Draws

Match

Games

Perc

ELO

Draws

40/15

1000

65.1%

109

55%

40/15

1000

74.8%

189

44%

40/15

1000

81.8%

253

34%

40/30

1000

63.0%

94

59%

40/30

1000

72.8%

170

48%

40/30

500

77.7%

215

41%

40/1m

500

61.7%

83

61%

40/1m

500

69.6%

144

53%

40/1m

250

73.0%

173

51%

40/2m

500

59.4%

66

63%

40/2m

250

65.4%

111

62%

40/2m

100

70.5%

152

53%

40/5m

200

56.9%

46

65%

40/5m

100

62.5%

90

59%

 

 

 

 

 

 

40/10m

100

57.5%

52

61%

40/10m

100

64.0%

101

58%

 

 

 

 

 

 

40/20m

100

56.5%

45

67%

 

 

 

 

 

 

 

 

 

 

 

 

40/1h

100

TODO

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The good news is that doubling speed still gives amazingly much at long time control even for top-engines. What happened to the dark age of the late 90's and the early years of the millennium when it was considered that engines at [D=14] would mainly produce draws? It seems the end is nowhere in sight.

 

_______________________________________________________________________________________________________

 

Diminishing returns - Depth vs ELO

 

Another way to measure the diminishing returns instead of using time is to do it by iteration depth. We do it for Stockfish 9 and an old fashioned engine like ProDeo in self-play matches Depth=x vs Depth=x+1

Stockfish 9

Depth+1

ELO

Stockfish 9

Depth+1

ELO

ProDeo 2.2

Depth+1

ELO

Depth = 1

53.3%

+23

Depth = 16

59.4%

+66

Depth = 1

93.6%

+429

Depth = 2

52.9%

+20

Depth = 17

61.8%

+84

Depth = 2

88.8%

+340

Depth = 3

64.5%

+105

Depth = 18

56.2%

+43

Depth = 3

70.8%

+154

Depth = 4

70.4%

+150

Depth = 19

57.0%

+49

Depth = 4

80.7%

+244

Depth = 5

75.1%

+191

Depth = 20

54.2%

+29

Depth = 5

77.2%

+209

Depth = 6

76.8%

+207

Depth = 21

55.6%

+39

Depth = 6

74.4%

+187

Depth = 7

76.0%

+200

Depth = 22

58.0%

+56

Depth = 7

70.4%

+151

Depth = 8

81.8%

+254

Depth = 23

53.3%

+21

Depth = 8

69.4%

+142

Depth = 9

74.9%

+190

Depth = 24

54.0%

+28

Depth = 9

68.6%

+136

Depth = 10

72.7%

+170

Depth = 25

54.5%

+31

Depth = 10

68.4%

+134

Depth = 11

68.6%

+136

Depth = 26

54.3%

+29

Depth = 11

62.8%

+91

Depth = 12

67.0%

+124

Depth = 27

52.0%

+13

Depth = 12

64.3%

+102

Depth = 13

62.9%

+93

Depth = 28

54.5%

+31

Depth = 13

62.2%

+87

Depth = 14

59.9%

+70

Depth = 29

55.0%

+35

Depth = 14

60.8%

+76

Depth = 15

61.7%

+83

Depth = 30

50.0%

0

Depth = 15

63.8%

+98

Remarks - Numbers of the last 3 Stockfish matches (28 vs 29, 29 vs 30 and 30 vs 31) statistically are not very reliable since only 100 games were played due to the time such a long time control takes, same as for ProDeo, last matches only 200 games. Nevertheless the table gives some insight what the future might hold.

______________________________________________________________________________________________________

 

Scaling results chess engines

 

How well do chess engines scale? When they (for instance) are given double, triple time, do they lose or win ELO in comparison with other engines? Below are the results of the CEGTrating list comparing the 40/4 versus the 40/20 time control and the gain or loss in ELO. Minimum games played is 300.

Engine

ELO 40/4

ELO 40/20

Scaling

Alfil 15.4

2599

2721

+122

Zappa Mexico

2507

2617

+110

Vajolet2 1.28

2655

2712

+62

Arasan 15.3

2517

2574

+57

Sorted list of engines based on scaling.

 

Full list

Sorted list as on the CEGT rating list April 2018.

 

Full list

______________________________________________________________________________________________________

 

How many games ?

 

How many games are needed to prove a change? That's the question. Below are the results of a simulation running random matches and to find out the equilibrium when it is proven that 30 matches in a row end between a percentage of 49.9% - 50.1% and the number seems to be 100,000 games. See the examples [ one ] [ two ] and [ three ]

 

As one can see from example one, playing 5000 games between 2 identical engines sometimes (emphasis added) may result in a total unreliable results, see the 51.1% result vs the 48.9% result. In the worst case scenario you might accept a change that in reality is a regression because of the convincing 51.1% score (indicating +7 ELO) with a LOS of 96.6% and never look back, hence the regression remains forever.

 

Not funny.

Engine

ELO 40/4

ELO 40/20

Scaling

Stockfish 9

3475

3480

+5

Houdini 6

3457

3440

-17

Komodo 11

3404

3398

-6

Shredder 13

3249

3218

-31