Blog

September 2015


And so after an absence of about 2 years the computer chess virus hit me again. I thought I was cured after the ProDeo 1.88 debacle, nice new feature, I love that kind of stuff but something went wrong with the playing strength or with the release of that version. Either way to play it safe I restarted my work from version 1.86 and on this page you can follow the playing strength progress that eventually (left or right) should lead to new a version, ProDeo 2.0 or REBEL 13, not sure about the latter.


So with this page you will get a glimpse in the mind of a chess programmer and by reading this diary you probably will conclude that's a scary place to be.


Kidding aside, first step is nearly finished, the tuning of the main ingredients of the evaluation function and not very surprisingly there is not much improvement to mention because it has always been well tuned since its early existence. However now that I have access to new hardware (a nice 16 thread workstation) tuning can de done much more secure. The improvements can be mentioned:

Diary


Results

Parameter 1.86

Change

Games

Time Control

Result

[Right to Move = 0]

[Right to Move = 100]    [ 1 ]

12.000

40/15

51.7%

[Double Isolated Pawns = 100]

[Double Isolated Pawns = 75]

12.000

40/15

51.0%

[King safety = 75]   

[King safety = 105]        [ 2 ]

12.000

40/15

51.1%

[Bad Bishop = 100]

[Bad Bishop = 75]

12.000

40/15

50.3%

[Minimum Knight Mobility = 100]

[Minimum Knight Mobility = 50]

12.000

40/15

50.3%

  1. Note that the [Right to Move] parameter is a TEMPO penalty for the color to move, a pure evaluation ingredient. However in practice it is of great influence on the search, the higher penalty you apply, the faster the search will run. Now this looks as great news but unfortunately it isn't, a too high penalty will produce unreliable evaluation scores also and thus  eventually you will pay for it as a regression in playing strength. In other words the right value has to be chosen with great care. To establish that 100 is the right value 60,000 (40/15) bullet games had to be played. And that value 100 is only valid for this 40/15 time control. It's yet unclear how it will perform at longer time controls.

   

  1. The [King Safety] parameter increase looks odd at first glance but the change is due to new code which has influenced the nature of the evaluation. It took 7 x 12.000 = 84.000 bullet games to arrive on the 105 value.


Note that in general a 1% increase of the result stands for an ELO improvement of 7 points. Meaning, if these above 5 changes all work together in harmony (putting them together) it would give a 4.4% x 7 = 30 ELO increase. But then again I know from 35 years experience this will be unlikely due to the interaction and the overlap. When I am ready to test them I will be happy with 20 ELO.


Meaning, the main improvement will have to come from search changes, not exactly my main interest in computer chess, I never had a passion for it, but facts are facts, search (and speed) are the dominant factors for progress in computer chess since the early days. I sometimes tend to call it a necessarily evil. Okay, that came out too strongly


Search

In the meantime I developed quite a number of new and promising search ideas.

Name

Meaning

Provisional Results

Late Move Pruning

Prune depth 1 & 2 moves based on move count.

12.000 games [40/15] 50.8%

Gives about a  10% speed-up.

Recapture extensions

Liimited from max 2 to 1

4.000 games [40/60] 50.8%

Late Move Reductions

An almost complete rewrite. It's explained on a separate page.

1.000 games [40/60] 51.9%

However too few games are played to draw any conclusion. Furthermore my (current) test philosophy regarding search changes demands that scaling should be part of the testing too, resulting in the following test procedure:


  1. 12.000 [40/15] games
  2.   6.000 [40/30] games
  3.   4.000 [40/60] games


Meaning that in total 22.000 games need to be played before a search change is promoted as approved. More, there shouldn't be too much fluctuatiions between the 3 test runs.

__________________________________________________________________________________________________


September 19  -  Results LMR testing (16 threads)

Games

Time control

Score

Depths

12.000

40/15

50.1%

   9.90 -  9.77 =  +0.13 

  6.000

40/30

51.7%

10.98 - 10.79 = +0.19

  4.000

40/60

50.9%

11.94 - 11.71 = +0.23

Lot of fuctuation between the 3 test runs. Reminds me once again why I gave the thing the name REBEL. It has all the signs it's somewhat better in the range of 5-10 elo points.


The DEPTH column is a statistic that measures the average middle game depths of the 2 tested versions. The longer the time control the deeper the new LMR will search which is a good sign.


We are going to introduce triple LMR, reducing not only 2 plies but even 3, tricky business. You can see the changes for that on the LMR page, changes are colored red.


Then we start the whole (boring) circus of playing 22.000 games again which will take 2 days and see what happens. In the meantime I have some Netflix series to catch up.

__________________________________________________________________________________________________


September 19 - More on the [Right to Move] TEMPO penalty.


Contrary (I think) to most programs the TEMPO penalty is not given in EVAL but in the move_do() part where also the incremental update stuff (material, PST, double pawns) resides. By doing so we are more flexible and can ssolve some chess knowledge in a cheap way. We apply a penalty by piece type. The initial TEMPO penalty table (for the middle game!) looks like this:

WP

WN

WB

WR

WQ

WK

BP

BN

BB

BR

BQ

BK

0.08

0.08

0.08

0.08

0.10

0.10

0.08

0.08

0.08

0.08

0.10

0.10

As one can see the penalties for Queen and King are somewhat higher and because the penalty is in move_do() the penalty is applied everytime again the Queen and King moves. It's a small discouragement not to start shuffling its pieces with moves like Kh1 | Kh2 and back to g1 again.


Furthermore one can solve some other basic stuff in a cheap way in the pre-processor.


Loss of castling rights.

In the opening phase when the white king has not castled we increase the penalty for the white king from 0.10 to 0.25 and vice versa for black.


Avoid early Queen play.

In the opening based on the move number we increase the penalty of the WQ and BQ. We also turn off the normal pin bonus her majesty gets for pinning an opponent piece. That will teach her to behave and wait for her moment to come.


In the endgame the penalty table is set to 0.05 for all piece types, his majesty has a free role now.

__________________________________________________________________________________________________


September 20


While the current 22,000 games LMR test run is still in progress (looks good so far) I am trying something else. Instead of relying on move-ordering I am trying to do LMR with the values of the history table only. And it's surprisingly doing well on my development PC, at 40/60 it scores 53% after 1200 games. So that version will be the next test run. Is LMR really that simple? For the pseudo code look at the LMR page.


BTW, I opened a new page, stories about the past.

__________________________________________________________________________________________________

September 21


LMR (edition 2) test run has finished and the result is somewhat disappointing as the last 40/60 run had a very bad start (happens sometimes, the opposite too) and only marginally could recover with 4000 games. Too bad, I can't play more. The results for LMR (edition 2) :

Games

Time control

Score

Depths

12,000

40/15

50.3%

10.02 -  9.80 =  +0.22

6,000

40/30

52.0%

11.17 - 10.81 = +0.36

4000

40/60

51.1%

12.15 - 11.70 = +0.45

The good news is that the newest LMR version (edition 3) has done extremely well on my development PC running at 4 full cores (no threads, so faster) scoring 53.3% (2021 games) [depths] 12.58 - 12.22 = +0.36 and is now put on the 22,000 games rack. It's good to have a 4th scaling reference point even if it's only (sic) 2000 games, which still takes a full day to complete. This is maddening, I can't even imagine some do this (and have done this) for a living           


I made a list of the search changes I still want to test, if things go as planned (they never do) testing should be finished within 2 weeks. So here goes and results will be posted when available.

Games

Time Control

Threads

Score

Depths

12,000

40/15

16

50.8%

9.78  -   9.81 = - 0.03

6,000

40/30

16

51.9%

10.99 - 10.81 = +0.18

4,000

40/60

16

51.8%

11.99 - 11.69 = +0.30

2,021

40/60

4 cores

53.3%

  12.58 - 12.22 = +0.36






12,000

40/15

16

51.4%

10.00 -   9.82 = +0.18

6,000

40/30

16

50.3%

11.02 - 10.80 = +0.22

4,000

40/60

16

52.5%

11.92 - 11.68 = +0.24

2,000

40/60

4 cores

52.0%

12.45 - 12.19 = +0.26






12,000

40/15

16

51.8%

9.98 -   9.82 = +0.16

6,000

40/30

16

53.3%

11.20 - 10.81 = +0.39

4,000

40/60

16

52.9%

12.26 - 11.71 = +0.55

2,000

40/60

4 cores

51.5%

12.83 - 12.19 = +0.64






12,000

40/15

16

53.6%

9.99 -   9.82 = +0.17

6,000

40/30

16

54.6%

11.20 - 10.80 = +0.40

4,000

40/60

16

54.3%

12.27 - 11.70 = +0.57

1,918

40/60

4 cores

55.0%

lost due to power failure






12,000

40/15

16

49.7%

9.86 - 9.82 = +0.04

6,000

40/30

16


cancelled

4,000

40/60

16

50.8%

11.74 - 11.70 = +0.04

2,000

40/60

4 cores

49.7%

12.27 - 12.21 = +0.06






12,000

40/15

16

50.3%

not available

6,000

40/30

16

50.6%

10.83 - 10.79 = +0.04

4,000

40/60

16

51.0%

11.73 - 11.71 = +0.02

2,000

40/60

  4 cores

50.1%

12.18 - 12.14 = +0.04






12,000

40/15

16

50.8%

not available

6,000

40/30

16

50.6%

10.89 - 10.79 = +0.10

4,000

40/60

16

50.2%

11.78 - 11.69 = +0.09

2,000

40/60

4 cores

49.5%

12.31 - 12.21 = +0.10






12,000

40/15

16



6,000

40/30

16



4,000

40/60

16



2,000

40/60

  4 cores








Testing what ?

round-1




round-2

[Right to Move = 50]




round-3

+

[Right to Move = 50]


[Right to Move = 50]                   round-4

[King safety = 105]                      BETA-1

[Double Isolated Pawns = 75]


Max. recapture extension from 2 to 1.

Likely fits better with the latest fashion

to focus on reductions than on extensions

in order to lower the branch factor.


Pawn push 7th row extension      round-6

No longer extend unconditionally but check

if the move is sound (SEE=OK) and if the

pawn can move to the promotion square.


Late Move Pruning a.k.a. LMP      round-5

LMP prunes in the last 2 plies before the

horizon by subtracting a fixed value *

movecount from the futility margin.


Time Control

Use 25% less time if move is constant.

Gained time then can ne used for when it's

really needed. Risky but worthwhile to try.


September 22


My suspect against [Right to Move = 100] became true, see how badly [40/60] scales to [40/15]. A 50% increase in ply should bring at least 10-15 elo even with only 2000 games and it gave a regression. So yesterday I stalled the parameter testing and will try lower values later which BTW also gave good results at [40/15] 25=50.8% | 50=51.4% | 75=51.4% respectively, so there is still good hope. But then (as predicted), there goes my 2 weeks planning.


What to do (test) next? And my curiosity about the simpleness of LMR (edition 3) got the upperhand and I decided to try LMR (edition 4) doing 4 reductions. But then only running on pure cores and [40/60] immediately, lower time controls at those low depths hardly make sense.


And then I made a mistake which I only noticed later, I had forgotten to restore the Right to Move parameter back to zero and so the 2 testruns were running with the bad parameter value of 100 instead of 0. But surprisingly so far results are still good (54%), it's coming more than a full ply deeper and I decided to give it a chance and let it run for the moment although it doesn't feel good.


____________________________________________________________________________________________________


September 23


I stopped both LMR (edition 4) matches. Reducing 4 plies is literally one bridge too far, for now. 3 is already a big step forward. Instead I will now focus to find the optimal value of the [Right to Move] parameter, there something to gain. See the changed test schedule above.   


____________________________________________________________________________________________________


September 24


[Right to Move = 50] testing finished. Remarkable positive result. It shows me again you can not be careful enough messing around with uncllear and hard to define evaluation ingredients such as the value of a tempo. Obviously in the past I used the wrong values, from 100, back to 0, even 125, then in version 1.86 back to 0, to arrive at 100 again in 1.87. Now that I have reasonable hardware I don't have to wild guess any longer, so it seems, for the moment.


Next, combining LMR (edition 3) with [Right to Move = 50] and see what's happens. In progress now.


____________________________________________________________________________________________________


September 26


Results of test round-3, see test schedule above look fine except for 40/60 with the highest scaling which is worrying. OTOH it's only 2000 games with an error bar of 13 elo points so there is still hope. For that reason we now include the 2 (more or less proven) positional improvements (king safery and the double isolated pawn change) and start round-4. And the 40/60 [2000 games] run should really give a good jump else this effort for a new version is on the brink to fail.


____________________________________________________________________________________________________


September 27


There was a short power failure which caused one of my PC's to reboot. So unfortunately only 1918 games of the 2000 were played and I leave it that way, restarting a match with cutechess-cli is probelematic in the way I use the program (without the concurrency option). But..... I am very happy with the result, 55.0% (35 elo) see test schedule above. The 3 other runs are looking good as well. I will label this version as BETA-1. eventhough one match in this round is still running.


It's another reminder that sometimes the result of 2000 games can be very misleading, see round-3. It's what I noticed 2 years ago when I faced (and underwent) another attack on my programmer genes to improve that old beast and museum piece of the 80's and 90's. It goes like this, you play 2000 40/60 games ussing 4 cores which takes 30 hours to complete and you get a (say) 51.5% score (+10 elo) and you play the same match again you can get 49.5%, thus a regression! Happened to me several times and is predicted by the error-bar (margin) that comes with 2000 games. So, so now and then these things happen, unbalanced randomness finding the edges (+ or -) of the of the error-bar (margin).


Anyway, a 35 elo improvement in just 5-6 weeks in not bad at all, in the 80's and 90's that sometimes took a full year. I want 50 elo for a version worthy the name REBEL 13 so 15 elo to go. Next round is testing LMP usually good for a 10% speed-up tested already on [40/15] 12,0000 games scoring 50.8%. We will see how it scales.


____________________________________________________________________________________________________


September 29


LMP testing finished and it scales badly. All the way from 50.8% -> 50.6% -> 50.2% to even 49.5%. It would be risky to count it as an improvement because the overall score is somewhat positive. I have made scaling a dominant point for this version because I noticed from statistics made of rating lists that ProDeo doesn't scale well, meaning that its performance drops the longer the time control. Whatever the reason for that (and I don't think any programmer can fully grasp what the reasons are for this phenomenon) I think it makes sense one can try to improve by only accepting changes that scale well. It's an experiment. And a time consuming one.


It's best for now I put LMP in the freezer and have a look at the code later, it should breed some 5-10 elo.


Not much left on the menu to test that possibly could bring the desired 15 elo for a version release, I must go back to the drawing table hunting for new candidate improvements. In the meantime I am testing now the recapture extension, limiting its maximum from 2 to 1, heck I might even try to do without them.


____________________________________________________________________________________________________


September 30


Less recapture extensions isn't an improvement also, doing no recapture extensions at all is a big regression and so I am stuck for the moment. I will take a moment of reflection, either find some new changes or release the thing as ProDeo 1.9 and enjoy life again.


____________________________________________________________________________________________________


October 1


Consulted my notes from the past with suggested (small) improvements (ideas) and picked a number of them to try. Most of them were hardly measurable with the hardware of the past and were stamped as unclear, thus not used. The below list of changes will only be tested at [40/15] 12.000 bullet games. If there is a sign of improvement it will be included in the scaling testing later.

Games

Score

LOS

Description


12,000

50.3%

-------

OLD - [Bad Bishop = 100]  |  NEW - [Bad Bishop = 75]

1

12,000

50.3%

-------

OLD - [Minimum Knight Mobility = 100]  |  NEW - [Minimum Knight Mobility = 50]

2

12,000

50.3%

78.9%

Increasing passed pawn scoring for the middlegame [Passed Pawns MIDG = 150] 

3

12,000

50.6%

98.4%

Increasing passed pawn scoring for the middlegame [Passed Pawns MIDG = 175]

3a

12,000

49.4%

-4.3%

[Passed Pawns MIDG = 200]   

3b

12,000

50.5%

88.8%

[Passed Pawns MIDG = 163] 


12,000



Futility pruning - don't prune pawn moves to the 6th/7th row. Played with 16 threads.

4

8,000

50.2%


Futility pruning - don't prune pawn moves to the 6th/7th row. Played with 4 cores.

4b

12,000



Knight outpost currently set to 125, try the 75 and 100 settings.

5

12,000

50.1%

59.6%

[Search Safety = 400] is an old parameter that nowadays is only used in the late endgame when nullmove is disabled. Its current value is 200 and I have good reasons to believe that part of the search should be less selective hence we double its safety. [late endgame depths] 14.61 - 14.64 = -0.03

6






12,000

50.2%

69.7%

Related to [6], complete rewrite of the search part that handles the late endgame. [late endgame depths] 14.41 - 14.76 = -0.35

6a

6,000

50.1%

57.1%

[40/30] [16 threads] [depths] 15.81 - 16.20 = -0.39

6b

4,000

49.6%

23.8%

[40/60] [16 threads] [depths] 17.11 - 17.40 = -0.29

6c

2,000

49.6%

32.6%

[40/60] [ 4 cores ]    [depths] 17.64 - 18.10 = -0.46

6d






12,000



Pawn evaluation scaling by material on the board.

7

12,000



In the endgame attack the opponent pawns from behind.

8

12,000



[Passed Pawn Tropism (1) = 100] evaluates a bonus for the king supporting its own passed pawn(s). Values to test that make sense are 75 and 125. Endgame stuff.

9

10,000

49.5%

12.2%

[Passed Pawn Tropism (2) = 200] evaluates to distances of the king to enemy passed pawn(s). Values to test that make sense are 250 and 300. Endgame stuff.

10

Remarks


1. A 12,000 games run takes about 11 hours to complete.


2. A 50.3% result (indicating +2 elo) with only 12,000 games is pretty meaningless in terms of the error bar (margin) which is -5/+5. A 50.3% result more or less gurantees the change is not a regression and if it is it's most likely a very minor one. Looking at it from the bright side it statistically can also be an 5 elo improvement.


3.  In the hope a cocktail of the above changes can bring me the 15 elo I want.


This is a boring (the waiting) and fascinating process at the same time.


___________________________________________________________________________________________________


October 2 (some musings)


[1] Bad Bishop evaluation is about a bishop looking at its own pawns in (the 2) forward directions and a penalty is given depending on the square before that pawn via a simple piece table multiplied by a factor 0, 1 or 2 via a square table. For a more detailed description see here.


[2] Minimum Knight Mobility is described here.


[3] The Passed Pawn scoring in REBEL since day one is doubled when the board position is an endgame position. It's based on the observation that with a board full of pieces a passed pawn and the danger it may rerpresent can be easier neutralized than with a board of a few pieces, the endgame. Now that I have reasonable hardware to finally properly test this 35 year old hypothesis I pretty much surprised to see that assumption most likely was never true, the [3b] 50.6% score with a LOS confidelity score of 98.4% is a convincing number. Hence we extend the testing and set the parameter to 200 meaning passed pawn scoring then is fully equal between the middle game and endgame. I can not believe this is true, it goes against all my chess instincts but since numbers don't lie I will have to accept, this is computer chess after all, not normal chess. Keeping my fingers crossed.


[3b] Match [Passed Pawns MIDG = 200] finished (uncrossing my fingers) and no real surprise it's even far worse (49.4%) than the [100] default setting. So the optimal value seem to be in the [150] - [200] area. I will test that later using the "binary search" approach knowing that [175] gives 50.6% starting with [163]. For now I want to test the fundamental [6a] search change first.


[4b] Futility pruning - don't prune pawn moves to the 6th/7th row. I anticipated a bit on my development PC playing 8000 [40/15] games, score 50.2%. Something like this was expected.


[6] 50.1% is a meaningless result and thus a waste of time. I better can have another look at the new [6a] search code. Put that in my notes.


[6a] While just a 50.2% score is disappointing (I expected more) I should not be complain too much looking at the huge drop in depth (-0.35) the rewrite caused which is 1/3 of a full ply. Nevertheless it should tested at longer time controls which I will do later. Also the new code leaves room for improvement.


[10] [Passed Pawn Tropism (2) = 250] is a regression (49.5%) no need to test the [300] value.


____________________________________________________________________________________________________


October 4


We take a break from the canidate peanuts improvements as I don't expect much from the remaining [5] [7] [8] and [9] and return to round-6 above, only extend 7th row pawn pushes conditional and no longer unconditional. Round-6 now finished and with the match scores of 50.3 | 50.6 | 51.0 | 50.1 we will include this change in the upcoming BETA-2 testing.


In the meantime we continue our work on finding the optimal value for the [Passed Pawns MIDG = xxx] parameter. [150] gave 50.3%, [175] gave 50.6% and so we try [163] as first one.


____________________________________________________________________________________________________


October 6


Adding up the results of the October 1 list (1.6%) doesn't justify a new beta round yet, it's too little to my taste. Experience has learned me that lumping together 4-5-6 individual small improvements doesn't bring the sum of the individual score as these changes are going to interact with each other and even more with the rest of the engine. I would a happy man with a 1% (7 elo) gain.


On the other hand I will accept [4] Futility pruning - don't prune pawn moves to the 6th/7th row without further testing as an improvement. The lack of it is a system flaw from the beginning, pawns that move to the 6th row are always dangerous, let alone when they move to the 7th row. Pruning them in the last 6 plies (the current futility depth) before the horizon is a receipt for trouble.


[6a] is interesting enough to test it further and put it on the scale-rack right now to satisfy my curiosity despite my growing impatientness with this kind of testing. And so we are going to lean back (again) and watch only for another 1½-2 days before the results are in. I ordered and downloaded an interesting book yesterday to keep me busy in the between time.


____________________________________________________________________________________________________


October 7


Bad results for the [6a] version. I am taking another moment of reflection.


Time for a new beta round, BETA-2. Hopefully the final version.


We lump every positive change together and only test it at decent time controls and let it run for a couple of days and keep you up-to-date about its progress 3-4 times a day. Currently...


Snapshot

Games

Time control

Cores

Score

ELO

Depths

1

242

40/60

8

53.9%

+27

12.97 - 12.17 = +0.80

2

1373



53.1%

+22

12.85 - 12.14 = +0.74

3

2387



53.4%

+24

12.83 - 12.13 = +0.70

4

3382



53.6%

+25


5

4000



53.8%

+26









1

31

40/240

4

61.3%

+80

14.88 - 14.02 = +0.86

2

166



58.1%

+56

15.04 - 14.17 = +0.87

3

277



55.2%

+37

15.04 - 14.16 = +0.88

4

396



54.8%

+35


5

508



54.1%

+29


Notes

1. Second run is at the unusual 40/4 CEGT | CCRL level because the emphasis of this release is on scaling.


2. The column depths measures the progress made in search depth for the middle game. Not very interesting for the user I suppose but for me as an statistic addict it is.


Snapshot


[2] The results so far for the first run [40/60] are far below the expectations, likely (although too early to tell at the moment) due to interaction of the 2 parametesr that gave a not so convincing score (50.3%) (see [1] and [2] above) on the rest of the changes. As it looks for now (and this might be premature) a third beta run will be needed. C'est la vie.


[3] From this point on not much will change in the average search depth for the middlegame and it's good to see it increasing the longer the time control. Not that I am happy with the results so far, I am definitely not.


__________________________________________________________________________________________________


October 9


We remove the 2 changes from BETA-2 that were not so clear, didn't have a LOS of 95+% (but somewhere in the 70%) call it BETA-3 and repeat the above test procedure, 4000 [40/60] games and 500 at CCRL | CEGT level.

Snapshot

Games

Time control

Cores

Score

ELO

Depths

1

107

40/60

8

55.1%

+35


2

982



55.2%

+36

12.95 - 12.13 = +0.82

3

2014



55.3%

+37

12.83 - 12.12 = +0.71

4

3284



55.1%

+35

12.82 - 12.12 = +0.70

5

4000



55.4%

+37

12.82 - 12.12 = +0.70








1

13

40/240

4

50.0%

0


2

110



58.6%

+60

15.07 - 14.11 = +0.96

3

238



56.5%

+45

15.09 - 14.16 = +0.91

4

390



56.4%

+44

15.06 - 14.14 = +0.92

5

500



56.0%

+42

15.05 - 14.14 = +0.91

Okay, I am reasonable satisfied, reasonable because an 35-45 elo gain in just 2 months was a dream scenario in the old days. Nevertheless incest testing isn't everything and you never know for sure how well the changes will work out against other engines. What can be said with certainty, it's better.


And so this ends this blog and I will start make preparations for the release.


And speaking of preparations, be prepared for a surprise.


Follow us at Facebook for the latests developments