Archive

Quantifying the Influence of the Opening on Pawn Structure

The choice of opening is supposed to have a lingering effect throughout a game of chess. So much so, in fact, that a game is routinely referred to as “a Sicilian game”, a “Queen’s pawn game”, and so on. Most of the influence of the opening comes from the pawn structure: pawns are relatively immobile, so what structure is created early on is likely stuck for most of the rest of the game.

This is a nice story, but as it is, it’s just a story. How often is the pawn structure destroyed, erasing evidence of the opening? What proportion of games actually have a recognizable opening? How long does the opening linger — is it still visible after 30 moves? In other words: can we quantify this?

Well, yes. The plot above is a crude measure of how strong the influence of the opening is after some number of moves. (The axis is labelled in “ply” — half-moves — so move 30 is the last shown in the plot.) Obviously there’s no influence before any moves have been played; influence peaks after just a few moves (defining the opening); influence then decays, reaching half its peak strength around move 15.

How is this measure obtained? If you believe that the choice of opening has a large influence on a game even 20 moves later, then you ought to be able to look at a board after 20 moves — without having seen the first part of the game! — and accurately guess the opening. That plot is obtained by teaching a computer to perform that task, and then measuring how well it does. The better it does, the larger the influence of the opening.

A support vector machine is a machine learning algorithm often used for classification. The usual approach is supervised learning. I have a bunch of data points (in this case, chess positions), and I’ve classified them in some way (in this case, according to ECO code). The SVM is trained on this prepared data, and at the end, it can be used to perform the classification task: given a chess position, guess what the opening was.

Now, “guessing what the opening was” isn’t really useful. But whether the SVM works or not… that is interesting information! It’s a crude measure of how easy it is to tell what the opening was, given a position. In other words, it’s a measure of how much influence the opening moves (as opposed to all the other moves) had on the position.

Pawns are supposed to mediate the influence of the opening, so I fed the SVM only information about pawns. (This also makes the training faster, and I’m both lazy and impatient.) Easier than guessing the full ECO code is guessing just the first letter, splitting the openings up into five broad classes, so that’s the categorization the machine was trained on. The plot above shows the percentage of openings correctly classified, as a function of number of moves played to get to the position. This measures, therefore, how easy it is to guess the opening from a position, by looking only at the pawn structure.

My understanding is that the format of the input can have a large impact on the performance of the SVM, so to avoid ambiguity: each position was described as a 64-component vector. Each component corresponds to a single square. A component is $0$ if there is no pawn on that square, $1$ if the pawn on that square is white, and $-1$ if it’s black.

I used libsvm for this experiment, so there’s not really any code to post. I took the data from nikonoel’s database of high-rated Lichess games in June 2020.