The contribution of this study will be to identify trading strategies that outperform or underperform what would be expected if there was no difference between strategy effectiveness. This will be achieved first by constructing the NHL as a transaction network and then by establishing a series of metrics and characteristics that can be combined and evaluated as a strategy.
A successful analysis will yield results that will be informative to the management of transacting sports teams. The analysis can be informative in either the positive or negative. If a particular strategy obviously outperforms, then this strategy should be considered. Alternatively, if a strategy dramatically underperforms, this strategy should be avoided. The results will also be useful in sports betting and prediction markets on either side of the action. Results that can be used in predictive machine learning models will be useful to sportsbooks who set lines. Models of this type would also be useful to bettors who can use these features to exploit inefficiencies in the markets made by bookmakers.
Datasets and Methods
The raw data is sourced from two websites:
Hockeyreference.com was used for the team performance data and prosportstransactions.com was used for the transaction data. The data from hockeyreference.com was generally quite clean and required few transformations. The performance data consists of 32 columns. Two of those columns were engineered for this project (“Conference Champion” and “League Champion”). The details of this dataset are in Figure 6. Hockeyreference.com provided options for downloading the relevant tables of statistics. Transaction data however was scraped using beautifulsoup.
Since the transaction data is original and was procured specifically for this project, it required significantly more cleaning and transforming. The original version contained more than 85,000 rows with transactions dating back to 1908. In its original format, it was structured as below in Figure 7.
In order to turn this into useful information for network analysis, the first step was to extract from each transaction ‘Team A’ and ‘Team B’. These correspond to two nodes where the transaction that they conduct is the edge that connects them. It was also important to account for the number of transactions conducted between two specific teams. This corresponds to an edge weight. The final version of the dataset also has columns for number of players sent/received. From this the inference about a preference for buying or selling was made. The date columns are also important as the NHL season runs through January 1 of every year, so one cannot simply use the year to determine what season a transaction took place in. Free agency begins every year on July 1. This was the cutoff date used to determine which season a transaction took place in. This makes the most sense as there is a trade deadline every year towards the end of the regular season. This means that conference champions will necessarily win their championships in June with the team that they constructed during the period July 1 to early April. This is similar to the transfer windows in European soccer where each window will have more or less of an impact on either the current season or subsequent one. The final version of the dataset is summarized in Figure 8.
The time period of this project is limited to the seasons 2000-01 to 2009-10 inclusive. This was a decision simply made because of the time constraint of the project. This 10 year period is the most recent period in which there were no major structural changes to the league such as divisional reorganization, franchise relocation, or league expansion. It’s certainly possible to conduct the same analysis done here on the complete dataset, however it was decided that the time required to account for these factors could have prevented any conclusion being reached. Particularly challenging to manage are the divisional reorganizations which is a critical component in determining a teams strategy. The major drawback to this period is that the 2004-2005 season was cancelled due to a collective bargaining disagreement. As such, there were few transactions and no games played.
There are several methods, formulas, and definitions that should be established before proceeding further with the analysis. Most of them can be found as node attributes in the network. A summary of the attributes can be found in Table 1.
Node Authority Value
An authoritative node is one that is linked with many hubs. A hub is a node that has many degrees. It’s calculated by using the HITS algorithm (Hyperlink Induced Topic Search). The node’s raw authority score is the sum of the hub values of the nodes that it’s linked with.
Node Betweenness Centrality
The betweenness centrality of a node is a number that corresponds to how frequently the node lies in the shortest path between two other nodes. It’s often used as a measure of influence since nodes with high betweenness centrality are influential in networks because more information flows through them.
This number simply corresponds to the division of the team which had to be encoded as number in order to properly display the interactive network using D3.js
The NHL has two conferences, East and West. The league has since expanded, but during the time period in focus, each conference had 15 teams. 8 of those teams went to the playoffs. Each conference would produce one champion who would play the opposing conference champion in the Stanley Cup Final. For the purpose of this study, conference champions are examined rather than league champions. A conference championship is still quite an accomplishment and it doubles the sample size of “winners” from 9 to 18.
Transaction with a Different Conference
The sum of transactions with teams in the opposite conference. This is not weighted by the number of players sent/received but corresponds to a transaction of any size. An overweight number of transactions in this category will correspond to a “National” strategy.
The degree of a node is the number of other nodes that it’s linked to. A node with a higher degree has more ‘neighbors.’
The NHL during the period in focus for this study had six divisions, three in each conference. They’re organized regionally and each contained five teams. The Eastern Conference had the Atlantic, Northeast, and Southeast divisions. The Western Conference had the Pacific, Northwest, and Central divisions.
Favorite Partner and Favorite Partner Weight
Favorite partner is the preferred trading partner as evaluated by the number of transactions. Favorite Partner Weight is the number of transactions conducted with that preferred partner.
Node Hub Value
A hub in a network is a node that has many degrees. It’s hub value specifically is calculated by using the HITS algorithm (Hyperlink Induced Topic Search). The node’s raw hub score is the sum of the authority values of the nodes that it’s linked with.
Node Local Reaching Centrality
Local Reaching Centrality is defined as “ the proportion of all nodes in the graph that can be reached from node i via outgoing edges.” The method is primarily used for directed graphs however it is generalizable to undirected graphs which is how the graphs in the study were constructed.
Average Degree of Neighbors
The average degree of neighbors of a node is the sum of the neighbors’ degrees divided by the total number of neighbors.
Net Players In/Out
This number is the result of Players Sent – Players Received. It’s a season total and is a determining factor in whether a team is a buyer or seller. A team whose net_players value is less than -1 is classified as a seller. A team whose net_players value is more than 1 is classified as a buyer. Values between -1 and 1 are classified as balanced or having no preference.
PageRank ranks each node with respect to its importance as determined by the amount and quality of it’s incoming links. It’s intended for use with directed graphs however it is generalizable to undirected graphs which is how the graphs were constructed for this project. The method and algorithm have been popularized by the founders of Google however I think it’s important to note that in their original papers and filings they cite the work of network science researchers Jon Kleinberg and Massimo Marchiori as well as the founder of Baidu, Robin Li. PageRank and Betweenness Centrality are quite well related as shown in Figure 9 above.
Transaction in Same Conference-Different Division
The sum of transactions conducted with a team in the same conference, but a different division. An overweight number of these transactions corresponds to a “Regional” strategy.
Transaction in Same Conference-Same Division
The sum of transactions conducted with a team in the same conference and the same division. An overweight number of these transactions corresponds to a “Local” strategy.
One of twelve possible strategies which are a combination of preferences for buying or selling and transacting nationally, regionally, or locally. No distinct preference would result in a classification of “Unbiased” or “No Preference”. A preference for buying or selling was assigned to teams that had a total number of players in or out that was outside the inclusive range of [-1,1]. Geographic preference was assigned by first calculating the standard deviation of the set of numbers that correspond to the number of transactions conducted in the three possible geographic locations (local, regional, national). If the standard deviation was less than 1.5, the geographic preference was determined to be “Unbiased”. If the standard deviation was greater than 1.5, the geographic preference was determined to be biased toward the location with the most trades.
This study did not include any teams that had a locally biased strategy. A national bias was the most common while there was a slight preference for selling as shown in the Figure 10 above. Radar charts for four of the possible strategies are below in Figure 11.
In Part 3, I’ll go over the initial results as well as outline some additional steps that can be taken to develop the idea.
-David Van Anda