PageRank for Ranking College Basketball
Something I have been working on lately, after obtaining a large database of NCAA college basketball games, has been ranking teams algorithmically. One such ranking method that I recently implemented is PageRank, Google's method for ranking webpages.
PageRank
The PageRank algorithm itself is covered in Google's White Paper on the subject. I, however, learned much more from this article by Michael Nielsen. It covers the math in a way that even I can understand, and provides a concise, ~6 line example in python using matrices. This will serve as the basis for my Ruby implementation of ranking College Basketball teams.
The Data
For this application, I am using 141,523 College Basketball Games from 1981 forward, stored in a MySQL database locally. These games have the following data:
| id | int(11) |
| date | datetime |
| team1 | int(11) |
| team1score | int(11) |
| team2 | int(11) |
| team2score | int(11) |
| team1home | int(1) |
| team1inconference | int(1) |
| year | varchar(255) |
Note:
team1 and team2 are Foreign Keyed to another table of teams
I have a separate 'year' column in order to make year-by-year querying easier
I realize team1 is unnecessary in front of 'in conference', but I don't really feel like changing it
Gems
Since I'm doing this in Ruby, there are some (2) gems necessary to get up and running. The first is the 'mysql gem', a simple way to get my data out of the database and into code. This can be installed simply with:
gem install mysql
Next is the wonderful Narray Library, which handles arrays, matrices, and vectors with a C backend, allowing for not native Ruby speeds (read: fast). For this to work on my MacBook, I had to install version 0.5.9p7:
wget http://rubyforge.org/frs/download.php/61995/narray-0.5.9p7.tar.gz tar -xvzf narray-0.5.9p7.tar.gz cd narray-0.5.9p7 ruby extconf.rb make sudo make install
Now let us write some code...
Implementation
Require
First, we simply pull in our libraries so that we can actually do stuff
require 'rubygems' require 'mysql' require 'narray'
Initialization
Next up, we initialize variables needed for the script. First up in this process is the database class, used to query MySQL. If you want to actually use this, you will have to put in your own host, username, password, and database. Alternatively, you can just build up an array of hashes and calculate PageRank on this. Next, s & t are used to decide whether to follow a link or to pick a random page. This is described in more detail in Michael's post. After this, we use MySQL to get the number of teams in order to create our matrices. Using this value, we create a (square) identity matrix the with a dimension of the number of teams. Finally, we create the probability distribution for the random jump, setting each to an equal value, 1/(the number of teams).
dbh = Mysql.real_connect("host", "username", "password", "database_name") s = 0.85 t = 1.0-s query = "SELECT COUNT(*) as count FROM Teams" res = dbh.query(query) numteams = res.fetch_hash["count"].to_i ident = NMatrix.float(numteams, numteams) ident.diagonal!(1.0) p = NMatrix.float(1, numteams).fill!(1.0) p = (1.0/numteams.to_f) * p
Getting Games into a Matrix
The next step is to pull the game data out of a database and into a matrix. For this example, I am only doing games in the 2009-2010 year. I just loop through each of these games and increment the count in a matrix depending on which team won.
rawgames = NMatrix.int(numteams, numteams) query = "SELECT * FROM Games WHERE year='2009-2010' ORDER BY date" res = dbh.query(query) res.each_hash do |row| team1 = row["team1"].to_i team2 = row["team2"].to_i if (team2 != 0 && team1 != 0) team1 -= 1 team2 -= 1 if (row["team1score"].to_i > row["team2score"].to_i) rawgames[team1, team2] += 1 else rawgames[team2, team1] += 1 end end end
Making a Probability Matrix
Now that I know of all of the wins/losses for a team, I can create the probability matrix for the chance of one team "linking" to another. In this case, a loss counts as a link, so the highest ranked teams are those with the most links from other high ranked teams. To do this, I just have to get the total of all of the losses of a team, and then divide the losses to each other team by this number.
prob = NMatrix.float(numteams, numteams) 0.upto(numteams - 1) do |i| sum = 0.0 0.upto(numteams - 1) do |j| sum += rawgames[j, i].to_f end 0.upto(numteams - 1) do |j| val = rawgames[j,i].to_f / sum prob[i, j] = val.nan? ? 0 : val end end
Calculating
With all of the information collected so far, we can calculate the PageRank for each team, allowing us to rank each team. You can see this calculation below, in the first line. The code that follows simply inserts the rank data into the database, allowing me to access and query it at my will.
fin = (t*((ident - (s*prob)).inverse)*p) 0.upto(numteams - 1) do |i| val = i + 1 year = '2009-2010' rank = fin[0,i] dbh.query("INSERT INTO page (teamid, year, rank) VALUES (#{val},'#{year}', #{rank})") end
All Together Now
Here's the final code, all put together and available for your usage. (MIT License)
Problems
I have 341 teams... with too many, memory could become an issue. For this case, I would look into a MapReduce method
If you're having trouble understanding this code, once again I highly reccommend this blog post
I'm not going to give you my game data, that you'll have to find for yourself
Results
Top 25 according to ESPN:
- Kansas
- Kentucky
- Duke
- Syracuse
- Ohio State
- West Virginia
- Kansas State
- New Mexico
- Villanova
- Purdue
- Butler
- Temple
- Michigan State
- Georgetown
- Tennessee
- Wisconsin
- Brigham Young
- Pittsburgh
- Baylor
- Maryland
- Vanderbilt
- Gonzaga
- Texas A&M
- Richmond
- Xavier
My Top 25:
- Duke
- West Virginia
- Georgetown
- Tennessee
- Syracuse
- Kentucky
- Kansas
- Michigan State
- Purdue
- Butler
- Pittsburgh
- Wisconsin
- Georgia Tech
- Villanova
- Kansas State
- Notre Dame
- Northern Iowa
- Maryland
- Louisville
- Ohio State
- Oklahoma State
- Minnesota
- Baylor
- Vanderbilt
- NC State
Note: My data includes March Madness games, where ESPN rankings are before the tournament.
blog comments powered by Disqus