PageRank for Ranking College Basketball

Something I have been working on lately, after obtaining a large database of NCAA college basketball games, has been ranking teams algorithmically. One such ranking method that I recently implemented is PageRank, Google's method for ranking webpages.

PageRank

The PageRank algorithm itself is covered in Google's White Paper on the subject. I, however, learned much more from this article by Michael Nielsen. It covers the math in a way that even I can understand, and provides a concise, ~6 line example in python using matrices. This will serve as the basis for my Ruby implementation of ranking College Basketball teams.

The Data

For this application, I am using 141,523 College Basketball Games from 1981 forward, stored in a MySQL database locally. These games have the following data:

| id                | int(11)      |

| date              | datetime     |

| team1             | int(11)      |

| team1score        | int(11)      |

| team2             | int(11)      |

| team2score        | int(11)      |

| team1home         | int(1)       |

| team1inconference | int(1)       |

| year              | varchar(255) |

Note:

  • team1 and team2 are Foreign Keyed to another table of teams

  • I have a separate 'year' column in order to make year-by-year querying easier

  • I realize team1 is unnecessary in front of 'in conference', but I don't really feel like changing it

Gems

Since I'm doing this in Ruby, there are some (2) gems necessary to get up and running. The first is the 'mysql gem', a simple way to get my data out of the database and into code. This can be installed simply with:

gem install mysql

Next is the wonderful Narray Library, which handles arrays, matrices, and vectors with a C backend, allowing for not native Ruby speeds (read: fast). For this to work on my MacBook, I had to install version 0.5.9p7:

wget http://rubyforge.org/frs/download.php/61995/narray-0.5.9p7.tar.gz

tar -xvzf narray-0.5.9p7.tar.gz

cd narray-0.5.9p7

ruby extconf.rb

make

sudo make install

Now let us write some code...

Implementation

Require

First, we simply pull in our libraries so that we can actually do stuff

require 'rubygems'

require 'mysql'

require 'narray'

Initialization

Next up, we initialize variables needed for the script. First up in this process is the database class, used to query MySQL. If you want to actually use this, you will have to put in your own host, username, password, and database. Alternatively, you can just build up an array of hashes and calculate PageRank on this. Next, s & t are used to decide whether to follow a link or to pick a random page. This is described in more detail in Michael's post. After this, we use MySQL to get the number of teams in order to create our matrices. Using this value, we create a (square) identity matrix the with a dimension of the number of teams. Finally, we create the probability distribution for the random jump, setting each to an equal value, 1/(the number of teams).

dbh = Mysql.real_connect("host", "username", "password", "database_name")

s = 0.85
t = 1.0-s

query = "SELECT COUNT(*) as count FROM Teams"
res = dbh.query(query)
numteams = res.fetch_hash["count"].to_i

ident = NMatrix.float(numteams, numteams)
ident.diagonal!(1.0)

p = NMatrix.float(1, numteams).fill!(1.0)
p = (1.0/numteams.to_f) * p

Getting Games into a Matrix

The next step is to pull the game data out of a database and into a matrix. For this example, I am only doing games in the 2009-2010 year. I just loop through each of these games and increment the count in a matrix depending on which team won.

rawgames = NMatrix.int(numteams, numteams)

query = "SELECT * FROM Games WHERE year='2009-2010' ORDER BY date"
res = dbh.query(query)
res.each_hash do |row|
  team1 = row["team1"].to_i
  team2 = row["team2"].to_i
  if (team2 != 0 && team1 != 0)
    team1 -= 1
    team2 -= 1
    if (row["team1score"].to_i > row["team2score"].to_i)
      rawgames[team1, team2] += 1
    else
      rawgames[team2, team1] += 1
    end
  end
end

Making a Probability Matrix

Now that I know of all of the wins/losses for a team, I can create the probability matrix for the chance of one team "linking" to another. In this case, a loss counts as a link, so the highest ranked teams are those with the most links from other high ranked teams. To do this, I just have to get the total of all of the losses of a team, and then divide the losses to each other team by this number.

prob = NMatrix.float(numteams, numteams)

0.upto(numteams - 1) do |i|
  sum = 0.0
  0.upto(numteams - 1) do |j|
    sum += rawgames[j, i].to_f
  end

  0.upto(numteams - 1) do |j|
    val = rawgames[j,i].to_f / sum
    prob[i, j] = val.nan? ? 0 : val
  end
end

Calculating

With all of the information collected so far, we can calculate the PageRank for each team, allowing us to rank each team. You can see this calculation below, in the first line. The code that follows simply inserts the rank data into the database, allowing me to access and query it at my will.

fin = (t*((ident - (s*prob)).inverse)*p)

0.upto(numteams - 1) do |i|
  val = i + 1
  year = '2009-2010'
  rank = fin[0,i]
  dbh.query("INSERT INTO page (teamid, year, rank) VALUES (#{val},'#{year}', #{rank})")
end

All Together Now

Here's the final code, all put together and available for your usage. (MIT License)

Problems

  • I have 341 teams... with too many, memory could become an issue. For this case, I would look into a MapReduce method

  • If you're having trouble understanding this code, once again I highly reccommend this blog post

  • I'm not going to give you my game data, that you'll have to find for yourself

Results

Top 25 according to ESPN:

  1. Kansas
  2. Kentucky
  3. Duke
  4. Syracuse
  5. Ohio State
  6. West Virginia
  7. Kansas State
  8. New Mexico
  9. Villanova
  10. Purdue
  11. Butler
  12. Temple
  13. Michigan State
  14. Georgetown
  15. Tennessee
  16. Wisconsin
  17. Brigham Young
  18. Pittsburgh
  19. Baylor
  20. Maryland
  21. Vanderbilt
  22. Gonzaga
  23. Texas A&M
  24. Richmond
  25. Xavier

My Top 25:

  1. Duke
  2. West Virginia
  3. Georgetown
  4. Tennessee
  5. Syracuse
  6. Kentucky
  7. Kansas
  8. Michigan State
  9. Purdue
  10. Butler
  11. Pittsburgh
  12. Wisconsin
  13. Georgia Tech
  14. Villanova
  15. Kansas State
  16. Notre Dame
  17. Northern Iowa
  18. Maryland
  19. Louisville
  20. Ohio State
  21. Oklahoma State
  22. Minnesota
  23. Baylor
  24. Vanderbilt
  25. NC State

Note: My data includes March Madness games, where ESPN rankings are before the tournament.

blog comments powered by Disqus