Cracking the BCS Egg

Summary: The author has proven that one of the BCS's computer systems would have given bizarre results for 1997, the last season before the BCS system was implemented. No season since then has produced extremely controversial computer ratings, but it is likely to happen again. The author thus challenges the NCAA and its BCS arm to clarify its knowledge of computer ratings. Changes to the system are proposed.

Article:

College Football's BCS system has been the recipient of loads of criticism annually, as pundits and millions of fans speculate incessantly about possible disaster scenarios. What has amazed me is how insulated the BCS's decision-makers have remained through it all. The NCAA and university officials behind the BCS seem to have adopted a position of silence, preferring to let results do the talking. By the end of the season, the BCS's rankings usually settle into a form that is reasonably acceptable to most, and so no major revolt has yet been incited.

My goal with this article is to stir the pot by demonstrating that the computer ranking systems of the BCS have the potential to produce farcical rankings that would have the college football world up in arms if similar rankings were to arise in the future.

I will cover only the computer rankings aspect of the BCS system. This is a field in which I have a modicum of expertise. I have experimented with such things since 1993. Approximately one hundred mathematical ranking systems have appeared on the Internet, so there are plenty of dabblers in the field. As one of those dabblers, my specialty has been learning to devise numerous varieties of systems in order to understand their differences, flaws, and limitations. I have programmed approximately a dozen different models, and with several of them I have spent long hours tweaking parameters to see what happens to rankings for many past seasons.

One of my recent projects was a program to emulate the Colley Matrix rankings, one of the BCS selectors. I found something rather interesting about the Colley rankings, but before getting to that, I want to give an overview of the BCS's problems, as I see them.

The biggest problem with the BCS is that little explication of its actions has ever been provided. Since we're talking about computer ratings, the Internet would seem a likely source of good information on the matter. However, the BCS overseers have never had a web site that would inspire confidence in it as a group devoted to a serious understanding of college football ratings. Their current web site (http://www.bcsfootball.org/bcsfootball/) leads with a logo that includes the FOX network's logo. The banner reads "Bowl Championship Series in association with foxsports.com on msn." A prominent feature is a section of ads for DVDs (only $24.95!). Other than that, there are some football news headlines, and conspicuous displays of the corporatized logos of the BCS bowl games. A small blurb states, "The BCS was implemented beginning with the 1998 season to determine the national champion for college football while maintaining and enhancing the bowl system," and not much more of substance.

If one searches (as of this writing) the term "BCS" on the NCAA's web site (www.ncaa.org), the articles fetched are almost all dated from 2003 and earlier. Not a one takes us to anything talking about the mechanics of the system, who is in charge, or why we should trust them.

This dearth of information has been the rule over the lifetime of the BCS. This should not be surprising, though. The NCAA men's basketball tournament committee has long operated on a closed-door basis. We know theoretically who is in charge, but why we should trust their actions will always remain a mystery. Unfortunately, there is one big difference between football and basketball. The basketball committee is selecting 65 schools for a tournament. All the best teams are going to be selected, and any that are left out unfairly are really only marginally qualified. Few are going to be very outraged by oversights on marginal teams. Meanwhile, the BCS system selects only ten teams for the top bowl games, and only two teams for the ostensible "national championship" game. The stakes are much higher for football's BCS.

Thankfully, the mathematics of the BCS system are at least partially disclosed. We know in advance which ranking systems will be involved and how the rankings will be weighted (along with the opinion polls) to establish the over-all BCS rankings. What we do not know is exactly how each computer system works. To my knowledge, only one of the systems, that of Dr. Wesley Colley, may be reproduced based on an outline of the theory provided by its designer. The web sites of the other five systems give either no descriptions of their methods, or not enough to reproduce them with confidence. All we know for sure is that the NCAA forbids the use of margins of victory in calculating computer ratings for use by the BCS. Whether a team won or lost is the only data to be considered by any computer selector.

Now, this rule against using margins to rank teams is quite a bone of contention. The actions of pollsters on just this point were questioned long before the inception of the BCS. If it appeared that a team lost ground in a poll because of a lackluster win, there was usually an outcry of fans and pundits claiming that "a win is a win." A classic case was when Penn State only beat Indiana by six points in 1994, and lost ground to Nebraska in the polls (Penn State was favored by 26 in that game).

There is certainly some merit to the "a win is a win" philosophy. Personally, I did not think any less of Penn State after that win (and my power ratings list Penn State as #1 for 1994). On the other hand, it is logical to argue that if Team A beats Team C by 50, and team B beats Team C by 1, then Team A is likely the better team and should be favored if they meet Team B.

Certainly two results are not sufficient for an accurate comparison of teams. However, many would argue that not even twelve results are enough to accurately rate teams. If the information contained in scores is disallowed, then so little comparative data remains that it's hard to trust power ratings based solely on who beat whom.

Power ratings are highly problematic for a small sample of games, and margins of victory eliminated. I will support this assertion by examining the Colley system. I have coded it based on the outline provided on Colley's web site. Colley publishes ratings back to the 1998 season, and I have verified that my program exactly duplicates his ratings for 1998 through 2007. The Colley ratings look generally acceptable for all those seasons, at least for the top few teams. I can find plenty of odd rankings for middle-of-the-pack teams. However, people generally only really care about the rankings of the top few teams, so we will only examine top tens.

I ran Colley's system on some seasons prior to 1998. It did not take long to find an objectionable ranking, as Colley's #1 team for 1997 was Tennessee. As most fans well remember, Tennessee lost two games that year, including a 42-17 whipping at the hands of the team most systems (that considered margin of victory) rated at #1: Nebraska.

Colley's top ten teams before and after the 1997 bowl season (as calculated by my Colley Matrix emulation) are as follows:

    End of the regular season Colley ratings:
    
    team                     w  l       power
    -----------------------------------------
  1 Tennessee               11  1    1.008562
  2 Michigan                11  0    0.970567
  3 Nebraska                12  0    0.937788
  4 Florida                  9  2    0.916233
  5 Florida State           10  1    0.915298
  6 Auburn                   9  3    0.892810
  7 Georgia                  9  2    0.870004
  8 Washington State        10  1    0.865106
  9 Kansas State            10  1    0.847633
 10 North Carolina          10  1    0.846583

    Post-bowl Colley ratings:

    team                     w  l       power
    -----------------------------------------
  1 Tennessee               11  2    0.989129
  2 Michigan                12  0    0.982527
  3 Nebraska                13  0    0.974416
  4 Florida                 10  2    0.952213
  5 Florida State           11  1    0.949675
  6 Auburn                  10  3    0.921386
  7 Georgia                 10  2    0.899771
  8 UCLA                    10  2    0.878634
  9 Kansas State            11  1    0.864743
 10 North Carolina          11  1    0.864365

Considering only who beat whom, it was almost inevitable that a Southeastern Conference team would top the 1997 Colley rankings. The SEC had an incredible 32-4 record against non-SEC teams during the regular season. The next best non-conference record was owned by the Pac 10, going 23-7. It is then no surprise that ratings based on wins alone would put the team with the most wins within the SEC on top. However, ranking Tennessee #1 for 1997 makes no sense. Nebraska and Michigan were undefeated, and Florida and Nebraska beat Tennessee by 13 and 25 points, respectively. Most would consider a #1 rating for Tennessee laughable.

The web site of the Anderson-Hester system (http://www.andersonsports.com/football/ACF_sugr.html), another BCS selector, specifically mentions 1997 in touting the success of the BCS. Anderson and Hester state that since the inception of the BCS, it has never failed to set up "true national championship games" whenever possible. By "true" championship games they are referring to games pitting undefeated teams. In other words, when two top teams have gone undefeated prior to the bowls, the BCS has always matched them, unlike the scenarios before the BCS-era which often saw the only two undefeated teams go to different bowl games. The last time that happened was 1997, when Michigan went to the Rose Bowl and Nebraska went to the Orange Bowl.

Anderson and Hester are right that whenever it was possible to match undefeated teams, the BCS has. However, some undefeated teams have been left out of the championship game (Tulane in 1998, Auburn and Utah in 2004, Boise State in 2006, and Hawaii in 2007).

More importantly, here we have an example of BCS selectors talking up the BCS's strengths by pointing to the 1997 season. Yet we do not know which schools the BCS would have selected for the championship game in 1997. According to my version of Colley's program, Colley would have selected Tennessee and Michigan, not Nebraska and Michigan.

I suspect the BCS could have been in for a huge embarrassment if it had been in place in 1997. There is a good chance that Tennessee, Florida, and/or Florida State may have been backed more strongly by their computer ratings than Nebraska and/or Michigan. Imagine if they had blocked both Nebraska and Michigan from the championship game. Are they prepared for something like this to happen in the future?


August 14, 2010 revision: What follows will soon be separated into its own essay. What you have read up to this point is all that needed to be said here. The BCS organizers clearly did not do much research on power ratings when they started doing their thing. It has been proven. If you are keen on going even deeper, continue reading!


So, why not allow margins of victory to go into ratings, to ensure they are as "accurate" as possible? The primary argument is that if larger margins of victory helped teams in the ratings, then some teams might be tempted to pile on meaningless points in lopsided wins in an attempt to inflate their ratings, thus improving their odds of landing in a BCS bowl. Thus, ignoring scores is an effort to avoid encouraging unsportsmanlike conduct.

However, there are real-world controls that inhibit running up the score. Running up the score generally involves leaving first-string players in the game long after it is out of reach. Any coach who does that to excess is not wise, as the risk of injury to their best players should outweigh any desire to run up a score. Even one needless injury could jeopardize a future victory against a competitive rival.

Another natural incentive to not run up scores is that any coach that does so is likely to be paid back in kind in a future season when their team is not so good. Certainly scores have been run up intentionally, but I would argue that it is a rare event, and few coaches really want to do it when the opportunity arises.

On the theoretical side, people who have designed power ratings that use scores have long known it is not wise to use a raw score when there is an extreme blow-out. Taking an example from history, Florida beat Central Michigan 82-6 in 1997. A margin of victory greater than 70 points is exceedingly rare in games between Division-1A teams. Because of this rarity, no serious prognosticator ever predicts a 76-point margin, even when the best team is playing the worst team. The best teams probably could beat the worst teams by 76 points nearly every time, but since coaches have more important priorities than huge margins, it simply does not happen often. When teams get ahead by 35 or 42 points, their senior players will be rested, and thereafter no one can predict whether the winning team's reserves will be energetic or unmotivated.

Since most designers of ranking systems want their ratings to be accurate, a score like 82-6 simply will not be treated as 82-6. There are any number of ways to adjust blow-out scores to something more realistic. My goal is not to cover statistical mathematics, but just as an example, a system designer might look at the fact that Las Vegas rarely sets point spreads greater than 50 points, and thus might adjust an 82-6 score to 56-6 in order to produce more accurate ratings. If such a step is not taken, then for subsequent games the power ratings will produce unrealistic predictions for the two teams involved in the blow-out.

As a matter of fact, if Florida had played Central Michigan a second time in 1997, we can be certain that the point spread would not have been 76. It would have probably been something more like 50 (the exact number would have depended on where the game was to be played). It only takes a little experience monitoring predictions and point spreads to confirm this statement. Thus, for any point-based rating system to be taken seriously, the designer must take precaution against overvaluing extreme results.

To illustrate the need for dampening blow-out results, we can look at a simple rating system that treats all results as equally important, and makes no adjustments to outcomes other than for home-field advantage. I call the method "networked transitive comparison" (NTC) because it amounts to linking all teams mathematically with the theory that if Team A beats Team B by x points, and Team B beats Team C by y points, then Team A is x+y points better than Team C. Transitive comparisons are famous and fun because extremely absurd chains of logic can often be found where some obviously bad team is shown to be superior to several of the elite teams of the season. For example, a transitive chain of 31 links can be found for the 2007 season that shows Division III Kenyon was 377 points better than Kansas (the Division-1A team with the best won-lost record of the year). Regardless of such amusements, when all such transitive comparisons are made and averaged out, the result is numerical power ratings that will place most teams roughly where they belong relative to others (so long as teams have played more than a few games each). Kenyon does not come out ranked above Kansas when a full analysis of all scores is performed.

An NTC-style rating can be calculated in several ways, and the procedure has probably been given many different names. More detail on it would be appropriate for its own article. However, the details may be ignored if we simply look at some NTC ratings from the past and note that of the dozens of people publishing power ratings in recent years, many similar ratings could be found.

An interesting case is the 2002 season. Below are the power ratings as calculated by my NTC program. This program adds three points to the scores of all road teams.

    team                     W  L      power
    ----------------------------------------
  1 Kansas State            11  2      33.92
  2 Southern Cal            11  2      31.12
  3 Oklahoma                12  2      27.41
  4 Miami (Florida)         12  1      24.75
  5 Texas                   11  2      22.90
  6 Georgia                 13  1      21.16
  7 Penn State               9  4      19.98
  8 Iowa                    11  2      19.80
  9 Ohio State              14  0      19.25
 10 Alabama                 10  3      18.64

Ohio State, the only undefeated Division-1A team, was the consensus National Champion of 2002. However, Ohio State did not win by convincing margins in half of their games. On the other hand, Kansas State lost two close games, and won several games by extreme margins (68-0, 64-0, and 58-7, to name a few). Southern California and Oklahoma followed suit, but with not quite as extreme blow-outs.

It is not easy to document now, but among the rating systems that were being published on the Web that year, a large fraction did not place Ohio State at #1. Jeff Sagarin's "Predictor" method had Kansas State #1, and Ohio State #8, nearly matching the NTC rankings. Sagarin is a BCS selector, although his Predictor system is not a part of it. Sagarin's Elo chess system is used by the BCS, and it ranked Ohio State #1. (Note, if one looks up the 1997 Sagarin ratings on the Internet, the rankings found will not be those of Sagarin's Elo chess system, as he only started publishing those rankings after the BCS's prohibition on using scores for BCS rankings.)

Clearly, the inclusion or exclusion of scores in the process of calculating ratings makes a drastic difference. What is interesting is the impact is not always what the BCS would hope to see. The 2002 Ohio State team benefits from the exclusion of scores, but the 1997 Nebraska and Michigan teams suffer from it. The best methods are actually somewhere in the middle - scores help improve rating accuracy, but they should not be overly-relied upon or taken too literally.

What's more, while using scores to make computer ratings does generally make for more appealing rankings (in my opinion), scores-based rankings could still leave undefeated powers out in the cold in the BCS. The system I have put the most time into (and that makes the ratings I post on my web site) ranks one-loss Florida State above Michigan for 1997. Some fans would find that outrageous. To respond to those fans we would have to go into another long-standing debate about ranking methodology. That is, do undefeated teams always deserve to be ranked above teams with losses?

It depends on your definition of "deserve." With my ratings I am more interested in a basis for prediction. My ranking of Florida State above Michigan in 1997 is only a guess that Florida State would have been slightly favored if the teams had met. Being undefeated does not mean a team is guaranteed to be favored in upcoming games! Undefeated teams are quite commonly underdogs, even late in the season.

This is the heart of the matter. What is the purpose of computer rankings? The BCS has a purpose of setting up a true national championship game (or so we assume). With such an important purpose, should they not make certain they use the best ranking models? I, for one, am not convinced they made much of an effort to investigate the quality of their models.

No one would take seriously a bowl-selection system that produced output like we have seen for 1997 or 2002. However, the NTC method is also capable of producing perfectly reasonable ratings that would not be controversial. The same could happen with any numerical ranking method. Ratings might appear reasonable for several seasons, but eventually there will probably be a year where the particular jumble of scores just does not allow for ratings that most fans would find realistic.

Once we appreciate this, an obvious conclusion is that evaluating a ranking system has to involve looking at rankings for a large number of seasons. Systems that come up with the smallest number of plainly objectionable rankings (such as Ohio State far removed from #1 in 2002) are the types of systems that should be employed by the BCS. I do not see how anyone could object to this common-sense notion, and yet the BCS organizers have never supplied historical ratings from their chosen systems, nor have they said whether or not they examined many seasons of past ratings in choosing their systems, nor have they come out with any statements whatsoever on how they decided which systems were of high enough quality to merit inclusion.

I have no desire to impugn the work of the people behind the various rating systems of the BCS. I am sure they do as well as can be done without consideration of scores. Rather, I am simply pointing out that those managing the BCS system have not demonstrated a knowledge of computer ratings, and they have not supplied evidence that they are using the best systems available. Good evidence could be provided by simply publishing the ratings their systems would have produced going back a few decades (or even all the way back to 1869) for all interested parties to compare and discuss.

Until basic steps like that are taken, the BCS should expect nothing but sarcasm and skepticism from a fan community that rightfully views their ratings as a proverbial black box.

Any number of systems can be designed that realistically handle extreme results, and in general, the math involved is so complex that there is no way any coach could anticipate what game outcomes would most enhance their power ratings. On top of that, if several such systems were used by the BCS, there would be absolutely no way to predict the benefit to be had from any particular score, whether it be a blow-out or not. Coaches would not be given any new incentive for unsportsmanlike conduct if BCS computer systems considered scores.

However, rather than advocating the adoption of systems that incorporate scores, my recipe for getting rid of the annual BCS hype and consternation would be to just do away with their ratings scheme. The formula that combines the computer ratings and opinion polls produces an impression that science has an answer. If only enough numbers are thrown around then the system must be right! That's the illusion promulgated as they tweak their formula over time.

The truth is that ratings and predictions are a very tricky business. The average prediction error for computer systems and betting lines is around 12 points. Predictions that are off by 20 or 30 points are just about as common as predictions that are right on the nose. Given that, who can really believe that there is some absolute truth as to which two teams deserve to be in a National Championship game? Just as no prediction is a "lock," no ranking method is perfect, nor is any consortium of methods perfect.

The NCAA/BCS should study the historical rankings of several systems to find out which are the most reliable over the long term. Then, rather than dogmatically believing in a formula that combines different rankings, the best systems (whether or not they utilize scores) should be used as baseline references, not as commandments set in stone.

Unfortunately, using computer ratings as only a tool to aid human judgement leaves humans to make the final decisions. Then things are opened up for charges of bias, politics, and so forth to muddy the waters. In that sense, I do admire the spirit of the BCS rankings. For the appearance of fairness, what better to resort to than pure mathematics? However, as has been clearly demonstrated in this article and elsewhere, no mathematical system is perfectly reliable, and no two systems agree.

My suggestion for maximizing happiness would be that the BCS adopt a new philosophy. They should simply state that if two and only two "major conference" Division-1A teams go undefeated in a season, those two teams should play in the championship game. If the number of undefeated teams is anything other than two, then the top two teams in the BCS rankings should be chosen. (Note, such a scheme would require more qualifications and possibly new rules on non-conference scheduling practices, but this suggestion is not the focus of the article, so I will leave it at that.) And to lend more credibility to their rankings, the BCS should do a historical review of rankings to prove to the public that they have studied the issue.

Of course, another option to change things would be a play-off. However, the same problems would still exist in that some method for choosing and seeding the play-off field would be needed, and that process would likely be under an even brighter spotlight. The NCAA and its BCS group have been warned: If they are going to pretend to be scientists, they had better back up their theories with evidence.


Posted October 22, 2008
Copyright 2008. All rights reserved.
Jon Dokter

Note: Links cited in the article are not active to avoid the need to repair broken links in the future.


www.timetravelsports.com