Retroactive Pick Record

Hey folks, Tom here again. One important aspect to designing a system like Beatpaths is to look for confirmation that the system is indeed getting better as more information is fed into it. I’ve already been looking into one aspect of this, by tracking the stability of the rankings that Beatpaths produces week-on-week. As more data become available to Beatpaths, the changes to team rankings are less and less dramatic in general (as indicated by the trendline).

However, increased stability in the rankings isn’t worth much if those rankings aren’t an accurate representation of each team’s performance relative to the field over the course of the regular season. So, the idea for this post is too take the final regular season rankings and look backwards. If the final rankings are better than the rankings of previous weeks, they should have a better pick record, because with the luxury of hindsight and more comprehensive information they can (in theory) more accurately rank teams relative to one another.

Below I have arrayed each team ranked as they are in the final regular season rankings, the pick record that this ranking gives us in retroactively ‘predicting’ the results of their regular season matchups, and the games that were incorrectly called.

Team Retroactive Pick Record Wins/Losses Inconsistent with Ranking
Indianapolis 12-4 Losses to: Chicago, Jacksonville, Green Bay, Tennessee
NY Jets 9-7 Losses to: New England, San Diego, Oakland, Denver, San Francisco, Seattle, Miami
Tennessee 14-2 Win over: Indianapolis
Loss to: Houston
Pittsburgh 14-2 Losses to: Philadelphia, NY Giants
Baltimore 15-1 Loss to: NY Giants
Minnesota 12-4 Losses to: Green Bay, Chicago, Tampa Bay, Atlanta
NY Giants 11-5 Wins over: Pittsburgh, Baltimore
Losses to: Cleveland, Philadelphia, Dallas
Carolina 14-2 Losses to: Tampa Bay, Atlanta
New England 13-3 Win over: NY Jets
Losses to: Miami, San Diego
Dallas 11-5 Win over: NY Giants
Losses to: Washington, Arizona, St. Louis, Philadelphia
Houston 13-3 Win over: Tennessee
Losses to: Jacksonville, Oakland
Cincinnati 14-1-1 Tied with: Philadelphia
Loss to: Cleveland
Washington 13-3 Win over: Dallas
Losses to: St. Louis, San Francisco
Philadelphia 11-4-1 Wins over: Pittsburgh, NY Giants, Dallas
Tied with: Cincinnati
Loss to: Chicago
Atlanta 11-5 Wins over: Carolina, Minnesota
Losses to: Tampa Bay, Denver, New Orleans
Tampa Bay 9-7 Wins over: Atlanta, Carolina, Minnesota
Losses to: New Orleans, Denver, San Diego, Oakland
Chicago 13-3 Wins over: Indianapolis, Philadelphia, Minnesota
Loss to: Green Bay
Arizona 15-1 Win over: Dallas
Miami 14-2 Wins over: New England, NY Jets
San Francisco 12-4 Wins over: NY Jets, Washington
Losses to: New Orleans, Seattle
Denver 8-8 Wins over: Tampa Bay, Atlanta, NY Jets
Losses to: Kansas City, Jacksonville, Oakland, Buffalo, San Diego
Buffalo 14-2 Win over: Denver
Loss to: Cleveland
New Orleans 13-3 Wins over: Tampa Bay, San Francisco, Atlanta
Cleveland 13-3 Wins over: Cincinnati, NY Giants, Buffalo
San Diego 12-4 Wins over: NY Jets, New England, Tampa Bay, Denver
Jacksonville 13-3 Wins over: Indianapolis, Houston, Denver
Oakland 11-5 Wins over: NY Jets, Denver, Houston, Tampa Bay
Loss to: Kansas City
Green Bay 13-3 Wins over: Minnesota, Indianapolis, Chicago
Kansas City 14-2 Wins over: Denver, Oakland
Seattle 14-2 Wins over: San Francisco, NY Jets
Detroit 16-0
St. Louis 14-2 Wins over: Washington, Dallas

Added up, the final season rankings have a retroactive pick record of 202-53-1 (78%), whereas the week-on-week picks have a record of 150-105-1 (58%). It achieves greater accuracy even with the inherent weakness that the retroactive picks face in always getting season splits between teams partially wrong. This means that the final rankings, with the privilege of comprehensive information and hindsight, are significantly better than the week-on-week picks, lending validity to the Beatpaths model and its ability over time to find an accurate ranking for each team relative to the rest.

This does not mean that the Beatpaths method has produced the best possible ranking of each team by the end of the regular season. It is probably possible to adjust these rankings such that the picks are overall better—certainly a sorting algorithm could be written to produce a ranking that maximizes the number of correct retroactive picks. The most interesting questions then become: 1) how can we adjust the ranking method so that it more quickly approaches the correct ranking even with incomplete information; and, 2) can we gain any potential insights into “fluke” games or teams with inconsistent play that we can recognize and account for in a Beatpaths variant?

Some initial insights can be gained from the three teams that Beatpaths does the worst with, even given the comprehensive regular season information: the NY Jets, Tampa Bay Buccaneers, and Denver Broncos.

The Jets cannot be lowered in the rankings below New England (the closest-ranked wrong pick) without worsening the pick record to 8-8 (their victories over Tennessee and New England wouldn’t be accounted for). Tampa Bay had a season split with Atlanta, so moving them up a slot won’t improve the pick record. Moving them below Denver doesn’t help, as their win over Chicago becomes unaccounted for, while moving them above Carolina (another season split) makes their loss to Dallas unaccounted for. Moving Denver up above Tampa Bay makes their loss to Miami unaccounted for, however dropping them below Buffalo would actually improve the retroactive picks. But dropping them further below San Diego would make both their wins over Cleveland and New Orleans unaccounted for. All three of these teams seem plagued by inconsistent play, yet don’t present significantly better options for ranking them elsewhere. The question remains: is there some way to identify and account for season-long inconsistent play?

Of course, the foregoing does not take into account the postseason and the changes to the rankings that will follow from the inclusion of post-season information into the system. It may be worth another examination after the Super Bowl to see if the post-Super Bowl rankings do even better. Nevertheless, I find this to thought experiment to be a strong confirmation that the Beatpaths method produces an increasingly accurate description of the relative strength of each team in the league over time.

15 Responses to Retroactive Pick Record

  1. doktarr says:

    Typo: San Fran actually beat Washington, although the pick is wrong either way.

    Going through this with Moose’s iterative rankings, here are teams that change.

    For the sake of sanity when doing this, I won’t double count changes for both teams. So, for example, I won’t mention the Titans’ loss to the Jets as an extra inaccurate pick in the Titans entry, since I’ll already have counted it in the Jets entry.

    In the interest of fairness I’ll include the MIN=>NYG game in the rankings, although I think it should be thrown out as meaningless.

    Jets: drop all the way to the #22 spot in the rankings. Their Seattle and Oakland losses remain misses, but their other 5 losses are now considered expected. On the flip side, their Miami, New England, Tennessee, and Arizona wins are now inconsistent, making the overall pick record 10-6. +1 overall.

    Pittsburgh: drop under the Giants, so that game is now considered consistent. +1.

    Baltimore: under the Giants, which makes for perfect 16-0 picks. +1.

    Minnesota: ranked under Atlanta, making that loss consistent. +1.

    Panthers: Ranked under Atlanta, but that just switches which game is considered inconsistent, for no net change in accuracy.

    Cowboys: much lower in the rankings, all the way at #17. But the flipped picks on the Philly and Washington splits cause no change. The only meaningful change is that the loss to Arizona is now consistent. +1.

    Houston: the wins over both Miami and Chicago are now considered inconsistent. These are the 13-14-15 ranked teams, and these are both rankings (i.e. no beatpath) picks, and both were home games for Houston where we would expect the , but there it is. -2.

    Cincinnati: drops all the way to #23. Cleveland picks are flipped for no change. Washington game is now inconsistent. -1.

    Tampa Bay: Another team thrashed by iterative, all the way down at #26. NFCS splits aside, the Chicago win is now inconsistent, but the three NFC West losses are now consistent. +2.

    San Fransisco: Loss to New Orleans is now consistent. +1.

    Buffalo: the win over San Diego is now inconsistent, but the loss to Cleveland is consistent, for no net change.

    So…. in the unlikely event that I haven’t made a mistake… iterative picks 5 more games correctly over the course of the season, for 81% correct overall. Setting aside the season splits, it’s even stronger.

    To put it another way: I think there were 21 season splits this year. Setting them aside, and setting aside the tie game, the standard algorithm made 32 “avoidable” incorrect picks. The iterative algorithm reduces that number to 27. That’s a 16% reduction in avoidable incorrect picks, which is considerable.

    If I tick Houston’s ranking up a teeny bit without actually changing the graph, I get 2 more right. But other than that, I can’t really blame Moose’s ranking algorithm for any of the off picks. 😉

  2. Tom says:


    Thanks for catching my typo. It ought to be corrected now.

    Also, thanks for going into the performance of the iterative method. How does the iterative retroactive pick compare to the iterative week-on-week pick record? (What I’m interested in is whether one method is better at both week-on-week and retroactive, or whether each is better at a different task.)

    I think 3 questions remain when pondering future method variants:

    1) What is the best possible end-of-regular-season ranking (better than what standard, iterative, or weighted have produced)?

    2) *Why* does one method produce better end-of-regular-season picks than another (luck or correctly ignoring fuzzy data)? Does this improved performance hold year-on-year, or are these results unique to 2008?

    3) What do these “inconsistent” games tell us, and how can we reach the best final rankings earlier in the season, given the benefit of hindsight? Would a new variant based on this knowledge also perform better even when applied to past years?

  3. JT says:

    A nice look at how well things shape up after more data gets into the system. It would be nice to see if there are similar results for previous seasons.

  4. doktarr says:

    1) Well, it is a theoretically answerable question, although it’s hard.

    I looked more carefully and there were actually 24 season splits (that’s got to be an unusually high number). So, and consideration may as well remove those 48 games, plus the tie, since any strict ranked order will get exactly half of them right.

    That leaves 207 contests that can used for fine tuning. Right now, we don’t know the best possible pick record, but I’d be surprised if it’s better than 187-20 or so. Iterative’s rankings, with Houston tweaked a bit higher in the rankings is 185-22, and I really don’t see any possible additional tweaks to improve the record. The iterative rankings without any tweaks are 183-24, and parallel is 178-29.

    2) The why, I would argue, is that it is weighting the data more equally. Whether this improvement holds to previous years is a testable question, but it’s too much work to do multiple years by hand. Those with access to the code could probably spit out a solution pretty fast, though.

    3) This is the hardest question. I think the iterative algorithm is one enhancement, because it means every win is, in some sense, weighted equally to all other wins.

    I think another enhancement may involve some method of considering larger and smaller loops simultaneously. “Start with the smallest loop” has always been something of a computational crutch, and the resilience of the NYJ=>TEN game this year laid that bare. I still believe that removing full loops and treating all wins equally (i.e. the iterative approach) is more even-handed than removing individual wins, but boga’s approach to finding the most inconsistent games may lead to some way to determine which loops to remove first.

  5. Tom says:


    While I appreciate your enthusiasm for the iterative method, I don’t think that manually moving individual teams up or down in the standard, parallel, or iterative rankings will lead to a final ranking that will satisfy us that it is *the* best possible final ranking. Something like this probably needs to be done with a sorting algorithm (hopefully something more elegant than brute force checking every possible ranking of 32 teams in 32 slots each with a 16 game record…).

    I agree that those with access to code can probably do better than either of our conjecture, however.

    I also agree that iterative appears to be an enhancement, and treating all wins weighted equally with the others sounds intuitively like the right thing to do. It’s not clear to me, however, if this is precisely the reason why it does better retroactively or whether it does better retroactively for some other reason. After all, maybe certain wins should not be weighted the same as others.

  6. mm says:

    I think it’s pretty obvious that iterative will produce better ‘retroactive’ ratings than standard or weighted.

    About a month ago I let my mind wander over this very stuff one afternoon while I was taking a long walk. Unfortunately, I didn’t write down anything, but I’ll go over a little of what I remember. Iterative is the one of the three that is designed to do the best when comparing past data; note it’s not designed to do the best at predicting future data (though it might), but it clearly should do the best at picking games that are already included in the results.

    Look at the loop A-B-C-A, which represents 3 games played. If you take that loop in isolation, then ranking 1)A 2)B 3)C; 1)B 2)C 3)A; 1)C 2)A 3)B will each produce a record of 2-1 when you look back ‘retroactively’ at the end of the season. However, 1)A 2)C 3)B is one of three rankings that will go 1-2 when you look ‘retroactively’. That’s not exactly ‘fair’; with only 3 games to examine, all teams are truly equal (A=B=C), but once you pick a #1, your #2 &#3 are forced on you, or else you end up with an extra loss. (note, this example calls into question the usefulness of this whole process, but we’ll ignore that!)

    Now, Iterative and Weighted will look at these games and attempt to pick only 1 to throw out; if successful, it will leave you with a 2-1 record when you look back. Standard throws out the whole thing, which means at the end you have a chance to arrive at both the 1-2 & the 2-1 record at the very end. Thus more losses will pop into standard. (note, iterative and weighted can each also end up throwing out all 3 games, and end up leaving the 1-2 possibility, but they make the attempt not to).

    Now, weighted attempts to use the criteria of points scored to pick which game to throw out. This sounds like it might be the best way to predict games in the future (I’m sure some of you have read the “guts and stomps” article at football outsiders), even if the data so far doesn’t seem to support it. However, this generally leads to throwing out more games than iterative, which leads to more possibilities for more losses to pop in.

    Iterative tries to remove the fewest amount of games when making its choice, thus making it (in general) the best for this exercise. When you go back into ‘retroactive’ predictions, you’re treating all games as equal, and you’re only looking at games that the algorithm has already seen. Since iterative normally has more of those games left in its process, it should do the best. It is designed to be the best of the 3 at looking backwards, not necessarily at looking forwards.

    Now, an algorithm that looks at all loops at once, or a computer that simply looks at all possible rankings and picks the best one can be even better (it might have Houston up slightly in doktarr’s example) (I’ll talk a little about this in a follow up comment, this is getting long). I’ll note here it is possible that standard and weighted might luck into these rankings and end up better than iterative, but it seems unlikely.

  7. ThunderThumbs says:

    This is really good stuff. And it underscores to me why we shouldn’t try too hard to correct for inaccuracy in the cases of Denver or the Jets – if a team is plain old inconsistent, then it’s going to partially thwart any effort to judge a team’s quality based off of their season’s performance.

    And yeah, I see mm’s point – an approach that tries to break loops specifically by removing the smallest number of games, is by definition going to do better at finding a ranking that fits the win/loss outcomes of the season so far.

  8. ThunderThumbs says:

    I think it’s fascinating that while our conventional wisdom was always that the NYG were ranked too low all season long… it actually incorrectly called wins more often than losses. Meaning, it suggests that it could have been more accurate if the NYG were ranked even lower.

    And yes, I think a restating of the retroactive question is, what is the minimum number of game outcomes to remove such that a Directed Acyclic Graph is possible?

    That’s a very interesting question to me. However, it does seem to be asking a conceptually different question than what is the most accurate graph for describing the quality of the teams. I can’t help thinking that there might be cases where removing one key, valid game outcome might create a lot of order, or where an orderly graph might rely on a spectacularly fluky game. Sometimes we *want* the beatloops.

    In other words, I guess I have a hypothesis that over 30-40 years of NFL game data, the most retroactively accurate methods might not necessarily be the most predictive. I’d love to test this.

  9. mm says:

    Now, I want to comment here that there may be several ‘best’ rankings if you ever get a computer to calculate it. I’ll want to bring up some examples here.

    Suppose 2 teams in a division split their series, and otherwise perform exactly the same over their schedule (assume their strength of schedule games leaves them in the same place). Even if all other 30 teams get perfectly ranked, you’ll be left with two equally good rankings, one which has A 1 spot over B, and one with B 1 spot over A, at whatever spot they occupy in the rankings.

    Now lets get more complicated. Consider a freaky team, we’ll call them “Denver”. To simplify this example I’ll imagine a 10 team league with 10 games played. Suppose when we ignore Denver’s games and we end up with a perfect 9 team order, but we grimace when we consider Denver’s record versus each team:

    1 (2-0)
    3 (0-1)
    4 (0-1)
    5 (1-1)
    6 (1-0)
    7 (1-0)
    9 (0-2)

    Lets look at our record depending on where we rank Denver (note that if we rank Denver 2nd, that means the team that is #2 above becomes 3, the team that is #3 becomes 4, etc.):

    1 (5-5)
    2 (3-7)
    3 (3-7)
    4 (4-6)
    5 (5-5)
    6 (5-5)
    7 (4-6)
    8 (3-7)
    9 (3-7)

    If we presume that the other 9 teams were perfectly ranked, then we have 4 ‘best possible’ rankings: 1 that has Denver first, 2 that have Denver in the middle (5 or 6), and 1 that has Denver dead last!

    Now this might be an extreme example, but it brings up some issues that can surface. I could also invent examples where 3 teams can rotate spots, but only in a certain order (like the A-B-C example in the last comment), groups that can go up and down, or (and simplest of them all) 1 team can move several spots up and down while still being one of the ‘best possible’ rankings. I think in the 16 game season, we just might end up with little bits of all of these issues interacting to produce several ‘equally good’ rankings.

    We’re left with some interesting questions, particularly with the ‘Denver’ issue. Let’s consider a less extreme (and perhaps more likely) ‘Denver’, a team that could be equally good at 5 and 6 as 9 or 10; is it really meaningful to refine an algorithm so we throw out the slightly less accurate it 7 or 8?

    And if you do use a computer to calculate the likelihood of each possible ranking of all 32 teams, what’s the best way to give people the data? If you end up with 4 ‘best’ results do you give them all 4? Do you give them an average of each of the 4? What about some kind of a weighted average that included every possible ranking, but with a smaller weight for larger errors? I suppose a team might be relatively high on most of the ‘best’ results, but be significantly lower on a large number of the ‘just below best’ numbers, so this method might show some notable differences.

  10. doktarr says:


    I absolutely agree that I’m not certain that the iterative rankings (with the tweak to Houston) gives the best possible retroactive pick record. I only mean to say that after looking at it, my intuition is that that ranking set is very close to as good as you could do. It’s just a guess, though.


    your example is not horribly unrealistic. Let’s take another extreme example; we’ll call them “The Jets”. The parallel algorithm ranks them #2 and has a pick record of 9-7. The iterative algorithm drops them all the way into the 20s, and improves the retroactive pick record all the way to… 10-6. So, not exactly the same as ranking them #2, and of course the misses are the opposite games in most cases… but it does show how an inconsistent team (or, in the case of 2008, the three inconsistent teams) simply don’t have a “right” place.

    Now, the question of whether that makes it wrong to, say, push Tampa under Oakland because that squeezes out some extra pick accuracy, is something of a matter of opinion. To me, the answer is yes, quite simply because I don’t need beatpaths to affirm that Tampa is probably a better team than Oakland. It’s far more interesting for me to see Tampa under Oakland, say “wow, really?” and recognize that on the balance, Tampa’s inconsistency swung it toward just being bad.

  11. chris clark says:

    I think I’ll follow up some of the stuff I was writing before on breaking beatloops in this thread.

    First, I got two good responses back from comp.theory. One of them said that the relevant theoretical problem is the “minimum feedback arc set” problem. I’ll go more into that in a moment. The other pointed me to a wonderful paper on statistics that was solving the same problem (note this web address should all be on one line):

    What I gleaned from the stats paper (and there was much I didn’t get since I’m not a statistician) is that you can map the loop breaking problem into a branch-and-bound problem, where roughly each branch corresponds to either breaking one game (or ranking one team, I’m not sure which) and depending on your “objective function”, that is what you are trying to maximize, you can argue for different orders. No order is best for all ways of defining how one ranks teams. That makes kind of intuitive sense, if you have a 3 team loop, each team could be argued as best and thus there are 3!=6 different orders that reflect 6 different judgment criteria. (Damn Latin plurals!)

    Now, the minimum feedback arc set is one (class of) such criteria. To me it matches pretty closely to the idea of beatflukes. The idea is we want to remove the minimum number of games that leave us with a DAG (directed acyclic graph). Those are the fluke games. Note, in most cases the set of fluke games is not unique. If we have just one cycle, ABC[A], then any one of the games AB, BC, or CA could be considered the fluke. However, if we had one more team and two more games in the “loop”, BD and DA. Then, the fluke game is “clearly” AB as that breaks both loops ABC[A] and ABD[A].

    This is also true if the two beatloops do not have the same size. ABC[A] and ABDEF[A]. The fluke game is still AB, because it is the only game in both loops. However, if we add a third loop of BDEG[B], we get back to an ambiguous situation where we have several pairs of games that will break all 3 loops: AB + GB, AB + EG, AB + DE, AB + BD, BC + BD, BC + DE, and perhaps some more. There are arguments that any one of these 2 game pairs is actually the pair of fluke games.

    Without other weighting criteria, it is best to try to come up with an algorithm that removes the games “fairly”. Borrowing the iterative idea, perhaps we should count the number of loops each game is in (call that loops(game)). Then, find the maximum number of loops any game is in (call that max) and then decrease each games value by subtracting loops(game)/max from each games count. That, will eliminate all the games involved in the maximal loops and decrease the strength of all other games proportionally to the number of loops they are involved in. Doing so, may (or may not) break all the loops. If it doesn’t, repeat the process.

    Note, It will break all loops that contain at least 1 game perticipating in the maximal number of loops. And, it removes all the games participating in those maximal number of loops fairly. Therefore, if you have only 1 beatloop, it removes all the games in the beatloop).

    It is worth experimenting with keeping the reduced numbers (or restoring the games to full strength) for subsequent iterations of breaking loops. If you keep the reduced numbers (a la iterative), the formula for subsequent removals needs tweaked so that one makes the games go to zero with proportional reductions in other games. If you revert each game to full strength after each iteration (a la standard), the formula works as stated.

  12. Tom says:


    That’s really awesome stuff you’ve been researching.

    @tt & mm

    It occurs to me that it might be interesting to look at things in the exact opposite way than I did–instead of looking retroactively, we could look forward and see how each week’s rankings predict all remaining future games. We could look at Week 3, 4, 5, … 15 and see what each week’s “future pick” record is. I wonder if the correct future-pick-percentage improves week-on-week. It would also be interesting to compare standard, parallel, iterative, & weighted methods on this.

  13. Kenneth says:

    I don’t have much substantial to add to what’s been said, but I wanted to check on one point:

    Doktarr: “looked more carefully and there were actually 24 season splits (that’s got to be an unusually high number).”

    Actually, there are 6 (4 choose 2) game pairs for each division, and 8 divisions, so there should be 48 game pairs…and in that situation, wouldn’t you expect 24 season splits? I mean, in a perfect 50-50 binomial distribution, you’d get 24 splits and 24 sweeps. Obviously, NFL games aren’t 50-50, but if you’re just guessing…

    Just wanted to check my math and make sure I had all that right.

  14. […] shared beatloops means that is is flukey. But we have developed a few theories (check the comments here for instance) on how to identify actual links (game outcomes) to remove, thereby de-emphasizing […]

  15. […] At the end of the 2008 regular season, I also did a retrospective look at how well the rankings reflect a team’s overall record. […]

Leave a Reply

Your email address will not be published. Required fields are marked *