Analyzing the Ranking System

Hi folks, Tom here. Last season I did a series of posts looking at the rankings produced by the Beatpaths system, and trying to determine what information we could glean from them. I looked for two things:

  • Did the Beatpaths system arrive at more accurate rankings as the season progressed? And,
  • Does the BeatPower score tell us anything about the confidence of the system in picking the winner of a given matchup?

At the end of the 2008 regular season, I also did a retrospective look at how well the rankings reflect a team’s overall record.

The point is to think about Beatpaths as a whole system, instead of just looking at any one team. This is the idea of the Beatpaths system in its essence: to look at each team relative to all other teams.

Stability in the Ranks

Below are two graphs: the final graph of ranking shifts from last season, and the first graph of this season. The purpose of these graphs is to look at how many teams shift in the rankings each week, and how dramatically they shift. The method is simple: add up the difference between each team’s rank from this week and the previous week. The higher the number, the less ‘stable’ the rankings are overall. If the Beatpaths system is working, the rankings should become more and more stable as the season progresses, because we will have more information about each team relative to all the others, and can more accurately place them in context.

While this year and last both started out with a high degree of instability in the rankings (typical of the early season), this year’s instability has dropped much more sharply. This may be due to the new tie-breaking criterion being used by TT this year. It also may be due to the lack of beatloops so far this season, because the creation or breaking of beatloops tends to create significant shifts throughout the rankings. We’ll see if this level of stability in the rankings will hold or not.

Making Picks with Confidence

Beyond the looking at the stability of the rankings as a whole, I also looked into which matchups the BeatPower scores of each team predicted with the most confidence. You’ll notice when looking at the weekly rankings that they include a BeatPower score on the right-hand side. The method here is simple: compare the matchups for Week 5 by weighing the rival teams’ BeatPower scores against one another. The greater the disparity, the greater the confidence the system has in picking the winner. If the BeatPower scores are close, the system has less confidence in picking a clear winner. Last year I found as a general rule that the top half of the ‘Confidence’ chart (high confidence picks) would have about half as many incorrect picks as the bottom half of the chart (low confidence picks).

Here are the matchups for Week 5 and the ‘confidence’ that the system has in them:

(out of 100)
BeatPower comparison
(predicted winner – predicted loser)
Minnesota-St. Louis 100 100-0
Indianapolis-Tennessee 100 100-0
NY Giants-Oakland 94.1 100-5.9
NY Jets-Miami 79.6 92.9-13.3
Philadelphia-Tampa Bay 66.7 66.7-0
Dallas-Kansas City 50 50-0
Jacksonville-Seattle 37.5 50-12.5
Arizona-Houston 32.5 62.5-30
San Francisco-Atlanta 31.8 88.9-57.1
Pittsburgh-Detroit 25 58.3-33.3
Cincinnati-Baltimore 23.8 93.8-70.0
Washington-Carolina 20 20-0
Denver-New England 16.7 100-83.3
Cleveland-Buffalo -6.7 0-6.7

Note the negative confidence pick at the bottom. We ran into these a few times last year. These occur because the Beatpath Rankings and BeatPower scores occasionally diverge (i.e. a team with a lower BeatPower score will nevertheless be ranked higher than another team with a higher BeatPower score). Since these uniformly happen with low confidence games, it doesn’t seem to make a difference that the score is negative: the Beatpaths system isn’t terribly confident about picking a winner one way or the other. My advice is not to bet on those games 😉

Also note: that the ‘confidence’ scores are simply something I’ve been playing around with out of curiosity. TT always gives the official Beatpaths picks, and this year he’s comparing Beatpaths, Isaacson-Tarbell, and a hybrid of those two methods (along with his personal picks, as always).

Looking Back

A final way of gaining insight into the Beatpaths system and how well it’s functioning is to look to the recent past. Specifically, how well does the current ranking of each team retroactively predict its win-loss record? Teams that are consistently good or consistently bad will usually be well represented by their rank relative to other teams. However, teams that are flukey, win against good opponents but find ways to lose to bad opponents, will be difficult to rank no matter what system you’re using. Looking at each team’s retroactive pick record is a good way to identify which teams these are and assess how well the Beatpaths system is handling them.

At this point in the season, without any beatloops, the Beatpath Rankings have a 4-0 (or 3-0 for the bye week teams) retroactive pick records. While the tie-breakers between certain teams may be disputable, the rankings at the moment do not have any team ranked higher than a team that defeated it. Only when beatloops are formed will ambiguities enter into the rankings, which Beatpaths attempts to resolve by simply removing those paths that form a loop. When ambiguous data is removed the rankings may no longer correspond with who-beat-who, and we can begin to examine the retroactive pick record for each team in detail.

That’s it from me this week. I’ll update this post on game day to fill in the pick confidence table with actual game results.

18 Responses to Analyzing the Ranking System

  1. ThunderThumbs says:

    Awesome, Tom.

    Yeah, my own meddling from early in the season might introduce some uncertainty in the early part of this graph. For one thing, I’m considering switching the tiebreaker once again for next week – luckily, it’s one that is very closely related to the current one, with hardly any change in the rankings this week from the one that is up right now. The new tiebreaker will count Edges (arrows; games) instead of Nodes (teams).

    We’ve had a weirdly accurate first four weeks of the season, which is partly related to there being no beatloops so far.

  2. Tom says:

    As I count them, the games with potentials for creating beatloops in Week 5 are:
    – Rams beating the Vikings
    – Titans beating the Colts
    – Dolphins beating Jets
    – Houston beating Arizona

    The last two on that list are the only ones that may have a chance. But we could conceivably come out of Week 5 still with no beatloops.

    I’m not sure how the change in tiebreaker methodology translates back into the real world. When we count arrows/games rather than nodes/teams, what does that mean in English to the average football fan?

  3. The MOOSE says:

    I believe, and please correct me if I’m wrong, is this: Up to now, this site has always ranked teams by taking a team finding all other teams connected to it on the graph. You would then subtract the teams below from the teams above in order to find out where each team should be ranked. In my graphs, I decided instead to find every path that the team was in, and count the “arrows” going out to teams below and coming in from above. The difference is that if A->B->C and A->C, when counting only teams, A gets +2 for B and C, but when counting paths A gets +3 for A->B, A->B->C, and A->C. Since A->C wouldn’t be shown in the graph as it is redundant, giving A extra points shows that the team is stronger than just having a path to the teams below it.

  4. Tom says:

    Moose, thanks for that explanation. I think I understand in a technical sense how things are added up. But what I don’t understand is why one method makes more or less sense than another method. All of these lines and nodes and paths represent something real, beyond their representation on the graph. What, in reality, does counting ‘arrows’ versus counting ‘paths’ mean? Why does it make sense to the guy on the couch with nachos why a team should be ranked in place X, because there are a certain number of ‘arrows’/’paths’ above or below?

    I can explain the site’s method overall in English: “Beatpaths ranks teams according to strength of schedule, and deals with contradictory wins/losses by ignoring them.” But I’m having a hard time translating these tie-breaking methods back into English.

  5. ThunderThumbs says:

    Ok, let me give it a shot. There are four tiebreaker methods I’ll be explaining. All tiebreaker methods are applied only to teams that the graphs says can be ranked next, so that a ranking will never contradict a graph.

    1) Strength of beatwins. This is the one we used the last couple of years. Take a team. Find their best x immediate beatwins (teams they’ve actually defeated). Find the scores of each of those beatwins. The score of a team was, I think, how many teams were in their downward beatpaths or something. Anyway, take the average strength of a team’s top x beatwins. Compare it to the other teams being compared. Top score wins. The attempt was to pick the team with the best and broadest strength under them, so one team with one (maybe flukey) good victory and a bunch of crappy victories wouldn’t beat out another team with three respectable victories.

    2) Nodes (the tiebreaker this season so far). All three of the remaining methods simply try to cohesively reflect the strength of the entire graph. Teams below minus teams above is the simplest.

    3) Edges – an edge is an arrow is a game. It means that given the graph, the team in question would have won all of the games downstream, or lost all the games upstream. Like, the CIN->PIT game is below Denver. That means that DEN would have beaten PIT soon. GB->PIT is also below Denver. That means that DEN would have beaten PIT. It’s counted twice, unlike Nodes.

    4) Paths – this is Moose’s method. It will count certain edges multiple times, if they’re part of a different overall path. I’m not as sure how to translate it to English. I think the idea is, each path is an expression of the team’s power over the downstream teams.

    All three are close in accuracy, but method #2 (Edges) is most accurate in predictive value based off of a five-season backtest.

    I wouldn’t say it’s strength of schedule, it’s more based off of the strength of their opponents, which is in turn based off of the strength of their opponents, and so on- all the graph methods can qualify for that definition.

  6. The MOOSE says:

    To extend the last 3 into another example, consider these games.


    In the “node” way of counting, you have A getting credit for +4, as they have paths over B, C, D and E.

    In the “edge” way of counting, A gets +6, as all of the games listed fall under paths started by A.

    In the “paths” way of counting, A gets +7, because the D->E game would count twice since it falls under B and C, both of which A has defeated.

    I have chosen to use the path method because it uses the most information. Reinforcing paths gains points in this method.

    The best way to explain the path method in English is to say “count every unique path beginning with team X”. In this example:


    These are the 7 “out” paths. Then find all the paths that end with the same team (the “in” paths) and subtract from the out paths. This gives each team their raw path score.

  7. Tom says:

    Thanks TT and Moose.

    So to explain the “node” tiebreaker we might say, “We give Team A 4 points because there are four other teams we think they can beat, either because they beat them, or they beat a team that beat them.”

    To explain the “edge” tiebreaker we might say, “We give Team A 6 points because there are six different previous matchups this season that we think A did win or would have won. We count beating Team D and Team E twice, because we have two pieces of evidence that Team A could beat them (either because Team A did beat them, or Team A beat a team that beat them).”

    To explain “paths” we might say, “This is essentially the same as the ‘edge’ method, but we count *all* the Teams A beat, and then *all* the teams that those beaten by A beat, and so on until we reach teams that haven’t got any wins.”

    Is this the right way to translate these into English, without having to reference the chart, arrows, edges, nodes, paths, etc?

  8. The MOOSE says:

    I usually find explaining the path method by building up from the bottom. Every team gets 1 point for each victory over a team, and all of the points for that team.

    So in the example above, you’d start with E who has 0 points. D has 1 point for beating E. C gets 1 for defeating D and all of D’s points (1) for 2 total. B is the same as C for 2. A gets the points for B (2), C (2) and E (0) for a total of 4, plus 3 for defeating those three teams once each. That’s how we get to 7.

  9. Tom says:

    Moose, thanks, that is a more intuitive way of stating it.

    So TT has backtested these methods and found that apparently the ‘edges’ tiebreaker is the most predictive. Does this mean that it’s most predictive of the rest of the team’s season as a whole, or most predictive of just their next week’s matchup?

    Also, I wonder why counting nodes or paths isn’t as predictive as counting edges (based on the five years of data we have)? Is there something about this way of assessing the team’s relative strength that better reflects some real world aspect of football?

  10. ThunderThumbs says:

    To explain the “edge” tiebreaker we might say, “We give Team A 6 points because there are six different previous matchups this season that we think A did win or would have won. We count beating Team D and Team E twice, because we have two pieces of evidence that Team A could beat them (either because Team A did beat them, or Team A beat a team that beat them).”

    I think it’s more that we count beating Team D twice because there are two different matchups involving Team D that we believe Team A would have won. Sort of. Since B beat D, and A beat B, A would beat D. And since C beat D, and A beat C, A would beat D. That’s why it counts twice.

  11. ThunderThumbs says:

    Also, I wonder why counting nodes or paths isn’t as predictive as counting edges (based on the five years of data we have)? Is there something about this way of assessing the team’s relative strength that better reflects some real world aspect of football?

    It’s a close (subjective) call to determine what data should and shouldn’t be paid attention to of the graph. It may be that “nodes” doesn’t pay enough attention, by not paying attention to all the edges and pathways that the other methods do. And it may be that “paths” overweights certain path segments more than is helpful, if we define “helpful” as maximizing the predictive effect. By doing tiebreakers that respect the ordering of the graph, what we’re really trying to do is match the percentage performance of the graphs (BeatPicks) as closely as possible – doing anything that goes against the spirit of the graph too much could start pulling the predictive performance in the wrong direction.

  12. Boga says:

    Let’s see if I got this right. Edges basically counts every arrow from the pretty graphs that are put up on the sites. Paths would count every arrow from the raw graph with redundant lines.

    Does that sound right?

  13. ThunderThumbs says:

    Not quite – Edges counts every arrow, including the redundant arrows that are removed. Paths count every path that a team has to every other team it has a path to. (Which means if a team has three different paths to one other team, it will count all three of them, even if they only diverge partially – this means some arrows can be counted multiple times.)

  14. Thurhame says:

    Tom, a little advice from a statistics student: you may get a better idea of stability if you look at variance (the sum of the squares of the deviations) rather than just the sum of the deviations. For instance, if team A moves up 4 spaces, and teams B and C each move down 2 spaces, that’s a variance of 24 (4 squared plus 2 squared plus 2 squared).

    PS The edges tiebreaker is rather interesting; I’ve never thought of something like that before. I use the paths tiebreaker. In my opinion, when you consider for instance
    A -> B -> D -> E
    A -> C -> D(-> E)
    A -> B -> C -> D -> E
    A —-> C(-> D -> E)
    the difference is that in the second one, B beat C instead of D. Since C is better than D, this should boost the rating of B and the teams that beat B, right? Paths does that, rating A 7-0 instead of 6-0. Edges doesn’t, giving
    A a rating of 5-0 both times.

    PPS ThunderThumbs, in your tiebreaker explanation, you make it sound like a 6-3 team would be rated above a 3-1 team. Is this right? I favor the 3-1 team (.750 to .666), just like I would in win-loss records.

  15. Tom says:


    Thanks for the tip. You’ve inspired me to dig up my stats textbook.

    Regarding your PPS to TT, as a stats student, wouldn’t you prefer the result confirmed by the larger-N sample?

  16. Thurhame says:

    Tom, yes, the larger-N sample has a higher confidence. I am more confident that a 6-3 team will win two-thirds of their games than I am that a 3-1 team will win three-fourths of their games. However, no matter what the confidence, three-fourths is still more than two-thirds.

  17. Thurhame says:

    Another way of thinking about it is that if a team is only barely above average, and plays many more games than the other teams without improving, it should not gravitate towards the top of the rankings. Similarly, if a team is barely below average, and plays many more games than the other teams without getting worse, it should not move towards the bottom of the rankings.

  18. […] everyone, Tom here. Last week I introduced three different ways of looking at the Beatpaths rankings as a whole […]

Leave a Reply

Your email address will not be published. Required fields are marked *