Analyzing the Ranking System – Week 6

Ahoy folks, Tom here with the “Bonfire of the Vanities” edition of this feature. Lots of new stuff to mull over this week, given the changes to the tiebreaker system.

Stability

As you might expect, the rankings have undergone a significant shuffle this week. Three matchups looked like pretty significant upsets:

  • Philadelphia’s loss to Oakland;
  • Cincinnati’s loss to Houston; and,
  • NY Jets’ loss to Buffalo.

The overall effect on the system of team rankings was substantial. For the last two weeks, we’ve only had two teams move 10+ spots. This week we had four: Houston, Cincinnati, San Diego, and Green Bay. All four of those big shifts, as Kenneth explained, were based on the results of the Houston-Cincinnati game.

The two graphs below show the raw number of rank changes between Week 5 and Week 6 (“Stability”), and the statistical measure of the dispersion of rank changes (“Variance”). Both show a big spike this week.


Pick Confidence for Week 7

Looking at pick confidence last week didn’t go very well. Usually, the low confidence picks have about twice the incorrect results than the high confidence picks. But last week, the outcome was reversed, with Philadelphia, Cincinnati, and the NY Jets all falling to their low-ranked opponents. Certainly some or all of these can be explained away as ‘flukes,’ but a dismissive approach to problems with one’s own model doesn’t lead to a better model.

Fortunately, over the past few weeks TT has been discussing and now implementing a new tiebreaker model. Part of the implementation has been to get rid of the old BeatPower numbers and replace them with a new set of numbers—EdgePower. TT has backtested this scheme as a tiebreaking method and found that it is more accurate when applied to the last few seasons worth of data. Hopefully, using the EdgePower numbers will give us slightly more sensitive comparisons between teams when trying to determine the confidence that the Beatpaths system has in a given matchup’s outcome.

So without further ado, here are this week’s confidence rankings of the Week 7 picks:

Matchup
(Winner-Loser)
“Confidence”
(out of 100)
EdgePower comparison
(predicted winner – predicted loser)
Result
New England-Tampa Bay 63.52 75.68-12.16
Indianapolis-St. Louis 50.67 66.89-16.22
San Diego-Kansas City 20.95 56.76-35.81
Minnesota-Pittsburgh 20.27 72.97-52.70
Philadelphia-Washington 19.27 44.27-25.00
New Orleans-Miami 18.92 67.57-48.65
Atlanta-Dallas 18.24 71.62-53.38
Green Bay-Cleveland 9.46 52.70-43.24
San Francisco-Houston 8.11 61.49-53.38
Chicago-Cincinnati 3.38 54.73-51.35
NY Giants-Arizona 3.38 61.49-58.11
NY Jets-Oakland 2.03 50.00-47.97
Carolina-Buffalo 0 39.19-39.19

One aspect of the new system that is immediately apparent is the much larger number of low confidence games. The contrast between team strengths, using EdgePower numbers, is much less stark. My intuition is that over the course of the season, as the graph gets more and more connected, the EdgePower of the teams at the top and bottom will begin to diverge more clearly from the middling or inconsistent teams.

Only two games are given a high confidence pick by the system: New England over Tampa, and Indianapolis over St. Louis. The two bottom games stand out to me as well. Just like last week, we have a zero-confidence game between Buffalo and Carolina. And the very close call the system is making between the NY Jets and Oakland is a bit shocking at first, but I suppose since Russell had a not-terrible day against the Eagles D, whereas Sanchez did his best Delhomme imitation last Sunday, I can see how such a close pick can be rationalized.

I like having a new and updated EdgePower system, but I am sensitive to the fact that it’s much less ‘confident’ about making picks now. Making more qualified predictions is not necessarily making better predictions. However, we’ll see how it goes over the course of the season, and get a sense for what exactly constitutes a high confidence score using the new numbers. And, after all, BeatPaths was not built to be predictive so much as it was built to be descriptive. So confidence rankings for picks remains, as always, just a fun side-project.

I’ll be back on game-day to fill in the correct and incorrect results of the picks.

Who’s Flukey Now?

Last week I was able to start looking at the retroactive records of several teams, once we got our first beatloop. Because Beatpaths simply removed any games involved in a beatloop from the system, it introduces some ambiguity into the rankings. So the idea here is to look and see which past games don’t make sense given each team’s current ranking. Maybe this will give us some insight into inconsistent teams or fluke games.

Here are the teams involved in beatloops this week. The record beside each team is not its win-loss record, but the Beatpaths record at ‘picking’ each team’s past matchups, and the results that are inconsistent with their current rank.

  • Cincinnati (3-3): wins over Green Bay, Baltimore, and Pittsburgh.
  • NY Jets (3-3): wins over Houston, New England; loss to Miami.
  • San Diego (3-2): losses to Baltimore, and Pittsburgh.
  • Green Bay (3-2): win over Chicago, loss to Cincinnati.
  • Houston (4-2): losses to NY Jets, and Jacksonville.
  • Baltimore (4-2): win over San Diego; loss to Cincinnati.
  • Pittsburgh (4-2): win over San Diego; loss to Cincinnati.
  • Jacksonville (4-2): win over Houston, loss to Seattle.
  • Chicago (4-1): loss to Green Bay.
  • New England (5-1): loss to NY Jets.
  • Miami (5-1): win over the NY Jets.
  • Seattle (5-1): win over Jacksonville
  • Buffalo (5-1): win over NY Jets

For all the other teams not listed here, Beatpaths has a perfect 6-0 or 5-0 record in ‘predicting’ their past record, given their current rank.

As you can see, the two most difficult teams to place are Cincinnati and the NY Jets. You can get an intuitive sense for their inconsistency by looking back at the Week 6 rankings, and seeing the range of possible ranks Cincinnati and the NY Jets could have and remain consistent with the graph. They could be top teams, they could be terrible teams, or they could be both and we might just never know which will show up on game day.

18 Responses to Analyzing the Ranking System – Week 6

  1. ThunderThumbs says:

    It might be interesting to look at the edgepower difference and also the rankings differential…

    NE-TB: 2-11, 27-32
    IND-STL: 1-18, 25-32
    SD-KC: 2-25, 16-29
    MIN-PIT: 1-14, 5-25
    PHI-WAS: 13-27, 22-30
    NO-MIA: 1-19, 5-30
    ATL-DAL: 3-13, 4-27
    GB-CLE: 2-28, 9-30
    SF-HOU: 5-20, 8-24
    CHI-CIN: 4-24, 2-29
    NYG-ARI:2-24, 7-21
    NYJ-OAK: 2-31, 12-26
    CAR-BUF:15-29, 13-31

    Not quite sure how to take those into account. One option is to montecarlo it. I just did that for NO/MIA, 5000 iterations, and I picked a random integer ranking for each one – if NO gets randomly ranked ahead of MIA, then they win that matchup. And then I did a percentrank, or, the percentage of iterations that NO is ranked ahead of MIA.

    So here’s how the numbers worked out for each of these, with the standard deviation in parentheses. People with bigger statistical brains than mine can explain exactly what that might mean. (Keep in mind that 100% should be factored down by whatever the universal Any Given Sunday quotient is)

    NE-TB: 100%
    IND-STL: 100%
    SD-KC: 86.9 (7.6)
    MIN-PIT: 84.7 (6.9)
    PHI-WAS: 88.0 (4.7)
    NO-MIA: 77.7 (8.9)
    ATL-DAL: 81.8 (7.3)
    GB-CLE: 67.6 (9.7)
    SF-HOU: 71.1 (6.3)
    CHI-CIN: 55.1 (9.7)
    NYG-ARI: 54.1 (7.5)
    NYJ-OAK:58.4 (9.4)
    CAR-BUF:50.0 (6.6)

  2. Go Niners! says:

    OK, here’s the math you want to avoid using Monte Carlo. Let’s take NO-MIA as our example.
    NO has 19 possible rankings (1 to 19; max – min + 1).
    MIA has 26 possible rankings (see above).
    NO has 15 possible rankings that rank equal to or less than Miami’s possible rankings (19-5+1).
    The odds that NO ranks *below* MIA is then:
    sum(0..14)/(26*19) = 21.3%
    and the odds that they rank exactly even is:
    15/(26*19) = 3.04%

  3. Go Niners! says:

    The sum (0..N) can be expressed as (N^2+N)/2 to make the calculation easier. The situation is more complicated where one team’s rankings are contained within another’s:
    BUF has 19 possible rankings. CAR has 15.
    15 of BUF’s 19 rankings might be equal to or lower than CAR. 2 are always higher, and 2 are always lower.
    The odds of BUF ranking lower than CAR is:
    sum(0..14)/(19*15) + 2/19 = 47.4%
    The odds of them being even is: 1/19 = 5.27%
    The odds of CAR ranking lower than BUF is also 47.4%.

  4. Go Niners! says:

    NE 100%, TB 0%
    IND 100%, STL 0%
    SD 85.13%, KC 10.71%, tie 4.16%
    MIN 80.62%, PIT 12.24%, tie 7.14%
    PHI 85.94%, WAS 7.40%, tie 6.66%
    NO 76.32%, MIA 18.42%, tie 5.26%
    ATL 77.28%, DAL 13.63%, tie 9.09%
    GB 67.52%, CLE 28.78%, tie 3.70%
    SF 69.49%, HOU 24.26%, tie 6.25%
    CHI 60.47%, CIN 35.96%, tie 3.57%
    NYG 65.02%, ARI 30.64%, tie 4.34%
    NYJ 72.99%, OAK 23.68%, tie 3.33%
    CAR 57.76%, BUF 36.98%, tie 5.26%

  5. Go Niners! says:

    Oops, screwed up in that last one. Let me fix it and post again (no way to delet posts)!

  6. Go Niners! says:

    Here we go:

    1: NE 100%, TB 0%
    1: IND 100%, STL 0%
    3: SD 85.13%, KC 10.71%, tie 4.16%
    3: MIN 80.62%, PIT 12.24%, tie 7.14%
    3: PHI 85.94%, WAS 7.40%, tie 6.66%
    3: NO 76.32%, MIA 18.42%, tie 5.26%
    3: ATL 77.28%, DAL 13.63%, tie 9.09%
    3: GB 67.52%, CLE 28.78%, tie 3.70%
    3: SF 69.49%, HOU 24.26%, tie 6.25%
    5: CHI 53.56%, CIN 42.87%, tie 3.57%
    6: NYG 52.19%, ARI 43.47%, tie 4.34%
    6: NYJ 56.68%, OAK 39.99%, tie 3.33%
    5: CAR 47.36%, BUF 47.38%, tie 5.26%

    The code, complete with awful input parsing and no exit from the loop:

    #!/bin/bash

    while [[ 1 ]]; do
    read t1 t2 t3
    t1n=`echo $t1 | cut -d ‘-’ -f 1`
    t2n=`echo $t1 | cut -d ‘-’ -f 2 | cut -d ‘:’ -f 1`
    t1l=`echo $t2 | cut -d ‘-’ -f 1`
    t1h=`echo $t2 | cut -d ‘-’ -f 2 | cut -d ‘,’ -f 1`
    t2l=`echo $t3 | cut -d ‘-’ -f 1`
    t2h=`echo $t3 | cut -d ‘-’ -f 2`

    if [[ $t1h -lt $t2l ]]; then
    echo “1: $t1n 100%, $t2n 0%”
    elif [[ $t2h -lt $t1l ]]; then
    echo “2: $t2n 100%, $t1n 0%”
    elif [[ $t1l -le $t2l && $t1h -le $t2h ]]; then
    denom=`echo “($t1h-$t1l+1)*($t2h-$t2l+1)” | bc`
    num=`echo “$t2l-$t1h” | bc`
    t2p=`echo “scale=2;($num^2+$num)*100/(2*$denom)” | bc`
    t0p=`echo “scale=2;($t2h-$t2l+1)*100/$denom” | bc`
    t1p=`echo “scale=2;100 – $t2p – $t0p” | bc`
    echo “3: $t1n $t1p%, $t2n $t2p%, tie $t0p%”
    elif [[ $t2l -le $t1l && $t2h -le $t1h ]]; then
    denom=`echo “($t1h-$t1l+1)*($t2h-$t2l+1)” | bc`
    num=`echo “$t1l-$t2h” | bc`
    t2p=`echo “scale=2;($num^2+$num)*100/(2*$denom)” | bc`
    t0p=`echo “scale=2;($t1h-$t1l+1)*100/$denom” | bc`
    t1p=`echo “scale=2;100 – $t2p – $t0p” | bc`
    echo “4: $t1n $t1p%, $t2n $t2p%, tie $t0p%”
    elif [[ $t2l -le $t1l && $t2h -ge $t1h ]]; then
    denom=`echo “($t1h-$t1l+1)*($t2h-$t2l+1)” | bc`
    num=`echo “$t1h-$t1l” | bc`
    t1p=`echo “scale=2;($num^2+$num)*100/(2*$denom) + ($t2h-$t1h)*100/($t2h-$t2l+1)” | bc`
    t0p=`echo “scale=2;100/($t2h-$t2l+1)” | bc`
    t2p=`echo “scale=2;100 – $t1p – $t0p” | bc`
    echo “5: $t1n $t1p%, $t2n $t2p%, tie $t0p%”
    elif [[ $t1l -le $t2l && $t1h -ge $t2h ]]; then
    denom=`echo “($t1h-$t1l+1)*($t2h-$t2l+1)” | bc`
    num=`echo “$t2h-$t2l” | bc`
    t2p=`echo “scale=2;($num^2+$num)*100/(2*$denom) + ($t1h-$t2h)*100/($t1h-$t1l+1)” | bc`
    t0p=`echo “scale=2;100/($t1h-$t1l+1)” | bc`
    t1p=`echo “scale=2;100 – $t2p – $t0p” | bc`
    echo “6: $t1n $t1p%, $t2n $t2p%, tie $t0p%”
    fi
    done

  7. Thurhame says:

    The standard distribution is meaningless, since the graphs of yes/no questions (such as “is NO ranked higher than MIA”) are not normally distributed. Instead, the result is always either yes/true/1/100% or no/false/0/0%.

    One can easily calculate the exact probability you are trying to find (in the case of NO being better than MIA, 763/988 or approximately 77.2%). Here are the exact probabilities for all the matchups:

    NE-TB: 100%
    IND-STL: 100%
    SD-KC: 85.1% (143/168)
    MIN-PIT: 83.0% (122/147)
    PHI-WAS: 86.7 (234/270)
    NO-MIA: 77.2% (763/988)
    ATL-DAL: 81.1% (107/132)
    GB-CLE: 66.3% (197/297)
    SF-HOU: 68.9% (375/544)
    CHI-CIN: 55.4% (31/56)
    NYG-ARI: 54.3% (25/46)
    NYJ-OAK: 58.3% (7/12)
    CAR-BUF: 50%

    Note 1: This is NOT the probability of the first team winning. This is the probability of the first team having the higher EdgePower ranking.

    Note 2: These numbers depend on TT’s assumption that each numerical rank possible for a team is equally likely.

    Note 3: These numbers consider each team independently; i.e. as if one team’s ranking could not affect another team’s ranking.

    In my opinion, it is always best to spell out the limitations of any probabilities you post.

  8. Tom says:

    I agree with Thurhame that statistical methods are not great when dealing with an ordinal variable (ordered ranks) as opposed to interval values (raw numbers). I think this is why most statistics-heavy football analysis sites tend to use the statistics on variables like completion percentage, or yards per attempt, or turnovers per game, i.e. a variable that is a raw number, not a number indicative relative position.

    That said, the EdgePower of a team *is* an interval variable: it’s the win-loss record of a team, extended through the league schedule via the transitive property. Running statistical methods on EdgePower may have, if not a normal distribution, at least a distribution that is tractable to statistical methods.

  9. ThunderThumbs says:

    Thanks, Thurhame – I’ve generally been opposed to methods that use probability of winning, because there just intuitively doesn’t seem to be any room for probability in efforts that aren’t random, like football games.

    I think I see how you’re getting your equations – those are the total possible number of ranking combinations each team could have as the denominator (although I think you have to subtract 1/2 of the ties, right? Or, 973 instead of 988 for NO/MIA?)

    So it’s, the percentage of possible rankings that NO is ranked ahead of MIA. Not exactly probability, since I would guess that many of these rankings are far more sane (and more “probable” in terms of relative team quality) than others.

  10. ThunderThumbs says:

    Also interesting to consider that no matter what ranking order we choose, the edgepower numbers of theses teams always remain the same. Is there anything interesting that the ordering of the possible rankings is different than the edgepower confidence ordering? For instance, PHI/WAS being third, while fifth in confidence?

  11. ThunderThumbs says:

    Thurhame, what’s the formula for figuring the numerator?

  12. ThunderThumbs says:

    Oh, I think I got it… man, this takes me back years. You have to split them up, but I’m still getting different numbers. Here’s an example with NO and MIA.

    NO: 1-19
    MIA: 5-30

    If they’re both in the range of 5-19, they can be in any order. They can’t be ranked the same, so this is 14*14 = 196 combinations of rankings within this range. NO is ahead of half, and MIA is ahead of half.

    If NO is ranked 1-4, they’re ahead of MIA for all of MIA’s possible rankings – this is 4*26 (5-30 inclusive) = 104.

    If MIA is ranked 20-30, NO is ahead of them for its entire range; this is 19*11 (20-30 inclusive) = 209.

    This makes 509 possible ranking slots for them, with NO ranked ahead for 411 of them, for 80.75%

    Am I missing any set of numbers?

  13. ThunderThumbs says:

    And it just occurred to me that these numbers are meaningless anyway because they don’t take into consideration the possible number of rankings outside their ranges. For instance, there’s only one way for NO to be ranked #1, but there are several ways for them to be ranked #4 – this would affect these other numbers. Really, the only way to get numbers we’re looking for here would be to calculate the total number of possible rankings given a possible beatpath graph, and then examine that set. I know there are ways to do this using graph theory.

  14. ThunderThumbs says:

    By the way, the way to ask the question is “What is the number of linear extensions of a DAG?” or “What is the number of topological sorts of a DAG?” And it turns out this is a hard problem to solve. With 32 nodes it’s probably doable, but it won’t scale well – I’m not sure I could solve it for college football or college basketball.

  15. Thurhame says:

    “…the EdgePower of a team *is* an interval variable: it’s the win-loss record of a team, extended through the league schedule via the transitive property. Running statistical methods on EdgePower may have, if not a normal distribution, at least a distribution that is tractable to statistical methods.” – ThunderThumbs

    True. However, if that’s what you were using to calculate the standard deviation, it has no relation to the percentages you gave, so I have no idea what it means.

    “They can’t be ranked the same,…” – ThunderThumbs

    If you take that into account, then for NO@MIA the overlap would 15*14=210. I didn’t think of doing that, so my numbers used 15*15=225.

    “…these numbers are meaningless anyway…” – ThunderThumbs

    Absolutely, as Note #2 in my previous post indicates.

    “…there just intuitively doesn’t seem to be any room for probability in efforts that aren’t random, like football games.” – ThunderThumbs

    While it may be true that football games are non-random from an absolute standpoint, given the information available to us they becomes random from our point of view. That is what probabilities of winning represent.

  16. Tom says:

    @Thurhame

    Your first quote there is from me, not TT. I’m not using EdgePower for anything statistical at the moment, although TT may be. I *am* calculating variance of an ordinal variable (for the ranking stability stuff), but I don’t see that as terribly problematic.

  17. Thurhame says:

    Sorry. I guess my eyes just saw the T and skipped over the rest of the name.

  18. ThunderThumbs says:

    Ach, sorry Go Niners! I think your comments were held for moderation since you were new. Thanks for the explanation!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>