Ranking Stability – Wildcard Week

Despite having only four games instead of the usual 16, the results this week still caused significant reordering of the ranks, especially among mid-ranked teams. I think this is a good illustration of just how sensitive Beatpaths rankings are to the strength of the field. The weeks in which rankings were more stable, despite having 16 games worth of data to integrate (e.g. Weeks 10-12, 15 & 17), are all the more amazing in light of the significant changes the four Wildcard games brought. The “big mover” this week was Washington, plummeting 11 ranks after their wins over Philadelphia were looped away.

Ranking Stability Wildcard Week

11 Responses to Ranking Stability – Wildcard Week

  1. Rick says:

    I think this is a fine illustration of my problems with the Beatpath concept as it currently stands.

    By now, there should be far greater stability in the rankings, yet there is not. I know somebody suggested that by the end of the season, some 100k “beatpaths” are considered, which implies there is plenty of data.

    In fact, there is very little data. There are 16 games a week over 16 weeks (assuming no bye weeks). This means that Beatpaths is only working with 256 discrete data points. Given that some of these data points are lost due to beatloops, there is technically even less that 256…which is really not alot of data to work with.

    It’s no surprise that it’s so unstable at the end of the season, it’s working with limited data. The 100k supposed “beatpaths” are extrapolations based on 1 single event (or slightly more than one event – typically not more than 16, as in the case of New England last year, or Detroit this year). So the additional beatpaths above the 256 are purely hypothetical and thus not legitimate…hence your beatloops.

    Overcoming the beatloops and the anomalies that lie in (for example) 2 Philly losses to Washington, requires more data.

    Last week, it should have been realtively clear that Washington was WORSE than Philly despite their 2 wins…even if that “worse” was by 1 position or 2 (though I’d argue more, given that Washington lost to lowly St. Louis, which somehow got beatlooped away…)

  2. Tom says:

    I too was puzzled that Washington’s two wins over Philadelphia had not already been looped away by other data last week. That and that the Bengals’ win over the Skins was the only thing holding them up to such a high rank (over both the Skins and Eagles), despite their otherwise dodgy season. However, I might note that despite the seeming incongruity of ranking the Bengals so high, their Retroactive Pick Record was actually quite good in that slot. It might simply be that Beatpaths is telling us something counterintuitive about the Bengals, which is exactly what it was designed to do.

    I think you’re right about the relative paucity of data in the Beatpaths system compared to something like Accuscore, which uses thousands of data points and hundreds of variables and complex simulation engines developed for analyzing genetic mutations, tribal migration, etc. Nevertheless, I think the competitiveness of Beatpaths despite its parsimony is what makes it all the more interesting. Getting a good pick record by analyzing every single data point possible using banks of computers *ought* to do better than rival systems, but it’s not necessarily more explanatory. With Beatpaths you can give a concise explanation of why one team is favored/ranked over another, whereas with Accuscore or Football Outsiders you would need to explain why a certain aggregation of a select set of subjectively chosen statistical data points, when uniquely combined for this matchup, gave a probabilistic result leaning one way over another.

    I’m also not comfortable condemning Beatpaths as irretrievably unstable compared to other rankings. I’d like to examine the overall stability of DVOA rankings throughout the year and compare them to Beatpaths’ before rushing to any conclusions about which is more stable/accurate.

  3. ThunderThumbs says:

    I should point out my own perception that after doing this every week of the season for the last four or five years, this year by far felt the weirdest. I’m liking the extra controversy due to the weirdness, but it still stands out to me that while in other seasons we’ve had at most one or two teams that seemed completely bipolar, this season it seemed more like four or five.

    This is the inherent “flaw” or challenge with any system that determines to judge the quality of a team by looking at season’s data. You can’t apply one personality to a team that by definition has multiple personalities. Not to mix my psychiatric metaphors even further, but the inability to nail down schizophrenia into some form of consistency is not a flaw that is limited to the beatpaths approach.

    Not to denigrate anyone with any particular psychiatric conditions 🙂 – just seems to be the best comparison to use.

  4. chris clark says:

    A “quick” response on the number of games is “too small” issue. Compare this to polling results. They predict how hundreds of millions of people will vote in a National election based on samples of just a few thousand with accuracy within a few percentage points.

    To get a relevant scaling, let’s see what the population size we are interested in is. If we are doing a string ranking, we could do so with only knowing the preference ranking of each team versus each other team as could be established in a round-robin tournament (presuming that the better team won each time–and I’ll get to that aspect in a bit). That works out to each of the 32 teams playing 31 games (i.e. one game versus each other team). Currently, each team plays 13 of the 31 potential opponents, playing 3 of the opponents twice to make up the 16 game regular season. If one looks at that, the games played actually sample 1/3 of the potential match ups which is quite good coverage when one thinks of it.

    Now, to the second half of the issue, whether the games actually demonstrate the better team. That clearly is questionable. This is partially why division rivals play each other twice. That gives two measurements of each team pairing, under different conditions, so that we have reasonable confidence that the measurements are independent. If both measurements agree and we have a sweep, we have more confidence that both are right. If the measurements differ, the result is questionable. This is why in many sports, such as baseball, basketball, and hockey, they have n-game series to determine the winner of a pairing. It gives one more independent results. If each team played each other team four times, twice home, twice away, we would have much better data. If we had six games, the data would be much better. However, we aren’t likely to see football teams play each other that many times in a season. Note, even if we had these stats, our 256 game season is not necessarily such a small sample compared to the realistic space, instead of 1/3 the games it goes down to 1/18th the games. One does not need to pole 1 person in 20 to get a good estimate of an election as long as the pole isn’t biased and our 13 games are not likely a biased sample.

    Still, you can get more results by using a finer scale to measure each game, as we have discussed before. However, there is one aspect of which one needs to be careful when getting finer data, that the data is independent. Much of statistics depends on that assumption and careless measurements of more data from the same games is likely to invalidate it. For example, teams that are behind and trying to catch up, generally throw more passes, and thus accumulate more “junk” yards, especially if the other team switches into “prevent defense” where they give up short passes to prevent long strikes. Thus, offensive yards is not completely independent of the games outcome. It may give us more information, but a team that gains 400 yards of offense in a game is not necessarily better than one which gained 350 in a different game.

    Now, the football outsiders folks have done some studies to try and better quantify those numbers, so that they can better compare yards in an apples-to-apples fashion. And to some extent that lets one look at each play as a more independent event. However, there are some built in assumptions in their model. When using that data at a deep level one needs to understand those assumptions. Again, arbitrarily making more measurements of non-independent events does not make the data more statistically sound. The FO people have tried to validate that their data has independence and predictive value by comparing it to the season split numbers for teams from different years, i.e. checking it against data which is likely to be statistically independent. To the extent one trusts their analysis (and assumptions), one can trust their results.

    Thus, one can suggest any metric one likes as a way of getting more data into a system, but unless one validates the predictive power of using the data and shows that it produces better results over a statistically significant set of data, one isn’t making the results more reliable. The same is true of the FO measurements. One can argue all day about whether certain events (e.g. fumbles) are significant in games, clearly they are, but they don’t necessarily make good statistical predictors. Fumbles recovered in one game do not predict fumbles predicted in another. Therefore, one could add fumbles recovered as something one measures, but using it to rank teams is not going to give one a reliable ranking.

    Finally, as Tom said, ranking stability is not necessarily the best judgment criteria. Teams do improve and regress and play at varying levels at varying weeks. If these events are occurring and the measurement isn’t reflecting them, it is actually a bad measurement. On the other hand, to be useful you do want some predictive power and that requires that your estimate match the actual result and for that stability of an estimate (when the data itself is stable) is a good thing.

    More interesting from that point of view is whether one could modify the model to predict which teams are improving and which are regressing and how much variance each team has in its performance. You get some hints of the variance at least by looking at the way the different resolution methods rank the teams. If one looked at all possible resolutions of loops and relevant team rankings, one might get an estimate of team variability. (I suggest that the more variable teams would have more different potential rankings over a variety of loop resolutions.) Then one would need to validate whether that is a good estimate by looking at how it worked over several years data and whether it made better predictions.

    When I get a chance, I will come back to the 400 yard game versus 350 yard game and predictive power with implications for the weighted method of breaking beatloops.

  5. chris clark says:

    does not predict fumbles predicted in another

    should be

    fumbles recovered in another

  6. Tom says:

    @chris clark

    I think one of mm’s earlier comments captured well the “variability” of individual team rankings, even when building an ‘optimal’ retroactive ranking:


    One way of approaching “streakiness” or improvement/collapse of the course of a season is to break loops using the oldest links in the loops, preserving more recent data. That might do a better job of capturing a team like the Redskins who switched from a 6-2 team in the first half of the season to a 2-6 team in the second half of the team.

  7. mm says:

    When talking about the problems with rankings, there’s one that can get overlooked, but bears repeating whenever one talks about limitations.

    In mathematics, if A>B and B>C, then A>C, but this doesn’t seem to hold in team sports. A team might be particularly prone to a 4-3 defense, or a runner who frequently cuts back, or a tight end who abuses slow linebacker/safeties. Thus, a team might consistently beat good teams that can’t attack their achilles heel, but fall to otherwise weaker teams. It is quite possible to have A>B, B>C, and yet C>A.

    This isn’t a problem limited to beatpaths, but other systems can have ways around it that beatpaths do not. If football outsiders wants to preview a game in depth, they’ll look at matchups vs. #1 and #2 recievers, vs. Tight ends, and other matchups in an effort to see if there are particular vulnerabilities which one team seems able to exploit. Since beatpaths only looks at the big picture (that’s the whole point), it doesn’t have this ability.

    Of course, this problem also means that the winner of a single-elimination tournament isn’t really the ‘best team’, but its still fun.

  8. chris clark says:


    Yes, I remember the comment and that’s exactly what I think we can capture. I suspect there is even a statistical way of capturing it. If you calculate all the possible rankings and ratings, you can then find averages and standard deviations of the values for each team.

    Your second point about breaking loops either that are oldest or are youngest, would work well for teams that are improving or getting worse. You could do the same kind of statistics on those teams, seeing if the different orders of breaking the loops make the team go up or down in the ranking and if so, determine whether the team is peaking or fading.

    I remember that last year (or perhaps it was two years ago) did that kind of analysis on the DVOA of different teams, complete with colored graphs and everything. I don’t know if they kept it up. I haven’t seen it this year.

  9. chris clark says:

    @ mm

    And, if you boil FO rankings down to any single number, such as DVOA, it will have the same problem. Any one number system requires transitivity A>B, B>C, implies A>C, while in sports, you are correct in observing we can have intransitive relationships. The same thing happens in other preference items to. Chocolate may be my “favorite” thing, but that doesn’t mean that I want chocolate on my “shrimp scampi” in place of parmesean cheese.

    My favorite intransitivity in football happened a few years ago and was called rock-paper-scissors, for NE, DEN, IND. They were all great teams, but each one had a weakness that only one of the other two could exploit. I think earlier SF, DAL, GB had a similar triad.

  10. Tom says:

    @chris clark

    I wonder if there’s a way to preserve as much of the ambiguity as possible when making picks. When doing the overall rankings, Beatpaths simply discards ambiguous information, getting rid of any loops that form. Do you think it’s possible to simply examine Team A vs. Team B and preserve as many of the ambiguous relationships as possible? How many of the triangular relationships can we keep when just comparing the relative strength of two teams?

  11. […] Tom here with the Beatpaths ranking stability. Last week, the results of only four wildcard round games, instead of the normal 16, nevertheless produced significant changes in the rankings. This week, the four divisional round games produced the opposite results: only three pairs of teams switch places. The total change in ranks was a mere six. […]

Leave a Reply

Your email address will not be published. Required fields are marked *