What a fabulous, fabulous game. I’ve been inhaling the sports coverage all night, watching the various key moments get nurtured into iconic parts of sports history. There are photos of Tyree’s catch that are just unreal, where you can’t fathom that a split second later, he didn’t lose possession.
I commented on this over at FO’s discussion board – how do you rationalize this loss? Either it is the biggest mismatch in Super Bowl history, which means that it was the biggest choke job in Super Bowl history… or, it wasn’t a choke/upset, which means that it wasn’t as much of a mismatch as we thought.
My original motivation for starting beatpaths was that there was more that went into wins and losses than all the possible statistical measures in a game. Whether it’s mismatches, playcalling, or just simple heart, I felt that the statistical extrapolation approach was missing something obvious – that some teams were just better at winning. So the desire to come up with a ranking system based only on wins, losses, and who beat who, was a bit of a perverse desire to make a point. I think that point has been made pretty well over the past few seasons, but even with that, you get results like tonight’s that contradict even just wins and losses.
The beatpath graph and rankings will still show the Patriots ranked ahead of the Giants – in fact, New England still has a beatpath to the Giants through Dallas. Which really just means that this result is seen as an upset. Which I think is accurate. But watching the game, it sure didn’t seem like an upset. A close game, sure, but not an upset. And I don’t particularly like the Any Given Sunday rationale, that the Giants were the better team today, but that the Patriots would still win eight out of ten times or whatever. I think that’s fairly graceless. The Patriots got beat up tonight.
So which is it? Was it flukey? A choke job? Biggest upset ever? Or was New England’s dominance a charade? Perhaps the Giants became a great team over the past few weeks while the Patriots lost their greatness?
One thing this makes me curious about is to throw out season series, and only count the most recent game between two teams (but still count all other games). How that affects the prediction percentage.
I think the Patriots perfection had as much to do with luck as it did with skill. Look for example at the Baltimore game and even the first Giants game. A few bounces here and there, and the outcome could have been different. I think the Patriots were also worn out by the constant struggle for perfection and the weight of expectations. The Giants won because they could go all out, with absolutely nothing to lose. They’ve been playing with that “leave everything on the field” mentality for weeks. The cliche phase “they wanted it more” seems to apply to the Giants quite well.
Maybe you could do something like FO does with their DVOA metrics and discount early-season wins as the season goes on?
Overall, great game. Still can’t believe it. Huge props for Eli.
If I were to characterize this game, though, I’d say that the thing that stuck out to me the most was Brady. He just did not look like Tom Brady. He was missing easy throws–even when he wasn’t in trouble from the Giants pass rush. I think that was the story of the night. I don’t know if the boot stuff was real, but something caused Brady to fall off from where he’s been, and that’s why they lost (in my opinion).
TT,
Throwing out the earlier game is somewhat akin to the idea of weighting more recent games more heavily, which I believe would improve predictions somewhat.
The cliff notes on the game are pretty simple, really:
- The Pats defense and Giants offense were basically as advertised. 17 points is a perfectly respectable job by that defense.
- The Giants’ D-line was incredible, playing their best game of the season by far. The Pats’ O-line, the best this year by far, had their worst game of the season by far. To say that Brady faced more pressure Sunday than any other game this season is a drastic understatement. He responded with a fairly mediocre game.
- The Giants coaching staff did a better job. The Patriots failed to make the proper offensive adjustments until it was too late.
None of these things were really predictable. A game like this will be an upset by any measure – everybody’s numbers across the web back this up.
The problem I’ve always had with discounting games is that I don’t know how much to discount them by. As soon as you get into degrading, you get into percentages – what is half a win? a third of a win? – and the decision on what percentages to us is what I define as subjective.
I could pick a slope, where the most recent game is full power, and the game x weeks ago is 0 power, but… then the question becomes what x is. And then it also kind of gets away from judging things by season.
So I guess I’ve always found the idea tempting, but that’s about where I get stuck. I think replacing an early-season game with a later-season game between the same two teams is vaguely defensible, but it’d at best be a variation.
I’ve never liked the idea of discounting wins, or some form of partial wins. At that point you’re basically saying a game (win or loose) early in the season doesn’t mean as much as one later in the season. That gets into some of the problems in all the various college football polls where a team can loose early but come back and finish strong will be looked on more favoringly than a team that suffers a late loss.
One thought would be to somehow treat the playoffs differently than the regular season though. I’m not sure how, exactly though. It would be akin to treating the playoffs like a somewhat new season with fewer teams.
Another thought would be to assign different weight based on what was the predicted outcome. Basically, if a team beats a team that previously had a beatpath to it, that win may have more weight to it, as they beat a team that was supposedly superior to them. That could produce some interesting dynamics, but I’m not sure if it would work or provide any better picture into things.
It sounds like you are surprised that the Giants won. I would say I am surprised that the Giants beat the Patriots but I am not surprised at the upset to the beatpaths graph. Your success rate for the regular season was 62.1%. So wouldn’t you expect that in any unbiased sample of games you would gte about 3/5 correct and 2/5 wrong? So say your sample is the Super Bowls. The every 5 years you will get 2 wrong. This game happen to be one of the 2 “upset” in a 5 year period.
Not only that but I would say that a majority of your 62.1% came from matchups like the #3 team playing the #26 team. And most of your upsets likely came from when 2 closely matched teams play (like #8 versus #13). So this Super Bowl had the #1 team play your #6 team. I don’t think it unreasonable that he #6 could win. It’s not like it was #1 vs. #15.
I think this explains how an “upset” is reasonable. The chances in getting any game correct is only 62.1%, barely over 50%. If you can think of any ways to improve beatpaths then that would certainly be good, but I wouldn’t do it with the primary reason being that 1 game was wrong. I would highly recommend NOT discounting wins unless you determine a slope empirically based on previous year’s data. You can’t even assume the slope is a straight line, it could be curved (for example last game is 1/2, previous is 1/3, 1/4, etc). This would take forever to get a reasonable guess.
This is the biggest upset in Super Bowl history. Bar none.
From any statistical standpoint, and that is all you can measure it on, the Giants should’ve been badly beaten. In fact, what you CAN say is that they picked the wrong MVP. Eli didn’t win the game, the Defensive Line did. By preventing Brady from scoring often (and their aggressive play prevented at least 10, possibly 14, points), they allowed Eli to have a shot at the finish. Which, apparently, was all he needed.
The Patriots didn’t lose that game, they got BEAT. Flat out. As a result, it’s a huge upset.
Heart is always something that comes into play in big games – the concept of “who wants it more?” is paramount, but MOST times this has little in the way of overcoming talent disparities and conditioning. In only 9 other cases, was the desire able to overcome the talent differential. Make that 10, now.
That’s not a good number if it’s what you use to determine who is going to win. The stats will always tell you more and be much more useful on average. That’s why won-loss records as a measure are less reliable than items like points/yds gained or items such as that.
But you can’t factor in things like desire, and fumble recoveries are a fluke (well the one in this game was clearly down by contact and should’ve been the Pats ball…so that was a bad call, but I doubt would’ve altered the outcome significantly based on the Giants’ D Line). So, when it comes to flukiness, alot of things went the Giants way. That Great Escape by Eli was amazing…and the catch even more so.
Either way, Manning was pedestrian and his numbers boring. Even the final drive wasn’t that astounding except for the blown sack (he almost had 2 INTs on that drive).
No, the story of the game was the Giants D Line….and that is further proof that Defense WINS championships.
JT,
It’s extremely important to draw a distinction between rankings that are DESCRIPTIVE, and rankings that are PREDICTIVE.
A descriptive ranking should definitely not discount wins that happened a long time ago, because the very point of a descriptive ranking is to describe the entire body of work of a season. If the goal of beatpaths is to describe most accurately which team has had a stronger season, then it absolutely should not discount early wins. The same would be true for college polls.
On the other hand, if the rankings are intended to be predictive, then you want them to reflect which teams are playing better right now. If that is the case, then it absolutely makes sense to track which way a team is trending. There are many approaches to doing that, but it generally means applying some sort of decay profile to earlier results, such that things that happened farther and farther back become less and less significant.
(I would argue that if the NCAA div 1 sets up a 4-team playoff, then the 4 teams should be selected using a descriptive formula, but seeded using a predictive formula.)
If the goal of beatpaths is to be the best predictor of future performance, then it probably makes sense to apply some sort of weight that drops off over time. Of course, it’s fairly tricky to do this for standard beatpaths, since it doesn’t really have a place to plug in a weight. Iterative beatpaths can do this fairly easily, though.
As far as what exactly the weight should be… well, it’s the one that makes the predictions most accurate. If you don’t feel like figuring out exactly what weights are the most accurate, then start with something like footballoutsider’s weighted DVOA decay:
Last six weeks: 100%
7-8 weeks ago: 99%
9 weeks ago: 93%
10-12 weeks ago: 60%
13-14 weeks ago: 15%
15+ weeks ago: 0%
Those sudden jumps seem pretty odd to me, but it’s not too tough to come up with a logistic curve that fits that profile (in a MMSE sense).
Replacing an early-season game with a later-season game seems extremely arbitrary to me – to me, this says “in this one case we will ignore early season results, but all the other early season results will still be counted 100%”.
One thing to point out is TT’s disclaimer at the beginning of each week’s pick page, “If you use BeatPaths to pick, you’re insane.” BeatPaths is not designed to be predictive as Doktarr explained, however studying it’s basic predictive results can still be interesting. Also, as Rick mentioned, no matter whose metric you use, BeatPaths, DVOA, AccuScore, this was an enormous upset. The biggest of all time? Maybe, maybe not, but a huge upset all the same.
The biggest problem with using BeatPaths as a predictive measure during the season (even in a weighted sense) is that using solely wins and losses is extremely little data. After week 1, there is a 13-way tie for first and a 13-way tie for 14th. Through the first few weeks there is a lot of shuffling as the graph takes shape. It takes a while for the graph to stablize, and even after certain games can cause huge shifts. I think weighting games will cause even more shifts which is exactly what TT tried to eliminate this year by ignoring BeatFlukes. Lastly, the NFL season has so few games that a reduction in data makes it difficult to see the overall picture.
If we were to look into a predictive measure though, we would have to run tests with many different scales to see which had the best correlation of ranks to wins.
The final problem is that trying to predict who will win a game is a futile effort. I fudged the numbers as best I could with post-season predictions and the best correlation I could get was 0.35 (standard method with a 2.5 rating home-field advantage) which is still pretty weak. All our algorithms can do is determine which team is more likely to win a game, whether that be 51% or 99%. Either way, there is always a chance the other team will win, and many times the “predictions” will be wrong.
I think our focus should remain on finding the best way to determine a full season ranking of teams based on the data we have. At some point we should analyze the methods here now that the season is over, and decide which makes more sense from top to bottom.
You’re of course right that beatpath rankings have a hard time predicting individual results because they fail to consider so many sources of data. Nevertheless, if the goal of beatpaths is to be as accurate as possible given the ridiculously tiny data set of wins and losses, then I believe some sort of time weighting is in order.
Of course, this really amounts to a doubling of the data set. Now in stead of wins and losses, you have ordered pairs of {W/L, week}. Still, personally I love the idea of applying the beatpath resolution algorithms to broader data sets, so this does not bother me. I’d love to see it work on ordered pairs of (delta VOA, week), although I’m probably the only one. I’m interested in the algorithm as a opponent-adjustment approach, moreso than I am in the specific dataset used.
Whether this would cause more shifting around in the graph from week to week would probably depend on the choice of algorithm. The idea of weighting the wins and losses doesn’t even make sense for the standard algorithm, unless we do some sort of hack (e.g. multiply each win by 100, apply the weight, and round to the nearest win, so a win six weeks back counts as 99 wins). I don’t think it would make the iterative graph that much more volatile, or the points-weighted graph, for that matter. Ultimately, though, who cares as long as it makes it more accurate? (Assuming future predictive accuracy is your goal, of course.)
To wrap up my logistic curve blurb – you get a pretty close fit to the weighted DVOA dropoff with:
weight(week) = 1/(1 + .00005*e^(.9*week))
It stays above .99 for 5 weeks, drops to .85 at week 9, and tumbles to under 3% for 15 weeks back. Anything 17 or more weeks back is under 1%.
I think the point is though, that future predictive accuracy is not the goal.
I think TT has made statements to the effect of predictive accuracy being a goal. That’s sort of the point of tracking the picks record and evaluating potential changes based on whether they would improve the picks.
Now, of course he’s been clear that you would be crazy to go by his picks. That makes perfect sense because the picks are considering such a tiny data set. They don’t consider injuries, home field advantage, and so on. But I think predictive accuracy, given this data set, is a goal.
With that in mind, it’s entirely reasonable to reject these changes on the grounds that they mean we’re considering twice as much data. Part of the beauty of the current approach is how well it does with so little data. As I said, personally I’m more interested in the algorithm as a means of opponent adjustment, as oppose to what exact data set it uses.
I like descriptive better than predictive but my main enjoyment of it is how it occasionally challenges stupid conventional wisdom.
And the predictive part is definitely a fun goal that I like moving towards – although, I did choose against beatflukes even though it was ever-so-slightly more predictive, because it made the things a bit less consistent from week to week. Overall I like getting the best and most holistic overview / ecology possible, seeing it as a whole system.
Thank you for doing this again this season. I enjoyed reading this site each week, and I hope you do it again next year.
If anyone is intereasted in applying beatpaths to other sports, I think it would be interesting to see them for boxing and MMA.
Last, even though you haven’t posted it, I assume that the Pats ended with the #1 rank. Did they also maintain beat wins over all other teams? (Of course it is no consolation for this devistated Pats fan.)
I’ve found that if you peruse a wide variety of predictive and descriptive methods for betting, and use those that agree on about 90%of the games or more, then you’ll bet the 60% vig you need to beat in Vegas.
Descriptive models have a flaw, which is usually that they don’t account for strength of competition or consistency. However, many use adjustments to overcome this shortcoming. Not a 100% sure thing, but still not too bad.
And, like Wall Street, if you’re going to employ a system, you have to STICK WITH IT. Don’t vary at all. Because when you do, you’ll fail or you’ll fall into bad habits.
I don’t bet that often, because I don’t go to Vegas often and I don’t have a bookie. And my wife wants a new kitchen.
Justin:
Yes, they stayed #1 and retained their beatwins over all other teams. The loss to NYG only caused a season split, so nothing changed in the graph since they had a path through Dallas anyway. The game couldn’t benefit NE in any real way since they already had a direct path to NYG, but losing that direct path cost them some stability at the top, not that it matters.
How’s that new portal coming? Should I point people to the NBA/NHL lists?