One of wine’s appeal is that you can find a great bottle that you enjoy on any budget. While one can spend hundreds or thousands of dollars on a wine (so I’ve heard), there are also fantastic wines for under $10. Price is neither a reliable indicator of quality, nor of how much you might enjoy it. What is price then? How much do ratings, varietal, or volume explain price? In this post, we explore how much a simple linear regression model can tell us.
The data set: to examine these questions, I’m using reviews from Wine Spectator published in 2016. I include only wines rated 85–100, vintage 2006 or later, and that also contain price, varietal, and number of cases made or imported—volume being one of the primary reasons why this data source was chosen. These results in 22,000 reviews. I’m going to assume, perhaps a little idealistically, that price is a function of quality, using rating as a proxy, and not vice versa.
Exploratory data analysis: what are the key features that explain price? First, how do prices vary by continent?
The most expensive wines (~$850) come from France, Australia, and USA while the cheapest ($5) come from Chile and Australia. Due to the large range, these boxplots show price on a log scale.
In terms of average (or median) price, North America wines are the most expensive, followed by Asia (in this dataset: China, Japan, India, Israel, Lebanon, & Turkey), Europe, Australia/New Zealand, and Africa, with South America the cheapest, on average.
What about that key variable, rating? The correlation between price and rating is 0.49 but between log(price) and rating it is 0.63. See below. Thus, in these models we will attempt to predict not price but log(price).
I’ve jittered the rating values so that they overlap less and illustrate the distribution better. I’ve also added some contours (red). Clearly, there is a strong trend: higher-rated wines tend to cost more. However, what this chart illustrates to me is the huge range in price for a given rating. For wine rated 90 points, you could pay $8 or $500. Which would you choose?
The model: I’m using a basic linear regression in R to predict log(price). (Model features are in italics.) The data were split into 75% training set and 25% test set and the metrics quoted below are the adjusted R2 values from the test set. However, I’m using metrics from both training and test to evaluate and avoid overfitting.
Rating alone, i.e. lm(log(price)~rating), explains 40% of the variance of log(price).
Rating + year: adding in year as an additional linear predictor, a proxy for wine age, raises that to 47% with a negative coefficient, meaning older wines are priced higher.
Rating + year + color: including color (red, white, and rose for major varietals and “unknown” for remainder), while statistically significant, only negligibly increases the value to 47.4%. The coefficients are ranked red, unknown, white and rose which matches intuition—high-end reds tend to command higher prices because they can be stored for far longer than whites.
Rating + year + color + factor(year): the addition of year as a factor, or “year quality,” meaning that it captures particular globally good, bad, or average years, bumps R2 to 49%. 2013 and 2014 are the top two years driving higher prices (which is a little surprising) while 2007 was the worst—which was generally a bad year for parts of Europe and the southern hemisphere.
Rating + year + color + factor(year) + factor(continent): continent brings it up to 51%, with all but South America being significant.
Rating + year + color + factor(year) + factor(continent) + log(num_cases): importantly, the big jump comes from adding in volume: number of cases. That reaches 59%. Interestingly, the number of cases made or imported range widely from low of 2 (2013 and 2014 Bâtard-Montrachet Bergundy) to 1.6 million (Kiwi 2016 Marlborough Sauvignon Blanc). Given the wide range, I used log(num_cases). As “X cases made” might be a better indicator of volume than “X cases imported,” I did try an interaction term log(num_cases)*factor({made,imported}) but it had negligible effect: 59.5%.
Rating + year + color + factor(year) + factor(continent) + log(num_cases) + factor(single_varietal): finally, adding varietal as factor(single_varietal) brings us to 61%. All wines with 2 or more grapes were defined as varietal “blend”.
To recap,
log(price) ~ rating + year + factor(color) + factor(year) + factor(continent) + log(num_cases) * factor(case_import) + factor(single_varietal)
provides a model that is not overfit in which the coefficients make directional sense.
Conclusions: how do these features rank in importance? Due to a singular matrix, I was not able to obtain the relative importance with this complete model. However, after removing the relatively unimportant factor(color), I obtained the following:
feature | Relative importance (%) |
---|---|
rating | 38 |
log(num_cases) | 29 |
factor(year) or “year quality” | 11 |
factor(single_varietal) | 10 |
year or “wine age” | 8 |
So there we have it. For these data, the most important features are rating (38%), number of cases (29%), year (as factor, 11%), varietal (10%) and year (as linear feature, 8%) with the remainder of features and interaction terms relatively insignificant.
In theory, year as a factor should not be important as that ought to be captured in the ratings—assuming rating is a proxy for quality. That is, if it were a banner year, then it ought to be reflected in a higher ratings of the wines. However, this analysis show that this is not the case and that both rating and year determine price.
This model only explained 61% of the variance in log(price). This means that there is a lot unexplained. Region and country (not shown here) explain a little, possibly currency fluctuations too, but I suspect that a large chunk of the remainder is the subjective noise of the industry which effectively spans from commodity to global high-end luxury good: vendors can get away with overpriced wine in mid-price range and collectors especially are simply willing to pay a premium for certain wines from certain wineries that they can lay down and enjoy for years to come.
Cheers!