Is Robert Parker high? Comparing wine raters

In wine, ratings are everything. They drive at least price, sales, and reputation. The state of a winery’s balance sheet might hang on getting a good rating from an influential outfit such as Wine Spectator or Robert Parker. However, perceptions of wine are incredibly subjective; a fantastic Pinotage to me might taste like dirt to you. Presumably, professional raters are more objective than us amateurs. Do these trained raters agree? Does Robert Parker tend to give high ratings, or is he about average compared to other raters for the same wine? In this post, we compare the variability among eleven influential raters for the same set of wines.

The data set: the data comes from a single major US online wine merchant that, to help sell the wine, aggregates reviews from other major raters such as Wine Spectator, Wine Enthusiast, and James Suckling. Each wine, in fact, might have up to eight different raters; although we only compare pairs of raters. I obtained roughly 22,000 wines that had 2 or more raters. While the aggregated reviews are not guaranteed to be for the same vintage for each wine, I am going to assume that vintage is an unbiased random effect—that a particular rater doesn’t tend to only rate good vintages—that will wash out across the data.

For each pair of raters, I aggregated the numbers of wines that they had co-rated. Using a threshold of 100 wines, I identified 9 raters that could be eliminated as having too few co-ratings across the board and could be removed from the analysis. Eleven raters remained:

Code Rater
rater_WS Wine Spectator
rater_WE Wine Enthusiast
rater_W_S Wine & Spirits
rater_VN Vinous
rater_ST Stephen Tanzer
rater_RP Robert Parker
rater_JS James Suckling
rater_JH James Halliday
rater_DC Decanter
rater_CG Connoisseurs Guide
rater_BH Allen Meadors – Burghound

This plot shows the sample sizes for each pair of raters:sample_sizesFor instance, Wine Enthusiast (WE) and Stephen Tanzer (ST) co-rated 2954 different wines. Thus, this is a pretty respectable data set.

Exploratory data analysis: one starting point is to examine the correlation between each pair of raters:correlationThis highlights a couple of findings: i) there are a number of pairs of raters with very low  correlations, such as Connoisseurs Guide (CG) and Burghound (BH), ii) there are no pairs of raters with correlations 0.8 or higher; they typically occur in the range 0.4–0.6.

To get absolute difference, another informative metric is the mean difference of ratings (over the set of co-rated wines) between each pair of raters, taking rater 1’s score minus rater 2’s score. Thus, if rater 1 tends to rate each wine higher than rater 2, there will be a positive mean difference. If they rate them the same (or, potentially, if they are perfectly anti-correlated), there will be zero mean difference.

In the heat map below, brighter colors are higher mean differences (rater1 – rater2). One rater that stands out is James Halliday (rater_JH). If we go across his row, we see that each number is positive, going as high as 2.9 points compared to Vinous (VN). (Another way to see this result is to scan his column in which the mean differences are all negative.)

Stephen Tanzer (ST), Wine & Spirits (W_S), and Berghound (BH) tend to be tougher than the rest of the crowd as their row values are mostly negative. Robert Parker, and others, are a reasonable mix of positive and negative values.

mean_differenceWe can grasp this more clearly if we compute the proportion of the cells in each row that are positive (mean_prop_positive) and also the mean of mean differences by averaging each row; this captures how a rater’s typical wine rating compares to another rater, averaged over all other raters (mean_mean_diff).

rater1 mean_prop_positive mean_mean_diff
rater_JH 1.00 1.80
rater_JS 0.90 1.06
rater_WE 0.80 0.52
rater_RP 0.70 0.30
rater_DC 0.40 -0.05
rater_CG 0.56 -0.21
rater_VN 0.50 -0.32
rater_WS 0.40 -0.43
rater_BH 0.11 -0.73
rater_W_S 0.20 -0.73
rater_ST 0.00 -0.94

Thus, this is one way to rank our raters: tougher raters are at the bottom, more generous raters at the top.

Additional Data Visualization: single metrics such as these don’t capture the true behavior of a pair of raters co-rating thousands of bottles of wine. One aid is to plot the joint rating distributions.

For instance, this is the joint distribution of the 9408 ratings for Robert Parker (rater r1, y-axis) and Stephen Tanzer (rater r2, x-axis):

rp_versus_stThis distribution is positively correlated, roughly bivariate normal but skewed towards higher ratings, with a peak around 92 / 93 points and range of 87–100 points each. Importantly, note the breadth of this distribution: for some wines, Robert Parker gave 100 points while Stephen Tanzer gave just 88.

Plotting the joint distributions for all pairs of raters:bivariate_matrixthe high variability among raters is obvious. They are positively correlated but raters rarely agree and the distributions are fairly broad.

What’s one to do? If you care about and believe in ratings, then one strategy could be to stick with a single rater. Find a rater that you tend to agree with, or can at least calibrate against, and then stick to that single source. However, and this is the kicker, studies have shown that the same rater rating the same wine multiple times are not consistent. The ratings—same person, same wine—can vary ±4 points! Ratings might indeed be bullshit.

Cheers!

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s