The data that I am using is from the kaggle website: https://www.kaggle.com/zynicide/wine-reviews/data
It's a concatenation of many wine reviews. Each review contains information about the winery, type of wine, region, as well as the name of the reviewer, the price, the number of points and a description of the wine.
The goal is to try to understand how much information about the quality of the wine can be guessed. First we will just look at the "objective" characteristics, such as the type of wine, location and price. This is not as objective as it seems, as it will have a lot to do with how the reviewers pick the wines to review in the first place. In a second time, we will try to extract information from the description of the wine, a long string of words. This will necessitate feature engineering, as we will try to convert the text into numbers that must correlate with the quality of the tested wine.
We will use always the same machine learning algorithm, and then compare the score that we obtain by fitting it to different subsets of the data. This will be our measure of how much information about the target (number of points of the review) there is in different subsets of the data.
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import missingno as msno
import category_encoders as ce
import sklearn as skl
import warnings
warnings.filterwarnings("ignore")
# plot parameters
plt.rcParams["figure.figsize"] = (10, 5)
plt.rcParams["font.size"] = 14
# Open data
datafile = 'data/wine-reviews/winemag-data-130k-v2.csv'
data_small = pd.read_csv(datafile, nrows=10)
data_small
We start by looking at the distributions and some correlations between the different "features" of the data.
First, we take a look at the distribution of the wine reviews according to the country of origin of the wine. We see that the reviewers are based in the US, with a majority of wine coming from there.
countries = pd.read_csv(datafile, usecols=[1])
print(countries.country.unique(), countries.nunique())
pd.value_counts(countries['country']).plot.bar(figsize=(10,6), logy=True)
count_nan = len(countries) - countries.count()
print('number of nans = {}'.format(count_nan))
# Designation
col_desig =pd.read_csv(datafile, usecols=[3])
print(col_desig.nunique())
As we see below, the reviewers attribute points out of 100, but really only in the window from 80 to 100 points.
# Points
col_points =pd.read_csv(datafile, usecols=[4])
col_points.hist(bins=20)
count_nan = len(col_points) - col_points.count()
print('number of nans = {}'.format(count_nan))
Prices vary wildly, with most bottles under \$100, but a few costing over \$3000. I expect that this is the feature that will the most strongly correlate with the points and be most useful in prediction. In the end, I decided to use a log scale to renormalize the prices. Interestingly, this gives a nice bell shape for the distribution, with a most-likely price of about \$$10^{1.4}=25$.
# Price
col_price =pd.read_csv(datafile, usecols=[5])
col_price.hist(bins=20, log=True)
count_nan = len(col_price) - col_price.count()
print('number of nans = {}'.format(count_nan))
# Log of price histogram
col_price['price'].apply(np.log10).hist(bins=20)
Here as well as for the countries, the distribution is uneven we observe that a few tasters have done most of the reviews.
# Taster name
col_taster =pd.read_csv(datafile, usecols=[9])
print(col_taster.nunique())
pd.value_counts(col_taster['taster_name']).plot.bar(figsize=(10,6), logy=True)
# Varieties
col_vars =pd.read_csv(datafile, usecols=[12])
print(col_vars.nunique())
# Wineries
col_wineries =pd.read_csv(datafile, usecols=[13])
print(col_wineries.nunique())
# Provinces
col_prov = pd.read_csv(datafile, usecols=[6])
print(col_prov.nunique())
count_nan = len(col_prov) - col_prov.count()
print('number of nans = {}'.format(count_nan))
# Region 1
col_reg1 = pd.read_csv(datafile, usecols=[7])
print(col_reg1.nunique())
count_nan = len(col_reg1) - col_reg1.count()
print('number of nans = {}'.format(count_nan))
# Region 2
col_reg2 = pd.read_csv(datafile, usecols=[8])
print(col_reg2.nunique())
count_nan = len(col_reg2) - col_reg2.count()
print('number of nans = {}'.format(count_nan))
# This is probably useless
As expected, we see a strong positive correlation between the price of the wine and the number of points it got. Is it a tribute to the objectivity of the price of wine, the better the more expensive, or that the reviewers are swayed by the price of a costly drink? Impossible here to tell unfortunately. It is certainly a bit of both.
price_points = pd.read_csv(datafile, usecols=[4,5])
price_points['price'] = price_points['price'].apply(np.log10)
plt.scatter(price_points['price'], price_points['points'])
plt.xlabel('price')
plt.ylabel('points')
A few of the features miss quite a bit of time. In the end we decided to drop the "region 2" and "designation" features, as they are missing in most data points, and anyway doe not seem to be that informative in general.
fulldata = pd.read_csv(datafile)
msno.matrix(fulldata)
I tried to be very careful about information leakage by splitting the data in a training and testing set as a first step. Then we do some treatment of the training data, which we format as functions, so that exactly the same treatment can be done for the testing data.
The main problem is how to treat all the features which are strings. The different origin and types of the wine represent very diverse categories. A one-hot encoding would be very impractical, as certain features can be thousands of different categories. I chose to use the hashing encoding. This will remove a bit of information, but make sure that the data does not become unwieldy.
As a first step, the choice is to try to guess the number of points from all other characteristics. I chose to encode the grade as different categories (bad, average, good). I made sure the distribution amongst those categories is more or less even (with about a third of the reviews falling in each categories), such that it is not possible to get a good accuracy score by just betting on the most likely categories.
It might also be interesting to use the price as a target, to see if we can guess how much a wine costs by how well it is appreciated by the reviewers.
from sklearn.model_selection import train_test_split
fulldata_train, fulldata_test = train_test_split(fulldata, test_size=0.1,
random_state=18)
# As a first approach, we just drop data with missing samples
# We remove two columns with much missing values
def treat_data(dat):
# make a function so that we can do the same procedure to the
# test data later
newdat = dat.drop(columns=['region_2','designation'])
newdat = newdat.dropna()
return newdat
fulldata_nona = treat_data(fulldata_train)
len(fulldata_nona)
# Preparing data: separated in features (X) and label (y)
def make_xy(dat):
# forget description for now
X = dat.drop(
columns=['description', 'title', 'taster_twitter_handle','points'])
# Price on log scale for nicer numbers
X['price'] = X['price'].apply(np.log10)
y = dat['points']
return X, y
X_data, y_data = make_xy(fulldata_nona)
# First approach: use hashing on categorical features
hashing = ce.HashingEncoder()
#Hashing function is "trained" on training data only
# to treat test data in the same way
hashing.fit(X_data, y_data)
def hashing_features(X):
newX = hashing.transform(X)
return newX
X_data_hashed = hashing_features(X_data)
# Let's split scores into three equivalent bins: bad, average, good
length_data = len(y_data)
split1 = y_data.sort_values().values[length_data//3]
split2 = y_data.sort_values().values[-length_data//3]
print(split1, split2)
def splitting_func(points):
# defined for training data only
# testing data will be scaled the same
if points <= split1:
return 0
elif points <= split2:
return 1
else:
return 2
def rescale_points(y):
newy = y.apply(splitting_func)
return newy
y_data_binned = rescale_points(y_data)
y_data_binned.hist(bins=20)
Since there are many features of different kinds, tree classifiers and random forests seem like the best option here. We set up a random forest classifier (with a search grid in order to optimize parameters) and use it always on different subsets of the data, so that the model is always of the same kind and we are just comparing the data.
We find that by fitting our original data, we guess the right points category 58% of the time. As we will see below, that is much better than random guessing.
# Prepare testing data in the same way
X_train, y_train = X_data_hashed, y_data_binned
X_test_init, y_test_init = make_xy(treat_data(fulldata_test))
X_test = hashing_features(X_test_init)
y_test = rescale_points(y_test_init)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# Testing classification problem with random forests
param_grid = {
'n_estimators': [100],
'max_depth': [3,5,10]
}
gs = GridSearchCV(RandomForestClassifier(), param_grid, cv=10,
return_train_score=True, verbose=2, n_jobs=6)
gs.fit(X_train, y_train)
print(gs.best_estimator_.score(X_train, y_train),
gs.best_estimator_.score(X_test, y_test))
As mentioned above, the target labels were split evenly so that there is a 1/3 chance to guess the right category purely by accident. We check this with the dummy classifiers from sklearn, which cannot get a score above 36% by uninformed guess.
This will provide the baseline for the scoring. If we can get a score larger than this, then this proves that we can learn something from the data. The higher the score, the more information was contained in the data.
# Make a baseline for guesses
from sklearn.dummy import DummyClassifier
strategies = ['stratified', 'most_frequent', 'prior', 'uniform']#, 'constant']
for stra in strategies:
dummy1 = DummyClassifier(strategy = stra)
dummy1.fit(X_train, y_train)
print('strategy {} gives a score of {}'.format(stra,dummy1.score(X_test, y_test)))
As expected, by just using the price, we have already quite a good guess, with an accuracy of 56%. Indeed, the price seems the most useful feature that correlates most with the points.
# Drop all but price
gs.fit(X_train[['price']], y_train)
print(gs.best_estimator_.score(X_train[['price']], y_train),
gs.best_estimator_.score(X_test[['price']], y_test))
# Drop the price
gs.fit(X_train.drop(columns='price'), y_train)
print(gs.best_estimator_.score(X_train.drop(columns='price'), y_train),
gs.best_estimator_.score(X_test.drop(columns='price'), y_test))
It is interesting that just by knowing the location and type of wine (and identity of the reviewer), we can still do a good guess whether the wine is good or not. This is most likely due to the fact that the list of wine is not random. The wines are chosen to be reviewed. If the reviewer chooses an exotic wine coming from a far away place, that is hard to obtain, they will most likely choose one with a good reputation rather than the local cheap wine.
The description is a long string of words of different lengths, for each wine. We need to treat the data in some way, before it can be comprehended by a random forest.
The easiest thing to do with a long text to turn it into a number is simply to count the words. The assumption is that the length of the review will have something to do with how much the reviewer enjoyed the wine.
As expected, better wines receive longer reviews. Surprisingly, the very best wines don't have the absolute longest reviews, but rather there is a category between 92 and 98 points that can get the longest reviews.
As a further check, it woud be interesting to study the correlation of the descripiton length with the wine reviewers. Certainly some wine reviewers give longer reviews than others
Suprisingly, there seems to be much less of a clear correlation between the description length and the price. In that case, it would seem that price and description length are good complementary informations to guess the number of points.
def count_words(d):
words = d.split()
cw = len(words)
if np.isnan(cw):
cw = 0
return cw
def get_descr_length(dat):
descr = dat[['description']]
descr['descr_length'] = descr['description'].apply(count_words)
return descr.drop(columns='description')
descr_length = get_descr_length(fulldata_nona)
descr_length['descr_length'].hist(bins=20)
descr_length.head()
descr_length.isnull().sum()
# Correlation between points of wine and number of words
plt.scatter(descr_length['descr_length'], fulldata_nona['points'])
plt.xlabel('Description length')
plt.ylabel('points')
plt.scatter(descr_length['descr_length'], fulldata_nona['price'].apply(np.log10))
plt.xlabel('Description length')
plt.ylabel('price')
# Adding the new feature to the data
X_train['description_length'] = descr_length['descr_length'].values
descr_le_test = get_descr_length(treat_data(fulldata_test))
X_test['description_length'] = descr_le_test['descr_length'].values
X_test.head()
# Fit with description length
gs.fit(X_train, y_train)
print(gs.best_estimator_.score(X_train, y_train),
gs.best_estimator_.score(X_test, y_test))
Indeed we increase the score significantly by adding the description length as a new feature.
# What if we only keep this new feature
gs.fit(descr_length, y_train)
print(gs.best_estimator_.score(descr_length, y_train),
gs.best_estimator_.score(descr_le_test, y_test))
This is almost as good as the price in order to predict the score of the wine.
The idea was two-fold. First, I wanted to check whether it is possible to guess how much a reviewer will appreciate the wine just by looking at the objective characteristics of the wine. I found that we can guess almost 2 out of 3 times whether the reviewer liked the wine or not, just by looking at its origin, its price, and the name of the reviewer.
The second objective was to try to teach the machine learning algorithm to "understand" the written description: if it means the wine is good or bad. The description length, without analysing it, already gives a good idea. Good wines inspire much longer reviews than bad wines.
There are several points that I would like to explore. First, does a feature tend to increase or decrease the score. For instance, I would like to check my assumption that an exotic wine tends to have a better score than a domestic one. Second, I would also like to check the characteristics of each wine taster. Are certain wine tasters more critical? Do they write longer or shorter reviews? Are they fond of a certain wine?
Finally, by analysing the preference of each wine taster, one could perhaps construct a wine suggestion algorithm. For a given taster, due to past preferences, it would check whether a new wine is likely to please or not.