Assignment - 1 (Machine Learning)

Linear Regression

Author: Jimut Bahan Pal

As usual, importing the necessary libraries

In [76]:
import numpy as np                                          # just an opensource version of mini-matlab for matrix mult and stuffs
import matplotlib.pyplot as plt                             # plotting library
import matplotlib.patches as mpatches                       # cool patches
from mpl_toolkits.mplot3d import Axes3D                     # 3D stuffs
from sklearn import datasets, linear_model                  # ML lib
from sklearn.datasets import make_regression                # autocreation of dataset and stuffs
from sklearn.linear_model import LinearRegression           # Linear Regression module
from sklearn.preprocessing import PolynomialFeatures        # ploynomial regression
from sklearn.metrics import mean_squared_error, r2_score    # for generating RMSE and other metrics for eval.

Part - A

Creation of 1D dataset

Artificially generated 1D dataset, using y = $m$x + $\epsilon$. Here, Gaussian noise $\epsilon$ = 6.

In [77]:
X, y = make_regression(n_samples=100, n_features=1, noise=6)
In [78]:
# plot regression dataset

fig, ax = plt.subplots()

ax.set_title('The plot for the whole dataset')
scatter = ax.scatter(X,y,color='#4224eb')
handles, labels = scatter.legend_elements(prop="sizes", alpha=0.6)

red_patch = mpatches.Patch(color='#4224eb', label='1D dataset')
plt.legend(handles=[red_patch],loc="lower right", title="Legend")

ax.grid(True)
plt.show()
In [79]:
# Split the data into training/testing sets
X_train = X[:-20]
X_test = X[-20:]

# Split the targets into training/testing sets
y_train = y[:-20]
y_test = y[-20:]
In [80]:
# Create linear regression object
regr = LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)
Out[80]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [81]:
# Make predictions using the testing set
y_pred = regr.predict(X_test)
print(y_pred)
[-50.22764936  -8.23980391  21.37777125 -49.43122707 -17.46700868
 -26.89924953  -2.26277484   9.91762861  14.55016599  27.72411913
 -35.53898008   7.19011314  -7.13377952 -21.22369842 -20.21989661
   6.58208755  33.6071665  -16.80304129  10.52432491 -49.7280335 ]

Root mean squared error

In [82]:
# The coefficients
print('Coefficients: \n', regr.coef_)

# The mean squared error
print('Root Mean squared error [RMSE]: %.2f'
      % np.sqrt(mean_squared_error(y_test, y_pred)))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(y_test, y_pred))
Coefficients: 
 [27.68789241]
Root Mean squared error [RMSE]: 6.62
Coefficient of determination: 0.93
In [83]:
# Plot outputs

fig, ax = plt.subplots()

ax.set_title('The plot for the test set')
scatter = ax.scatter(X_test, y_test,color='#4224eb')
handles, labels = scatter.legend_elements(prop="sizes", alpha=0.6)

red_patch = mpatches.Patch(color='#4224eb', label='Test set')
plt.legend(handles=[red_patch],loc="lower right", title="Legend")

plt.plot(X_test, y_pred, color='#f90909', linewidth=3)

# plt.xticks(())
# plt.yticks(())

ax.grid(True)
plt.show()

Part - B

Creation of 4D dataset

${h _\theta }\left( x \right) = {\theta _0} + {\theta _1}{x _1} + {\theta _2}x _1^2 + {\theta _3}x _1^3 + ... + {\theta _n}x _1^n$

In [84]:
X, y = make_regression(n_samples=100, n_features=4, noise=6)

So we can see this has 4 features, i.e. 4 columns

In [85]:
X
Out[85]:
array([[-2.15226364e+00, -5.00218872e-01,  1.11450604e+00,
        -1.86073162e-01],
       [ 1.64817412e+00,  4.57225473e-01,  1.06333539e+00,
         6.80776250e-01],
       [ 6.64231789e-01,  2.33461926e-01, -2.36015528e+00,
        -1.52245598e+00],
       [ 7.30918837e-01, -1.93838555e+00, -1.78745978e+00,
         4.49551570e-01],
       [-1.24331388e+00, -7.45916438e-01,  4.60316519e-02,
        -1.35155844e+00],
       [ 3.53479852e-01,  4.84011368e-01,  4.41739355e-01,
        -8.61682521e-01],
       [ 1.15591173e+00, -1.11363617e-01,  2.39132012e-01,
         4.88286470e-01],
       [-4.24086950e-01,  9.38457761e-01, -1.44039846e-02,
         7.81324962e-01],
       [-1.34629031e-02, -9.71058921e-02, -3.53644981e-01,
        -9.17978580e-01],
       [-1.16412511e+00,  6.88415286e-01,  4.17606506e-01,
        -2.76615899e-01],
       [ 4.85460259e-01,  2.43031371e-01, -1.11334016e+00,
         6.01926654e-01],
       [-2.54451502e-01,  7.63103671e-01, -4.55841314e-01,
        -1.55381053e-01],
       [ 3.19170344e-01,  1.22487609e-01,  1.49503533e+00,
         6.37572177e-01],
       [ 2.88328675e+00,  1.22237030e+00,  5.26779793e-02,
         7.65774957e-01],
       [-4.05392026e-02,  5.72163262e-01, -3.41063107e-02,
        -1.13875386e-01],
       [-2.13404672e+00,  1.72155742e+00,  1.49449048e+00,
        -1.08673295e+00],
       [ 1.69036265e+00,  1.86084179e+00, -1.24654269e+00,
        -1.13435681e-02],
       [ 1.08217900e+00,  6.89223773e-01, -6.78134000e-01,
        -8.45502157e-01],
       [-3.59947193e-01,  1.28714488e+00,  6.23223458e-01,
        -6.76078880e-03],
       [-1.72884257e+00,  1.12888536e+00, -2.68053548e-01,
        -7.87643896e-01],
       [ 6.56253367e-01, -8.40176324e-01,  6.43854496e-01,
        -4.21448437e-01],
       [ 1.03747720e-01,  3.79241140e-01,  9.83050155e-03,
         1.33327100e-01],
       [ 5.52313935e-01,  6.51768329e-01, -1.29521143e+00,
        -6.66365450e-01],
       [ 8.07707427e-01, -3.35888001e-01, -1.52336236e+00,
         2.24134765e-01],
       [-1.22058277e+00, -1.34366773e+00, -1.09389393e+00,
        -1.68325553e+00],
       [-1.74534555e+00, -5.00860429e-01, -1.52566922e+00,
        -4.94549825e-01],
       [ 3.50983275e-01,  1.02479025e+00,  8.74753518e-01,
        -1.08811830e+00],
       [ 5.20978381e-01, -1.33231760e+00,  4.36977244e-01,
        -6.92352060e-01],
       [-6.63279332e-02, -2.46981849e-01, -1.68199571e+00,
        -1.61651664e+00],
       [-1.41889718e-01,  6.89547755e-01, -2.13565001e+00,
         2.65634632e-01],
       [-2.85630842e+00, -3.50224209e-01,  9.08656516e-01,
         8.44727465e-02],
       [ 1.36188542e+00,  8.76056358e-01,  4.51394180e-01,
        -9.71135395e-01],
       [ 3.02878923e-01,  7.16234699e-01, -3.86521118e-01,
        -5.16436657e-01],
       [ 4.83117416e-01,  1.72018971e+00,  1.61107131e-02,
         6.31090527e-01],
       [ 4.52327016e-01, -8.85877538e-02, -3.58624458e-01,
        -1.48417908e+00],
       [ 1.59097160e+00, -1.35215537e-01,  1.75539877e+00,
         5.17554082e-01],
       [-7.51783966e-01,  9.68171995e-01,  7.05644106e-02,
        -2.77752848e-01],
       [-1.37971920e+00, -1.50070770e-01,  6.55180546e-01,
        -2.14567210e-01],
       [-2.45851105e-01,  4.49981738e-01, -3.58871190e-01,
         2.69410640e-01],
       [ 5.71508808e-01,  1.02695909e+00,  1.11993548e+00,
         9.00201213e-01],
       [ 2.09271120e-02,  1.55049959e+00, -2.74624604e-02,
        -6.04130550e-01],
       [ 1.00331223e+00,  1.36316321e+00,  5.19125543e-01,
        -1.31892657e+00],
       [-5.28583143e-01, -4.87691046e-01, -1.89407355e-02,
        -2.10848559e-02],
       [ 5.31978428e-02,  7.25835111e-01, -5.03126279e-01,
        -4.42512301e-01],
       [-1.76896653e-01, -1.70475295e+00, -1.00151923e+00,
        -1.67071238e+00],
       [ 4.19768969e-01, -2.02679311e-01,  1.83020106e+00,
        -8.45983666e-01],
       [-8.86406716e-01, -6.36805833e-01,  7.55045777e-01,
        -5.27394930e-01],
       [ 2.39448259e-01, -8.06192304e-01, -7.26038773e-01,
        -1.42152315e+00],
       [ 1.00186033e-01, -1.36418249e+00, -1.76839465e+00,
        -3.50716188e-01],
       [ 2.50251720e+00, -1.65518879e+00, -1.81735305e+00,
         6.12381091e-01],
       [-1.58234011e+00,  1.55114139e-01, -5.65145733e-01,
        -1.28566440e+00],
       [-4.98851514e-01,  4.29546411e-02,  2.75565586e+00,
        -7.53010670e-02],
       [-8.77913308e-01, -2.33862873e+00, -1.65321549e+00,
         2.38381464e-01],
       [ 9.00509503e-01, -3.72472976e-01, -7.25409342e-01,
         1.04173182e+00],
       [ 1.91239952e-01, -1.15870114e+00, -3.92051705e-01,
         5.52834544e-01],
       [ 2.56605327e-01, -4.37541118e-01,  1.49085735e+00,
        -1.83099714e+00],
       [ 1.20855702e-01, -7.07881844e-01,  4.49057527e-01,
         8.35676689e-01],
       [-3.86048850e-01,  4.80052531e-01, -2.26207965e-01,
        -9.61446455e-01],
       [ 2.40354500e-01, -2.40473212e-02, -1.86783641e+00,
         8.26644290e-01],
       [-1.66465102e-01,  3.91746726e-01, -1.67217356e-02,
         1.36953626e+00],
       [ 2.33912348e-01, -1.37956476e+00,  3.22619228e-01,
         8.89264189e-01],
       [-1.64541711e+00,  9.01387144e-01,  3.54810658e-01,
        -2.70661064e+00],
       [ 4.66275336e-01,  2.33165234e-01,  6.11576337e-02,
         1.35665892e+00],
       [ 1.18808196e+00,  1.97019025e-01, -1.63685593e-01,
         3.32389356e-01],
       [-2.74494881e-01,  2.18527870e-01, -1.80884899e-01,
        -2.63057105e+00],
       [-1.76205879e-01, -4.07640846e-01,  6.94985207e-01,
        -3.92819542e-01],
       [ 2.70828521e-01, -1.00266838e+00, -5.54521348e-01,
         7.99104370e-02],
       [-1.79407460e+00,  4.78881354e-01, -6.76150927e-01,
         1.70456091e-02],
       [-1.89910570e+00,  9.47252218e-02,  1.74514896e+00,
        -1.92177264e-02],
       [ 4.94222265e-02, -2.75817902e-01, -3.37935177e-01,
         1.48845863e+00],
       [-9.30560887e-01, -1.95583738e-01, -5.83841197e-01,
        -3.70034944e-02],
       [-9.87162127e-01, -7.55916422e-01,  1.66539822e+00,
         1.93640518e-01],
       [ 7.34641823e-01,  7.06665912e-01, -1.82597565e-01,
         3.77040013e-01],
       [-1.00055869e+00, -1.01865534e+00, -7.85726483e-01,
        -4.11130178e-02],
       [-7.36877814e-02, -4.58235374e-01, -2.76192248e-01,
         1.07475524e+00],
       [ 1.05573477e+00, -1.32636841e+00,  2.47557280e-01,
         7.69136468e-01],
       [ 7.80093732e-01,  3.67022095e-01,  5.08964736e-01,
        -1.08716820e+00],
       [ 6.24471891e-02,  5.42960358e-01, -7.53280399e-01,
        -1.39819689e+00],
       [-3.33480469e-01,  1.38460065e+00,  5.95074597e-01,
         4.31377397e-02],
       [-3.58789186e-01,  1.14804762e+00,  2.58758155e+00,
         2.79470067e-03],
       [ 5.02146006e-01, -1.65527900e-02, -1.96531125e+00,
        -2.76420270e-01],
       [ 9.11283711e-01, -3.26176905e-01, -6.46981203e-01,
        -2.05907625e-01],
       [-3.90782122e-01, -4.98475456e-01,  7.70287904e-01,
         8.53090261e-01],
       [-7.33982548e-01,  2.95111530e-01,  7.97163433e-01,
         3.55037810e-01],
       [ 1.44987187e+00,  4.49916902e-01, -4.97805188e-01,
        -1.93468856e-01],
       [ 2.24409794e+00,  3.32210035e-01,  1.27992200e+00,
         1.01063191e+00],
       [ 5.72422145e-01, -1.89716574e+00, -6.27763946e-01,
        -3.27675743e+00],
       [ 1.00922463e-01, -1.45093800e+00,  4.93831092e-01,
        -2.06935173e-01],
       [-6.43096378e-01, -4.96703457e-01, -1.62274605e-01,
         5.70829733e-01],
       [ 5.29607327e-02, -1.00024341e+00, -5.74702800e-01,
        -3.07398431e-01],
       [-3.19110626e-01, -1.67902646e+00,  7.03483173e-02,
         4.01231914e-01],
       [ 1.83195254e+00,  6.40664718e-01,  1.43223491e+00,
         1.49748714e+00],
       [ 7.91302457e-02,  1.90003270e+00,  1.46342212e+00,
        -7.80340601e-01],
       [-5.09217416e-01,  5.60948404e-01, -1.92189475e-01,
        -4.62160977e-01],
       [ 1.19007237e+00,  6.37657859e-01,  1.20232798e+00,
        -9.35678943e-01],
       [-3.82683914e-04, -1.38379257e+00,  2.92127963e-01,
         1.28462685e+00],
       [-8.54449643e-02, -4.22945675e-01,  3.97237834e-01,
        -1.10395443e+00],
       [-5.61794356e-01,  1.71159075e+00,  9.00551243e-01,
        -1.04954797e+00],
       [ 7.48473223e-01, -4.91214033e-02,  1.86163582e+00,
         9.81122588e-01],
       [ 1.44305313e+00, -4.42956362e-01, -1.44308819e+00,
        -2.12201400e+00]])

We can select the first column .i.e. $x_1$

In [86]:
X[:,0]
Out[86]:
array([-2.15226364e+00,  1.64817412e+00,  6.64231789e-01,  7.30918837e-01,
       -1.24331388e+00,  3.53479852e-01,  1.15591173e+00, -4.24086950e-01,
       -1.34629031e-02, -1.16412511e+00,  4.85460259e-01, -2.54451502e-01,
        3.19170344e-01,  2.88328675e+00, -4.05392026e-02, -2.13404672e+00,
        1.69036265e+00,  1.08217900e+00, -3.59947193e-01, -1.72884257e+00,
        6.56253367e-01,  1.03747720e-01,  5.52313935e-01,  8.07707427e-01,
       -1.22058277e+00, -1.74534555e+00,  3.50983275e-01,  5.20978381e-01,
       -6.63279332e-02, -1.41889718e-01, -2.85630842e+00,  1.36188542e+00,
        3.02878923e-01,  4.83117416e-01,  4.52327016e-01,  1.59097160e+00,
       -7.51783966e-01, -1.37971920e+00, -2.45851105e-01,  5.71508808e-01,
        2.09271120e-02,  1.00331223e+00, -5.28583143e-01,  5.31978428e-02,
       -1.76896653e-01,  4.19768969e-01, -8.86406716e-01,  2.39448259e-01,
        1.00186033e-01,  2.50251720e+00, -1.58234011e+00, -4.98851514e-01,
       -8.77913308e-01,  9.00509503e-01,  1.91239952e-01,  2.56605327e-01,
        1.20855702e-01, -3.86048850e-01,  2.40354500e-01, -1.66465102e-01,
        2.33912348e-01, -1.64541711e+00,  4.66275336e-01,  1.18808196e+00,
       -2.74494881e-01, -1.76205879e-01,  2.70828521e-01, -1.79407460e+00,
       -1.89910570e+00,  4.94222265e-02, -9.30560887e-01, -9.87162127e-01,
        7.34641823e-01, -1.00055869e+00, -7.36877814e-02,  1.05573477e+00,
        7.80093732e-01,  6.24471891e-02, -3.33480469e-01, -3.58789186e-01,
        5.02146006e-01,  9.11283711e-01, -3.90782122e-01, -7.33982548e-01,
        1.44987187e+00,  2.24409794e+00,  5.72422145e-01,  1.00922463e-01,
       -6.43096378e-01,  5.29607327e-02, -3.19110626e-01,  1.83195254e+00,
        7.91302457e-02, -5.09217416e-01,  1.19007237e+00, -3.82683914e-04,
       -8.54449643e-02, -5.61794356e-01,  7.48473223e-01,  1.44305313e+00])

Creating the train and test dataset

In [87]:
# Split the data into training/testing sets
X_train = X[:-20]
X_test = X[-20:]

# Split the targets into training/testing sets
y_train = y[:-20]
y_test = y[-20:]

Training

In [88]:
polynomial_features= PolynomialFeatures(degree=4)
x_poly = polynomial_features.fit_transform(X_train)
x_test_poly = polynomial_features.fit_transform(X_test)

model = LinearRegression()
model.fit(x_poly, y_train)
y_poly_pred = model.predict(x_test_poly)

Root Mean Squared Error

In [89]:
rmse = np.sqrt(mean_squared_error(y_test,y_poly_pred))
r2 = r2_score(y_test,y_poly_pred)
print("Root Mean Squared Error [RMSE] : ",rmse)
print("R-2 score : ",r2)
Root Mean Squared Error [RMSE] :  48.28276313969695
R-2 score :  0.9155798975007788


One of the problems for visualising 4D data is, it can't be visualised spatially, so, for the sake of experiment, I have tried to use heatmap as another dimension. Just trying to recreate this stuff. Column 1,2,3 and 4 are the four features and they represent some of the dimension in this plot.

In [90]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

img = ax.scatter(X_test[:,0], X_test[:,1], X_test[:,2], c=X_test[:,3], cmap=plt.hot())
fig.colorbar(img)
plt.show()

Now a struggle to display the plot for 4D surface. Just trying to recreate this stuff for visualising 4D.

In [91]:
import matplotlib
from scipy.interpolate import griddata
from matplotlib import cm

name_color_map = 'seismic';
list_name_variables = ['x', 'y', 'z', 'c'];
index_x = 0; index_y = 1; index_z = 2; index_c = 3;

# X-Y are transformed into 2D grids. It's like a form of interpolation
X_test_1 = np.linspace(X_test[:,0].min(), X_test[:,0].max(), len(np.unique(X_test[:,0]))); 
y_test = np.linspace(y_test.min(), y_test.max(), len(np.unique(y_test)));
x2, y2 = np.meshgrid(X_test_1, y_test);

# Interpolation of Z: old X-Y to the new X-Y grid.
# Note: Sometimes values can be < z.min and so it may be better to set 
# the values too low to the true minimum value.
z2 = griddata( (X_test[:,0], y_test), X_test[:,1], (x2, y2), method='cubic', fill_value = 0);
z2[z2 < X_test[:,1].min()] = X_test[:,1].min();

# Interpolation of C: old X-Y on the new X-Y grid (as we did for Z)
# The only problem is the fact that the interpolation of C does not take
# into account Z and that, consequently, the representation is less 
# valid compared to the previous solutions.
c2 = griddata( (X_test[:,0], y_test), X_test[:,2], (x2, y2), method='cubic', fill_value = 0);
c2[c2 < X_test[:,2].min()] = X_test[:,2].min(); 

#--------
color_dimension = c2; # It must be in 2D - as for "X, Y, Z".
minn, maxx = color_dimension.min(), color_dimension.max();
norm = matplotlib.colors.Normalize(minn, maxx);
m = plt.cm.ScalarMappable(norm=norm, cmap = name_color_map);
m.set_array([]);
fcolors = m.to_rgba(color_dimension);

# At this time, X-Y-Z-C are all 2D and we can use "plot_surface".
fig = plt.figure(); ax = fig.gca(projection='3d');
surf = ax.plot_surface(x2, y2, z2, facecolors = fcolors, linewidth=0, rstride=1, cstride=1,
                       antialiased=False);
cbar = fig.colorbar(m, shrink=0.5, aspect=5);
cbar.ax.get_yaxis().labelpad = 15; cbar.ax.set_ylabel(list_name_variables[index_c], rotation = 270);
ax.set_xlabel(list_name_variables[index_x]); ax.set_ylabel(list_name_variables[index_y]);
ax.set_zlabel(list_name_variables[index_z]);
plt.title('%s in fcn of %s, %s and %s' % (list_name_variables[index_c], list_name_variables[index_x], list_name_variables[index_y], list_name_variables[index_z]) );
plt.show();

References

  1. Dripta Maharaj, (2020), Slides, availabe on web https://sites.google.com/view/da220-2019-20 , last accessed on 16.1.2020.
  1. Brownlee, J., (2018), How to Generate Test Datasets in Python with scikit-learn, availabe on web https://machinelearningmastery.com/generate-test-datasets-python-scikit-learn/ , last accessed on 16.1.2020.
  1. Scikit-learn documentation, Linear Regression Example, availabe on web https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py , last accessed on 16.1.2020.
  1. user2386081, (2013), Matplotlib scatter plot legend, https://stackoverflow.com/questions/17411940/matplotlib-scatter-plot-legend , last accessed on 16.1.2020.
  1. Dixon & Moe, (2015), Searching for that perfect color has never been easier, use our HTML color picker to browse millions of colors and color harmonies. https://htmlcolorcodes.com/color-picker/ , last accessed on 16.1.2020.
  1. Matplotlib documentation, Scatter plots with a legend, https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/scatter_with_legend.html , last accessed on 16.1.2020.
  1. Tengis, (2013), How to make a 4d plot with matplotlib using arbitrary data, https://stackoverflow.com/questions/14995610/how-to-make-a-4d-plot-with-matplotlib-using-arbitrary-data , last accessed on 16.1.2020.
  1. Agarwal, A., (2018), Polynomial Regression, Towards datascience, https://towardsdatascience.com/polynomial-regression-bbe8b9d97491 , last accessed on 16.1.2020 .

Acknowledgements

  • Dripta Maharaj