Lecture 03: Machine Learning

 

 

Recap of Lecture 02:

We previously discussed the idea of EDA and how to implement it. We attached a pre-done EDA project on iris Flower and now we want you to have the idea on what these lines of code are trying to do.

Continuing Lecture 03:

We would like to make sure, we don’t expect you to know the meaning of each line of code but we want you to have the idea about all the methods that can be practiced for EDA. Following the execution of EDA, we would also learn to execute Linear regression Model.

2.1.2 Overview of Iris flower EDA project

The Iris flower EDA project is a simple dataset to help us grasp the basics of EDA. There are three species of Iris flowers



Iris virginica



Iris versicolor



Iris Setosa

Our objective is to classify a new flower as one of the three classes given four features.

Through the activity we will find out the role of domain knowledge in machine learning.

2.1.3 Downloading the CSV File

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

'''downlaod iris.csv from <https://raw.githubusercontent.com/>
uiuc-cse/data-fa14/gh-pages/data/iris.csv'''
#Load Iris.csv into a pandas dataFrame.
iris = pd.read_csv("iris.csv")

From this portion of python code in jupyter notebook, we learn how to load the necessary data from a csv (comma separated values) file provided in the notebook.

Here’s a short description of the libraries used in the code (We encourage you to go through the documentation for more details):

Pandas:

Pandas is an open-source library that is made mainly for working with relational or labeled data both easily and intuitively. It provides various data structures and operations for manipulating numerical data and time series.

Numpy:

Numpy is a general-purpose array-processing package. It provides a high-performance multidimensional array object, and tools for working with these arrays. It is the fundamental package for scientific computing with Python.

Besides its obvious scientific uses, Numpy can also be used as an efficient multi-dimensional container of generic data.

Seaborn:

Seaborn is a library mostly used for statistical plotting in Python. It is built on top of Matplotlib and provides beautiful default styles and color palettes to make statistical plots more attractive.

Matplotlib.pyplot:

Matplotlib is a plotting library for creating static, animated, and interactive visualizations in Python. Matplotlib can be used in Python scripts, the Python and IPython shell, web application servers, and various graphical user interface toolkits like Tkinter, awxPython, etc

Following down the Jupyter notebook, we can see that there are some print statements that show us the data contained in the csv file that we previously downloaded.

2.1.4 2 - D Scatter Plots

#2-D scatter plot:
#ALWAYS understand the axis: labels and scale.

iris.plot(kind='scatter', x='sepal_length', y='sepal_width') ;
plt.show()

Here is a basic plot of characteristics of the different flowers. We use the ‘.plot’ method in jupyter to create the scatter plot out of the data in the csv file.

Since we are plotting a scatter plot, we set kind = ‘scatter’. Now we can plot a scatter plot for any two characteristics of the flowers. In this specific example, we can see that we have used sepal_lenght vs sepal_width. After setting the x axis = sepal_lenght and the y axis = sepal_width, we type the function plt.show() which shows us the scatter plot that has been made.



Now this scatter plot is hard to read as all the points are blue in colour making it hard to distinguish between the data points.

A solution to this, we can color-code the scatter plot for each flower type.

sns.set_style("whitegrid");
sns.FacetGrid(iris, hue="species") \\
   .map(plt.scatter, "sepal_length", "sepal_width") \\
   .add_legend();
plt.show();

Here sns stands for the python library seaborn.

‘.set_style’ method of seaborn is a stylising feature to set a white grid for the base of the graph.

The arguments of the “.FacetGrid” method are

  1. the data which is the ‘iris’ object that we used to open the csv file.
  2. column and row specifications which are used to differentiate between multiple graphs. An example is when you have to seperate male and female flowers, you would set column = “gender” so that there would be two scatter plots both representing a different gender.
  3. The hue specification which colors a certain property on the graph. In this graph, the points are coloured according to their different species.
  4. For those interested, we highly recommend reading the documentation for seaborn.

The ‘.map’ method gets the graph ready by taking arguments to decide type of plot (plt.scatter), and the parameters being graphed (sepal_lenght and sepal_width).

The ‘.add_legend’ method does exactly what it says. It adds a legend for the viewer to know what color is mapped to which feature.

‘plt.show()’ presents us with the following graph.



Here we can see that by using sepal_length and sepal_width features, we can distinguish the Setosa flowers from others. Seperating Versicolor and Viginica is much harder as they have considerable overlap.

2.1.5 Pair-Plot

	plt.close();
	sns.set_style("whitegrid");
	sns.pairplot(iris, hue="species", size=3);
	plt.show()

In this section we will use a different plot to represent the data. Going by the name of the pair-plot, its a representation of all the features versus each other.

The disadvantage of using a pairplot is that it can not visualise higher dimensional patterns in 3-D or 4-D. The only possibility is to view 2-D patterns.

In the code, we use the familiar “sns.set_style(”whitegrid”)” to create a grid behind each graph.

Please refer to the previous section for more details.

Now we use the ‘.pairplot’ method which in our case takes the same parameters as the ‘.FacetGrid’ method. For the differences, we recommend reading the Seaborn documentation.

Presenting the graph with the ever familiar ‘plt.show()’ we get:



The diagonal elements are called the PDF’s for each feature. Head down below to the next section to read up about them.

And finally to close a plot, we use the ‘.close()’ method as shown in the code.

2.1.6 Histogram, PDF, CDF

A histogram is the graphical representation of data where data is grouped into continuous number ranges and each range corresponds to a vertical bar.

A probability density function(pdf) tells us the probability that a random variable takes on a certain value. PDF is mostly or used only for discrete data.

A cumulative distribution function(CDF) tells us the probability that a random variable takes on a value less than or equal to a particular value(x). CDF can be used for both discrete and continuous data.

We will try visualizing our data in histograms.

import numpy as np
iris_setosa = iris.loc[iris["species"] == "setosa"];
iris_virginica = iris.loc[iris["species"] == "virginica"];
iris_versicolor = iris.loc[iris["species"] == "versicolor"];
plt.plot(iris_setosa["petal_length"],
np.zeros_like(iris_setosa['petal_length']),'o')
plt.plot(iris_versicolor["petal_length"],
np.zeros_like(iris_versicolor['petal_length']),'o')
plt.plot(iris_virginica["petal_length"],
np.zeros_like(iris_virginica['petal_length']),'o')

plt.show()

Here,

.loc is used to access the particular array’s elements in the list with the mentioned label.

The first three lines of code are used to create three separate datasets (one for each species) by filtering the rows of the original dataset based on the "species" column.

The next three lines of code are used to create a simple scatter plot using the matplotlib library. Each of the three species is plotted separately, with the petal length on the x-axis and zeros on the y-axis. The numpy function “zeros_like()” is used to create an array of zeros with the same shape as the petal length data. This is done to ensure that the data points are plotted at the same y-coordinate.




sns.FacetGrid(iris, hue="species", size=5) \\
   .map(sns.distplot, "petal_length") \\
   .add_legend();
plt.show();

The first line of code creates a FacetGrid object, which takes the iris dataset as input and specifies the "hue" argument to be the "species" column. This means that the different species will be color-coded in the resulting plot.

The second line of code uses the “map()” function to apply the “displot” function from seaborn to the "petal_length" column of each subset of the data (one for each species). This creates a histogram of the petal length distribution for each species.



You can similarly learn how to plot PDF and CDF to understand the probability factor of any flower set on the basis of their different properties.

2.1.7 Mean, Variance, Standard Deviation

The mean is a measure of central tendency that represents the average of a set of numerical values.

Mean=\frac{∑x}{n}

The variance is a measure of how spread out a set of values is from their mean.

Variance=(\frac{∑x}{n})^2-\frac{∑x^2}{n}

The standard deviation is another measure of how spread out a set of values is from their mean.

SD=((\frac{∑x}{n})^2-\frac{∑x^2}{n})^{\frac{1}{2}}

#Mean, Variance, Std-deviation,  
print("Means:")
print(np.mean(iris_setosa["petal_length"]))
#Mean with an outlier.
print(np.mean(np.append(iris_setosa["petal_length"],50)));
print(np.mean(iris_virginica["petal_length"]))
print(np.mean(iris_versicolor["petal_length"]))

print("\\nStd-dev:");
print(np.std(iris_setosa["petal_length"]))
print(np.std(iris_virginica["petal_length"]))
print(np.std(iris_versicolor["petal_length"]))

Here,

“.mean” , “.std” are used to determine the mean and standard deviation.

If you are into applied statistics, you might have got the intuition why is knowing mean or standard deviation is necessary. But also, you might have got the intuition with the mean with outlier part, it can give the wrong information about the data-set.

This is why people also observe data using median and calculate the variance from median to understand the deviation. The code for median and different statistical terms to identify the deviation were attached with jupyter file. You can have a look at it.

But why are they necessary? Why is it necessary for us to analyze the data for machine learning?

3.1.0 Executing Regression model

Here is a link to the jupyter file and we expect you to go through this jupyter file and understand the execution process.

Class_2_Regression_test.ipynb

We will try to give you a explanation to each of the lines of code but we highly encourage you to go through the documentation of each function you find hard to understand.

from sklearn.datasets import make_regression
X,y  = make_regression(n_samples=50, n_features=1, noise=50, random_state=1)

This code uses the “make_regression” function from the “sklearn.datasets” module to create a random dataset for regression analysis with 50 samples, 1 feature, and 50 units of noise. The dataset is assigned to x and y variables, representing the input features and target values, respectively. The random_state parameter ensures that the generated dataset can be reproduced. sklearn.datasets provides various functions for generating synthetic datasets for machine learning and statistical analysis.

import matplotlib.pyplot as plt
plt.scatter(x=X, y=y,c='b')
plt.show()


The first line of this code imports the “pyplot” module from “matplotlib” library and renames it to “plt” for convenience.

The second line creates a scatter plot using “plt.scatter” function. The “x” and “y” parameters specify the data to be plotted on the x and y axis, respectively. The “c” parameter sets the color of the points to blue.

Now that we have created datasets in which we can comfortably apply our linear regression model, we will move forward with two methods ( manual mathematical programming and in-built python function).

Let’s first go with the manual mathematical programming and verify our last discussion on linear regression model.

x_mean = 0.0
for i in range(X.shape[0]):
    x_mean += X[i][0]
x_mean /= X.shape[0]

y_mean = 0.0 
for i in range(y.shape[0]):
    y_mean += y[i]
y_mean /= y.shape[0]
num = 0.0
denom = 0.0 

for i in range(X.shape[0]):
    num = num + X[i][0]*(y[i] - y_mean)
    denom = denom + X[i][0]*(X[i][0] - x_mean)
a = num / denom
b = y_mean - a*x_mean
import matplotlib.pyplot as plt

x_values = []
y_values = []

for i in range(X.shape[0]):
    x_values.append([X[i][0]])
    y_values.append(a*X[i][0] + b)

plt.scatter(x=X, y=y,c='b')
plt.plot(x_values,y_values)

plt.show()


The first block of code calculates the mean values of x and y using a for loop.

The second block of code calculates the slope a and the y-intercept b of the line using the mean values and the data (x,y).

The third block of code creates two empty lists x_values and y_values, which are then populated using a for loop to generate the (x,y) coordinates for the line that represents the regression model. Finally, a scatter plot of the data (x,y) is created with blue points using plt.scatter, and the regression line is plotted on top of it using plt.plot .

If we observe carefully, our regression model has been fitted almost perfectly capturing most of the points in our 2-d co-ordinate. The same fitted line can be now use to predict the weight for the unknown data-point.

Similarly, the linear regression model can be applied into our data-set using the in-built python function as follows:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X,y)
model.coef_
model.intercept_

It will print out the coefficient and intercept and you can verify if our theory for linear regression model actually worked out.

Comments