Classification
In machine learning, classification
is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training
set of data containing observations (or instances) whose
category membership is known.
Example
Consider the following data:
Score1 |
29 |
22 |
10 |
31 |
17 |
33 |
32 |
20 |
Score2 |
43 |
29 |
47 |
55 |
18 |
54 |
40 |
41 |
Result |
Pass Fail Fail Pass Fail Pass Pass Pass |
Table 1.1: Example data for a classification problem
Data in Table 1.1 is the training set of data. There are two attributes “Score1” and “Score2”. The class label is called
“Result”. The class label has two
possible values “Pass” and “Fail”. The
data can be divided into two categories or classes:
The set of data for which the class label is “Pass” and
the set of data for which the class label is “Fail”.
Let us
assume that we have no knowledge about the data other than what is given in the
table. Now, the problem can be posed as follows:
If we have some new data, say “Score1
= 25” and “Score2 = 36”, what value should be assigned to “Result”
corresponding to the new data; in other words,
to which of the two categories or classes the new observation should be assigned? See Figure 1.1: for a graphical representation of the problem.
Figure 1.1: Graphical representation
of data in Table 1.1. Solid dots
represent data in “Pass” class and hollow dots data in “Fail” class. The class
label of the square dot is to be determined.
To answer
this question, using the given data alone we need to find the rule, or the
formula, or the method that has been used in assigning the values to the class
label “Result”. The problem of
finding this rule or formula
or the method is the classification problem. In general, even the general
form of the rule or function or method will not be known. So several different
rules, etc. may have
to be tested to obtain the correct rule or function or method.
Examples
IF Score1 ≥ 20 AND Score2 ≥ 40 THEN “Pass” ELSE “Fail”.
Or, we may consider
the following rules with unspecified values for M, m1, m2 and then by some method
estimate their values.
IF Score1 + Score2 ≥ M, THEN “Pass” ELSE “Fail”. |
IF Score1 ≥ m1 AND Score2 ≥ m2 THEN “Pass” ELSE “Fail”.
Algorithms
There are several machine learning
algorithms for classification. The
following are some of the well-known
algorithms.
a)
Logistic regression
b)
Naive Bayes algorithm
c)
k-NN algorithm
d)
Decision tree algorithm
e)
Support vector
machine algorithm
f)
Random forest
algorithm
Remarks
•
A classification problem requires that examples be classified into one of two or more classes.
•
A classification can have real-valued or discrete input
variables.
•
A problem
with two classes
is often called
a two-class or binary classification problem.
•
A problem
with more than two classes
is often called
a multi-class classification problem.
•
A problem
where an example
is assigned multiple
classes is called a multi-label
classification problem.
Regression
In
machine learning, a regression problem
is the problem of predicting the value of a numeric
variable based on observed values of the variable. The value of the output variable may be a
number, such as an integer or a floating point value. These are often quantities, such as amounts and sizes. The input
variables may be discrete or real-valued.
Example
Consider the data on car prices
given in Table
1.2.
Price |
Age |
Distance |
Weight |
(US$) |
(years) |
(KM) |
(pounds) |
13500 |
23 |
46986 |
1165 |
13750 |
23 |
72937 |
1165 |
13950 |
24 |
41711 |
1165 |
14950 |
26 |
48000 |
1165 |
13750 |
30 |
38500 |
1170 |
12950 |
32 |
61000 |
1170 |
16900 |
27 |
94612 |
1245 |
18600 |
30 |
75889 |
1245 |
21500 |
27 |
19700 |
1185 |
12950 |
23 |
71138 |
1105 |
Table 1.2: Prices of used cars: example data for regression
Suppose
we are required to estimate the price of a car aged 25 years with distance
53240 KM and weight 1200 pounds.
This is an example of a regression problem because we have to predict the value of the numeric variable
“Price”.
General approach
Let x denote the set of input
variables and y the output variable. In
machine learning, the general approach to regression is to assume a model, that
is, some mathematical relation between x and y, involving some parameters say,
θ, in the following form:
y = f (x, θ)
The function y = f (x, θ) is
called the regression function. The
machine learning algorithm optimizes the parameters in the set θ such that the approximation error is minimized;
that is, the estimates of the
values of the dependent variable y are as close as possible to the correct
values given in the training set.
Example
y = f (x, θ) |
Price = a0 + a1 × (Age) + a2 × (Distance) + a3 × (Weight)
Where x = (Age, Distance,
Weight) denotes the set of input variables and θ =
(a0, a1, a2, a3) denotes the set of parameters of the model.
regression Algorithms
•
Simple
linear regression
•
Multivariate linear regression
•
Polynomial regression
•
Logistic
regression
•
Decision
tree
•
Random
forest
Different regression models:
There are various types of regression techniques available to
make predictions. These techniques mostly differ in three aspects, namely, the
number and type of independent variables, the type of dependent variables and
the shape of regression line. Some of these are listed below.
• Simple linear regression: There is only one continuous independent variable x and the
assumed relation between the independent variable and the dependent variable y
is
y = a + bx.
• Multivariate linear regression: There are more than one independent variable, say x1,
. . . , xn, and the assumed relation between the independent
variables and the dependent variable is
y = a0 + a1x1
+ ⋯ + anxn.
• Polynomial regression: There is only one continuous independent variable x and the
assumed model is
y = a0 + a1x + ⋯ + anxn.
• Logistic regression: The dependent variable is binary, that is, a variable which takes only
the values 0 and 1. The assumed model involves certain probability
distributions.
0 Comments