Classification and Regression

Classification

In machine learning, classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.

Example

Consider the following data:

 

Score1

29

22

10

31

17

33

32

20

Score2

43

29

47

55

18

54

40

41

Result

Pass   Fail   Fail   Pass   Fail   Pass   Pass   Pass

Table 1.1: Example data for a classification problem

                       

Data in Table 1.1 is the training set of data. There are two attributes “Score1” and “Score2”. The class label is called “Result”. The class label has two possible values “Pass” and “Fail”. The data can be divided into two categories or classes: The set of data for which the class label is “Pass” and the set of data for which the class label is “Fail”.

Let us assume that we have no knowledge about the data other than what is given in the table. Now, the problem can be posed as follows:

 If we have some new data, say “Score1 = 25” and “Score2 = 36”, what value should be assigned to “Result” corresponding to the new data; in other words, to which of the two categories or classes the new observation should be assigned? See Figure 1.1:  for a graphical representation of the problem.



Figure 1.1: Graphical representation of data in Table 1.1. Solid dots represent data in “Pass” class and hollow dots data in “Fail” class. The class label of the square dot is to be determined.

 

To answer this question, using the given data alone we need to find the rule, or the formula, or the method that has been used in assigning the values to the class label “Result”. The problem of finding this rule or formula or the method is the classification problem. In general, even the general form of the rule or function or method will not be known. So several different rules, etc. may have to be tested to obtain the correct rule or function or method.

 

Examples


Consider the data given in Table 1.1 and the associated classification problem. We may consider the following rules for the classification of the new data:

 

 IF Score1 + Score2 60, THEN “Pass” ELSE “Fail”.

IF Score1 20 AND Score2 40 THEN “Pass” ELSE “Fail”.

 

Or, we may consider the following rules with unspecified values for M, m1, m2 and then by some method estimate their values.


 

 

IF Score1 + Score2  M, THEN “Pass” ELSE “Fail”.


IF Score1 m1 AND Score2 m2 THEN “Pass” ELSE “Fail”.

 

Algorithms

There are several machine learning algorithms for classification. The following are some of the well-known algorithms.

a)      Logistic regression

b)      Naive Bayes algorithm

c)      k-NN algorithm

d)      Decision tree algorithm

e)      Support vector machine algorithm

f)      Random forest algorithm

 

Remarks

    A classification problem requires that examples be classified into one of two or more classes.

    A classification can have real-valued or discrete input variables.

    A problem with two classes is often called a two-class or binary classification problem.

    A problem with more than two classes is often called a multi-class classification problem.

    A problem where an example is assigned multiple classes is called a multi-label

classification problem.

 

Regression

In machine learning, a regression problem is the problem of predicting the value of a numeric variable based on observed values of the variable. The value of the output variable may be a number, such as an integer or a floating point value. These are often quantities, such as amounts and sizes. The input variables may be discrete or real-valued.

 

Example

Consider the data on car prices given in Table 1.2.

 

Price

Age

Distance

Weight

(US$)

(years)

(KM)

(pounds)

13500

23

46986

1165

13750

23

72937

1165

13950

24

41711

1165

14950

26

48000

1165

13750

30

38500

1170

12950

32

61000

1170

16900

27

94612

1245

18600

30

75889

1245

21500

27

19700

1185

12950

23

71138

1105

 

Table 1.2: Prices of used cars: example data for regression

 

Suppose we are required to estimate the price of a car aged 25 years with distance 53240 KM and weight 1200 pounds. This is an example of a regression problem because we have to predict the value of the numeric variable “Price”.

 

General approach

Let x denote the set of input variables and y the output variable. In machine learning, the general approach to regression is to assume a model, that is, some mathematical relation between x and y, involving some parameters say, θ, in the following form:

y = f (x, θ)

The function y = f (x, θ) is called the regression function. The machine learning algorithm optimizes the parameters in the set θ such that the approximation error is minimized; that is, the estimates of the values of the dependent variable y are as close as possible to the correct values given in the training set.

 

Example

 


For example, if the input variables are “Age”, “Distance” and “Weight” and the output variable is “Price”, the model may be

 

                                                                                               y = f (x, Î¸)

 

 

Price = a0 + a1 × (Age) + a2 × (Distance) + a3 × (Weight)

 

Where x =    (Age, Distance, Weight) denotes the set of input variables and θ  = (a0, a1, a2, a3) denotes the set of parameters of the model.

 

 regression Algorithms

    Simple linear regression

    Multivariate linear regression

    Polynomial regression

    Logistic regression

    Decision tree

    Random forest

 

Different regression models:

 

There are various types of regression techniques available to make predictions. These techniques mostly differ in three aspects, namely, the number and type of independent variables, the type of dependent variables and the shape of regression line. Some of these are listed below.

    Simple linear regression: There is only one continuous independent variable x and the assumed relation between the independent variable and the dependent variable y is

y = a + bx.

 

    Multivariate linear regression: There are more than one independent variable, say x1, . . . , xn, and the assumed relation between the independent variables and the dependent variable is

y = a0 + a1x1 + + anxn.

    Polynomial regression: There is only one continuous independent variable x and the assumed model is

y = a0 + a1x + + anxn.

    Logistic regression: The dependent variable is binary, that is, a variable which takes only the values 0 and 1. The assumed model involves certain probability distributions.

Post a Comment

0 Comments