Classification and Regression

Classification

In machine learning, classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.

Example

Consider the following data:

Score1	29	22	10	31	17	33	32	20
Score2	43	29	47	55	18	54	40	41
Result	Pass Fail Fail Pass Fail Pass Pass Pass

Table 1.1: Example data for a classification problem

Data in Table 1.1 is the training set of data. There are two attributes “Score1” and “Score2”. The class label is called “Result”. The class label has two possible values “Pass” and “Fail”. The data can be divided into two categories or classes: The set of data for which the class label is “Pass” and the set of data for which the class label is “Fail”.

Let us assume that we have no knowledge about the data other than what is given in the table. Now, the problem can be posed as follows:

If we have some new data, say “Score1 = 25” and “Score2 = 36”, what value should be assigned to “Result” corresponding to the new data; in other words, to which of the two categories or classes the new observation should be assigned? See Figure 1.1: for a graphical representation of the problem.

Figure 1.1: Graphical representation of data in Table 1.1. Solid dots represent data in “Pass” class and hollow dots data in “Fail” class. The class label of the square dot is to be determined.

To answer this question, using the given data alone we need to find the rule, or the formula, or the method that has been used in assigning the values to the class label “Result”. The problem of finding this rule or formula or the method is the classification problem. In general, even the general form of the rule or function or method will not be known. So several different rules, etc. may have to be tested to obtain the correct rule or function or method.

Examples

Consider the data given in Table 1.1 and the associated classification problem. We may consider the following rules for the classification of the new data:

IF Score1 + Score2 ≥ 60, THEN “Pass” ELSE “Fail”.

IF Score1 ≥ 20 AND Score2 ≥ 40 THEN “Pass” ELSE “Fail”.

Or, we may consider the following rules with unspecified values for M, m₁, m₂ and then by some method estimate their values.

IF Score1 + Score2 ≥ M, THEN “Pass” ELSE “Fail”.

IF Score1 ≥ m₁ AND Score2 ≥ m₂ THEN “Pass” ELSE “Fail”.

Algorithms

There are several machine learning algorithms for classification. The following are some of the well-known algorithms.

a) Logistic regression

b) Naive Bayes algorithm

c) k-NN algorithm

d) Decision tree algorithm

e) Support vector machine algorithm

f) Random forest algorithm

Remarks

• A classification problem requires that examples be classified into one of two or more classes.

• A classification can have real-valued or discrete input variables.

• A problem with two classes is often called a two-class or binary classification problem.

• A problem with more than two classes is often called a multi-class classification problem.

• A problem where an example is assigned multiple classes is called a multi-label

classification problem.

Regression

In machine learning, a regression problem is the problem of predicting the value of a numeric variable based on observed values of the variable. The value of the output variable may be a number, such as an integer or a floating point value. These are often quantities, such as amounts and sizes. The input variables may be discrete or real-valued.

Example

Consider the data on car prices given in Table 1.2.

Price	Age	Distance	Weight
(US$)	(years)	(KM)	(pounds)
13500	23	46986	1165
13750	23	72937	1165
13950	24	41711	1165
14950	26	48000	1165
13750	30	38500	1170
12950	32	61000	1170
16900	27	94612	1245
18600	30	75889	1245
21500	27	19700	1185
12950	23	71138	1105

Table 1.2: Prices of used cars: example data for regression

Suppose we are required to estimate the price of a car aged 25 years with distance 53240 KM and weight 1200 pounds. This is an example of a regression problem because we have to predict the value of the numeric variable “Price”.

General approach

Let x denote the set of input variables and y the output variable. In machine learning, the general approach to regression is to assume a model, that is, some mathematical relation between x and y, involving some parameters say, θ, in the following form:

y = f (x, θ)

The function y = f (x, θ) is called the regression function. The machine learning algorithm optimizes the parameters in the set θ such that the approximation error is minimized; that is, the estimates of the values of the dependent variable y are as close as possible to the correct values given in the training set.

Example

For example, if the input variables are “Age”, “Distance” and “Weight” and the output variable is “Price”, the model may be

y = f (x, θ)

Price = a₀ + a₁ × (Age) + a₂ × (Distance) + a₃ × (Weight)

Where x = (Age, Distance, Weight) denotes the set of input variables and θ = (a₀, a₁, a₂, a₃) denotes the set of parameters of the model.

regression Algorithms

• Simple linear regression

• Multivariate linear regression

• Polynomial regression

• Logistic regression

• Decision tree

• Random forest

Different regression models:

There are various types of regression techniques available to make predictions. These techniques mostly differ in three aspects, namely, the number and type of independent variables, the type of dependent variables and the shape of regression line. Some of these are listed below.

• Simple linear regression: There is only one continuous independent variable x and the assumed relation between the independent variable and the dependent variable y is

y = a + bx.

• Multivariate linear regression: There are more than one independent variable, say x₁, . . . , x_n, and the assumed relation between the independent variables and the dependent variable is

y = a₀ + a₁x₁ + ⋯ + a_nx_n.

• Polynomial regression: There is only one continuous independent variable x and the assumed model is

y = a₀ + a₁x + ⋯ + a_nxⁿ.

• Logistic regression: The dependent variable is binary, that is, a variable which takes only the values 0 and 1. The assumed model involves certain probability distributions.

Classification and Regression

Classification

Example

Examples

Algorithms

Remarks

Regression

Example

General approach

Example

regression Algorithms

Post a Comment

0 Comments

Popular Posts

Information Security

Overview of information security

Database System Development Lifecycle

Categories

Tags

Random Posts

Popular Posts

Information Security

Overview of information security

Database System Development Lifecycle

Menu Footer Widget

Classification and Regression

Classification

Example

Examples

Algorithms

Remarks

Regression

Example

General approach

Example

regression Algorithms

You may like these posts

Post a Comment

0 Comments

Popular Posts

Information Security

Overview of information security

Database System Development Lifecycle

Categories

Tags

Random Posts

Popular Posts

Information Security

Overview of information security

Database System Development Lifecycle

Menu Footer Widget