Understanding data

 Understanding data

 Since an important component of the machine learning process is data storage, we briefly consider in this section the different types and forms of data that are encountered in the machine learning process.

1.     Unit of observation

 By a unit of observation we mean the smallest entity with measured properties of interest for a study.

 Examples

       A person, an object or a thing

       A time point

       A geographic region

       A measurement

 Sometimes, units of observation are combined to form units such as person-years.

2.     Examples and features

 Datasets that store the units of observation and their properties can be imagined as collections of data consisting of the following:

·       Examples

An “example” is an instance of the unit of observation for which properties have been recorded.

An “example” is also referred to as an “instance”, or “case” or “record.” (It may be noted that

the word “example” has been used here in a technical sense.)

·       Features

A “feature” is a recorded property or a characteristic of examples. It is also referred to as

“attribute”, or “variable” or “feature.”

Examples for “examples” and “features”

 1. Cancer detection

 Consider the problem of developing an algorithm for detecting cancer. In this study we note

 the following.

 (a) The units of observation are the patients.

 (b) The examples are members of a sample of cancer patients.

 (c) The following attributes of the patients may be chosen as the features:

 • gender

 • age

 • blood pressure

 • the findings of the pathology report after a biopsy

 2. Pet selection

 Suppose we want to predict the type of pet a person will choose.

 (a) The units are the persons.

 (b) The examples are members of a sample of persons who own pets.

(c) The features might include age, home region, family income, etc. of persons who own

 pets.

Figure 1:  Example for “examples” and “features” collected in a matrix format (data relates to automobiles and their features)

 3. Spam e-mail

 Let it be required to build a learning algorithm to identify spam e-mail.

 (a) The unit of observation could be an e-mail message.

 (b) The examples would be specific messages.

 (c) The features might consist of the words used in the messages.

 Examples and features are generally collected in a “matrix format”. Fig. 1: shows such a dataset.

1.     Different forms of data

 1. Numeric data

 If a feature represents a characteristic measured in numbers, it is called a numeric feature.

 2. Categorical or nominal

 A categorical feature is an attribute that can take on one of a limited, and usually fixed, number of possible values on the basis of some qualitative property. A categorical feature is also called a nominal feature.

 3. Ordinal data

 This denotes a nominal variable with categories falling in an ordered list. Examples include clothing sizes such as small, medium, and large, or a measurement of customer satisfaction on a scale from “not at all happy” to “very happy.”

 Examples

 In the data given in Fig.1, the features “year”, “price” and “mileage” are numeric and the features “model”, “color” and “transmission” are categorical.



Post a Comment

0 Comments