Understanding data
Since an important component of the machine
learning process is data storage, we briefly consider in this section the
different types and forms of data that are encountered in the machine learning
process.
1.
Unit of observation
By a unit of observation we mean the smallest
entity with measured properties of interest for a study.
Examples
• A person, an object or a thing
• A time point
• A geographic region
• A measurement
Sometimes, units of observation are combined
to form units such as person-years.
2.
Examples and features
Datasets that store the units of observation
and their properties can be imagined as collections of data consisting of the
following:
· Examples
An
“example” is an instance of the unit of observation for which properties have
been recorded.
An
“example” is also referred to as an “instance”, or “case” or “record.” (It may
be noted that
the
word “example” has been used here in a technical sense.)
· Features
A
“feature” is a recorded property or a characteristic of examples. It is also
referred to as
“attribute”,
or “variable” or “feature.”
Examples for “examples” and “features”
1.
Cancer detection
Consider the problem of developing an
algorithm for detecting cancer. In this study we note
the following.
(a) The units of observation are the patients.
(b) The examples are members of a sample of
cancer patients.
(c) The following attributes of the patients
may be chosen as the features:
• gender
• age
• blood pressure
• the findings of the pathology report after a
biopsy
2. Pet
selection
Suppose we want to predict the type of pet a
person will choose.
(a) The units are the persons.
(b) The examples are members of a sample of
persons who own pets.
(c) The features might include age,
home region, family income, etc. of persons who own
pets.

3. Spam
e-mail
Let it be required to build a learning
algorithm to identify spam e-mail.
(a) The unit of observation could be an e-mail
message.
(b) The examples would be specific messages.
(c) The features might consist of the words
used in the messages.
Examples and features are generally collected
in a “matrix format”. Fig. 1: shows such a dataset.
1.
Different forms
of data
1.
Numeric data
If a feature represents a characteristic
measured in numbers, it is called a numeric feature.
2.
Categorical or nominal
A categorical feature is an attribute that can
take on one of a limited, and usually fixed, number of possible values on the
basis of some qualitative property. A categorical feature is also called a
nominal feature.
3.
Ordinal data
This denotes a nominal variable with
categories falling in an ordered list. Examples include clothing sizes such as
small, medium, and large, or a measurement of customer satisfaction on a scale
from “not at all happy” to “very happy.”
Examples
In the data given in Fig.1, the features
“year”, “price” and “mileage” are numeric and the features “model”, “color” and
“transmission” are categorical.
0 Comments