Example C.2. Consider the random experiment described in Example C.1. The average number of tails expected to show up when a fair coin is tossed

four times is

t r x :0 x 1 l16*1x 4 lL6+2 x 6 l16+3 x 4116- t 4x r f 16 :2 . (C .10)

The variance for the number of tails expected to show up is

“2y : (0 _ 2)’ xLl76+ (1 – 2)2 x 4116+ (2 -2)2 x 6lL6

+(3 – 2)2 x a l76 + (4 – 2)2 x t f to : t .

t

For pairs of random variables, a useful expected value to compute is the

covariance function. Cou. which is defined as follows:

Cou(X,Y): El(X – px)(Y – PY)l (c .11)

Note that the variance of a random variable X is equivalent Cou(X, X). The

expected value of a function also has the following properties:

t . E[a]: a, i f a is a constant.

2. Elaxl : aElXl .

3. ElaX + bYl : aElXl + bElYl.

Based on these properties, Equations C.9 and C.11 can be rewritten as follows:

“\ : n11x – px)21: E[x2] – Elxl’

Cou(X,Y) : EIXY) – ElXlElYl (c.12) (c.13)

C.2 Statistics

To draw conclusions about a population, it is generally not feasible to gather

data from the entire population. Instead, we must make reasonable conclusions

about the population based on evidence gathered from sampled data. The

process of drawing reliable conclusions about the population based on sampled

data is known as statistical inference.

724 Appendix C Probability and Statistics

C.z.I Point Estimation

The term statistic refers to a numeric quantity derived from sampled data. Two examples of useful statistics include the sample mean (z) and the sample variance (s|):

1 N’ \ – r . N ‘ L ” o

; – 1

(c.14)

(c .15). S v :

1 ‘\tr r

\ – / l – . – – t2 A/ – 1 /-t\1,x

i : I

alfz – ur”rf]: o?tN

The process of estimating the parameters of a population using sample statis- tics is known as point estimation.

Example C.3. Let Xr, Xz, . . . , XN be a random sample of l/ independent and identically distributed observations drawn from a population with mean p,y and variance o2*. t’et 7 be the sample mean. Then

Ewl :”[# \”,): # t Elxo]: f ‘ Nt,x : t,x, (c .16)

where E[Xd – px since all the observations come from the same distribution with mean p76. This result suggests that the sample mean 7 approaches the population mearr p,x) especiaily when ly’ is sufficiently large. In statistical terms, the sample mean is called an unbiased estimator of the population mean. It is possible to show that the variance of the sample mean is

(c.17)

Because the population variance is usually unknown, the variance of the sample mean is often approximated by replacing o| with the sample variance s2r. The quantity sy lfi is known as the standard error of the mean. r

C.2.2 Central Limit Theorem

The normal distribution is perhaps one of the most widely used probability distributions because there are many random phenomena that can be modeled using this distribution. This is a consequence of a statistical principle known as the central limit theorem.

C.2 Statistics 725

Theorem C.1- (Central Limit Theorem). Consider a random sample of si,ze N drawn from a probabi,Ii,ty di,stri,buti,on with mean p,x and uariance o2*. If E is the sample n’Lean, then the di,stri,buti,on of 7 approaches a normal dis- tribution with mean p,y and uariance

“2xlN when the sample size i,s large.

The central limit theorem holds regardless of the distribution from which the random variable is drawn. For example, suppose we randomly draw l/ in- dependent examples from a data set with an unknown distribution. Let X.i be a random variable that denotes whether the ith example is predicted correctly by a given classifier; i.e., Xi: 1 if the example is classified correctly, and 0 otherwise. The sample mean, X, denotes the expected accuracy of the classi- fier. The central limit theorem suggests that the expected accuracy (i.e., the sample mean) tends to be normally distributed even though the distribution from which the examples are drawn may not be normally distributed.

C.2.3 fnterval Estimation

When estimating the parameters of a population, it is useful to indicate the re- liability of the estimate. For example, suppose we are interested in estimating the population mean p,y from a set of randomly drawn observations. Using a point estimate such as the sample mean, r, may not be sufficient, especially when the sample size is small. Instead, it may be useful to provide an interval that contains the population mean with high probability. The task of esti- mating an interval in which the population parameter can be found is termed interval estimation. Let 0 be the population parameter to be estimated. If

P(h<g<9 i l : 7 -d . , (c.18)

then (91,02) is the confidence interval for 0 at a confidence level of 1- a. Figure C.1 shows the 95% confidence interval for a parameter derived from a normal distribution with mean 0 and variance 1. The shaded region under the normal distribution has an area equal to 0.95. In other words, if we generate a sample from this distribution, there is a g57o chance that the estimated parameter falls between -2 and *2.

Consider a sequence of randomly drawn observations, Xl, X2,…, X1,’. We would like to estimate the population mean, ,u;, based upon the sample mean, T, at 68To confidence interval. According to the central limit theorem, 7 ap- proaches a normal distribution with mean px and variance

“klN when.ly’

is sufficiently large. Such a distribution can be transformed into a standard normal distribution (i.e., a normal distribution with mean 0 and variance 1)

726 Appendix C Probability and Statistics

o 4

0.3

0.25

o 2

0 1 5

o.’r

0.05

-4 -3 – 2 – 1 0 1 2 3 4

Confidence interval of a parameter.Figure C.1,

in the following way.

Z : r – l t x x r – 1 ”

=N(0 .1 ) , ox/t/N sx/{N

(c.1e)

where the population standard deviation is approximated by the standard error of the sample mean. Fhom the probability table of a standard normal distribution, P(-L < Z < 1) :0.69. The probability can be rewritten in the following way:

P(-sy IJN < r – p,x < sx/’/N) : 0.68,

or equivalently,

P(T – sxlt/N 1 px <E + sxlJN) :0.08.

Therefore, the 68% confidence interval for p,ss is r* sxlr/N.

C.3 Hypothesis Testing

Hypothesis testing is a statistical inference procedure to determine whether a conjecture or hypothesis should be accepted or rejected based on the evidence gathered from data. Examples of hypothesis tests include verifying the quality of patterns extracted by data mining algorithms and validating the significance of the performance difference between two classification models.

C.3 Hypothesis Testing 727

In hypothesis testing, we are usually presented with two contrasting hy- potheses, which are known, respectively, as the null hypothesis and the al-

ternative hypothesis. The general procedure for hypothesis testing consists

of the following four steps:

1. Formulate the null and alternative hypotheses to be tested.

2. Define a test statistic 0 that determines whether the null hypothesis

should be accepted or rejected. The probability distribution associated

with the test statistic should be known.

3. Compute the value of 0 from the observed data. Use the knowledge of the probability distribution to determine a quantity known as the p-value.

4. Define a significance level, a, which controls the range of 0 values in

which the null hypothesis should be rejected. The range of values for I

is known as the rejection region.

Consider an association pattern X derived using the algorithms presented

in Chapter 6. Suppose we are interested in evaluating the quality of the pattern

from a statistical perspective. The criterion for judging whether the pattern

is interesting depends on a quantity known as the support of the pattern (see

Equation 6.2), s(X). Support measures the fraction of records in which the pattern is actually observed. X is considered interesting if s(X) > m’insup,

where mi.nsup is a user-specified minimum threshold. The problem can be formulated into the hypothesis testing framework in

the following way. To validate the pattern X, we need to decide whether to

accept the null hypothesis, Hs: s(X) – m’insup, or the alternative hypothesis

H1 : s(X) > minsup. If the null hypothesis is rejected, then X is considered

an interesting pattern. To perform the test, the probability distribution for

s(X) must also be known. We can apply the binomial distribution to model

this problem because determining the number of times pattern X appears in

.ly’ records is analogous to determining the number of heads that shows up

when tossing I{ coins. The former can be described by a binomial distribution

with mean s(X) and variance s(X) x s(X)lN. The binomial distribution can

be further approximated using a normal distribution if l/ is sufficiently large,

which is typically the case in most market basket analysis problems.

Under the null hypothesis, s(X) is assumed to be normally distributed

with mean minsup and variance m,insup x minsupfli. To test whether the

null hypothesis should be accepted or rejected, the following Z-stalistic can

be used: s(X) – m’insup