Statistics/Preliminaries
Template:Nav This chapter discusses some preliminary knowledge (related to statistics) for the following chapters in the advanced part.
Empirical distribution
Template:Colored definition Template:Colored remark Since all these random variables follow the same cdf as , we may expect their distribution should be somewhat similar to the distribution of , and indeed, this is true. Before showing how this is true, we need to define "the distribution of these random variables" more precisely, as follows: Template:Colored definition Template:Colored remark Template:Colored example Template:Colored remark Template:Colored theorem Template:Colored remark We have mentioned how we can approximate the cdf, and now we would like to estimate the Template:Colored em/Template:Colored em. Let us first discuss how to estimate the pmf.
For the discrete random variable , from the empirical cdf, we know that each is "assigned" with the probability . Also, considering the previous example, the empirical pmf is . Template:Colored remark To discuss the estimation of pdf of continuous random variable, we need to define Template:Colored em first. Template:Colored definition For the continuous random variable , construct class intervals for which are a non-overlapped partition of the interval , in which and are the minimum and maximum values in the sample. Then, the pdf when and are close, i.e. the length of each class interval is small. (Although the union of the above class intervals is and thus the value is not included in the interval, it does not matter since the value of the pdf at does not affect the calculation of probability.) Here, is and is .
Since is the relative frequency of occurrences of the event , we can rewrite the above expression as in which is called the Template:Colored em.
Since there are many possible ways to construct the class intervals, the value of can differ even with the same and . When is Template:Colored em and the length of each class interval is Template:Colored em, we will expect to be a good estimate of (the theoretical pdf).
There are some properties related to the relative frequency histogram, as follows: Template:Colored proposition
Proof.
(i) Since the indicator function is nonnegative (its value is either 0 or 1), is positive, and so is positive, we have by definition.
(ii) Here, is and is .
(iii) We can "split" the integral in a similar way as in (ii), and then eventually the integral equals , and it can can approximate since it is the relative frequency of occurrences of the event .
Expectation
In this section, we will discuss some results about expectation, which involve some sort of inequalities. Let and be constants. Also, let be the sample space of .
Proof. Assume .
Case 1: is discrete.
By definition of expectation, . Then, we have because of the condition .
Case 2: is continuous.
We have similarly because of the condition of .
Template:Colored remark Template:Colored proposition
Proof. as desired.
Proof. First, observe that is a nonnegative random variable. Then, by Markov's inequality, for each (positive) , we have , since is positive.
Proof. Let be the tangent of the function at . Then, since is convex, we have for each (informally, we can observe this graphically). As a result, we have as desired.
Proof. Since , we must have .
Convergence
Before discussing convergence, we will define some terms that will be used later. Template:Colored definition Template:Colored remark In a Template:Colored em, say , we observe Template:Colored em values of their sample mean, , and sample variance, . Template:Colored em, each of the values is only Template:Colored em realization of the respective random variables and . We should notice the difference between these definite values (not random variables) and the statistics (random variables).
To explain the definitions of the sample mean and sample variance more intuitively, consider the following.
Recall that the empirical cdf assigns probability to each of the random sample . Thus, by the definition of mean and variance, the Template:Colored em of a random variable, say , with this cdf (and hence with the corresponding pmf ) is . Similarly, the Template:Colored em of is . In other words, the Template:Colored em and Template:Colored em of the empirical distribution, which corresponds to the Template:Colored em, is the Template:Colored em and the Template:Colored em respectively, which is quite natural, right? Template:Colored remark Also, recall that the empirical cdf can well approximate the cdf of , when is large. Since and are the mean and variance of a random variable with cdf it is natural to expect that and can well approximate the mean and variance of .
Convergence in probability
Template:Colored definition Template:Colored remark The following theorem, namely Template:Colored em, is an important theorem which is related to convergence in probability. Template:Colored theorem
Proof. We use to denote .
By definition, as is equivalent to as .
By Chebyshov's inequality, we have
Since are Template:Colored em (and hence functions of them are also independent) and the expectation is multiplicative under independence, So, the probability is Template:Colored em an expression that tends to be 0 as . Since the probability is nonnegative (), it follows that the probability also tends to be 0 as .
Template:Colored remark There are also some properties of convergence in probability that help us to determine a complex expression converges to what thing. Template:Colored proposition
Proof. Template:Colored em: Assume and . Continuous mapping theorem is first proven so that we can use it in the proof of other properties (the proof is omitted here). Also, it can be shown that (joint convergence in probability, the definition is similar, except that the random variables become ordered pairs, so the interpretation of "" becomes the Template:Colored em between the two points in Cartesian coordinate system, which are represented by the ordered pairs)
After that we define , , and respectively, where each of these functions is continuous, and are constants. Then, applying the continuous mapping theorem using each of these functions gives us the first three results.
Convergence in distribution
Template:Colored definition Template:Colored remark A very important theorem in statistics which is related to convergence in distribution is Template:Colored em. Template:Colored theorem
Proof. A (lengthy) proof can be founded in Probability/Transformation of Random Variables#Central limit theorem.
There are some properties of convergence in distribution, but they are a bit different from the properties of convergence in probability. These properties are given by Template:Colored em, and also continuous mapping theorem. Template:Colored theorem
Proof. Omitted.
Proof. Template:Colored em: Assume and . Then, it can be shown that (joint convergence in distribution, and the definitions of this is similar, except that the cdf's become joint cdf's of ordered pairs). After that, we define ,, and respectively, where each of the functions is continuous, and then applying the continuous mapping theorem using each of these functions gives us the three desired results.
Resampling
By Template:Colored em, we mean creating new samples based on an existing sample. Now, let us consider the following for a general overview of the procedure of resampling.
Suppose is a Template:Colored em from a distribution of a random variable with cdf, . Let be a corresponding Template:Colored em of the random sample . Based on this realization, we have also a Template:Colored em of the empirical cdf: [1]. Since this is a realization of the empirical cdf, by Glivenko-Cantelli theorem, it is a good estimate of the cdf when is large [2]. In other words, if we denote the random variable with the same pdf as that Template:Colored em of the empirical cdf by , and have similar distributions when is large.
Notice that a realization of empirical cdf is a Template:Colored em cdf (since the support is countable). We now draw a Template:Colored em (called the bootstrap (or resampling) random sample) with sample size (called the Template:Colored em) from the distribution of a random variable ( comes from Template:Colored em from , so the behaviour of sampling from is called Template:Colored em).
Then, the relative frequency historgram of should be close to that of the corresponding Template:Colored em of the empirical pmf of (found from the realization of the empirical cdf of ), which is close to pdf of . This means the relative frequency historgram of is close to the pdf of .
In particular, since the cdf of , , assigns probability to each of [3], the pmf of is . Notice that this pmf is quite simple, and therefore it can make the related calculation about it simpler. For example, in the following, we want to know the distribution of , and this simple pmf can make the resulting distribution also quite simple.
Template:Colored remark In the following, we will discuss an application of the bootstrap method (or Template:Colored em) mentioned above, namely using bootstrap method to Template:Colored em the distribution of a statistic (the inputs of the functions are random variables and is a function). The reason for approximating, rather than finding the distribution exactly, is that the latter is usually infeasible (or may be too complicated).
To do this, consider the "bootstrapped statistic" and the statistic . is the bootstrap random sample (with bootstrap sample size ) from the distribution of and is the random sample from the distribution of . When is large, since the distribution of is similar to that of , the bootstrap random sample and the random sample are also similar. It follows that and are similar as well, or to be more precise, the Template:Colored em of and are close. As a result, we can utilize the distribution of (which is easier to find and simpler, since the pmf of is simple as in above) to approximate the distribution of . A procedure to do this is as follows:
- Generate a Template:Colored em from the Template:Colored em , which is from the distribution of .
- Calculate a realization of the bootstrapped statistic , .
- Repeat 1. to 2. times to get a sequence of realizations of : .
- Plot the relative frequency historgram of the realizations .
This histogram of the realizations (which are a realization of a random sample from with sample size ) is close to the pmf of [4], and thus close to the pmf of . Template:Nav Template:BookCat
- β This is different from the empirical cdf .
- β For Glivenko-Cantelli theorem, the empirical cdf is a good estimate of the cdf , regardless of what the actual values (realization) of the random sample are, i.e. for each realization of the empirical cdf, it is a good estimate of the cdf , when is large.
- β That is, for a realization of random sample , say , the probability for to equal (which corresponds to the realization of respectively), is each.
- β The reason is mentioned similarly above: the histogram should be close to the pmf of since the cdf corresponding to the histogram (i.e. the realization of the empirical cdf of the random sample ) is close to the cdf of