The **exponential distribution** is one of the most
popular and useful distributions in statistical analysis. It is used to
model the number of events^{2} occurring over a period of time. Let \(X\) be the number of events over the
observation period \(t\): \(X \sim {\sf Exp}(\lambda t)\) with \(\lambda\) representing the expected number
of events per time unit (minute, hour, etc.).\(\lambda\) is often referred to as the
*rate parameter*. The probability density function (PDF) is given
by \(f(x;\lambda) = \lambda e^{-\lambda
x}\) for \(x \geq 0\) and \(f(x;\lambda) = 0\) for \(x <0\). By integrating \(f(x,\lambda)\) one receives \(F(x;\lambda) = 1-e^{-\lambda x}\), i. e.
the cumulative distribution function (CDF). Interestingly, the mean and
the standard deviation of X are both \(\frac{1}{\lambda}\), with the variance
being \(Var(X) =
\frac{1}{\lambda^2}\).

The goal of this study lies in observing the behavior of the exponential distribution, especially regarding its mean. Using R, it is possible to compare theoretical findings to a simulation. Let the rate parameter be \(\lambda = 0.2\) and the sample size \(n\) be equal to 20. It is known, that the expected value of the distribution is \(E[X] = \frac{1}{\lambda}\), one would therefore expect a sample mean of \(\bar{X} = \frac{1}{0.2} = 5\).

```
set.seed(101)
n <- 40
lambda <- 0.2
mean(rexp(n,rate = lambda))
```

`## [1] 4.034012`

We see that the sample mean is close to its theoretical counterpart.
The **law of large numbers** would predict, that the sample
mean limits to the theoretical mean as the sample size grows.

```
n <- 400
mean(rexp(n,rate = lambda))
```

`## [1] 4.818306`

And indeed, the distance between the theoretical and sample mean did
shrink markedly by increasing the sample size by a factor of 10. This
finding, as intuitive as it might be, poses a theoretical question about
the distribution of sample means. Using simple algebra to receive the
sample (arithmetic^{3}-) means expected value and variance, one
arrives at \(E[\bar{X}] = \mu\) and
\(E[Var(\bar{X})] =
\frac{\sigma^2}{n}\). It is essential to realize that statistics,
such as the sample mean, are calculated based on realizations drawn
randomly from a population, are themselves *random variables*.
The first moment, \(E[\bar{X}]\) of the
distribution of the sample means implies a centering around the expected
value of the underlying distribution. This can be tested by simulating
1.000 samples of size 40 and 80 from the exponential function
characterized above and taking the mean of each. The resulting
distributions are depicted as a histogram (A) and as a density (B).

```
mns_40 = NULL
mns_80 = NULL
for (i in 1 : 1000) mns_40 = c(mns_40, mean(rexp(40, rate = lambda)))
for (i in 1 : 1000) mns_80 = c(mns_80, mean(rexp(80, rate = lambda)))
mns <- data.frame(mean = c(mns_40,mns_80),size = rep(c("n = 40", "n = 80"),each =1000))
library(ggplot2)
library(ggpubr)
g = ggplot(mns, aes(x=mean, fill=size))
g = g + geom_histogram(alpha=0.4, position="identity", bins = 100)
d = ggplot(mns, aes(x=mean, fill=size))
d = d + geom_density(alpha = 0.4, aes(color = size),linetype = 2)
ggarrange(g,d, nrow = 2, align = "hv",common.legend = TRUE,
legend = "bottom", labels = c("A","B"))
```

By visually examining the mean distributions, it is possible to make 3 distinct observations:

- both distributions are centered around the theoretical mean,
- increasing the sample size leads to a less flat distribution, i. e. lower variance,
- the distributions of \(\bar{X}\) appear quite Gaussian since they take a bell curve shape.

Findings 1-3 are in line with the **central limit
theorem** (CLT) which would predict such a behavior of sample
means. This enables the statistician to apply techniques for normal
distributions to sample means and other select statistics and make
useful inferences about the underlying data-creating processes.

```
sim_variance <- c(var(mns_40), var(mns_80))
sim_mean <- c(mean(mns_40), mean(mns_80))
theo_variance <- c((1/0.04)/40, (1/0.04)/80)
theo_mean <- c(1/lambda, 1/lambda)
df <- data.frame(size = c("40","80"),sim_mean, theo_mean, sim_variance, theo_variance)
p = ggplot(mns, aes(x =mean, fill=size)) + geom_boxplot(aes(fill = size)) + coord_flip()
t = ggtexttable(df)
ggarrange(p,t, nrow = 2, align = "hv",common.legend = TRUE,
legend = "right", heights = c(2,1), widths = c(1,2) )
```

A comparison between the realized first moments of the simulated distribution of means to their theoretical counterparts, as suggested by the CLT, is conducted (see above). Indeed, the simulated moments are very close to what the theory would suggest, indicating that the CLT holds true.

To illustrate the application of inferential statistics a brief
analysis of the “ToothGrowth” data
set is performed. The data is based on an experiment^{4} where the effect of
vitamin c on teeth growth in guinea pigs was studied. The data comprises
60 observations which are split into two groups of 30 observations each,
categorized by the delivery method of the vitamin. The first group
received their doses via orange juice (OJ), the second via ascorbic acid
(coded VC). The researchers varied the doses given (0.5, 1, and 2
mg/day), this originally numerical column is converted to a
categorical.

```
TG = ToothGrowth
TG[,3] = as.factor(TG[,3])
colnames(TG) = c("ToothLength", "DeliveryMethod", "Dose")
summary(TG)
```

```
## ToothLength DeliveryMethod Dose
## Min. : 4.20 OJ:30 0.5:20
## 1st Qu.:13.07 VC:30 1 :20
## Median :19.25 2 :20
## Mean :18.81
## 3rd Qu.:25.27
## Max. :33.90
```

`ggplot(TG, aes(x=Dose, y = ToothLength)) + geom_point(aes(color = DeliveryMethod)) `

By visually examining the data in the plot above, it is easy to observe a positive relationship between teeth growth and the dosage of vitamin c. The guinea pigs which received their dosage via orange juice (OJ) also seem to have experienced stronger growth as those which received ascorbic acid (VC). We can test these findings by applying standard t-tests to the mean teeth growth between the two groups of delivery methods and the low and the high dosage level (0.5 mg vs. 2 mg).

```
# Testing for differences between the delivery methods.
OJ = TG$ToothLength[TG$DeliveryMethod == "OJ"]
VC = TG$ToothLength[TG$DeliveryMethod == "VC"]
pvalues_Delivery = c(t.test(OJ,VC, alternative = "two.sided")$p.value,
t.test(OJ,VC, alternative = "greater")$p.value)
# Testing for differences between the low and the high dosage (LD, HD).
LD <- TG$ToothLength[TG$Dose == "0.5"]
HD <- TG$ToothLength[TG$Dose == "2"]
pvalues_Dose = c(t.test(HD,LD, alternative = "two.sided")$p.value,
t.test(HD,LD, alternative = "greater")$p.value)
# Compiling the results
pvals <- data.frame(pvalues_Dose, pvalues_Delivery,
row.names = c("two-sided", "greater"))
colnames(pvals) <- c("H0: High Dosage of Vitamin C has no effect on teeth growth.",
"H0: Delivery Method has no effect on teeth growth.")
pvals
```

It is possible to reject the hypothesis, that the higher dosage of vitamin c is not associated with increased teeth growth, in favor of the alternative in both the one-sided as well as the two-sided test at an confidence level of less than 1%. This strongly indicates, that vitamin c does have an positive effect on teeth growth. The effect of the different delivery method does remain ambiguous as we cannot reject the null-hypothesis of the two-sided test at a statistical significance of 5%. These findings are based on the assumption, that the data is independent and identically distributed (i.i.d) and that the central limit theorem holds true.

Humboldt University of Berlin↩︎

E. g. the failure of an electric or mechanical component, the death of a patient in a hospital or the outbreak of measles in a daycare.↩︎

\(\bar{X} = \frac{1}{n}\sum_{i = 1}^{n}x_i\)↩︎

Crampton, E. W. (1947). The growth of the odontoblast of the incisor teeth as a criterion of vitamin C intake of the guinea pig. The Journal of Nutrition, 33(5), 491–504. doi:10.1093/jn/33.5.491.↩︎