Lab 02 Describing data

1 Objectives

Investigate sampling error; see that larger samples have less sampling error
Visualize confidence intervals
Calculate basic summary statistics using R
Calculate confidence intervals for the mean with R

Data for all labs (unzip to your working directory)

2 Start a script!

For this and every lab or project, begin by:

starting a new script
create a good header section and table of contents
save the script file with an informative name
set your working directory

Aim to make the script useful as a future reference for doing things in R - this will come in handy for projects and assessments!

3 Missing data

Sometimes we do not have all variables measured on all individuals in the data set. When this happens, we need a space holder in our data files so that R knows that the data is missing. The standard way of doing this in R is to put NA in the location that the data would have gone. NA is short for “not available”.

For example, in the Titanic data set, we do not know the age of several passengers. Let’s look at it. Load the Titanic data set:

# Load data
# NB your file path may be different than mine
titanicData <- read.csv("data/titanic.csv" )

Have R print out the vector of the age variable, which you can do easily by just typing its name:

# print the variable age
titanicData$age

   [1] 29.0000  2.0000 30.0000 25.0000  0.9167 47.0000 63.0000 39.0000 58.0000
  [10] 71.0000 47.0000 19.0000      NA      NA      NA 50.0000 24.0000 36.0000
  [19] 37.0000 47.0000 26.0000 25.0000 25.0000 19.0000 28.0000 45.0000 39.0000
  [28] 30.0000 58.0000      NA 45.0000 22.0000      NA 41.0000 48.0000      NA
  [37] 44.0000 59.0000 60.0000 45.0000      NA 53.0000 58.0000 36.0000 33.0000
  [46]      NA      NA 36.0000 36.0000 14.0000 11.0000 49.0000      NA 36.0000
  [55]      NA 46.0000 47.0000 27.0000 31.0000      NA      NA      NA      NA
  [64] 27.0000 26.0000      NA      NA 64.0000 37.0000 39.0000 55.0000      NA
  [73] 70.0000 69.0000 36.0000 39.0000 38.0000      NA 27.0000 31.0000 27.0000
  [82]      NA 31.0000 17.0000      NA      NA  4.0000 27.0000 50.0000 48.0000
  [91] 49.0000 48.0000 39.0000 23.0000 53.0000 36.0000      NA      NA 30.0000
 [100] 24.0000 19.0000 28.0000 23.0000 64.0000 60.0000      NA 49.0000      NA
 [109] 44.0000 22.0000 60.0000 48.0000 37.0000 35.0000 47.0000 22.0000 45.0000
 [118] 49.0000      NA 71.0000 54.0000 38.0000 19.0000 58.0000 45.0000 23.0000
 [127] 46.0000 25.0000 21.0000 48.0000 49.0000 45.0000 36.0000      NA 55.0000
 [136] 52.0000 24.0000      NA      NA      NA 16.0000 44.0000 51.0000 42.0000
 [145] 35.0000 35.0000 38.0000 35.0000      NA 50.0000 49.0000 46.0000      NA
 [154] 58.0000 41.0000      NA 42.0000 40.0000      NA      NA      NA 42.0000
 [163] 55.0000 50.0000 16.0000      NA 29.0000 21.0000 30.0000 15.0000 30.0000
 [172]      NA      NA      NA 46.0000 54.0000 36.0000 28.0000      NA 65.0000
 [181] 33.0000 44.0000 37.0000      NA 55.0000 47.0000 36.0000 58.0000 31.0000
 [190] 23.0000 19.0000 64.0000      NA 64.0000 22.0000 28.0000      NA      NA
 [199] 22.0000      NA      NA 18.0000 17.0000 52.0000 46.0000 56.0000      NA
 [208]      NA 43.0000 31.0000      NA      NA 33.0000      NA 27.0000 55.0000
 [217] 54.0000      NA 61.0000 48.0000 18.0000 13.0000 21.0000      NA      NA
 [226]      NA 34.0000 40.0000 36.0000 50.0000 39.0000 56.0000 28.0000 56.0000
 [235] 56.0000 24.0000 18.0000      NA 24.0000 23.0000 45.0000 40.0000  6.0000
 [244] 57.0000      NA 32.0000 62.0000 54.0000 43.0000 52.0000      NA 62.0000
 [253] 67.0000 63.0000 61.0000 46.0000 52.0000 39.0000 18.0000 48.0000      NA
 [262] 49.0000 39.0000 17.0000 46.0000      NA 31.0000      NA 61.0000 47.0000
 [271] 64.0000 60.0000 60.0000 55.0000 54.0000 21.0000 57.0000 45.0000 31.0000
 [280] 50.0000 50.0000 27.0000 20.0000 51.0000      NA 21.0000      NA      NA
 [289] 36.0000      NA      NA      NA      NA      NA      NA      NA      NA
 [298]      NA      NA      NA      NA      NA      NA      NA      NA      NA
 [307] 40.0000      NA      NA 32.0000      NA      NA      NA      NA      NA
 [316]      NA 33.0000      NA      NA      NA      NA      NA 30.0000 28.0000
 [325] 18.0000      NA 34.0000 32.0000 57.0000 18.0000 23.0000 36.0000 28.0000
 [334] 51.0000 32.0000 19.0000 28.0000 36.0000  4.0000  1.0000 12.0000 34.0000
 [343] 19.0000 23.0000 26.0000      NA 27.0000 15.0000 45.0000 40.0000 20.0000
 [352] 25.0000 36.0000 25.0000      NA 42.0000 26.0000 26.0000  0.8333 31.0000
 [361]      NA 19.0000 54.0000 44.0000 52.0000 30.0000 30.0000      NA      NA
 [370] 29.0000      NA 29.0000 27.0000 24.0000 35.0000 31.0000  8.0000 22.0000
 [379] 30.0000      NA 20.0000      NA 21.0000 49.0000  8.0000 28.0000 18.0000
 [388]      NA 28.0000 22.0000 25.0000 18.0000 32.0000 18.0000      NA 42.0000
 [397] 34.0000  8.0000      NA      NA 23.0000 21.0000 19.0000      NA      NA
 [406]      NA 38.0000      NA 38.0000 35.0000 35.0000 38.0000 24.0000 16.0000
 [415] 26.0000 45.0000 24.0000 21.0000 22.0000      NA 34.0000 30.0000 50.0000
 [424] 30.0000 23.0000  1.0000 44.0000 28.0000  6.0000 30.0000      NA 43.0000
 [433] 45.0000  7.0000 24.0000 24.0000 49.0000 48.0000      NA 34.0000 32.0000
 [442] 21.0000 18.0000 53.0000 23.0000 21.0000      NA 52.0000 42.0000 36.0000
 [451] 21.0000 41.0000      NA      NA 33.0000 17.0000      NA      NA      NA
 [460]      NA      NA      NA 23.0000 34.0000      NA 22.0000      NA      NA
 [469] 45.0000      NA      NA 31.0000 30.0000 26.0000      NA 34.0000 26.0000
 [478] 22.0000  1.0000  3.0000      NA      NA      NA 25.0000      NA 48.0000
 [487]      NA 57.0000      NA      NA      NA  2.0000      NA 27.0000 19.0000
 [496] 30.0000 20.0000 45.0000      NA 46.0000 41.0000 13.0000 19.0000 30.0000
 [505] 48.0000 71.0000 54.0000      NA      NA 64.0000 32.0000 18.0000  2.0000
 [514] 32.0000  3.0000 26.0000 19.0000      NA 20.0000 29.0000 39.0000 22.0000
 [523]      NA 24.0000      NA 28.0000      NA 50.0000 20.0000 40.0000 42.0000
 [532] 21.0000 32.0000 34.0000      NA      NA 33.0000  2.0000  8.0000 36.0000
 [541] 34.0000 30.0000 28.0000 23.0000  0.8333 25.0000  3.0000 50.0000      NA
 [550] 21.0000      NA      NA 25.0000 18.0000 20.0000 30.0000 59.0000 30.0000
 [559] 35.0000 22.0000      NA 25.0000 41.0000 25.0000 14.0000 50.0000 22.0000
 [568]      NA 27.0000 29.0000 27.0000 30.0000 22.0000 35.0000 30.0000 28.0000
 [577] 23.0000      NA 12.0000 40.0000 36.0000 28.0000 32.0000 29.0000  4.0000
 [586]  2.0000      NA      NA 36.0000 33.0000      NA      NA      NA 32.0000
 [595]      NA      NA 26.0000      NA 30.0000 24.0000      NA 18.0000 42.0000
 [604] 13.0000 16.0000 35.0000 16.0000 25.0000 18.0000 20.0000 30.0000 26.0000
 [613] 40.0000 24.0000 41.0000 18.0000  0.8333 23.0000 20.0000 25.0000 35.0000
 [622] 17.0000 32.0000 20.0000 39.0000 39.0000  6.0000  2.0000 17.0000 38.0000
 [631]  9.0000 26.0000 11.0000  4.0000 20.0000 26.0000 25.0000 18.0000 24.0000
 [640] 35.0000 40.0000 38.0000  5.0000  9.0000  3.0000 13.0000 23.0000  5.0000
 [649]      NA 45.0000 23.0000 17.0000 27.0000 23.0000 20.0000 32.0000 33.0000
 [658]  3.0000      NA      NA      NA 18.0000 40.0000 26.0000 15.0000 45.0000
 [667] 18.0000 27.0000 22.0000 19.0000 26.0000 22.0000 20.0000 32.0000 21.0000
 [676] 18.0000 26.0000  6.0000      NA      NA  9.0000 40.0000 32.0000      NA
 [685] 26.0000 18.0000 20.0000      NA 29.0000 22.0000 22.0000 35.0000 21.0000
 [694] 20.0000 19.0000 18.0000 18.0000 38.0000      NA 30.0000 17.0000 21.0000
 [703] 21.0000 21.0000      NA      NA 24.0000 33.0000 33.0000 28.0000 16.0000
 [712] 37.0000 28.0000      NA 24.0000 21.0000      NA 32.0000 29.0000 26.0000
 [721] 18.0000 20.0000 19.0000 24.0000 24.0000 36.0000 31.0000 31.0000 30.0000
 [730] 22.0000      NA 43.0000 35.0000 27.0000 19.0000 30.0000 36.0000  3.0000
 [739]  9.0000 59.0000 19.0000 44.0000 17.0000      NA 45.0000 22.0000 19.0000
 [748] 29.0000 30.0000 34.0000 28.0000  0.3333 27.0000 25.0000 24.0000 22.0000
 [757] 21.0000 17.0000      NA      NA 26.0000 33.0000  1.0000  0.1667 25.0000
 [766] 36.0000 36.0000 30.0000      NA 23.0000 26.0000 19.0000 65.0000      NA
 [775] 42.0000 43.0000 32.0000 19.0000 30.0000 24.0000 23.0000      NA 24.0000
 [784] 24.0000 23.0000 22.0000      NA 18.0000 16.0000 45.0000      NA      NA
 [793]      NA 47.0000  5.0000      NA      NA      NA      NA      NA      NA
 [802]      NA      NA      NA      NA      NA 21.0000 18.0000  9.0000 48.0000
 [811] 16.0000      NA      NA 25.0000      NA      NA 22.0000 16.0000      NA
 [820] 33.0000      NA  9.0000 41.0000 38.0000 40.0000 43.0000 14.0000 16.0000
 [829]  9.0000 10.0000  6.0000 11.0000 40.0000 32.0000      NA 20.0000 37.0000
 [838] 28.0000 19.0000      NA      NA      NA      NA      NA      NA      NA
 [847]      NA      NA      NA      NA      NA      NA      NA      NA      NA
 [856]      NA      NA      NA      NA      NA      NA      NA      NA      NA
 [865]      NA      NA      NA      NA      NA      NA      NA      NA      NA
 [874]      NA      NA      NA      NA      NA      NA      NA      NA      NA
 [883]      NA      NA      NA      NA      NA      NA      NA      NA      NA
 [892]      NA      NA      NA      NA      NA      NA      NA      NA      NA
 [901]      NA      NA      NA      NA      NA      NA      NA      NA      NA
 [910]      NA      NA      NA      NA      NA      NA      NA      NA      NA
 [919]      NA      NA      NA      NA      NA      NA      NA      NA      NA
 [928]      NA      NA      NA      NA      NA      NA      NA      NA      NA
 [937]      NA      NA      NA      NA      NA      NA      NA      NA      NA
 [946]      NA      NA      NA      NA      NA      NA      NA      NA      NA
 [955]      NA      NA      NA      NA      NA      NA      NA      NA      NA
 [964]      NA      NA      NA      NA      NA      NA      NA      NA      NA
 [973]      NA      NA      NA      NA      NA      NA      NA      NA      NA
 [982]      NA      NA      NA      NA      NA      NA      NA      NA      NA
 [991]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1000]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1009]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1018]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1027]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1036]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1045]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1054]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1063]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1072]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1081]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1090]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1099]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1108]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1117]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1126]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1135]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1144]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1153]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1162]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1171]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1180]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1189]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1198]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1207]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1216]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1225]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1234]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1243]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1252]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1261]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1270]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1279]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1288]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1297]      NA      NA      NA      NA      NA      NA      NA      NA      NA
[1306]      NA      NA      NA      NA      NA      NA      NA      NA

If you look through the results, you will see that most individuals have numbers in this list, but some have NA. These NA values are the people for which we do not have age information - missing data.

By the way, the titanic.csv file simply has nothing in the places where there is missing data. When R loaded it, it replaced the empty spots with NA automatically.

4 Measures of location

For this lab we will use R to give some basic descriptive statistics for numerical data.

4.1 `mean()`

We have already seen in lab 1 how to calculate the mean of a vector of data using mean(). Unfortunately, if there are missing data we need to tell R how to deal with it.

A (somewhat annoying) quirk of R is that if we try to take the mean of a list of numbers that include missing data, we get an NA for the result!

# mean with missing data - doh!
mean(titanicData$age)

[1] NA

4.2 `na.rm = TRUE`

To get the mean of all the numbers that we do have, we have to add an argument to the mean() function, na.rm = TRUE.

# tell R to exclude NAs
mean(titanicData$age, na.rm = TRUE)

[1] 31.19418

This tells R to remove (“rm”) the NAs (“na”) before calculating the mean. It turns out that the mean age of passengers that we have information for was about 31.2.

na.rm = TRUE can be added to many functions in R, including median(), as we shall see next.

4.3 `median()`

The median of a series of numbers is the “middle” number – half of the numbers in the list are greater the median and half are below it. It can be calculated in R by using median().

# calculate median
median(titanicData$age, na.rm = TRUE)

[1] 30

4.4 `summary()`

A handy function that will return both the mean and median at the same time, along with other information such as the first and third quartiles, is summary().

# automatic summary information
summary(titanicData$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.1667 21.0000 30.0000 31.1942 41.0000 71.0000     680

From left to right, this output gives us

the smallest (minimum) value in the list (“Min.”)
the first quartile (“1st Qu.” - the value where 25% of the values are smaller)
the median
the mean
the third quartile (“3rd Qu.” - 75% of values are smaller)
the largest (maximum) value (“Max.”)
the number of individuals with missing values (”NA’s”)

The first quartile is the value in the data that is larger than a quarter of the data points. The third quartile is larger than 3/4 of the data. These are also called the 25th percentile and the 75th percentile, respectively. (You may remember these from boxplots, where the top and bottom of the box mark the 75th and 25th percentiles, respectively.)

5 Measures of variability

R can also calculate measures of the variability of a sample. In this section we’ll learn how to calculate the variance, standard deviation, coefficient of variation and interquartile range of a set of data.

5.1 `var()`

To calculate the variance of a list of numbers, use var().

# variance is big here!
var(titanicData$age, na.rm = TRUE)

[1] 217.4895

Note that var(), as well as sd() below, have the same need for na.rm = TRUE when analyzing data that include missing values.

5.2 `sd()`

The standard deviation can be calculated by sd().

# standard deviation
sd(titanicData$age, na.rm = TRUE)

[1] 14.74753

Of course, the standard deviation is the same as the square root of the variance.

sqrt(var(titanicData$age, na.rm = TRUE))

[1] 14.74753

5.3 Coefficient of variation

The variance of a measure is positively correlated with it’s mean. We can standardize the measure of variation according to the mean (so that variation can be compared between variables with different means) using the coefficient of variation, the `c.v.’.

$c.v. = (standard\ deviation) / mean * 100$

There is no standard function in R to calculate the coefficient of variation. You can do this yourself, though, directly from the definition:

# calculate the cv
sd(titanicData$age, na.rm = TRUE) / mean(titanicData$age, na.rm = TRUE) * 100

[1] 47.27653

5.4 `IQR()`

The interquartile range (or IQR) is the difference between the third quartile and the first quartile; in other words the range covered by the middle half of the data. It can be calculated easily with IQR().

# Interquartile range
IQR(titanicData$age, na.rm = TRUE)

[1] 20

Note that this is the same as we could calculate by the results from summary() above. The third quartile is 41 and the first quartile is 21, so the difference is 41 – 21 = 20.

5.5 Confidence intervals (CI) of the mean

The confidence interval for an estimate tells us a range of values that is likely to contain the true value of the parameter. For example, in 95% of random samples from a population, the 95% confidence interval of the mean will contain the true value of the population mean.

R does not have a built-in function to calculate only the confidence interval of the mean, but the function that calculates t-tests will give us this information. The function t.test() has many results in its output. By adding $conf.int to this function we can slice out only the confidence interval for the mean. By default it gives us the low and high values for the 95% confidence interval of a variable.

# 95 CI
t.test(titanicData$age)$conf.int

[1] 30.04312 32.34524
attr(,"conf.level")
[1] 0.95

As the result above shows, the 95% confidence interval of the mean of age in the titanicData data set is from about 30.0 to 32.3. R also tells us that it used a 95% confidence level for its calculation. (The confidence interval is not so useful in this case, because we actually have information for nearly all the individuals on the Titanic.)

To calculate confidence intervals with a different level of confidence, we can add the option conf.level to the t.test function. For example, for a 99% confidence interval we can use the following.

# Specify confidence interval with conf.level argument
t.test(titanicData$age, conf.level = 0.99)$conf.int

[1] 29.67976 32.70861
attr(,"conf.level")
[1] 0.99

6 Sampling exercise

6.1 Distribution of sample means

Go to the web and open the page at Gaussian sampling exercise. This page contains some interactive visualizations that let you play around with sampling to see the distribution of sampling means. Click the button that says “Tutorial” near the bottom of the page and follow along with the instructions.

6.2 Confidence intervals

Go back to the web, and open the page at Confidence interval exercise. This page draws confidence intervals for the mean of a known population. Click “Tutorial” again, and follow along.

7 Challenge questions

Make a script in RStudio that collects all your R code required to answer the following questions. Include answers to the qualitative questions using comments.

7.1

Use the data file MancHeight.csv for this question - it contains height, gender and the number of siblings for university statistics students from Manchester. These data were collected using measuring tapes in units of cm. Open that file in R.

Plot the distribution of heights in the class. Describe the shape of the distribution. Is it symmetric or skewed? Is it unimodal (one distinct peak), bimodal (two discernible peaks) or something else? ? Key point: A distribution is skewed if it is asymmetric. A distribution is skewed right if there is a long tail to the right, and skewed left if there is a long tail to the left.

Are there any large or small outliers that look as though a student used the wrong units for their height measurement. (e.g., are there any that are more plausibly a height given in inches rather than the requested centimeters?) If so, and if this is not likely to be an accurate description of an individual in your class, use filter() from the package dplyr to create a new data set without those rows.

Use R to calculate the mean height of all students in the class.

Use sd() to calculate the standard deviation of height.

7.2

The file caffeine.csv contains data on the amount of caffeine in a 16 oz. cup of coffee obtained from various vendors. For context, doses of caffeine over 25 mg are enough to increase anxiety in some people, and doses over 300 to 360 mg are enough to significantly increase heart rate in most people. A can of Red Bull contains 80mg of caffeine.

What is the mean amount of caffeine in 16 oz. coffees?

What is the 95% confidence interval for the mean?

Plot the frequency distribution of caffeine levels for these data in a histogram. Comment on whether you think the amount of caffeine in a cup of coffee is relatively consistent from one vendor to another? What is the standard deviation of caffeine level? What is the coefficient of variation?

The file caffeineCosta.csv has data on six 16 oz. cups of Breakfast Blend coffee sampled on six different days from a Costa location. Calculate the mean (and the 95% confidence interval for the mean) for these data. Compare these results to the data taken on the broader sample of vendors in the first file. Describe the difference.

7.3

A confidence interval is a range of values that are likely to contain the true value of a parameter. Consider the caffeine.csv data again.

Calculate the 99% confidence interval for the mean caffeine level.

Compare this 99% confidence interval to the 95% confidence interval you calculate in question 2 above. Which confidence interval is wider (i.e., spans a broader range)? Why should this one be wider?

Let’s compare the quantiles of the distribution of caffeine to this confidence interval. Approximately 95% of the data values should fall between the 2.5% and 97.5% quantiles of the distribution of caffeine. (Explain why this is true.) We can use R to calculate the 2.5% and 97.5% quantiles with a command like the following. (Replace “datavector” with the name of the vector of your caffeine data.)

quantile(datavector, c(0.025, 0.975), na.rm =TRUE)

Are these the same as the boundaries of the 95% confidence interval? If not, why not? Which should bound a smaller region, the quantile or the confidence interval of the mean?

7.4

Return to the class data set Mancheight.csv. Find the mean value of sibs. Add one to this to find the mean number of children per family in the class.

The mean number of offspring per family twenty years ago was about 2. Is the value for this class similar, greater, or smaller? If different, think of reasons for the difference.

Are the families represented in this class systematically different from the population at large? Is there a potential sampling bias?

Consider the way in which the data were collected. How many families with zero children are represented? Why? What effect does this have on the estimated mean family size of all couples?

7.5

Return to the data on countries of the world, in countries.csv. Plot the distributions for ecological_footprint_2000, cell_phone_subscriptions_per_100_people_2012, and life_expectancy_at_birth_female.

For each variable, plot a histogram of the distribution using hist(). Is the variable skewed? If so, in which direction?

For each variable, calculate the mean and median and add them to their respective plots using the function abline() using the v argument. Are the mean and median similar for each? Match the difference in mean and median to the direction of skew on the histogram. Do you see a pattern?

Harper Adams Data Science

Harper Data Science

This module is a part of the MSc in Data Science for Global Agriculture, Food, and Environment at Harper Adams University, led by Ed Harris.