Anja's Stats Information

I took some free statistics and data science courses through Harvard University and have decided to write down my learnings.

The courses I took are:

Statistics and R
Visualization

Probability
Linear Regression

Machine Learning
Inference and Modeling

R Stats

Statistics and R

The course link can be found here. Within this course we look at an introduction to R coding and surface level statistic concepts.

R Basics

R Install

R-Studio Install

Swirl

Installing R file

Basic commands

In order to have a package to use in R, you must install (once) and import (each .r file you want to use it in):
install.packages("")
library()

Download CSV File

First, install downloader package:
install.packages("downloader")
Then in the file we want to:
library(downloader)
url <- "https..."
filename <- "...csv"
download(url, destfile = filename)

dplyr

There is also a great package for data manipulation and importation: dplyr
install.packages("dplyr")
library(dplyr)

You can use the following:
- View(data) which lets you open a new tab to see the data
- filter() let's you look at just specific categories of data you want
- select() allows you to select and work with specific subset of data you want
- unlist() removes stuff from list format - becomes vector
- %>% is the inflix operator which works like a pipe, i.e., it passes the left hand side of the operator to the first argument of the right hand side of the operator. The point of it is to save time typing out data and graph changes
- basename(url) returns name.csv from a url
- summarize(aggregate_fn(column))
An example using a lot of these would be:
primate_sleep_list <- filter(m_sleep_df, order == "Primates") %>%
select(sleep_total) %>%
unlist

You can find my dplyr exercises here.
Installing RStudio for Mac

There are four panels in RStudio:
- top left source: where you open files and work on them
- bottom left console: where output is printed out and you can do operations
- top right environment/history: you can see what variables you have created
- bottom right help/plots/etc.: has all information and plots from the source/console

This great open source library that walks you through the basics of R. Here you can find a couple exercises I completed from the swirl tutorial. It goes through the following:

Basic building blocks

Matrix and dataframes

Basic building blocks

R has a lot of functions. As a coding language it is best used for mathematics and statistic calculations along with simulations.

The math operations are:

<- : assigns a value to a variable
+ : adding
- : subtracting
/ : dividing
* : multiplying
^ : exponent
%% : gives remainder
remainder(num, divisor = #)
sqrt() : square root
abs() : absolute value
mean()
median()
mode()
c() : "concatenate"; creates a vector combining the objects within the brackets

For vector operations between vectors are done unit-to-unit if vectors are the same size, however, if vectors are different size, then the shorter vector is recycled until it is the same length as the longer one.

Workspace & files

Some commands that are useful for working on your system:

getwd() : get working directory
ls() : list objects in directory (variables created)
list.files() : list files in directory
args() : when used on a function name can let you see what arguments go into the function
dir.create("name") : this let's you create a directory with a set "name"
setwd("") : set which directory you want to work in
file.create("name.R")
file.rename("old.R", "new.R")
file.remove("remove-name.R")
file.copy("original.R", "copy.R")
file.path("name.R")
unlink("directory", recursive = TRUE) : deletes directory

Sequence of numbers

There are several different ways to get a sequence of numbers:

: e.g. 1:3 gives us 1 2 3
seq(#, #, by = increment, length = # of #s you want)
seq(along.with = #) OR seq_along(#) : gives you a list of integers up until value inputted
rep(#, times = how many times)

Vector

Atomic vector is a ve tor that contains exactly one data type (i.e. logical, character, integer, etc). List is a vector than cna contain multiple data types. Logical vector has TRUE or FALSE; it checks things using: NA!A, A|B, A&B, ==, !=, <=, >=, <, >. Character vectors double quotes are used to distinguish character objects.

Two useful vector things to know:

LETTERS : is a predefined R variable for all the letters in the English alphabet.
paste(vector, collapse = " ") : puts characters together from vector with a space in between; you can also specify something else to go inbetween each part of the vector using this function.

Missing Values

Working with missing values is a regular occurence for data. It is important to understand how to find out if there is missing data and also if to fill it and with what. Here are a couple of good to know functions when working with missing data:

is.na(data_vector) : prints out TRUE or FALSE
rnorm(#) : draws # of values from a standard normal distribution - which can be useful for inputting missing data
sample(data, n) : lets you pull out random sample from your data - which can be useful for inputting missing data

Helpful to know:

NA : null
NaN : not a number
Inf : infinity

Subsetting vectors

You can create subsets of vectors using vec[#:#] which uses index to pull out values from vector. We can also exclude indices: x[c(-2, -10)] or x[-c(2,10)].

You can also check if the vector is empty or not: is.na(vector) or !is.na(vector).

We can get the titles of the rows of a matrix or dataframe using names(matrix). And we can also check if two vectors are the same using identical(vector1, vector2) - which returns TRUE or FALSE.

Matrix and dataframes

Matrices can only contain one class of data, whereas, dataframes can contain more than one class. Here are some useful functions:

length(df)
attributes(df) : lists things like dimension, # of rows/columns, etc.
matrix(data = , nrow = #, ncol = #) : creates a matrix with set row and columns
cbind(vector, vector/matrix) : combines columns of two objects
data.frame(vector/matrix, vector/matrix) : combines columns into dataframe
colnames(df) <- list : let's you assign column names from a list
rownames(df) <- list : let's you assign row names from a list

Neat fact is that you can access columns to perfomr aggregate functions by using df$column_name. For instance: sum(df$column_name).

Logic

Above there were some logic functions that were spoken of already. Here are some more:

isTRUE(eq'n) : returns TRUE or FALSE
xor(expression, expression) : only one expression needs to be true to return TRUE
which(logical_vector) : returns true indices
any(logical_vector) : returns TRUE if at least one component is true
all(logical_vector) : returns TRUE is all of the components are true

Note that the number of typographical symbols is important:

& : evaluates across a vector
&& : only evaluates the 1st member of a vector

Functions

Functions are created if you plan on reusing them. They are small pieces of reusable code. You can create a function in the same .R file or in a separate. After you have saved your function be sure to type submit() in your console to ensure you can use your function. Here is an example of one:

boring_function <- function(x) {
x
}

R has some very useful functions built in: lapply, sapply, vapply, and tapply - which I will go into more detail for.

Lapply and sapply

These are loop functions. They allow us to perform a function through a list.

lapply: applies a function to a list and returns a list.

sapply: applies a function to a list but returns values instead of list (i.e. a character list is returned). So for elements length one, it returns a vector. For elements that are the same sized vectors, it returns a matrix. If it cannot figure out what type it is, it returns a list.

Vapply and tapply

vapply: does same thing as sapply but allows you to specify format of results - thus speeding up the process for larger datasets.

tapply: allows the splitting of data by a specific value of some variable. An example:

tapply(flags$animals, flags$landmass, mean)

this function takes the mean of animals separated by landmass.

Looking at data

Some useful dataframe functions are:

head(df, #)
tail(df, #)
summary(df) : gives various details based on datatype per column
dim(df)
nrow(df)
ncol(df)
names(df) : column names - character vector
str() : returns structure of df/function/etc.
class(object)
read.csv() : default stores it into a dataframe
read.table() : also store in dataframe
object.size(df) : size occupying memory
unique(vector) : returns vector with duplicates removed

Simulation

Simulations are super useful for running experiments that are randomized to collect data. Here are the common functions used for simulations:

sample(data, size, replace = FALSE, prob = NULL) : you can specify probability as a vector: c(0.3, 0.7) but it does not have to be known.
replicate(number of times, distribution) : returns matrix of n trials of the specified distribution. You can also specify your own function instead of a distribution by using { }.
colMeans(matrix) : takes mean of each column

We have a lot of different distributions/functions and variables that can be used for creating simulations. Here is a table that breaks down some of them:

	r* random function	d* density function	p* probability function	q* quantile function
binom() binomial distribution (R.V.)*	rbinom()	dbinom()	pbinom()	qbinom()
norm() normal distribution (R.V.)*
pois() poisson distribution (R.V.)*
exp() exponential distribution (R.V.)*
chisq() chi-square distribution (R.V.)*
gamma() gamma distribution (R.V.)*

hist(data) : creates histogram of data - more on visualization in a different course.

Dates and time

Dates and times are represented by POSIXct (the number of dats.# of seconds since 1970-01-01) and POSIXlt (list of seconds, minutes, hour, etc.). Some useful function to know:

Sys.Date() : days since 1970-01-01
Sys.time() : time in POSIXlt formate
weekdays()
months()
quarters()
strtime() : ?strptime gives you a list of variablwe to specify which correspons to what (e.g. %B is the full month name)
difftime(time1, time 2, units="days") : gives the number of days between the two times

Base graphics

As mentioned before this we will go over in more detail in the visualization course section later on. Here are some basic functions to get us started:

data(data) : loads dataframe with data given
plot(data) : creates scatterplot; can access parameters of plot using ?par. An example of a basic scatterplot:
plot(x=df$name, y=df$name, xlab = "xlabel", xlim = c(#beginx, #endx), ylab = "ylabel", main = "Main Title", sub = "subtitle", col = number representing color for plot)
boxplot()

Exploratory Data Analysis
Histogram

QQ-Plot

Boxplot

Scatterplot

Symmetry of log ratios
- The following is useful for all types of plots:
  - load("data.RData") : this is for when you want data that exists as an .RData file
  - par(mfrow = c(#, #)) : this changes the view of the subplots, where the first number is the number of rows and second is number of columns
  Here is a sample of making a histogram (where "breaks" is telling where to draw intervals):
  hist(data, main = "Main Title", xlab = "X-Axis", ylab = "Y-Axis", breaks = seq(floor(mindata), ceiling(max(data))))
  
  Related to a histogram, we can use an empirical commutative distribution function to show us the frequency of values occurring below set thresholds. For a normal distribution, you would expect to see something like an S shape.
- A QQ-plot (aka quantile-quantile plot) is a plot that displays observed percentiles vs percentiles predicted by normal qqnorm(data). We are also able to draw a qqline() which makes it easier to see if our data is close to normal.
  
  qqnorm(data)
  qqline(data, col = "steelblue")
  It would look like this: This data is not normal.
  
  If our data is not normally distributed, we can have data that is right (or positively) skewed (long tail to right)
  
  histogram:
  
  qqnorm:
  
  Or we could have data that is left (or negatively) skewed (long tail to left)
  histogram:
  
  qqnorm:
  
  Here is a basic example of a qqnorm plot.
- Boxplots are commonly used to see distributions (especially if not normal). We call the x-axis: factors and the y-axis: values. A basic example would be:
  boxplot(data, ylab = "Y-Axis", ylim = c(#, #))
  Or we can:
  boxplot(split(values, factor))
  Or even:
  boxplot(values~ factor)
  
  Here is a basic coding example of a boxplot.
- Usually scatterplots can show correlation beter than other summary stats.
  
  It is a good idea to make a line that plots the correlation to see more data on correlation:
  abline(0, cor(x,y))
  
  Here is a basic coding example of a scatterplot.
- Sometimes we have a graph that doesn't show a linear relationship but rathedr a multiplicative change - making it hard to understand whether there is a relationship in general. For instance, if we have a plot
  But if we take log base 2 of our function, we can see that there is a very linear relationship:
Random Variable and Probability Distribution
Random variable

Null distribution

Probability distributions
- Inference statistics lets us look if our results have a significance when looking at the relationship between variables.
  
  A random variable is a variable whose values are numerical and determined by random (or by chance).
  
  Here are the R exercises for random variables.
- A null distribution is a set of all possible realizations under null. The null hypothesis is when there is no effect between the variables being studied.
  
  Here are the R exercises for null distribution.
- A probability distribution shows the probability of each occurrence of the possible outcomes of a random variable.
  
  Here are the R exercises for probability distribution.
Central Limit Theorem
Normal distribution

Populations, parameters and sample estimates

CLT

t-tests
- As mentioned before a normal distribution is a bell shaped curve of probability. The distribution of the data between standard deviations are specific ratios:
  The formula for the probability is:
  
  We can also standardize the units (which is very useful) by subtracting the mean from each point and dividing by the standard deviation:
  
  Here are the R exercises for normal distributions.
- We have our population mean, mu. However, in the real world it is extremely difficult to get the whole population so we usually have to rely on a sample. We also want to make sure our sample is a good representation of the population in order for us to be able to have our results be valid.
  
  The Central Limit Theorem (CLT) lets us see how close the sample average is to the population average.
  
  The calculation for population vs sample standard deviation varies slightly. Rafael Irizarry created some shortcuts for data exploration in his library: rafalib. One of the function is for calculating the population sd: popsd(data).
  
  Here are the R exercises for population samples estimates.
- The Central Limit Theorem states that if the sample size is big enough that the sample distribution will approximate the normal distribution (regardless of the population's distribution). This is a very powerful result as it allows us to calculate a lot of information of our sample without knowing the distribution of the population. This is due to the fact the bigger the sample size the smaller the spread of our data (due to the denominator of our sd being bigger leading to a smaller overall sd).
  
  How big is big enough?
  On average sample sizes that are greater than 30 is sufficient - but it does depend on the data.
  
  Under the CLT, the average of the sample means and standard deviations will equal the population mean and standard deviation.
  
  Here are the R exercises for CLT.
- A t-test is used to compare the means of two groups to see if there was a change between them. This is usually done to see if the treatment group mean is actually different than the control group. In order to use this test, the data must be randomly selected and normally distributed. This is why it can be used in conjunction with the CLT.
  
  We must consider the degrees of freedom when looking at our t-test. This website has one of the simplest explanations for degrees of freedom and why they are needed. Usually for t-tests our degrees of freedom with be the two population amounts minus 2: df = m + n – 2.
  
  For a t-test, there are several different kinds depending on the data you have. This website has a really good breakdown of the different kinds you can calculate. Overall, we can have the calculation for the t-statistic between two populations be:
  
  For programming we can use:
  t.test(sample1, sample2)
  This will compute the mean difference and estimate SE
  
  Here are the R exercises for the CLT and t-tests.
Inference
Inference allows us to draw conclusions about our population with only sample data. However, in order to do this properly there is a lot to consider: distribution, size of sample, etc. We looked at some of these already - in particular with the CLT. Here we will look at more things to consider when trying to find out important results about our sample/population.

P-value

Confidence Intervals

Power Calculations

Monte Carlo Sims

Create Null

Association Tests
- P-values are important in hypothesis testing. A p-value is the probability of obtaining a result that suggests the alternate hypothesis is correct, given the fact the null hypothesis is actually true. The smaller the p-value, the stronger the evidence is to reject the null hypothesis.
  
  There are different levels of p-values that are used. Most often p-value of 0.05 or 0.01 is used. The level to use is dependent on the data and type of question you are trying to answer.
  
  For programming we can use: pnorm(a) which gives us the probability that our random variable falls below a. Example:
  pval <- 2*(1 - pnorm(tstat))
  for a 2 tailed test using our t-statistic result. For a right tail test:
  pval <- 1 - pnorm(abs(tstat))
  And for a left tail test:
  pval <- pnorm( - abs(tstat))
  
  Note: we can change the t-stat to any distribution value we want, e.g. z-score and the process still remains the same when calculating the p-value.
- Sometimes very large sample sizes can lead to a very small p-value implying that there is a statistical significance in the results even when there is not one. This is why confidence intervals (CI) are more informative, as they include the estimate itself.
  
  We can calculate our confidence interval:
  Where:
  
  To calculate using R (if we have a normal distribution):
  upper_ci <- xbar + (qnorm(1 - alpha/2) * sd)
  lower_ci <- xbar - (qnorm(1 - alpha/2) * sd)
  ci <- c(lower_ci, upper_ci)
  If our distribution is not normal or we have too small sample size for CLT, we must use the t-distribution:
  upper_ci <- xbar + (qt(1 - alpha/2, df = #) * sd)
  lower_ci <- xbar - (qt(1 - alpha/2, df = #) * sd)
  ci <- c(lower_ci, upper_ci)
  
  Here are the R exercises for t-tests and confidence interval calculations.
- Statistical power is the probability of rejecting the null hypothesis when the alternate is actually true. The larger the sample size, the larger your power.
  
  We have two main errors when it comes to our hypothesis testing:
  1. Type I error: reject the null, when we should accept. Meaning we think that there is an effect, but there is not.
  2. Type II error: accept the null, when we should reject. Meaning we think there is no effect, when there actually is.
  Here are the R exercises for the power calculations.
- Monte Carlo simulations are used to model probability of different outcome in a process that cannot easily be predicted due to the intervention of a random variable.
  
  There are many different ways to generate data sets to model our experiments. The most basic building blocks for a Monte Carlo simulation is:
  N <- 15
  B <- 10000
  tstats <- replicate(B,{
  X <- sample(c(-1,1), N, replace=TRUE)
  sqrt(N)*mean(X)/sd(X)
  })
  
  Here are the R exercises for the Monte Carlo simulations.
- We can create a null distribution from the data we have using the following steps:
  1. Get new null mean:
    - a) collect data:
      dat <- c(dataset1, dataset2)
    - b) randomize data:
      shuffle <- sample(dat)
    - c) disperse to 2 groups:
      dataset1* <- shuffle[1 : n]
      dataset2* <- shuffle[n+1 : 2n]
    - d) take mean for null:
      mean(dataset1*) - mean(dataset2*)
  2. replicate
    replicate(B, null_mean(n))
  We can calculate p-value given null mean and observed mean:
  mean( abs(null_distribution) >= abs(observed) )
  
  Here are the R exercises for the permutations.
- There are many different association tests that are used for just what the name suggests: finding out if there is any association between two r.v.
  
  There are several useful methods to look at association:
  - chisq.test(data): can tell you whether 2 variables are independent of one another. If the p-value is less than alpha then the observed is not the same as the expected.
  - fisher.test(data): can tell you if there are non-random associations between two variables (it is based on the hypergeometric distribution).
  An example of something used in genetics is the Manhattan plots are used to plot chromosomes in the x-axis vs the association statisical significance as a -log(pvalue) in the y-axis. It would look something like this:
  The y-axis is actually plotting the p-value but the log is taken of the p-value to exaggerate any p-values that are very small (like chromosome 6, 8, 12, and 19 in the picture).
  
  Here are the R exercises for the association tests.
Robust Statistics
What are robust summary statistics?
They provide valid results even in the face of less than ideal conditions (e.g. data with outliers).

Example of robust summary statistics include: median, MAD, and Spearman Correlation.

Median isn't as sensitive to outliers as mean because it is more determined by rank of data as opposed to purely values like mean.

MAD (aka median absolute deviation) is a robust estimate of the standard deviation. It is calculate by .

Spearman correlation is a more robust summary statistic for correlation (compared to Pearson). This is because it does not just look at the values of the points - which can make it more susceptible to outlier. But rather it looks at the rank of the values slong both axes. It is calculated in two steps.
1. Compute rank of the vectors (both x and y)
2. Compute correlation of the newly ranked vector
For instance, we might have something that initally looks like this:
Here we can see that the data looks correlated. But if we take the rank of the values we might see that it actually looks something like this: which shows no correlation.

This is how the Spearmann correlation works. Here are some exercises I completed for the robust summary statistics.

Mann-Whitney-Wilcoxon Test (aka Mann-Whitney U Test or Wilcoxon Rank Sum Test) is used to determine if two samples are likely to be dervied from the same population.

Usually we have our hypothesis testing to check equality of means between two independent samples for a large enough population that is either normally distributed or large enough to be approximately normal. However, in cases were we have a small sample size not normally distributed we want to use a nonparametric test. The Mann-Whitney-Wilcoxon test is a popular nonparametric test with the hypotheses being:
- H0: The two populations are equal.
- H1: The two populations are not equal.
It first ranks the y-values between the two groups. Then it looks at how many values are smaller in 1 group compared to the other:
After the sum has been taken of how many are smaller then it divides it by the total number of points within the group. This is then averaged to get an overal score to determine if the two populations are equal or not.

Here are some exercises I completed for the Mann-Whitney-Wilcoxon Test.

Visuals

Visualization

The course link can be found here. Within this course we look at different R functions for visualizing data and what scenarios to use them in.

Introduction to Data Visualization
Data Types

We have two different types of data: categorial and numerical. Categorical can be futher divided into ordinal (ordered) or non-ordinal. While numerical can be further divided into discrete (finite) or continous.

Practice Problems
1. The type of data we are working with will often influence the data visualization technique we use. We will be working with two types of variables: categorical and numeric. Each can be divided into two other groups: categorical can be ordinal or not, whereas numerical variables can be discrete or continuous. We will review data types using some of the examples provided in the dslabs package. For example, the heights dataset.
  library(dslabs)
  data(heights)
  
  names(heights)
  ## "sex" "height"
2. We saw that sex is the first variable. We know what values are represented by this variable and can confirm this by looking at the first few entires:
  library(dslabs)
  data(heights)
  head(heights)
  
  What data type is the sex variable? Categorical
3. Keep in mind that discrete numeric data can be considered ordinal. Although this is technically true, we usually reserve the term ordinal data for variables belonging to a small number of different groups, with each group having many members. The height variable could be ordinal if, for example, we report a small number of values such as short, medium, and tall. Let's explore how many unique values are used by the heights variable. For this we can use the unique function:
  x <- c(3, 3, 3, 3, 4, 4, 2)
  unique(x)
  
  library(dslabs)
  data(heights)
  x <- heights$height
  
  length(unique(x))
  ## 139
4. One of the useful outputs of data visualization is that we can learn about the distribution of variables. For categorical data we can construct this distribution by simply computing the frequency of each unique value. This can be done with the function table.
  library(dslabs)
  data(heights)
  x <- heights$height
  
  tab <- table(x)
5. To see why treating the reported heights as an ordinal value is not useful in practice we note how many values are reported only once.
  library(dslabs)
  data(heights)
  tab <- table(heights$height)
  
  sum(tab==1)
  ## 63
6. Since there are a finite number of reported heights and technically the height can be considered ordinal, what is true:
  It is more effective to consider heights to be numerical given the number of unique values we observe and the fact that if we keep collecting data even more will be observed.
Introduction to Distributions
Cumulative Distribution Function

Every continous distribution has a cumulative distribtion funciton which defines the proportion of the data below a given value. It is written as:

For datasets that are not normal, the CDF can be calculated manually by defining a function to compute the probability above. Like this:
# define range of values spanning the dataset:
a <- seq(min(my_data), max(my_data), length = 100)

# computes prob. for a single value
cdf_function <- function(x) {
mean(my_data <= x)
}
cdf_values <- sapply(a, cdf_function)
plot(a, cdf_values)

Smooth Density Plots

Smooth density plots are similar to histoplots but more appealing because they go through the frequency scale vs count scale and at the top of the small histogram buckets. Histogram is assumption free, whereas the smooth density is based on assumptions/choices you make (i.e. you can control the smoothness using ggplot). The Smooth Density is scaled so that the area under the density curve adds up to 1, meaning it gives us the proportion within a range. This all makes it easier to compare two datasets.

Practice Problems: Distributions
1. You may have noticed that numerical data is often summarized with the average value. For example, the quality of a high school is sometimes summarized with one number: the average score on a standardized test. Occasionally, a second number is reported: the standard deviation. So, for example, you might read a report stating that scores were 680 plus or minus 50 (the standard deviation). The report has summarized an entire vector of scores with with just two numbers. Is this appropriate? Is there any important piece of information that we are missing by only looking at this summary rather than the entire list? We are going to learn when these 2 numbers are enough and when we need more elaborate summaries and plots to describe the data.
  Our first data visualization building block is learning to summarize lists of factors or numeric vectors. The most basic statistical summary of a list of objects or numbers is its distribution. Once a vector has been summarized as distribution, there are several data visualization techniques to effectively relay this information. In later assessments we will practice to write code for data visualization. Here we start with some multiple choice questions to test your understanding of distributions and related basic plots.
  In the murders dataset, the region is a categorical variable and on the right you can see its distribution. To the closet 5%, what proportion of the states are in the North Central region?
  25%
2. In the murders dataset, the region is a categorical variable and to the right is its distribution. What is true:
  The graph shows only four numbers with a bar plot.
3. The plot shows the eCDF for male heights. Based on the plot, what percentage of males are shorter than 75 inches? 95%
4. The plot shows the eCDF for male heights. To the closest inch, what height m has the property that 1/2 of the male students are taller than m and 1/2 are shorter?
  69 in
5. Here is an eCDF of the murder rates across states. Knowing that there are 51 states (counting DC) and based on this plot, how many states have murder rates larger than 10 per 100,000 people?
  1
6. What is true:
  with the exception of 4 states, the murder rates are below 5 per 100,000.
7. Here is a histogram of male heights in our heights dataset. Based on this plot, how many males are between 62.5 and 65.5?
  58
8. From above, About what percentage are shorter than 60 inches?
  1%
9. Based on this density plot, about what proportion of US states have populations larger than 10 million?
  0.15
10. Here are three density plots. Is it possible that they are from the same dataset? What is true:
  They are the same dataset, but the first does not have the x-axis in the log scale, the second undersmooths and the third oversmooths.
Normal Distribution aka Guassian distribution

The normal distribution has the classic bell-shaped curve where: 1 standard deviation contains 68% of the data, 2sds contain 95%, and 3sds contain 99.7% of the data.

The equation for the distribution is given by . While the equation to convert from a random variable (X) to the normal (Z) is given by

The scale function converts a vector of approximately normally distributed values into z-scores:
z <- scale(x)

You can compute the proportion of observations that are within 2 standard deviations of the mean like this:
mean(abs(z) < 2)

The Normal CDF and pnorm

Discretization: although true height distribution is continuous, the reported heights tend to be more common at discrete values (usually due to rounding) meaning that the pnorm is not as accurate when looking over ranges not including an integer.
F(a) <- pnorm(a, mu, sigma)

Practice Problems: Normal Distribution
1. Histograms and density plots provide excellent summaries of a distribution. But can we summarize even further? We often see the average and standard deviation used as summary statistics: a two number summary! To understand what these summaries are and why they are so widely used, we need to understand the normal distribution.
  The normal distribution, also known as the bell curve and as the Gaussian distribution, is one of the most famous mathematical concepts in history. A reason for this is that approximately normal distributions occur in many situations. Examples include gambling winnings, heights, weights, blood pressure, standardized test scores, and experimental measurement errors. Often data visualization is needed to confirm that our data follows a normal distribution.
  Here we focus on how the normal distribution helps us summarize data and can be useful in practice.
  One way the normal distribution is useful is that it can be used to approximate the distribution of a list of numbers without having access to the entire list. We will demonstrate this with the heights dataset. Load the height data set and create a vector x with just the male heights:
  library(dslabs)
  data(heights)
  x <- heights$height[heights$sex == "Male"]
  
  mean(x<=72) - mean(x <= 69)
  ## 0.3337438
2. Suppose all you know about the height data from the previous exercise is the average and the standard deviation and that its distribution is approximated by the normal distribution. Suppose you only have avg and stdev below, but no access to x, can you approximate the proportion of the data that is between 69 and 72 inches?
  library(dslabs)
  data(heights)
  x <- heights$height[heights$sex=="Male"]
  avg <- mean(x)
  stdev <- sd(x)
  
  pnorm(72, avg, stdev) - pnorm(69, avg, stdev)
  ## 0.3061779
3. Notice that the approximation calculated in the second question is very close to the exact calculation in the first question. The normal distribution was a useful approximation for this case. However, the approximation is not always useful. An example is for the more extreme values, often called the "tails" of the distribution. Let's look at an example. We can compute the proportion of heights between 79 and 81.
  library(dslabs)
  data(heights)
  x <- heights$height[heights$sex == "Male"]
  exact <- mean(x > 79 & x <= 81)
  approx <- pnorm(81, mean(x), sd(x)) - pnorm(79, mean(x), sd(x))
  
  exact/approx
  ## 1.614261
4. Someone asks you what percent of seven footers are in the National Basketball Association (NBA). Can you provide an estimate? Let's try using the normal approximation to answer this question.
  First, we will estimate the proportion of adult men that are taller than 7 feet.
  Assume that the distribution of adult men in the world as normally distributed with an average of 69 inches and a standard deviation of 3 inches.
  # use pnorm to calculate the proportion over 7 feet (7*12 inches)
  1 - pnorm(7*12, 69, 3)
  ## 2.866516e-07
5. Now we have an approximation for the proportion, call it p, of men that are 7 feet tall or taller. We know that there are about 1 billion men between the ages of 18 and 40 in the world, the age range for the NBA. Can we use the normal distribution to estimate how many of these 1 billion men are at least seven feet tall?
  p <- 1 - pnorm(7*12, 69, 3)
  
  round(10^9*p)
  ## 287
6. There are about 10 National Basketball Association (NBA) players that are 7 feet tall or higher.
  p <- 1 - pnorm(7*12, 69, 3)
  N <- round(10^9*p)
  
  10/N
  ## 0.03484321
7. In the previous exercise we estimated the proportion of seven footers in the NBA using this simple code. Repeat the calculations performed in the previous question for Lebron James' height: 6 feet 8 inches. There are about 150 players, instead of 10, that are at least that tall in the NBA.
  ## Change the solution to previous answer
  p <- 1 - pnorm((6*12)+8, 69, 3)
  N <- round(p * 10^9)
  
  150/N
  ## 0.001220842
8. In answering the previous questions, we found that it is not at all rare for a seven footer to become an NBA player. What would be a fair critique of our calculations?
  As seen in exercise 3, the normal approximation tends to underestimate the extreme values. It's possible that there are more seven footers than we predicted.
Quantiles, Percentiles, and Boxplots
Definition of quantiles

Quantiles are "cutoff points that divide a dataset into intervals with set probabilities - where the qth quantile is the value at which q% of the observations are equal to or less than that value".

We can get the quanitile by simply using quantile(data, desired_quantile). Furthermore, we can get percentiles - which are the quantiles that divide a dataset into 100 intervals each with 1% probability:
p <- seq(0.01, 0.99, 0.01)
quantile(data, p)

Finding quantiles with qnorm

We can determine the theoretical quantiles of a dataset: the theoretical value of quantiles assuming that a dataset follows a normal distribution by using qnorm(p, mu, sigma).

Quantile-Quantile Plots

We can use abline(0,1) to show if the distribution is normally distributed.

For normal distribution, mean and median are the same.

Boxplots

Boxplots are five number summaries: range, Q1, Q2, Q3, Q4. The interquartile range is calculated using Q3 and Q1: IQR = Q3 - Q1

Practice Problems: Quantiles, percentiles, and boxplots
1. When analyzing data it's often important to know the number of measurements you have for each category.
  library(dslabs)
  data(heights)
  male <- heights$height[heights$sex=="Male"]
  female <- heights$height[heights$sex=="Female"]
  
  length(male)
  ## 812
  length(female)
  ## 238
2. Suppose we can't make a plot and want to compare the distributions side by side. If the number of data points is large, listing all the numbers is impractical. A more practical approach is to look at the percentiles.
  library(dslabs)
  data(heights)
  male <- heights$height[heights$sex=="Male"]
  female <- heights$height[heights$sex=="Female"]
  
  percentiles <- c(0.1, 0.3, 0.5, 0.7, 0.9)
  
  female_percentiles <- quantile(female, percentiles)
  male_percentiles <- quantile(male, percentiles)
  
  df <- data.frame(female = female_percentiles, male = male_percentiles)
  df
3. Study the boxplots summarizing the distributions of populations sizes by country. Which continent has the country with the largest population size?
  Asia
4. Which continent has the largest median population?
  Africa
5. To the nearest million, what is the median population size for Africa?
  10 million
6. Approximately what proportion of countries in Europe have populations below 14 million.
  0.75
7. Which continent shown below has the largest interquartile range for log(population)?
  Americas
Exploratory Data Analysis
Practice Problems: Robust Summaries with Outliers
1. For this chapter, we will use height data collected by Francis Galton for his genetics studies. Here we just use height of the children in the dataset. Compute the average and median of these data.
  library(HistData)
  data(Galton)
  x <- Galton$child
  
  mean(x)
  ## 68.08847
  median(x)
  ## 68.2
2. Now for the same data compute the standard deviation and the median absolute deviation (MAD).
  library(HistData)
  data(Galton)
  x <- Galton$child
  
  sd(x)
  ## 2.517941
  
  #median absolute deviation:
  mad(x)
  ## 2.9652
3. In the previous exercises we saw that the mean and median are very similar and so are the standard deviation and MAD. This is expected since the data is approximated by a normal distribution which has this property. Now suppose that Galton made a mistake when entering the first value, forgetting to use the decimal point. The data now has an outlier that the normal approximation does not account for. Let's see how this affects the average.
  library(HistData)
  data(Galton)
  x <- Galton$child
  x_with_error <- x
  x_with_error[1] <- x_with_error[1]*10
  
  mean(x_with_error) - mean(x)
  ## 0.5983836
4. In the previous exercise we saw how a simple mistake in 1 out of over 900 observations can result in the average of our data increasing more than half an inch, which is a large difference in practical terms. Now let's explore the effect this outlier has on the standard deviation.
  x_with_error <- x
  x_with_error[1] <- x_with_error[1]*10
  
  sd(x_with_error) - sd(x)
  ## 15.6746
5. In the previous exercises we saw how one mistake can have a substantial effect on the average and the standard deviation. Now we are going to see how the median and MAD are much more resistant to outliers. For this reason we say that they are robust summaries.
  x_with_error <- x
  x_with_error[1] <- x_with_error[1]*10
  
  median(x_with_error) - median(x)
  ##0
6. We saw that the median barely changes. Now let's see how the MAD is affected.
  x_with_error <- x
  x_with_error[1] <- x_with_error[1]*10
  
  mad(x_with_error) - mad(x)
  ## 0
7. How could you use exploratory data analysis to detect that an error was made?
  A boxplot, histogram, or qq-plot would reveal a clear outlier.
8. We have seen how the average can be affected by outliers. But how large can this effect get? This of course depends on the size of the outlier and the size of the dataset. To see how outliers can affect the average of a dataset, let's write a simple function that takes the size of the outlier as input and returns the average.
  x <- Galton$child
  error_avg <- function(k){
  x_error <- x
  x_error[1] <- k
  mean(x_error)
  }
  
  error_avg(10000)
  ## 78.79784
  error_avg(-10000)
  ## 57.24612
Basics of ggplot2

ggplot

“Grammar of graphics” (gg) lets you use simple base cases to do a lot of stuff. Where you are working with data tables: rows have to be observations and columns have to be variables. There is a really good cheatsheet for graphing. Tidyverse is a package that contains dplyr and ggplot2, which makes it easier to work with dataframes and create graphic.

Graph Components

Break graph into components: data (what are you summarizing), geometry (scatterplot, bar plot, histograms, smooth densities, q-q plots, and box plots), aesthetic mappings, scale component, labels, titles, legends, etc.

Creating a New Plot

Need ggplot object: ggplot(data = dataset) OR pipe the data: dataset %>% ggplot().
Customizing Plots
Layers

In order to make graph you add layers (added component by component). Order of layers matters!! Adds layers in order from top down You add a layer using “+”:
Data %>% ggplot() %>% layer1 + layer 2 +... +layer n
- Geometry: geom_point will let you create the actual graph using the loaded data
- geom_abline() gives us intercept a and slope b
- The most common one used is aes (i.e. aesthetic mappings)
  aes(x = … , y = …)
- Geom_label adds label to plot with little rectangle and geom_text adds text to plot. E.g. geom_text(aes(x, y, label = column))
- Nudge_x moves labels to the right; used outside of aes
Tinkering

Through tinkering around with the settings the following was found:
- When you affect something outside of aes all the sizes change not based on the data given
- By putting in the aes in ggplot() then this is a global aes (meaning that you do not need to call x and y each time for geom_… )
- The local mappings can still override the global mapping
Scales, Labels, and Colors

It is benefical to change scales to logarithmic to spread clusters of data when values are high. We can do this using: scale_x_continous(trans = “log10”) for different logs or scale_x_log10() function within ggplot (because log base 10 is used so often).

In order to label the axes and title we use: xlab(), ylab(), ggtitle().

For colours, we use the color argument in geom_point. Aes needs to be used to create a mapping: geom_point(aes(col=region)). And to change the legend title: scale_color_discrete()

Add-on Packages

There are more add-on packages that make graphing more complex:
- ggthemes() which has themes that can be set using dslabs package: ds_theme_set() and theme_economist() being one of the types of themes.
- ggrepel() has a part that adds labels but ensures they don’t fall on top of each other. In order to use it, change geom_text() to geom_text_repel().
- gridExtra package let's you create plots then populate this function grid.arrange() which helps layout the graphs.
Other Examples

There are other examples of types of types of graphs, the most commonly used ones being: geom_histogram(), geom_density(), geom_qq().

Practice Problems
1. Start by loading the dplyr and ggplot2 libraries as well as the murders data. With ggplot2 plots can be saved as objects. For example we can associate a dataset with a plot object like this
  p <- ggplot(data = murders)
  Because data is the first argument we don't need to spell it out. So we can write this instead:
  p <- ggplot(murders)
  or, if we load dplyr, we can use the pipe:
  p <- murders %>% ggplot()
  Remember the pipe sends the object on the left of %>% to be the first argument for the function the right of %>%.
  library(dplyr)
  library(ggplot2)
  library(dslabs)
  
  data(heights)
  data(murders)
  
  p <- ggplot(murders)
  class(p)
  ## "gg" "ggplot"
2. Remember that to print an object you can use the command print or simply type the object. For example, instead of
  x <- 2
  print(x)
  you can simply type
  x <- 2
  x
  Print the object p defined in exercise one
  p <- ggplot(murders)
  and describe what you see.
  A blank slate plot.
3. Now we are going to review the use of pipes by seeing how they can be used with ggplot. Using the pipe %>%, create an object p associated with the heights dataset instead of with the murders dataset as in previous exercises.
  data(heights)
  # define ggplot object called p like in the previous exercise but using a pipe
  p <- heights %>% ggplot()
4. Now we are going to add layers and the corresponding aesthetic mappings. For the murders data, we plotted total murders versus population sizes in the videos. Explore the murders data frame to remind yourself of the names for the two variables (total murders and population size) we want to plot and select the correct answer.
  total and population
5. To create a scatter plot, we add a layer with the function geom_point. The aesthetic mappings require us to define the x-axis and y-axis variables respectively.
  murders %>% ggplot(aes(x = population, y = total)) +
  geom_point()
6. Switch order
  murders %>% ggplot(aes(total, population)) +
  geom_point()
7. If instead of points we want to add text, we can use the geom_text() or geom_label() geometries. However, note that the following code
  murders %>% ggplot(aes(population, total)) +
  geom_label()
  
  will give us the error message: Error: geom_label requires the following missing aesthetics: label. Why is this?
  We need to map a character to each point through the label argument in aes
8. You can also add labels to the points on a plot.
  library(dplyr)
  library(ggplot2)
  library(dslabs)
  
  data(murders)
  # Add the label
  murders %>% ggplot(aes(population, total, label = abb)) +
  geom_label()
9. geom_point colors. Now let's change the color of the labels to blue. How can we do this?
  By using the color argument in geom_label because we want all colors to be blue so we do not need to map colors
10. Now let's go ahead and make the labels blue.
  murders %>% ggplot(aes(population, total,label= abb)) +
  geom_label(color="blue")
11. Now suppose we want to use color to represent the different regions. So the states from the West will be one color, states from the Northeast another, and so on. In this case, which of the following is most appropriate:
  Mapping the colors through the color argument of aes because each label needs a different color
12. We previously used this code to make a plot using the state abbreviations as labels:
  murders %>% ggplot(aes(population, total, label = abb)) +
  geom_label()
  
  We are now going to add color to represent the region.
  murders %>% ggplot(aes(population, total, label = abb, color = region)) +
  geom_label()
13. Now we are going to change the axes to log scales to account for the fact that the population distribution is skewed. Let's start by defining an object p that holds the plot we have made up to now:
  p <- murders %>% ggplot(aes(population, total, label = abb, color = region)) +
  geom_label()
  
  To change the x-axis to a log scale we learned about the scale_x_log10() function. We can change the axis by adding this layer to the object p to change the scale and render the plot using the following code p + scale_x_log10()
  p <- murders %>% ggplot(aes(population, total, label = abb, color = region)) + geom_label()
  # add log scale to p
  p + scale_x_log10() +scale_y_log10()
14. In the previous exercises we created a plot using the following code:
  library(dplyr)
  library(ggplot2)
  library(dslabs)
  
  data(murders)
  
  p<- murders %>% ggplot(aes(population, total, label = abb, color = region)) +
  geom_label()
  p + scale_x_log10() + scale_y_log10()
  We are now going to add a title to this plot. We will do this by adding yet another layer, this time with the function ggtitle.
  p <- murders %>% ggplot(aes(population, total, label = abb, color = region)) +
  geom_label()
  # add a layer to add title to the next line
  p + scale_x_log10() +
  scale_y_log10() +
  ggtitle("Gun murder data")
15. We are going to shift our focus from the murders dataset to explore the heights dataset.
  We use the geom_histogram function to make a histogram of the heights in the heights data frame. When reading the documentation for this function we see that it requires just one mapping, the values to be used for the histogram.
  What is the variable containing the heights in inches in the heights data frame?
  height
16. We are now going to make a histogram of the heights so we will load the heights dataset.
  p <- heights %>% ggplot(aes(height)) +
  geom_histogram()
  p
17. Note that when we run the code from the previous exercise we get the following warning: stat_bin() using bins = 30. Pick better value with binwidth.
  p <- heights %>%
  ggplot(aes(height))
  # add the geom_histogram layer but with the better binwidth
  p + geom_histogram(binwidth = 1)
18. Now instead of a histogram we are going to make a smooth density plot. In this case, we will not make an object p. Instead we will render the plot using a single line of code. In the previous exercise, we could have created a histogram using one line of code like this:
  heights %>%
  ggplot(aes(height)) +
  geom_histogram()
  
  Now instead of geom_histogram we will use geom_density to create a smooth density plot.
  heights %>%
  ggplot(aes(height)) +
  geom_density()
19. Now we are going to make density plots for males and females separately. We can do this using the group argument within the aes mapping. Because each point will be assigned to a different density depending on a variable from the dataset, we need to map within aes.
  heights %>%
  ggplot(aes(height, group = sex)) +
  geom_density()
20. In the previous exercise we made the two density plots, one for each sex, using:
  heights %>%
  ggplot(aes(height, group = sex)) +
  geom_density()
  
  We can also assign groups through the color or fill argument. For example, if you type color = sex ggplot knows you want a different color for each sex. So two densities must be drawn. You can therefore skip the group = sex mapping. Using color has the added benefit that it uses color to distinguish the groups.
  heights %>%
  ggplot(aes(height, color = sex)) +
  geom_density()
21. We can also assign groups using the fill argument. When using the geom_density geometry, color creates a colored line for the smooth density plot while fill colors in the area under the curve. We can see what this looks like by running the following code:
  heights %>%
  ggplot(aes(height, fill = sex)) +
  geom_density()
  
  However, here the second density is drawn over the other. We can change this by using something called alpha blending.
  heights %>%
  ggplot(aes(height, fill = sex)) +
  geom_density(alpha=0.2)
Introduction to Gapminder

Gapminder.org is an organization dedicated to educating the public by using data to dispel common myths about the so-called developing world. You can access it from the Dslabs library using data(gapminder). Started because views on the world based on wealthy was cemented about 50 years ago (1962) and it has not been updated - Gapminder looks to inform people with data about the truth of now (using data).
Using the Gapminder Dataset
Faceting

The faceting variables allows you to make side by side comparisons for varying data. It keeps the scale the same through all plots to allow comparisons easily to occur. You can add layers and stratify by up to 2 variables: facet_grid(). Example: facet_grid(continent ~ year) divides it by year and continent.

We can also only divide by 1 variable: facet_grid(. ~ year):

If we want to analyze more than 2 years, we can but we should use facet_wrap(), which will automatically wraps accordingly (facet_grid(facet_wrap(~year)).

Time Series Plots

When you want to look at data across time, your time should be on the x-axis and what you want to look at in y-axis. You can use group=category or color = category within ggplot to break the data up into categories over time:
geom_text(data = labels, aes(x, y, color = country), size 5) +
theme(legend.position="none")

Transformations

The way data is given maybe useful when looking at the numbers but you might need a transformation in order to make the graph clear to use. For instance, to get GDP per day is more useful to mutate(dollar_per_day = gdp/population/365) and then take log2(dollars_per_day) (or you could use scale_x_continuous(trans = “log2”)).

Stratify and Boxplot

When you are creating a graph - particularly boxplots - it is useful to know how to rearrange the boxplots so that there is some sort of ordering that makes sense to the data you are trying to convey. We can to this by using reorder() based on a numeric vector.
mutate(region = reorder(region, dollar_per_day, FUN = median)) %>%
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust =1))

Comparing Distributions

Within the data you can separate data into new groupings using mutate: mutate(group = ifelse(region%in%west, “West”, “Developing”)). You can also find the intersection between two lists using intersect() (tidyverse has a simpler version to do this). And a third way you can compare is by adding a factor for year (fill = factor(year)) in your aes

Density Plots

To have the areas of the densities be proportional to the size of the groups, we can simply multiply the y-axis values by the size of the group within geom_density. To access geom features use "..".
geom_density(aes(x=dollars_per_day, y = ..count..))

We can also use case_when() to group certain aspects together:
mutate(group = case_when(
.$region %in% west ~ "West"
.$region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
.$region %in% c("Caribbean", "Central America", "South America") ~ "Latin America"
.$continent == "Africa" & .$region != "Northern Africa" ~ "Sub-Saharan Africa",
TRUE ~ "Others"
))

A couple more useful arguments to know:
- levels = c() lets us dictate the ordering in which to plot data
- position = "stack" shows the different factors on top of one another to see differences
- weight argument based on specific section of your data let's you take into account the number of observations within a specific section of data
Ecological Fallacy

Be careful about making assumtions at an individual level when it could be a trend that just occurs at the population level. We can soemtimes check this using logit transformations.to have the areas of the densities be proportional to the size of the groups, we can simply multiply the y-axis values by the size of the group.

Practice Problems: Exploring the Gapminder Dataset
1. Using ggplot and the points layer, create a scatter plot of life expectancy versus fertility for the African continent in 2012.
  library(dplyr)
  library(ggplot2)
  library(dslabs)
  data(gapminder)
  
  gapminder %>% filter(continent == "Africa" & year == "2012") %>%
  ggplot(aes(life_expectancy, fertility)) +
  geom_point()
2. Note that there is quite a bit of variability in life expectancy and fertility with some African countries having very high life expectancies. There also appear to be three clusters in the plot. Remake the plot from the previous exercises but this time use color to distinguish the different regions of Africa to see if this explains the clusters. Remember that you can explore the gapminder data to see how the regions of Africa are labeled in the data frame!
  library(dplyr)
  library(ggplot2)
  library(dslabs)
  data(gapminder)
  
  gapminder %>% filter(continent == "Africa" & year == "2012") %>%
  ggplot(aes(y= life_expectancy, x= fertility, color = region)) +
  geom_point()
3. While many of the countries in the high life expectancy/low fertility cluster are from Northern Africa, three countries are not. Create a table showing the country and region for the African countries (use select) that in 2012 had fertility rates of 3 or less and life expectancies of at least 70.
  library(dplyr)
  library(dslabs)
  data(gapminder)
  
  df <- gapminder %>% filter(continent=="Africa", year == "2012", fertility <= 3, life_expectancy >=70) %>% select(country, region) %>% data.frame()
4. The Vietnam War lasted from 1955 to 1975. Do the data support war having a negative effect on life expectancy? We will create a time series plot that covers the period from 1960 to 2010 of life expectancy for Vietnam and the United States, using color to distinguish the two countries. In this start we start the analysis by generating a table.
  library(dplyr)
  library(dslabs)
  data(gapminder)
  
  tab <- gapminder %>% filter(year %in% 1960:2010, country %in% c("Vietnam", "United States"))
5. Now that you have created the data table in Exercise 4, it is time to plot the data for the two countries.
  p <- tab %>% ggplot(aes(x=year,y=life_expectancy,color=country)) +
  geom_line()
  p
6. Cambodia was also involved in this conflict and, after the war, Pol Pot and his communist Khmer Rouge took control and ruled Cambodia from 1975 to 1979. He is considered one of the most brutal dictators in history. Does the data support this claim?
  library(dplyr)
  library(ggplot2)
  library(dslabs)
  data(gapminder)
  
  gapminder %>% filter(year %in% 1960:2010, country == "Cambodia") %>%
  ggplot(aes(year, life_expectancy)) +
  geom_line()
  
  Yes, it does.
7. Now we are going to calculate and plot dollars per day for African countries in 2010 using GDP data. In the first part of this analysis, we will create the dollars per day variable.
  library(dplyr)
  library(dslabs)
  data(gapminder)
  
  daydollars <- gapminder %>% mutate(dollars_per_day = gdp/population/365) %>% filter(continent == "Africa", year == "2010", !is.na(dollars_per_day)) daydollars
8. Now we are going to calculate and plot dollars per day for African countries in 2010 using GDP data. In the second part of this analysis, we will plot the smooth density plot using a log (base 2) x axis.
  daydollars %>% ggplot(aes(dollars_per_day)) +
  geom_density() +
  scale_x_continuous(trans = "log2")
9. Now we are going to combine the plotting tools we have used in the past two exercises to create density plots for multiple years.
  library(dplyr)
  library(ggplot2)
  library(dslabs)
  data(gapminder)
  
  gapminder %>% filter(continent == "Africa", year %in% c(1970,2010)) %>%
  mutate(dollars_per_day = gdp/population/365) %>%
  filter(!is.na(dollars_per_day)) %>%
  ggplot(aes(dollars_per_day)) +
  geom_density() +
  scale_x_continuous(trans = "log2") +
  facet_grid(year~.)
10. Now we are going to edit the code from Exercise 9 to show a stacked density plot of each region in Africa.
  library(dplyr)
  library(ggplot2)
  library(dslabs)
  data(gapminder)
  
  dat <- gapminder %>% filter(continent == "Africa", year %in% c(1970,2010)) %>% mutate(dollars_per_day = gdp/population/365) %>%
  filter(!is.na(dollars_per_day)) %>%
  ggplot(aes(dollars_per_day, color = region, fill = region)) +
  geom_density(bw=0.5, position = "stack" ) +
  scale_x_continuous(trans = "log2") +
  facet_grid(year~.)
  dat
11. We are going to continue looking at patterns in the gapminder dataset by plotting infant mortality rates versus dollars per day for African countries.
  library(dplyr)
  library(ggplot2)
  library(dslabs)
  data(gapminder)
  
  gapminder_Africa_2010 <- gapminder %>% filter(continent == "Africa", year == "2010") %>% mutate(dollars_per_day = gdp/population/365) %>% filter(!is.na(dollars_per_day))
  
  # scatter plot
  gapminder_Africa_2010 %>% ggplot(aes(dollars_per_day, infant_mortality, color = region)) +
  geom_point()
12. Now we are going to transform the x axis of the plot from the previous exercise.
  gapminder_Africa_2010 %>% ggplot(aes(dollars_per_day, infant_mortality, color = region)) +
  geom_point() +
  scale_x_continuous(trans = "log2")
13. Note that there is a large variation in infant mortality and dollars per day among African countries. As an example, one country has infant mortality rates of less than 20 per 1000 and dollars per day of 16, while another country has infant mortality rates over 10% and dollars per day of about 1. In this exercise, we will remake the plot from Exercise 12 with country names instead of points so we can identify which countries are which.
  gapminder_Africa_2010 %>% ggplot(aes(dollars_per_day, infant_mortality, color = region,label = country)) +
  geom_point() +
  scale_x_continuous(trans = "log2") +
  geom_text()
14. Now we are going to look at changes in the infant mortality and dollars per day patterns African countries between 1970 and 2010.
  library(dplyr)
  library(ggplot2)
  library(dslabs)
  data(gapminder)
  
  africa_year_comp <- gapminder %>%
  filter(continent == "Africa", year %in% c(1970,2010)) %>%
  mutate(dollars_per_day = gdp/population/365) %>%
  filter(!is.na(dollars_per_day)& !is.na(infant_mortality))
  
  africa_year_comp %>% ggplot(aes(dollars_per_day, infant_mortality, color = region,label = country)) +
  geom_point() +
  scale_x_continuous(trans = "log2") +
  geom_text() + facet_grid(year~.)
Data Visualization Principles
Encoding Data Using Visual Cues

Some useful suggests:
- Position, aligned lengths, angles, area, brightness, and colour hue
- Area and angles are a lot harder to read than length and position
- Know When to Include Zero: it can be dishonest not to start at 0
- Do Not Distort Quantities: when comparing differences better to make circle comparisons using radius proportions rather than area (ggplto automatically uses area)
- Order by a Meaningful Value: reorder()
Practice Problems: Data Visualization Principles 1
1. Pie charts are appropriate:
  Never
2. What is the problem with this plot?
  The axis does not start at 0. Judging by the length, it appears Trump received 3 times as many votes when in fact it was about 30% more.
3. Take a look at the following two plots. They show the same information: rates of measles by state in the United States for 1928.
  The plot on the right is better because it orders the states by disease rate so we can quickly see the states with highest and lowest rates.
Show the Data

Two features that can be added that are helpful to make your data more clear, especially if points are overlapped:
- jitter() adds spaces between point
- Alpha blending stronger colour when more points overlapped
We can also put plots on a common axes to make ease of comparison. Align plots vertical to see horizontal changes and vice versa.

Default colours in ggplot are usually hard for colour blind people, so you can select your own colours:
colour_blind_friendly_cols <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
Add it with + scale_color_manual(values = colour_blind_friendly_cols).

Practice Problems: Data Visualization Principles 2
1. To make the plot on the right in the exercise from the last set of assessments, we had to reorder the levels of the states' variables. Redefine the state object so that the levels are re-ordered by rate. Print the new object state and its levels (using levels) so you can see that the vector is now re-ordered by the levels.
  library(dplyr)
  library(ggplot2)
  library(dslabs)
  dat <- us_contagious_diseases %>%
  filter(year == 1967 & disease=="Measles" & !is.na(population)) %>%
  mutate(rate = count / population * 10000 * 52 / weeks_reporting)
  state <- dat$state
  state <- reorder(state, rate)
  rate <- dat$count/(dat$population/10000)*(52/dat$weeks_reporting)
  
  print(state)
  levels(state)
2. Now we are going to customize this plot a little more by creating a rate variable and reordering by that variable instead.
  Add a single line of code to the definition of the dat table that uses mutate to reorder the states by the rate variable.
  The sample code provided will then create a bar plot using the newly defined dat.
  
  library(dplyr)
  library(ggplot2)
  library(dslabs)
  data(us_contagious_diseases)
  
  dat <- us_contagious_diseases %>%
  filter(year == 1967 & disease=="Measles" & count>0 & !is.na(population)) %>%
  mutate(rate = count / population * 10000 * 52 / weeks_reporting) %>%
  mutate(state = reorder(state, rate))
  dat %>% ggplot(aes(state, rate)) +
  geom_bar(stat="identity") +
  coord_flip()
3. Say we are interested in comparing gun homicide rates across regions of the US. We see this plot:
  library(dplyr)
  library(ggplot2)
  library(dslabs)
  data("murders")
  
  murders %>% mutate(rate = total/population*100000) %>%
  group_by(region) %>%
  summarize(avg = mean(rate)) %>%
  mutate(region = factor(region)) %>%
  ggplot(aes(region, avg)) +
  geom_bar(stat="identity") +
  ylab("Murder Rate Average")
  
  and decide to move to a state in the western region. What is the main problem with this interpretation?
  It does not show all the data. We do not see the variability within a region and it's possible that the safest states are not in the West.
4. To further investigate whether moving to the western region is a wise decision, let's make a box plot of murder rates by region, showing all points. Order the regions by their median murder rate by using mutate and reorder. Make a box plot of the murder rates by region. Show all of the points on the box plot.
  library(dplyr)
  library(ggplot2)
  library(dslabs)
  data("murders")
  
  m <- murders %>%
  mutate(rate = total/population*100000) %>%
  mutate(region = reorder(region, rate))
  m %>% ggplot(aes(region, rate)) +
  geom_boxplot() + geom_point()
Slope Charts

We can make slope charts using geom_line:
west <- c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand")
dat <- gapminder
filter(year %in% c(2010, 2015) & region %in% west & !is.na(life_expectancy) & population > 10^7)
dat %>%
mutate(location = ifelse(year == 2010, 1, 2),
location = ifelse(year == 2015 & country %in% c("United Kingdom", "Portugal"), location +0.22, location), hjust = ifelse(year == 2010, 1, 0) %>%
mutate(year = as.factor(year)) %>%
ggplot(aes(year, life_expectancy, group = country)) +
geom_line(aes(color = country), show.legend = FALSE) +
geom_text(aes(x=location, label = country, hjust = hjust), show.legend = FALSE) +
xlab("") +
ylab("Life Expectancy")

Bland-Altman plot (aka Tukey Mean Difference, and MA): shows the difference and average between values:

Encoding a Third Variable

Using shape() or color or hue, is very useful in getting through many different data points:
We can pick between many, but here are some popular ones: For colours - sequential palettes (library(RColorBrewer)) : used to go from low to high values Diverging palettes: used to represent colours that verge from a centre - emphasis on both ends

Case Study: Vaccines

You can look at data collected in the US using data(data(us_contagious_diseases)).

Practice Problems: Data Visualization Principles 3
1. The sample code given creates a tile plot showing the rate of measles cases per population. We are going to modify the tile plot to look at smallpox cases instead.Modify the tile plot to show the rate of smallpox cases instead of measles cases. Exclude years in which cases were reported in fewer than 10 weeks from the plot.
  library(dplyr)
  library(ggplot2)
  library(RColorBrewer)
  library(dslabs)
  data(us_contagious_diseases)
  
  the_disease = "Smallpox"
  dat <- us_contagious_diseases %>%
  filter(!state%in%c("Hawaii","Alaska") & disease == the_disease & weeks_reporting>=10) %>%
  mutate(rate = count / population * 10000) %>%
  mutate(state = reorder(state, rate))
  
  dat %>% ggplot(aes(year, state, fill = rate)) +
  geom_tile(color = "grey50") +
  scale_x_continuous(expand=c(0,0)) +
  scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
  theme_minimal() +
  theme(panel.grid = element_blank()) +
  ggtitle(the_disease) +
  ylab("") +
  xlab("")
2. The sample code given creates a time series plot showing the rate of measles cases per population by state. We are going to again modify this plot to look at smallpox cases instead. Modify the sample code for the time series plot to plot data for smallpox instead of for measles. Once again, restrict the plot to years in which cases were reported in at least 10 weeks.
  library(dplyr)
  library(ggplot2)
  library(dslabs)
  library(RColorBrewer)
  data(us_contagious_diseases)
  head(us_contagious_diseases)
  
  the_disease = "Smallpox"
  dat <- us_contagious_diseases %>%
  filter(!state%in%c("Hawaii","Alaska") & disease == the_disease & weeks_reporting>=10) %>%
  mutate(rate = count / population * 10000) %>%
  mutate(state = reorder(state, rate))
  
  avg <- us_contagious_diseases %>%
  filter(disease==the_disease) %>% group_by(year) %>%
  summarize(us_rate = sum(count, na.rm=TRUE)/sum(population, na.rm=TRUE)*10000)
  dat %>% ggplot() +
  geom_line(aes(year, rate, group = state), color = "grey50", show.legend = FALSE, alpha = 0.2, size = 1) +
  geom_line(mapping = aes(year, us_rate), data = avg, size = 1, color = "black") +
  scale_y_continuous(trans = "sqrt", breaks = c(5,25,125,300)) +
  ggtitle("Cases per 10,000 by state") +
  xlab("") +
  ylab("") +
  geom_text(data = data.frame(x=1955, y=50), mapping = aes(x, y, label="US average"), color="black") +
  geom_vline(xintercept=1963, col = "blue")
3. Now we are going to look at the rates of all diseases in one state. Again, you will be modifying the sample code to produce the desired plot. For the state of California, make a time series plot showing rates for all diseases. Include only years with 10 or more weeks reporting. Use a different color for each disease. Include your aes function inside of ggplot rather than inside your geom layer.
  library(dplyr)
  library(ggplot2)
  library(dslabs)
  library(RColorBrewer)
  data(us_contagious_diseases)
  
  us_contagious_diseases %>% filter(state=="California" & weeks_reporting>=10) %>%
  group_by(year, disease) %>%
  summarize(rate = sum(count)/sum(population)*10000) %>%
  ggplot(aes(year, rate, color = disease)) +
  geom_line()
4. Now we are going to make a time series plot for the rates of all diseases in the United States. For this exercise, we have provided less sample code - you can take a look at the previous exercise to get you started. Compute the US rate by using summarize to sum over states. Call the variable rate. The US rate for each disease will be the total number of cases divided by the total population. Remember to convert to cases per 10,000. You will need to filter for !is.na(population) to get all the data. Plot each disease in a different color.
  library(dplyr)
  library(ggplot2)
  library(dslabs)
  library(RColorBrewer)
  data(us_contagious_diseases)
  
  us_contagious_diseases %>% filter(!is.na(population)) %>%
  group_by(year, disease) %>%
  summarize(rate = sum(count)/sum(population)*10000) %>%
  ggplot(aes(year, rate, color = disease)) +
  geom_line()
Assessment: Titanic Survival

Practice Problems

Find R code on GitHub.

Prob

Probability

The course link can be found here. Within this course we look at probability theory.

Intro to Discrete
Discrete Probability

In probability, we talk about events. An event is something that occurs by chance. The probability of an event is the proportion of times the event occurs when we repeat the experiment over and over, independently and under the same conditions. Discrete probability is when there is a finite number of outcomes.

Monte Carlo Simulations

A Monte Carlo simulation is a simulation in which an experiment is run a certain amount of times to determine the experimental probability of our event occurring.

Probability Distribution

The probability of outcomes is drawn as a distribution (can be discrete or continuous random variable). This shows the proportion of each outcome occurring.

Independence

Two events are independent if the outcome of one does not affect the outcome of the other.

We know that the conditional probability between two events will not be affected:

Practice Problems
1. One ball will be drawn at random from a box containing: 3 cyan balls, 5 magenta balls, and 7 yellow balls. What is the probability that the ball will be cyan?
  Mathematically:
  
  3/(3+5+7) = 0.2
  
  R:
  
  cyan <- 3
  magenta <- 5
  yellow <- 7
  
  # Assign a variable `p` as the probability of choosing a cyan ball from the box
  p <- cyan/sum(cyan, magenta, yellow)
  
  # Print the variable `p` to the console
  p
  ## 0.2
2. One ball will be drawn at random from a box containing: 3 cyan balls, 5 magenta balls, and 7 yellow balls. What is the probability that the ball will not be cyan?
  Mathematically:
  
  1 - cyan (0.2) = 0.8
  
  R:
  
  # `p` is defined as the probability of choosing a cyan ball from a box containing: 3 cyan balls, 5 magenta balls, and 7 yellow balls.
  # Using variable `p`, calculate the probability of choosing any ball that is not cyan from the box
  1- p
  ##0.8
3. Instead of taking just one draw, consider taking two draws. You take the second draw without returning the first draw to the box. We call this sampling without replacement. What is the probability that the first draw is cyan and that the second draw is not cyan? Provide at least 3 significant digits.
  Mathematically:
  
  (cyan/total)(not cyan/total-1) = (3/15)(12/14) = (6/35) = 0.171428571
  
  R:
  
  cyan <- 3
  magenta <- 5
  yellow <- 7
  
  # The variable `p_1` is the probability of choosing a cyan ball from the box on the first draw.
  p_1 <- cyan / (cyan + magenta + yellow)
  
  # Assign a variable `p_2` as the probability of not choosing a cyan ball on the second draw without replacement.
  p_2 <- sum(magenta, yellow)/ (cyan + magenta + yellow - 1)
  
  # Calculate the probability that the first draw is cyan and the second draw is not cyan using `p_1` and `p_2`.
  p_1*p_2
  ## 0.1714286
4. Now repeat the experiment, but this time, after taking the first draw and recording the color, return it back to the box and shake the box. We call this sampling with replacement. What is the probability that the first draw is cyan and that the second draw is not cyan?
  Mathematically:
  
  (cyan/total)(not cyan/total) = (3/15)(12/15) = (6/35) = 0.16
  
  R:
  
  cyan <- 3
  magenta <- 5
  yellow <- 7
  
  # The variable 'p_1' is the probability of choosing a cyan ball from the box on the first draw.
  p_1 <- cyan / (cyan + magenta + yellow)
  
  # Assign a variable 'p_2' as the probability of not choosing a cyan ball on the second draw with replacement.
  p_2 <- sum(magenta, yellow) / (cyan + magenta + yellow)
  
  # Calculate the probability that the first draw is cyan and the second draw is not cyan using `p_1` and `p_2`.
  p_1*p_2
  ## 0.16

Combination and Permutation

Setting up card deck

In order to set up a card deck you can use two different functions: paste() and expand.grid().

paste() is used to put two strings together. In the case of the deck, we want to group numbers 1-10 and J,Q,K with suits - heart, diamond, spade, clover. An example of how we could use paste:

paste(letters[1:5], as.character(1:5))

This would return a1 b2 c3 d4 e5

expand.grid() gives a combinations of all items listed between two lists. This would allow us to match every suit with a number. An example of this would be:

expand.grid(pants=c(“blue”, “black”), shirt = c(“white”, “grey”, “plaid”))

This would return

pants	shirt
blue	white
black	white
blue	grey
black	grey
blue	plaid
black	plaid

Now to apply it to our cards:

Suits = c( list all )
Numbers = c( list all as strings )
Deck <- expand.grid( number = numbers, suit=suits) %>% paste(number, suit)

Now we can check our probabilities:

kings <- paster(“King”, suits”)
mean(deck %in% kings)

gtools package

This package is very useful because it has combinations and permutations operations. Combinations are elements from a subset where order does not matter - selecting "r" items from "n" (i.e. 1 then 3 same as 3 then 1). Permutations are combinations of items where order matters (i.e. 1 then 3 different from 3 then 1). We can create all the different 2 card hands from a deck of cards using permutations:

hands <- permutations(52, 2, v= deck)

Then we can select the first and second hand by specifying location:

first_card <- hands[,1]
second_card <- hands[,2]

Checking for duplicates

duplicated() checks if an element of a vector has already appeared in that vector; returns true or false. Then we can check if there exists any duplicates in a list: any(duplicated()) (which will return true if there are any TRUEs listed).

Note: Often for loops are less preferred in R; usually prefer operations on entire vectors. sapply() allows applying an operation to every single element of a vector.

How many Monte Carlo Experiments are Enough?

When we want to run simulations it is important to get the right number of Monte Carlo sims otherwise the results won't be correct. In order to determine a "good enough" amount of sims, we want to check the stability of the experiment. You do this by looking at varying sizes in your sample and graph it to see where stabilizing might occur:

Practice Problems

Imagine you draw two balls from a box containing colored balls. You either replace the first ball before you draw the second or you leave the first ball out of the box when you draw the second ball. Under which situation are the two draws independent of one another?
Replace
Say you’ve drawn 5 balls from a box that has 3 cyan balls, 5 magenta balls, and 7 yellow balls, with replacement, and all have been yellow. What is the probability that the next one is yellow?
cyan <- 3
magenta <- 5
yellow <- 7

# Assign the variable 'p_yellow' as the probability that a yellow ball is drawn from the box.
p_yellow <- yellow/sum(cyan, magenta, yellow)

# Using the variable 'p_yellow', calculate the probability of drawing a yellow ball on the sixth draw. Print this value to the console.
P_yellow
## 0.4666667
If you roll a 6-sided die once, what is the probability of not seeing a 6? If you roll a 6-sided die six times, what is the probability of not seeing a 6 on any of those rolls?
# Assign the variable 'p_no6' as the probability of not seeing a 6 on a single roll.
p_no6 <- 5/6

# Calculate the probability of not seeing a 6 on six rolls using `p_no6`. Print your result to the console: do not assign it to a variable.
p_no6^6
## 0.334898
Two teams, say the Celtics and the Cavs, are playing a seven game series. The Cavs are a better team and have a 60% chance of winning each game. What is the probability that the Celtics win at least one game? Remember that the Celtics must win one of the first four games, or the series will be over!
# Assign the variable `p_cavs_win4` as the probability that the Cavs will win the first four games of the series.
p_cavs_win4 <- (.6)^4

# Using the variable `p_cavs_win4`, calculate the probability that the Celtics win at least one game in the first four games of the series.
1 - p_cavs_win4
## 0.8704
Create a Monte Carlo simulation to confirm your answer to the previous problem by estimating how frequently the Celtics win at least 1 of 4 games. Use B <- 10000 simulations. The provided sample code simulates a single series of four random games, simulated_games.
# This line of example code simulates four independent random games where the Celtics either lose or win. Copy this example code to use within the `replicate` function.
simulated_games <- sample(c("lose","win"), 4, replace = TRUE, prob = c(0.6, 0.4))
# The variable 'B' specifies the number of times we want the simulation to run. Let's run the Monte Carlo simulation 10,000 times.
B <- 10000

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling.
set.seed(1)
# Create an object called `celtic_wins` that replicates two steps for B iterations: (1) generating a random four-game series `simulated_games` using the example code, then (2) determining whether the simulated series contains at least one win for the Celtics.
# celtic_wins <- replicate(B, any(simulated_games=="win"))
celtic_wins <- replicate(B, any({simulated_games <- sample(c("lose","win"), 4, replace = TRUE, prob = c(0.6, 0.4))}=="win"))

# Calculate the frequency out of B iterations that the Celtics won at least one game. Print your answer to the console.
mean(celtic_wins)
## 0.8757

Addition Rule and Monty Hall
Addition rule (A or B) : P(A or B) = P(A) + P(B) - P(A and B)

Monty Hall problem is a counter-intuitive stats problem that gives you are choice of 3 doors (one of which has a prize). After you pick a door one of the other doors opens to reveal no prize - the question is do you switch doors from your intial pick?

When you work of the statistics behind it - you should switch your choice due to an increase knowledge.

Practice Problems
1. Two teams, say the Cavs and the Warriors, are playing a seven game championship series. The first to win four games wins the series. The teams are equally good, so they each have a 50-50 chance of winning each game. If the Cavs lose the first game, what is the probability that they win the series?
  # Assign a variable 'n' as the number of remaining games.
  n <- 6
  
  # Assign a variable `outcomes` as a vector of possible game outcomes, where 0 indicates a loss and 1 indicates a win for the Cavs.
  outcomes <- c(0,1)
  
  # Assign a variable `l` to a list of all possible outcomes in all remaining games. Use the `rep` function on `list(outcomes)` to create list of length `n`.
  l <- rep(list(outcomes), n)
  
  # Create a data frame named 'possibilities' that contains all combinations of possible outcomes for the remaining games.
  possibilities <- expand.grid(l)
  
  # Create a vector named 'results' that indicates whether each row in the data frame 'possibilities' contains enough wins for the Cavs to win the series.
  results <- rowSums(possibilities) >=4
  
  # Calculate the proportion of 'results' in which the Cavs win the series. Print the outcome to the console.
  mean(results)
  ## 0.34375
2. Confirm the results of the previous question with a Monte Carlo simulation to estimate the probability of the Cavs winning the series after losing the first game.
  # The variable `B` specifies the number of times we want the simulation to run. Let's run the Monte Carlo simulation 10,000 times.
  B <- 10000
  
  # Use the `set.seed` function to make sure your answer matches the expected result after random sampling.
  set.seed(1)
  
  # Create an object called `results` that replicates for `B` iterations a simulated series and determines whether that series contains at least four wins for the Cavs.
  results <- replicate(B, {sum(sample(c(0,1), 6, replace = TRUE))>=4})
  
  # Calculate the frequency out of `B` iterations that the Cavs won at least four games in the remainder of the series. Print your answer to the console.
  mean(results)
  ## 0.3453
3. Two teams A and B are playing a seven series game series. Team A is better than team B and has a p> 0.5 chance of winning each game.
  # Let's assign the variable 'p' as the vector of probabilities that team A will win.
  p <- seq(0.5, 0.95, 0.025)
  # Given a value 'p', the probability of winning the series for the underdog team B can be computed with the following function based on a Monte Carlo simulation:
  prob_win <- function(p){
  B <- 10000
  result <- replicate(B, {
  b_win <- sample(c(1,0), 7, replace = TRUE, prob = c(1-p, p))
  sum(b_win)>=4
  })
  mean(result)
  }
  
  # Apply the 'prob_win' function across the vector of probabilities that team A will win to determine the probability that team B will win. Call this object 'Pr'.
  Pr <- sapply(p, prob_win)
  
  # Plot the probability 'p' on the x-axis and 'Pr' on the y-axis.
  plot(p, Pr)
4. Repeat the previous exercise, but now keep the probability that team A wins fixed at p <- 0.75 and compute the probability for different series lengths. For example, wins in best of 1 game, 3 games, 5 games, and so on through a series that lasts 25 games.
  # Given a value 'p', the probability of winning the series for the underdog team B can be computed with the following function based on a Monte Carlo simulation:
  prob_win <- function(N, p=0.75){
  B <- 10000
  result <- replicate(B, {
  b_win <- sample(c(1,0), N, replace = TRUE, prob = c(1-p, p))
  sum(b_win)>=(N+1)/2
  
  })
  mean(result)
  }
  
  # Assign the variable 'N' as the vector of series lengths. Use only odd numbers ranging from 1 to 25 games.
  N <- seq(from=1, to=25, by=2)
  
  # Apply the 'prob_win' function across the vector of series lengths to determine the probability that team B will win. Call this object `Pr`.
  Pr <- sapply(N, prob_win)
  
  # Plot the number of games in the series 'N' on the x-axis and 'Pr' on the y-axis.
  plot(N, Pr)
Discrete Probability Assessment

Practice Problems

Find R code on GitHub.
Continuous Probability
Continuous probability

Continuous probability functions operate on intervals rather than singular values. We have difference functions: eCDF (empirical cumulative distribution function) and CDF (cumulative distribution function) - which is a basic summary of a list of numerical values. They represent the proportion/count of observations falling below a specific value; the eCDF is an estimator of the CDF Create function: function(a) mean(x>=a). The CDF defines a probability distribution for our variables.

Theoretical Distribution

Approximation to the normal distribution (pnorm()) pnorm(a, avg, s) is very useful if we can use but it is only used for continuous data.

Data scientist use data which technically speaking is discrete; so we look at intervals of probabilities instead of exact values. This allows us to use the normal approximation (with large enough data set).

Discretization: using continuous distribution (by way of intervals) even though the data is technically discrete

Probability Density

You get the probability density function for the normal distribution using dnorm(z, mu, sigma). This allow us to fit models to data for which predefined functions are not available.

We can use dnorm() to plot the density curve for the normal distribution. dnorm(z) gives the probability density f(z) of a certain z-score, so we can draw a curve by calculating the density over a range of possible values of z.

Since we know 99.7% of observations will be within -3 <= z <= 3, we can use a value of z slightly larger than 3 and this will cover most likely values of the normal distribution. Then, we calculate f(z), which is dnorm() of the series of z-scores. Last, we plot z against f(z).
# to use %>% we must import tidyverse
library(tidyverse)

x <- seq(-4, 4, length = 100)
# creating dataframe for the normal distribution for z values between -4 and 4, then plotting it
data.frame(x, f = dnorm(x)) %>%
ggplot(aes(x, f)) +
geom_line()

Monte Carlo Simulation

How to generate Monte Carlo simulations: use rnorm() which takes size, average (default 0), standard deviation (default 1)

x <- heights
n <- length(x)
avg <- mean(x)
s <- sd(x)
simulated _heights <- rnorm(n, avg, s)
This generates normally generated data

Other continuous distribution
- Norm (norm), student -t (t), chi-squared, exponential, gamma, and beta distributions
- R provides functions to computer: density (dblah), quantiles (qblah), probability density function (pblah), and random (rblah), cumulative distribution
Practice Problems
1. Assume the distribution of female heights is approximated by a normal distribution with a mean of 64 inches and a standard deviation of 3 inches. If we pick a female at random, what is the probability that she is 5 feet or shorter?
  # Assign a variable 'female_avg' as the average female height.
  female_avg <- 64
  
  # Assign a variable 'female_sd' as the standard deviation for female heights.
  female_sd <- 3
  
  # Using variables 'female_avg' and 'female_sd', calculate the probability that a randomly selected female is shorter than 5 feet. Print this value to the console.
  pnorm(60, female_avg, female_sd)
  ## 0.09121122
2. Assume the distribution of female heights is approximated by a normal distribution with a mean of 64 inches and a standard deviation of 3 inches. If we pick a female at random, what is the probability that she is 6 feet or taller?
  # Assign a variable 'female_avg' as the average female height.
  female_avg <- 64
  
  # Assign a variable 'female_sd' as the standard deviation for female heights.
  female_sd <- 3
  
  # Using variables 'female_avg' and 'female_sd', calculate the probability that a randomly selected female is 6 feet or taller. Print this value to the console.
  1- pnorm(72, female_avg, female_sd)
  ## 0.003830381
3. Assume the distribution of female heights is approximated by a normal distribution with a mean of 64 inches and a standard deviation of 3 inches. If we pick a female at random, what is the probability that she is between 61 and 67 inches?
  # Assign a variable 'female_avg' as the average female height.
  female_avg <- 64
  
  # Assign a variable 'female_sd' as the standard deviation for female heights.
  female_sd <- 3
  
  # Using variables 'female_avg' and 'female_sd', calculate the probability that a randomly selected female is between the desired height range. Print this value to the console.
  pnorm(67, female_avg, female_sd) - pnorm(61, female_avg, female_sd)
  ## 0.6826895
4. Repeat the previous exercise, but convert everything to centimeters. That is, multiply every height, including the standard deviation, by 2.54. What is the answer now?
  # Assign a variable 'female_avg' as the average female height. Convert this value to centimeters.
  female_avg <- 64*2.54
  
  # Assign a variable 'female_sd' as the standard deviation for female heights. Convert this value to centimeters.
  female_sd <- 3*2.54
  
  # Using variables 'female_avg' and 'female_sd', calculate the probability that a randomly selected female is between the desired height range. Print this value to the console.
  pnorm(67*2.54, female_avg, female_sd) - pnorm(61*2.54, female_avg, female_sd)
  ## 0.6826895
5. Compute the probability that the height of a randomly chosen female is within 1 SD from the average height.
  # Assign a variable 'female_avg' as the average female height.
  female_avg <- 64
  
  # Assign a variable 'female_sd' as the standard deviation for female heights.
  female_sd <- 3
  
  # To a variable named 'taller', assign the value of a height that is one SD taller than average.
  taller <- female_avg + female_sd
  
  # To a variable named 'shorter', assign the value of a height that is one SD shorter than average.
  shorter <- female_avg - female_sd
  
  # Calculate the probability that a randomly selected female is between the desired height range. Print this value to the console.
  pnorm(taller, female_avg, female_sd) - pnorm(shorter, female_avg, female_sd)
  ## 0.6826895
6. Imagine the distribution of male adults is approximately normal with an average of 69 inches and a standard deviation of 3 inches. How tall is a male in the 99th percentile?
  # Assign a variable 'male_avg' as the average male height.
  male_avg <- 69
  
  # Assign a variable 'male_sd' as the standard deviation for male heights.
  male_sd <- 3
  
  # Determine the height of a man in the 99th percentile of the distribution.
  qnorm(.99, male_avg, male_sd)
  ## 75.97904
7. The distribution of IQ scores is approximately normally distributed. The average is 100 and the standard deviation is 15. Suppose you want to know the distribution of the person with the highest IQ in your school district, where 10,000 people are born each year. Generate 10,000 IQ scores 1,000 times using a Monte Carlo simulation. Make a histogram of the highest IQ scores.
  # The variable `B` specifies the number of times we want the simulation to run.
  B <- 1000
  
  # Use the `set.seed` function to make sure your answer matches the expected result after random number generation.
  set.seed(1)
  
  avg <- 100
  s <- 15
  
  # Create an object called `highestIQ` that contains the highest IQ score from each random distribution of 10,000 people.
  highestIQ <- replicate(B, {max(rnorm(10000, avg, s))})
  
  # Make a histogram of the highest IQ scores.
  hist(highestIQ)
Continuous Probability Assessment

Practice Problems

Find R code on GitHub.
Random Variables and Sampling Models
Random Variable

A random variable is a variable that has different outcomes left to chance.

We can check the distribution of our data by doing the following code:
# first we sequence the data
s <- seq(min(data), max(data), length = 100)

# next we want to create a data frame that has the normal distribution from our data set:
Normal_density <- data.frame(s = s, f = dnorm(s, mean(data), sd(data)))

# finally, we can use our original data to create a histogram with a line graph representing the distribution if it was a normal distribution. This will help us see if our distribution is approximately normal.
data.frame(data=data) %>% ggplot(aes(data, ..density..)) +
geom_histogram(color = “black”, binwidth = 10) +
ylab(“Probability”) +
geom_line(data = normal_density, mapping=aes(s, f), color= “blue”)

Sampling Models

Sampling models create ways to randomly sample populations to answer questions. We want to model random behaviour of a population with sample draws.
- rep(c(“B”, “R”, “G”), c(18, 18, 2))) repeats each one the amount of times specified
- R.V. X <- sample(ifelse (color == “Red”, -1, 1), n, replace = True) OR if you know the probability: X <- sample(c( -1, 1), n, replace = True, prob=c(9/19, 10/19))
Distributions vs. Probability Distributions

Distribution of a list of numbers: what proportion of the list is less than or equal to a value. The probability distribution of a random variable is probability of the observed value falling in any given interval. Probability distribution: what is the probability that X is less than or equal to a; with expected value and standard error

The average of a random sample is called the expected value and the standard deviation is call the standard error of the random variable.
avg <- sum(x)/length(x)
sd <- sqrt(sum(x-avg)^2)/length(x)

Notation for Random Variables

Capital letters are for RV and lower case are for observed values.

CLT

The Central Limit Theorem states that when the sample size is large, the sum of the independent trials is approximately normal. This is a big result because if it is approximately normal we only need the mean and sd to have the distribution.

Where we have:
- E[X] = mu (expected value = mean): number of draws x avg of numbers in urn
- Standard Error: SE[X] = sqrt(number of draws) x sd OR |b - a|sqrt(p(1-p)) to give us the range of possibilities.
Practice Problems
1. An American roulette wheel has 18 red, 18 black, and 2 green pockets. Each red and black pocket is associated with a number from 1 to 36. The two remaining green slots feature "0" and "00". Players place bets on which pocket they think a ball will land in after the wheel is spun. Players can bet on a specific number (0, 00, 1-36) or color (red, black, or green). What are the chances that the ball lands in a green pocket?
  # The variables `green`, `black`, and `red` contain the number of pockets for each color
  green <- 2
  black <- 18
  red <- 18
  
  # Assign a variable `p_green` as the probability of the ball landing in a green pocket
  p_green <- green/sum(green, black, red)
  
  # Print the variable `p_green` to the console
  p_green
  ## 0.05263158
2. In American roulette, the payout for winning on green is $17. This means that if you bet $1 and it lands on green, you get $17 as a prize. Create a model to predict your winnings from betting on green one time.
  # Use the `set.seed` function to make sure your answer matches the expected result after random sampling.
  set.seed(1)
  
  # The variables 'green', 'black', and 'red' contain the number of pockets for each color
  green <- 2
  black <- 18
  red <- 18
  
  # Assign a variable `p_green` as the probability of the ball landing in a green pocket
  p_green <- green / (green+black+red)
  
  # Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
  p_not_green <- 1 - p_green
  
  # Create a model to predict the random variable `X`, your winnings from betting on green. Sample one time.
  X <- sample(c(17, -1), 1, replace = TRUE, prob=c(p_green, p_not_green))
  
  # Print the value of `X` to the console
  X
  ## -1
3. In American roulette, the payout for winning on green is $17. This means that if you bet $1 and it lands on green, you get $17 as a prize.In the previous exercise, you created a model to predict your winnings from betting on green. Now, compute the expected value of X, the random variable you generated previously.
  # The variables 'green', 'black', and 'red' contain the number of pockets for each color
  green <- 2
  black <- 18
  red <- 18
  
  # Assign a variable `p_green` as the probability of the ball landing in a green pocket
  p_green <- green / (green+black+red)
  # Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
  p_not_green <- 1-p_green
  
  # Calculate the expected outcome if you win $17 if the ball lands on green and you lose $1 if the ball doesn't land on green
  17*p_green + (-1)*p_not_green
  ## -0.05263158
4. The standard error of a random variable X tells us the difference between a random variable and its expected value. You calculated a random variable X in exercise 2 and the expected value of that random variable in exercise 3. Now, compute the standard error of that random variable, which represents a single outcome after one spin of the roulette wheel.
  # The variables 'green', 'black', and 'red' contain the number of pockets for each color
  green <- 2
  black <- 18
  red <- 18
  
  # Assign a variable `p_green` as the probability of the ball landing in a green pocket
  p_green <- green / (green+black+red)
  
  # Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
  p_not_green <- 1-p_green
  
  # Compute the standard error of the random variable
  abs(-1-17)*sqrt(p_green*p_not_green)
  ## 4.019344
5. You modeled the outcome of a single spin of the roulette wheel, X, in exercise 2. Now create a random variable S that sums your winnings after betting on green 1,000 times.
  # The variables 'green', 'black', and 'red' contain the number of pockets for each color
  green <- 2
  black <- 18
  red <- 18
  
  # Assign a variable `p_green` as the probability of the ball landing in a green pocket
  p_green <- green / (green+black+red)
  
  # Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
  p_not_green <- 1-p_green
  
  # Use the `set.seed` function to make sure your answer matches the expected result after random sampling
  set.seed(1)
  
  # Define the number of bets using the variable 'n'
  n <- 1000
  
  # Create a vector called 'X' that contains the outcomes of 1000 samples
  X <- sample(c(17, -1), n, replace = TRUE, prob=c(p_green, p_not_green))
  
  # Assign the sum of all 1000 outcomes to the variable 'S'
  S <- sum(X)
  
  # Print the value of 'S' to the console
  S
  ## -10
6. In the previous exercise, you generated a vector of random outcomes, X, after betting on green 1,000 times. What is the expected value of S?
  # The variables 'green', 'black', and 'red' contain the number of pockets for each color
  green <- 2
  black <- 18
  red <- 18
  
  # Assign a variable `p_green` as the probability of the ball landing in a green pocket
  p_green <- green / (green+black+red)
  
  # Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
  p_not_green <- 1-p_green
  
  # Define the number of bets using the variable 'n'
  n <- 1000
  
  # Calculate the expected outcome of 1,000 spins if you win $17 when the ball lands on green and you lose $1 when the ball doesn't land on green
  X <- sample(c(17, -1), n, replace = TRUE, prob=c(p_green, p_not_green))
  E_S <- n*(17*p_green + -1*p_not_green)
  E_S
  ## -52.63158
7. You generated the expected value of S, the outcomes of 1,000 bets that the ball lands in the green pocket, in the previous exercise. What is the standard error of S?
  # The variables 'green', 'black', and 'red' contain the number of pockets for each color
  green <- 2
  black <- 18
  red <- 18
  
  # Assign a variable `p_green` as the probability of the ball landing in a green pocket
  p_green <- green / (green+black+red)
  
  # Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
  p_not_green <- 1-p_green
  
  # Define the number of bets using the variable 'n'
  n <- 1000
  
  # Compute the standard error of the sum of 1,000 outcomes
  se_S <- sqrt(n)*(abs(-1-17)*sqrt(p_green*p_not_green))
  se_S
  ## 127.1028
The Central Limit Theorem Continued
Averages and Proportions

Rules for expected value and standard error:
- The expected value of the sum of RV is the sum of the E(X) of the individual RV
- E(X) of RV times constant is the c*E(X)
- SE^2 of the sum of independent RV is the sum of the square of the SE of each random variable
- SE of RV times a constant is the c*SE(X)
- SE(X) = sigma/sqrt(n)
- X~ norm then aX +b ~norm also
Law of Large Numbers

When n is large enough then the mean converges to the true E(X).

How Large is Large in CLT

30 is a good rule of thumb, but highly dependent on your data.

Practice Problems
1. The exercises in the previous chapter explored winnings in American roulette. In this chapter of exercises, we will continue with the roulette example and add in the Central Limit Theorem. In the previous chapter of exercises, you created a random variable S that is the sum of your winnings after betting on green a number of times in American Roulette. What is the probability that you end up winning money if you bet on green 100 times?
  # Assign a variable `p_green` as the probability of the ball landing in a green pocket
  p_green <- 2 / 38
  
  # Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
  p_not_green <- 1-p_green
  
  # Define the number of bets using the variable 'n'
  n <- 100
  
  # Calculate 'avg', the expected outcome of 100 spins if you win $17 when the ball lands on green and you lose $1 when the ball doesn't land on green
  avg <- n * (17*p_green + -1*p_not_green)
  
  # Compute 'se', the standard error of the sum of 100 outcomes
  se <- sqrt(n) * (17 - -1)*sqrt(p_green*p_not_green)
  
  # Using the expected value 'avg' and standard error 'se', compute the probability that you win money betting on green 100 times.
  1- pnorm(0, avg, se)
  ## 0.4479091
2. Create a Monte Carlo simulation that generates 10,000 outcomes of S, the sum of 100 bets. Compute the average and standard deviation of the resulting list and compare them to the expected value (-5.263158) and standard error (40.19344) for S that you calculated previously.
  # Assign a variable `p_green` as the probability of the ball landing in a green pocket
  p_green <- 2 / 38
  
  # Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
  p_not_green <- 1-p_green
  
  # Define the number of bets using the variable 'n'
  n <- 100
  
  # The variable `B` specifies the number of times we want the simulation to run. Let's run the Monte Carlo simulation 10,000 times.
  B <- 10000
  
  # Use the `set.seed` function to make sure your answer matches the expected result after random sampling.
  set.seed(1)
  
  # Create an object called `S` that replicates the sample code for `B` iterations and sums the outcomes.
  S <- replicate(B, {
  X <- sample(c(17, -1), n, replace = TRUE, prob=c(p_green, p_not_green))
  sum(X)
  })
  
  # Compute the average value for 'S'
  mean(S)
  ## 40.30608
  
  # Calculate the standard deviation of 'S'
  sd(S)
  ## -5.9086
3. In this chapter, you calculated the probability of winning money in American roulette using the CLT. Now, calculate the probability of winning money from the Monte Carlo simulation. The Monte Carlo simulation from the previous exercise has already been pre-run for you, resulting in the variable S that contains a list of 10,000 simulated outcomes.
  # Calculate the proportion of outcomes in the vector `S` that exceed $0 mean(S>0)
  ## 0.4232
4. The Monte Carlo result and the CLT approximation for the probability of losing money after 100 bets are close, but not that close. What could account for this?
  The CLT does not work as well when the probability of success is small.
5. Now create a random variable Y that contains your average winnings per bet after betting on green 10,000 times.
  # Use the `set.seed` function to make sure your answer matches the expected result after random sampling.
  set.seed(1)
  
  # Define the number of bets using the variable 'n'
  n <- 10000
  
  # Assign a variable `p_green` as the probability of the ball landing in a green pocket
  p_green <- 2 / 38
  
  # Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
  p_not_green <- 1 - p_green
  
  # Create a vector called `X` that contains the outcomes of `n` bets
  X <- sample(c(17, -1), n, replace=TRUE, prob=c(p_green, p_not_green))
  
  # Define a variable `Y` that contains the mean outcome per bet. Print this mean to the console.
  Y <- mean(X)
  Y
  ## 0.008
6. What is the expected value of Y, the average outcome per bet after betting on green 10,000 times?
  # Assign a variable `p_green` as the probability of the ball landing in a green pocket
  p_green <- 2 / 38
  
  # Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
  p_not_green <- 1 - p_green
  
  # Calculate the expected outcome of `Y`, the mean outcome per bet in 10,000 bets
  E_Y <- (17*p_green + (-1)*p_not_green)
  E_Y
  ## -0.05263158
7. What is the standard error of Y, the average result of 10,000 spins?
  # Define the number of bets using the variable 'n'
  n <- 10000
  
  # Assign a variable `p_green` as the probability of the ball landing in a green pocket
  p_green <- 2 / 38
  
  # Assign a variable `p_not_green` as the probability of the ball not landing in a green pocket
  p_not_green <- 1 - p_green
  
  # Compute the standard error of 'Y', the mean outcome per bet from 10,000 bets.
  SE_Y <- abs(-1-17)*sqrt(p_green*p_not_green)/sqrt(10000)
  SE_Y
  ## 0.04019344
8. What is the probability that your winnings are positive after betting on green 10,000 times?
  # We defined the average using the following code
  avg <- 17*p_green + -1*p_not_green
  
  # We defined standard error using this equation
  se <- 1/sqrt(n) * (17 - -1)*sqrt(p_green*p_not_green)
  
  # Given this average and standard error, determine the probability of winning more than $0. Print the result to the console.
  1 - pnorm(0, avg, se)
  ## 0.0951898
9. Create a Monte Carlo simulation that generates 10,000 outcomes of S, the average outcome from 10,000 bets on green. Compute the average and standard deviation of the resulting list to confirm the results from previous exercises using the Central Limit Theorem.
  # The variable `n` specifies the number of independent bets on green
  n <- 10000
  
  # The variable `B` specifies the number of times we want the simulation to run
  B <- 10000
  
  # Use the `set.seed` function to make sure your answer matches the expected result after random number generation
  set.seed(1)
  
  # Generate a vector `S` that contains the the average outcomes of 10,000 bets modeled 10,000 times
  S <- replicate(B, {
  X <- sample(c(17,-1), n, replace=TRUE, prob=c(p_green, p_not_green))
  mean(X)
  })
  
  # Compute the average of `S`
  mean(S)
  ## 0.03996168
  
  # Compute the standard deviation of `S`
  sd(S)
  ## -0.05223142
10. In a previous exercise, you found the probability of winning more than $0 after betting on green 10,000 times using the Central Limit Theorem. Then, you used a Monte Carlo simulation to model the average result of betting on green 10,000 times over 10,000 simulated series of bets. What is the probability of winning more than $0 as estimated by your Monte Carlo simulation? The code to generate the vector S that contains the the average outcomes of 10,000 bets modeled 10,000 times has already been run for you.
  # Compute the proportion of outcomes in the vector 'S' where you won more than $0
  mean(S>0)
  ## 0.0977
11. The Monte Carlo result and the CLT approximation are now much closer than when we calculated the probability of winning for 100 bets on green. What could account for this difference?
  The CLT works better when the sample size is larger.
Assessment: Random Variables, Sampling Models, and the Central Limit Theorem

Practice Problems

Find R code on GitHub.
The Big Short
Interest Rates Explained

Banks want to cover for the amount of people the predict will not pay back the loan. This is why they have different interest rates. Due to CLT, since our losses are a sum of independent draws we have E(X) = n*(p*loss + q*noloss) and sd(X) = √(n) *abs(loss - noloss)*√(p*q). Now, if our expected value is zero, it means we breakeven. This leads to out equation of loss*p +x*q = 0

Next, we want to take into account that we want a 99% confidence interval (meaning 99% of the time we breakeven). We do this by getting P(X<0) = 0.01. When we standardize this, we get P(Z < -n*(loss*p + x*q/abs(x-loss)sqrt(n*p*q)) =0.01. Using qnorm(), we can see that: qnorm(0.01) = -2.32. This means we are looking for our z-value to be -2.32.

This gives us the equation: -n*(loss*p + x*q/abs(x-loss)sqrt(n*p*q) = -2.32 which will determine our values needed.

Big Short

The reason the big short happened was it assumed that the chance of defaulting on high risk mortgage was independent from person to person. However, when there is a global event (recession) many people are impacted and the chances of defaulting are not 50-50. Financial experts assumed independence when there was none.

Practice Problems
1. Say you manage a bank that gives out 10,000 loans. The default rate is 0.03 and you lose $200,000 in each foreclosure. Create a random variable S that contains the earnings of your bank. Calculate the total amount of money lost in this scenario.
  # Assign the number of loans to the variable `n`
  n <- 10000
  
  # Assign the loss per foreclosure to the variable `loss_per_foreclosure`
  loss_per_foreclosure <- -200000
  
  # Assign the probability of default to the variable `p_default`
  p_default <- 0.03
  
  # Use the `set.seed` function to make sure your answer matches the expected result after random sampling
  set.seed(1)
  
  # Generate a vector called `defaults` that contains the default outcomes of `n` loans
  defaults <- sample(c(0,1), n, replace=TRUE, prob = c(1-p_default, p_default))
  
  # Generate `S`, the total amount of money lost across all foreclosures. Print the value to the console.
  S <- sum(defaults)*loss_per_foreclosure
  S
  ## -6.3e+07
2. Run a Monte Carlo simulation with 10,000 outcomes for S, the sum of losses over 10,000 loans. Make a histogram of the results.
  # Assign the number of loans to the variable `n`
  n <- 10000
  
  # Assign the loss per foreclosure to the variable `loss_per_foreclosure`
  loss_per_foreclosure <- -200000
  
  # Assign the probability of default to the variable `p_default`
  p_default <- 0.03
  
  # Use the `set.seed` function to make sure your answer matches the expected result after random sampling
  set.seed(1)
  
  # The variable `B` specifies the number of times we want the simulation to run
  B <- 10000
  
  # Generate a list of summed losses 'S'. Replicate the code from the previous exercise over 'B' iterations to generate a list of summed losses for 'n' loans. Ignore any warnings for now.
  S <- replicate(B, {
  X <- sample(c(0,1), n, replace=TRUE, prob = c(1-p_default, p_default))
  sum(X)*loss_per_foreclosure
  })
  
  # Plot a histogram of 'S'. Ignore any warnings for now.
  hist(S)
3. What is the expected value of S, the sum of losses over 10,000 loans? For now, assume a bank makes no money if the loan is paid.
  # Assign the number of loans to the variable `n`
  n <- 10000
  
  # Assign the loss per foreclosure to the variable `loss_per_foreclosure`
  loss_per_foreclosure <- -200000
  
  # Assign the probability of default to the variable `p_default`
  p_default <- 0.03
  
  # Calculate the expected loss due to default out of 10,000 loans
  n*p_default*loss_per_foreclosure
  ## -6e+07
4. What is the standard error of S?
  # Assign the number of loans to the variable `n`
  n <- 10000
  
  # Assign the loss per foreclosure to the variable `loss_per_foreclosure`
  loss_per_foreclosure <- -200000
  
  # Assign the probability of default to the variable `p_default`
  p_default <- 0.03
  
  # Compute the standard error of the sum of 10,000 loans
  abs(loss_per_foreclosure)*sqrt(n*p_default*(1-p_default))
  ## 3411744
5. So far, we've been assuming that we make no money when people pay their loans and we lose a lot of money when people default on their loans. Assume we give out loans for $180,000. How much money do we need to make when people pay their loans so that our net loss is $0? In other words, what interest rate do we need to charge in order to not lose money?
  # Assign the loss per foreclosure to the variable `loss_per_foreclosure`
  loss_per_foreclosure <- -200000
  
  # Assign the probability of default to the variable `p_default`
  p_default <- 0.03
  
  # Assign a variable `x` as the total amount necessary to have an expected outcome of $0
  # lp + x(1-p) = 0
  x <- -p_default*loss_per_foreclosure/(1-p_default)
  
  # Convert `x` to a rate, given that the loan amount is $180,000. Print this value to the console.
  x/180000
  ## 0.03436426
6. With the interest rate calculated in the last example, we still lose money 50% of the time. What should the interest rate be so that the chance of losing money is 1 in 20? In math notation, what should the interest rate be so that P(S<0) = 0.05 Let z = qnorm(0.05) give us the value of z for which P(Z < z)=0.05;
  # Assign the number of loans to the variable `n`
  n <- 10000
  
  # Assign the loss per foreclosure to the variable `loss_per_foreclosure`
  loss_per_foreclosure <- -200000
  
  # Assign the probability of default to the variable `p_default`
  p_default <- 0.03
  
  # Generate a variable `z` using the `qnorm` function
  z <- qnorm(0.05)
  
  # Generate a variable `x` using `z`, `p_default`, `loss_per_foreclosure`, and `n`
  x <- -loss_per_foreclosure*( ( n*p_default - z*sqrt(n*p_default*(1-p_default)) ) / (n*(1-p_default) + z*sqrt(n*p_default*(1-p_default))) )
  
  # Convert `x` to an interest rate, given that the loan amount is $180,000. Print this value to the console.
  x/180000
  ## 0.03768738
7. The bank wants to minimize the probability of losing money. Which of the following achieves their goal without making interest rates go up?
  A reduced default rate

Inference

Inference and Modeling

The course link can be found here. Within this course we look at how to model data and draw inferences from our samples. This section will be updated soon!

Parameters and Estimates
Central Limit Theorem in Practice
Confidence Intervals and p-Values
Statistical Models
Bayesian Statistics
Election Forecasting
Association Tests

Regression

Linear Regression

The course link can be found here. Within this course we look at linear models and how to understand relationships between various factors within an analysis.

Baseball as a Motivating Example
Baseball Basics

Some information to know about baseball and how things are denoted in this section are:
- 9 innings with 3 outs per team per inning
- There is a set order for 9 batters
- Bases form a cycle
- There are 6 ways to succeed:
  1. BB (base on balls aka "walk")
  2. SB (stolen bases)
  3. 1B (single base run)
  4. 2B (double bases run)
  5. 3B (triple bases run)
  6. HR (home run)
- We define a hit with any of the four outcomes (1B, 2B, 3B, HR) and at bat (AB) to have just two outcomes (get a hit or out)
- PA (stands for plate appearance): up to bat; has a binary outcome - either they are out or not out.
- Batting average has been considered for a long time the most important offensive stat. Where batting average is defined as H/AB but it ignores BB.
- The statistics analyzed are team-level because players are not independent from one another.
We can make a scatterplot of the home runs vs runs per game to futher see a relationship by:
library(Lahman)

ds_theme_set()
Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR_per_game = HR/G, R_per_game = R/G) %>%
ggplot(aes(HR_per_game, R_per_game)) +
geom_point(alpha=0.5)

Assessment: Baseball as a Motivating Example
1. What is the application of statistics and data science to baseball called?
  Sabermetrics
2. Which of the following outcomes is not included in the batting average?
  A base on balls
3. Why do we consider team statistics as well as individual player statistics?
  The success of any individual player also depends on the strength of their team.
4. You want to know whether teams with more at-bats per game have more runs per game. What R code below correctly makes a scatter plot for this relationship?
  Teams %>% filter(yearID %in% 1961:2001 ) %>%
  mutate(AB_per_game = AB/G, R_per_game = R/G) %>%
  ggplot(aes(AB_per_game, R_per_game)) +
  geom_point(alpha = 0.5)
5. What does the variable “SOA” stand for in the Teams table?
  Strikeouts by pitchers
6. Load the Lahman library. Filter the Teams data frame to include years from 1961 to 2001. Make a scatterplot of runs per game versus at bats (AB) per game.
  Teams %>% filter(yearID %in% 1961:2001 ) %>%
  mutate(AB_per_game = AB/G, R_per_game = R/G) %>%
  ggplot(aes(AB_per_game, R_per_game)) +
  geom_point(alpha = 0.5)
  
  As the number of at bats per game increases, the number of runs per game tends to increase.
7. Use the filtered Teams data frame from Question 6. Make a scatterplot of win rate (number of wins per game) versus number of fielding errors (E) per game. Which of the following is true?
  Teams %>% filter(yearID %in% 1961:2001 ) %>%
  mutate(W_per_game = W/G, E_per_game = E/G) %>%
  ggplot(aes(W_per_game, E_per_game)) +
  geom_point(alpha = 0.5)
  
  As the number of errors per game increases, the win rate tends to decrease.
8. Use the filtered Teams data frame from Question 6. Make a scatterplot of triples (X3B) per game versus doubles (X2B) per game. Which of the following is true?
  Teams %>% filter(yearID %in% 1961:2001 ) %>%
  mutate(X3_per_game = X3B/G, X2_per_game = X2B/G) %>%
  ggplot(aes(X3_per_game, X2_per_game)) +
  geom_point(alpha = 0.5)
  
  There is no clear relationship between doubles per game and triples per game.
RStudio on GitHub: 1.1_BaseballExample_Assessment.R
Correlation
Correlation

For univariate data, each observation includes just one value. However, in most data anaylsis cases we will not be just working with one variable.

For instance, we can look at the library HistData that contains heights of people by gender and familial connections. Here is how to filter for fathers and sons:
library(HistData)>
data("GaltonFamilies")
set.seed(1983)

galton_heights <- GaltonFamilies %>%
filter(gender == "male") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(father, childHeight) %>%
rename(son = childHeight) %>%

Usually, correlation coefficients summarize the trend effectively.

Correlation Coefficient

We usually refer to the correlation coefficient with the Greek letter rho. The equation is:
Here we have xi and yi being two different variable values from the data set.

If xi and yi are unrelated then the product will be positive as often as it is negative, giving us 0.

The correlation coefficient range is -1 to 1. The value telling you how strong the relationship and the positive/negative telling you the directionaility of the relationship.

Sample Correlation is a Random Variable

Sample correlation is the most commonly used estimate of the population correlation. When the CLT is applied we can have the following distribution:

Assessment: Correlation
1. While studying heredity, Francis Galton developed what important statistical concept?
  Correlation
2. The correlation coefficient is a summary of what?
  The trend between two variables
3. Below is a scatter plot showing the relationship between two variables, x and y. From this figure, the correlation between x and y appears to be about: -0.9
4. Instead of running a Monte Carlo simulation with a sample size of 25 from the 179 father-son pairs described in the videos, imagine we now run the simulation with a sample size of 50. Note: You do not need to run any code to determine the answer to this exercise. Would you expect the mean of the sample correlation to increase, decrease, or stay approximately the same?
  Stay approximately the same; Because the expected value of the sample correlation is the population correlation, it should stay approximately the same even if the sample size is increased.
5. Instead of running a Monte Carlo simulation with a sample size of 25 from the 179 father-son pairs described in the videos, imagine we now run the simulation with a sample size of 50. Note: You do not need to run any code to determine the answer to this exercise. Would you expect the standard deviation of the sample correlation to increase, decrease, or stay approximately the same?
  Decrease; As the sample size N increases, the standard deviation of the sample correlation should decrease.
6. If X and Y are completely independent, what do you expect the value of the correlation coefficient to be?
  0
7. RStudio on GitHub: 1.2_Correlation_assement.R
8. RStudio on GitHub: 1.2_Correlation_assement.R
Stratification and Variance Explained
Anscombe's Quartet

Although correlation is sometimes a good summary, it isn't always. An example of why it is not always the best is called Anscombe’s Quartet and it has four data sets that have near identical simple descriptive statistics:
All of the above have rho of 0.816, but as can be see are very different!

Stratification

Correlation is only meaningful in a particular context, so we use stratification. Stratifying lets you take a conditional average: looking at a specific subgroup of your data. The following gives us an equation for regression:
This equation lets us say if there is a perfect correlation than the increase for both is by their sd.

We can model both the slope:
and the intercept:

Fun fact: it is called regression because the son “regresses” to the average height as it is not a perfect correlation. In general, we see heights regress to mediocrity.

Bivariate Normal Distribution

Looking at scatterplots we can also see if there is correlation:

If X ~Norm() and Y~Norm(). Then we can say, for any stratum (X=x), Y ~ Norm() in the stratum and the pair (Y, x) is approximately bivariate normal; we call is the conditional distribution of Y (Y | X = x).

If there are 3 or more variables in which each pair is bivariate normal, we say it follows a multivariate normal distribution (i.e. jointly normal).

Under the stratum, we can get the expected value given by this formula:
and variance explained:

Some things to note:
- Regression lines based on strata matter.
- E(Y|X=x) cannot be rearranged to find the regression line going the other way, but rather E(X|Y=y) must be calculated separately
Assessment:
1. The slope of the regression line in this figure is equal to what, in words? Slope = (correlation coefficient of son and father heights) * (standard deviation of sons’ heights / standard deviation of fathers’ heights)
2. Why does the regression line simplify to a line with intercept zero and slope rho when we standardize our x and y variables?
  When we standardize variables, both x and y will have a mean of zero and a standard deviation of one. When you substitute this into the formula for the regression line, the terms cancel out until we have the following equation: yi=rho*xi
3. What is a limitation of calculating conditional means?
  Each stratum we condition on (e.g., a specific father’s height) may not have many data points.
  Because there are limited data points for each stratum, our average values have large standard errors.
  Conditional means are less stable than a regression line.
4. A regression line is the best prediction of Y given we know the value of X when
  X and Y follow a bivariate normal distribution.
5. Which one of the following scatterplots depicts an x and y distribution that is NOT well-approximated by the bivariate normal distribution? The v-shaped distribution of points from the first plot means that the x and y variables do not follow a bivariate normal distribution.
  When a pair of random variables is approximated by a bivariate normal, the scatter plot looks like an oval (as in the 2nd, 3rd, and 4th plots) - it is okay if the oval is very round (as in the 3rd plot) or long and thin (as in the 4th plot).
6. We previously calculated that the correlation coefficient rho between fathers’ and sons’ heights is 0.5. Given this, what percent of the variation in sons’ heights is explained by fathers’ heights?
  25%
  
  When two variables follow a bivariate normal distribution, the variation explained can be calculated as 100*rho^2.
7. Suppose the correlation between father and son’s height is 0.5, the standard deviation of fathers’ heights is 2 inches, and the standard deviation of sons’ heights is 3 inches. Given a one inch increase in a father’s height, what is the predicted change in the son’s height?
  rho = 0.5
  sd_f = 2
  sd_s = 3
  f = +1
  s = ?
  3/2*0.5 = 0.75
  
  The slope of the regression line is calculated by multiplying the correlation coefficient by the ratio of the standard deviation of son heights and standard deviation of father heights: sigma_son/sigma_father.
RStudio on GitHub questions 8-11: 1.3_Stratification-VarianceExplained_assessment.R
Introduction to Linear Models
Using the tidy(), glance(), and augment() functions from the broom package will be helpful in data analysis.

Confounding: Are BBs More Predictive?

A really important fact is that association is not causation!

In baseball, there is a confounding variable for BB and HR:
BB are confound with HR.

In real-life this happens because pitchers sometimes tend to avoid throwing strikes to home run hitters leading to more home run hitters having more bases on balls.

Regression can help use adjust the confounding we see to see if there is still an affect.

Stratification and Multivariate Regression

First approach we can take is to keep HR fixed and then examine relationship between R and BB. After stratifying we can check if there is an effect. Here we can get the expected value:
Where beta is the slope by strata.

Linear Models

Second, we can look at a linear model with regression. Where a linear model is a model that describes the relationship between two or more variables. And regression allows us to find relationships between two variables while adjusting for others. This is particularly useful when we are not able to randomly assign groups in treatment vs. control.

For our Galton height example, we can predict any son's height (Yi) from this equation:
The epsilon is there as an error buffer for accounting for things like mother’s genetic effect, environmental factors, and other biological randomness. Epsilon assume independence E(X) =0 and sd not dependent on i (same for each individual).

Linear models are meant to be interpretable. So we can go back and change our equation to make it a bit more interpretable:
This would give us the beta_0 is the predicted height for the son of the average father.

Assessment:
1. As described in the videos, when we stratified our regression lines for runs per game vs. bases on balls by the number of home runs, what happened?
  The slope of runs per game vs. bases on balls within each stratum was reduced because we removed confounding by home runs.
2. We run a linear model for sons’ heights vs. fathers’ heights using the Galton height data, and get the following results:
  > lm(son ~ father, data = galton_heights)
  
  Call:
  lm(formula = son ~ father, data = galton_heights)
  
  Coefficients:
  (Intercept) father
  35.71 0.50
  Interpret the numeric coefficient for "father."
  For every inch we increase the father’s height, the predicted son’s height grows by 0.5 inches.
3. We want the intercept term for our model to be more interpretable, so we run the same model as before but now we subtract the mean of fathers’ heights from each individual father’s height to create a new variable centered at zero.
  galton_heights <- galton_heights %>%
  mutate(father_centered=father - mean(father))
  
  We run a linear model using this centered fathers’ height variable.
  > lm(son ~ father, data = galton_heights)
  
  Call:
  lm(formula = son ~ father, data = galton_heights)
  
  Coefficients:
  (Intercept) father
  70.45 0.50
  Interpret the numeric coefficient for the intercept.
  The height of a son of a father of average height is 70.45 inches.
4. Suppose we fit a multivariable regression model for expected runs based on BB and HR: Suppose we fix BB = x_1. Then we observe a linear relationship between runs and HR with intercept of:
  If x_1 is fixed, then Beta_1*x_1 is fixed and acts as the intercept for this regression model. This is the basis of stratification.
5. Which of the following are assumptions for the errors epsilon_i in a linear regression model?
  The epsilon_i are independent of each other
  The epsilon_i have expected value 0
  The variance of epsilon_i is a constant
Least Squares Estimates (LSE)

Least Squares Estimates (LSE)

For linear models to be useful we need to estimate the unknown parameters. When looking for parameters, we also want to find the parameters that minimize the residual sum of squares:

The best way to minimize the RSS, is to use partial derivatives, set them equal to zero and find the minimums.

A useful fact to know is that beta and SE(beta) both have t-distribution with N- p (number of parameters in model) degrees of freedom.

In order to create a linear model prediction we can use the lm() function. Where lm( value predicting ~ value used for prediction ). Also do not forget that summary(): gives more information; estimate, se, tvalue, p-values (Pr(>|t|).

An example of calculating LSE for a repeated experiment would be:
lse %>% summarize(cor(beta_0, beta_1))
B <- 1000
N <- 50
lse <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) %>%
mutate(father = father - mean(father)) %>%
lm(son ~ father, data = .) %>% .$coef
})
cor(lse[1,], lse[2,])

Predicted Variables are Random Variables

Plugging estimates into our model gives our predicted values predict(). Then from there we can construct a confidence interval. Using ggplot2:
geom_smooth(method = “lm”)
we plot the CI around predicted Y.

Assessment: Least Squares Estimates

RStudio on GitHub: 2.2_LSE_assessment.R
Advanced dplyr: summarize with functions and broom
Important to know: lm() ignores the group_by() because lm() is not part of tidyverse. So we can work around it by:
dat %>% group_by(HR) %>% summarize(slope = lm(R ~ BB)$coef[2])

What is really great about the broom package is that it allows the connection of lm() to tidyverse and compute things like CI.

Three main functions to know and use:
- tidy() returns estimates and related information as a data frame.
  tidy(fit) returns estimate, se, stat and p-value; can add more i.e. tidy(fit, conf.int = TRUE) returns CI. Combine this with summarize to allow functionalities like group_by
- glance()
- augment()
]
Useful to know for lm():
Using the dot operator instead of the across() function will lead R to ignore our grouping. You may notice that the estimate for both lgID values is the same as a result

Assessment: Advanced dplyr
1. As seen in the videos, what problem do we encounter when we try to run a linear model on our baseball data, grouping by home runs?
  The lm() function does not know how to handle grouped tibbles.
2. Tibbles are similar to what other class in R?
  Dataframes
3. What are some advantages of tibbles compared to data frames?
  All of the listed answers are advantages of tibbles when compared to data frames: tibbles display better, they always return tibbles when subsetted, they can have complex entries, and they can be grouped.
4. What are two advantages of the summarize() command, when applied to the tidyverse?
  Correct. The summarize function can understand grouped tibbles.
  Correct. The sumarize function always returns a type of data frame (tibble or data.frame)
5. RStudio on GitHub questions 5: 2.3_Advanced-dplyr_assessment.R
6. The output of a broom function is always what?
  A tibble
RStudio on GitHub questions 7-10: 2.3_Advanced-dplyr_assessment.R
Regression and Baseball
Building a Better Offensive Metric for Baseball

To let lm() know there are two predictor variables use “+”
fit <- Teams %>%
filter(yearID %in% 1961:2001) %>%
mutate(BB =BB/G, HR = HR/G, R = R/G) %>%
lm(R ~ BB + HR, data = .)

tidy library let’s us see the summary of the model (in this case lm() let us see it).

Side note: Jointly normal means if the other variables are held constant then the remain predictor will have a linear relationship with the outcome without the slope changing based on the other predictive values

When you use the predict() it should be used on data that hasn’t been used to create the model. In order to have a better prediction, it is beneficial to filter out players who have a low plate appearance - it gives us more data to work with. For example to give us number of runs we predict a team would score if all batters were the exact same as that one player:
players <- Batting %>% filter(yearID %in% 1961:2001) %>%
group_by(playerID) %>%
mutate(PA = BB +AB) %>%
summarize(G = sum (PA)/pa_per_game,
BB = sum(BB)/G,
singles = sum(H-X2B-X3B-HR)/G,
doubles = sum(X2B)/G,
triples = sum(X3B)/G,
HR = sum(HR)/G,
AVG = sum(H)/sum(AB),
PA = sum(PA)) %>%
filter(PA >= 300) %>%
select(-G) %>%
mutate(R_hat = predict(fit, newdata = .))

A useful example to know is a way to actually pick the players for the team can be done using what computer scientists call linear programming:
library(reshape2)
library(lpSolve)

players <- players %>% filter(debut <= "1997-01-01" & debut > "1988-01-01")
constraint_matrix <- acast(players, POS ~ playerID, fun.aggregate = length)
npos <- nrow(constraint_matrix)
constraint_matrix <- rbind(constraint_matrix, salary = players$salary)
constraint_dir <- c(rep("==", npos), "<=")
constraint_limit <- c(rep(1, npos), 50*10^6)
lp_solution <- lp("max", players$R_hat,
constraint_matrix, constraint_dir,
constraint_limit,
all.int = TRUE)

Then we can us this algorithm to choose 9 players:
our_team <- players %>%
filter(lp_solution$solution == 1) %>%
arrange(desc(R_hat))
our_team %>% select(nameFirst, nameLast, POS, salary, R_hat)

We can actually see that these players all have above average BB and HR rates while the same is not true for singles:
my_scale <- function(x) (x - median(x))/mad(x)
players %>% mutate(BB = my_scale(BB),
singles = my_scale(singles),
doubles = my_scale(doubles),
triples = my_scale(triples),
HR = my_scale(HR),
AVG = my_scale(AVG),
R_hat = my_scale(R_hat)) %>%
filter(playerID %in% our_team$playerID) %>%
select(nameFirst, nameLast, BB, singles, doubles, triples, HR, AVG, R_hat) %>%
arrange(desc(R_hat))

Regression Fallacy

Sophomore slump: refers to an instance in which a second effort fails to live up to the standard of the first effort.

When we look at correlation for performance in two separate years is high but not perfect. Through futher analysis we see that the sophomore slump is basically the rule of regression to the mean.

Measurement Error Models

Sometimes we will have non-random covariates (e.g. time) leading to randomness from measurement error. To check if the estimated points fit the data, we use augment() (from broom).
augment(fit) %>%
ggplot() +
geom_point(aes(time, y)) +
geom_line(aes(time, .fitted))

Assessment: Regression and Baseball
1. What is the final linear model (in the video "Building a Better Offensive Metric for Baseball") we used to predict runs scored per game?
  lm(R ~ BB + singles + doubles + triples + HR)
2. We want to estimate runs per game scored by individual players, not just by teams. What summary metric do we calculate to help estimate this?
  pa_per_game <- Batting %>%
  filter(yearID == 2002) %>%
  group_by(teamID) %>%
  summarize(pa_per_game = sum(AB+BB)/max(G)) %>%
  .$pa_per_game %>%
  mean
  
  pa_per_game: the number of plate appearances per team per game, averaged across all teams
3. Imagine you have two teams. Team A is comprised of batters who, on average, get two bases on balls, four singles, one double, no triples, and one home run. Team B is comprised of batters who, on average, get one base on balls, six singles, two doubles, one triple, and no home runs. (For convenience, the coefficients for the model are as follows: BB 0.371, singles 0.519, doubles 0.771, triples 1.24, and home runs 1.44.). Which team scores more runs, as predicted by our model?
  RStudio on GitHub question 3: 2.4_RegressionBaseball_assessment.R
  Team B
4. The formula for on-base-percentage plus slugging percentage (OPS) is: . The OPS metric gives the most weight to:
  HR
5. What statistical concept properly explains the "sophomore slump"?
  Regression to the mean
6. In our model of time vs. observed_distance in the video "Measurement Error Models," the randomness of our data was due to:
  measurement error
7. Which of the following are important assumptions about the measurement errors in the experiment presented in the video "Measurement Error Models"?
  The measurement error is random
  The measurement error is independent
  The measurement error has the same distribution for each time
8. Which of the following scenarios would violate an assumption of our measurement error model?
  There was one position where it was particularly difficult to see the dropped ball.
RStudio on GitHub questions 9-11: 2.4_RegressionBaseball_assessment.R
Correlation is Not Causation
Spurious Correlation

Spurious correlations refers to a connection being seen between two variables that appears to be causal but it is not. This can lead to misinterpreting associations.

These kinds of associations can arise from data dredging, data phishing, or data snooping (i.e. cherry picking). P-hacking is a particular form of data dredging (reporting experiments that report only the experiments that resulted in small p-values). There are methods that can be done to adjust for this.

Outliers

Outliers allow us to see high correlation without an actual relationship. Spearman correlation computes the correlation on the ranks of the values; thereby making it robust to outliers.

Reversing Cause and Effect

Cause and effect reversal: meaning that you assume the wrong directionality of effect (e.g. sons height affects dad’s height).

Confounders

Confounders cause changes in variables that are correlated. The way to analyze confounders is by stratifying data over various aspects to see if there is any change in a foreseen correlation.

Simpson's Paradox

This is when we see the correlation flipped when analyzing a whole population compared to the stratified population. A visual example being:
But when stratified we see:

Assessment: Correlation is Not Causation
1. In the videos, we ran one million tests of correlation for two random variables, X and Y. How many of these correlations would you expect to have a significant p-value (p<=0.05), just by chance?
  In this example, the chance of finding a correlation when none exists is 0.05*1,000,000 chances.
2. Which of the following are examples of p-hacking?
  Looking for associations between an outcome and several exposures and only reporting the one that is significant.
  Trying several different models and selecting the one that yields the smallest p-value.
  Repeating an experiment multiple times and only reporting the one with the smallest p-value.
3. The Spearman correlation coefficient is robust to outliers because:
  It calculates correlation between ranks, not values.
4. What can you do to determine if you are misinterpreting results because of a confounder?
  More closely examine the results by stratifying and plotting the data.
  Although you can sometimes use linear models, you can't always and exploratory data analysis (stratifying and plotting data) will help determine if there is a confounder.
5. Look again at the admissions data presented in the confounders video using ?admissions. What important characteristic of the table variables do you need to know to understand the calculations used in this video?
  The column admitted is the percent of students admitted, while the column applicants is the total number of applicants.
6. In the example in the confounders video, major selectivity confounds the relationship between UC Berkeley admission rates and gender because:
  Major selectivity is associated with both admission rates and with gender, as women tended to apply to more selective majors.
7. Admission rates at UC Berkeley are an example of Simpson’s Paradox because:
  It appears that men have higher a higher admission rate than women, however, after we stratify by major, we see that on average women have a higher admission rate than men.

Anja Wu

Statistics and R

Basic commands

Download CSV File

dplyr

Basic building blocks

Workspace & files

Sequence of numbers

Vector

Missing Values

Subsetting vectors

Matrix and dataframes

Logic

Functions

Lapply and sapply

Vapply and tapply

Looking at data

Simulation

Dates and time

Base graphics

Visualization

Data Types

Practice Problems

Cumulative Distribution Function

Smooth Density Plots

Practice Problems: Distributions

Normal Distribution aka Guassian distribution

The Normal CDF and pnorm

Practice Problems: Normal Distribution

Definition of quantiles

Finding quantiles with qnorm

Quantile-Quantile Plots

Boxplots

Practice Problems: Quantiles, percentiles, and boxplots

Practice Problems: Robust Summaries with Outliers

ggplot

Graph Components

Creating a New Plot

Layers

Tinkering

Scales, Labels, and Colors

Add-on Packages

Other Examples

Practice Problems

Faceting

Time Series Plots

Transformations

Stratify and Boxplot

Comparing Distributions

Density Plots

Ecological Fallacy

Practice Problems: Exploring the Gapminder Dataset

Encoding Data Using Visual Cues

Practice Problems: Data Visualization Principles 1

Show the Data

Practice Problems: Data Visualization Principles 2

Slope Charts

Encoding a Third Variable

Case Study: Vaccines

Practice Problems: Data Visualization Principles 3

Practice Problems

Probability

Discrete Probability

Monte Carlo Simulations

Probability Distribution

Independence

Practice Problems

Setting up card deck

gtools package

Checking for duplicates

How many Monte Carlo Experiments are Enough?

Practice Problems

Practice Problems

Practice Problems

Continuous probability

Theoretical Distribution

Probability Density

Monte Carlo Simulation

Other continuous distribution

Practice Problems