
Lecture 7: Introduction to R for Hypothesis Testing
Modern data science uses free and open-source languages:
While Python is the most used language by computer engineers for web and app development, R has some advantages:
There are some key concepts you need to understand and to remember:
R is usually used via RStudio and first time users often confuse the two. At its simplest, R is like a car’s engine while RStudio is like a car’s dashboard.
R: The engine

RStudio: The dashboard

In your web browser (Chrome, Firefox, …), go to: https://posit.cloud/
Most of the R code displayed in this lecture is included in these slides. Rather than typing it manually, open these slides in another tab to copy-paste the code.
Access these slides from the URL:
https://damien-dupre.github.io/STA1005/lectures/lecture_7
When you create a new project, you will see the following 3 windows (also called panes):
The last window Code Editor opens when creating a new R Script
The console displays
The Status of R is indicated by the symbol in the console prompt:
> means ready to process code+ means incomplete command (escape with Esc)
The Console Doesn’t Save!
You can execute code by typing it directly into the Console. However, it will not be saved. And if you make a mistake you will have to re-type everything all over again.
Instead, write all your code in a document (R script or Quarto file) in the Code Editor.
The Environment tab of this pane shows you the names of all the data objects (like vectors, matrices, and data frames) that you have defined in your current R session.
You can also see information like the number of observations and rows in data objects.
The Files panel gives access to the file directory.
The Plots panel shows all your plots. There are buttons for opening the plot in a pop up and exporting as a pdf or jpeg.
The Packages shows a list of all the R packages installed on the local or remote machine and indicates whether or not they are currently loaded.
With the Help menu for R functions you can access essential information to use them. Just have a look at some of them by pressing F1 with the cursor on the function, using the help() function, or by typing ? followed by the function name such as:
It makes the link between all the previous panes and allows you to reproduce actions and behaviours.
You can open as many R Script / Quarto files as you want.
These documents are the only documents that have to be saved. No need to save your data, figures and calculations as you can reproduce them every time instantaneously with the code.
Save your eyes and look like a nerd by changing the code’s appearance
.R is the extension for an R script (document including R code):
#)Example of non-active code
Example of active code
.qmd is the extension for a Quarto file:
.html (web page, slides, books, and dashboards)
.pdf (Academic LaTex papers and reports)
.doc (MS Word documents)

In an R Script, place your cursor anywhere on the line you want to run and either:
Run button on RStudio’s interface
R packages extend the functionality of R. They are written by a worldwide community of R users and can be downloaded for free from the internet.
A good analogy for R packages are like apps you can download onto a mobile phone.
R: A new phone

R Packages: Apps you can download

Say you have purchased a new phone, to use Instagram you need to install the app once and to open the app every time to use it.
The process is very similar for using an R package. You need to:
install.packages().library().Once the package is loaded you can use all the functions from this package such as:
Open an R Script in RStudio. In this document:
praise() as it is, without arguments05:00
Functions are algorithms (or lines of code) which transform data to something else. For example, the function lm(), uses data to compute the result of a linear regression model.
Functions have a name and several arguments that require some information.
For example, the function seq() makes a sequence of numbers:
from is the number starting the sequenceto is the last number of the sequenceWarning
Arguments don’t need to be explicitly called, they can also be matched by position:
Usually the arguments of functions expect an Object name to access the data.
An object is a box that can include anything (e.g., values, dataframes, figures, models, functions, …) and has a name that you have to choose.
To create an object, you need to assign something to a name using the <- operator. If you type the name of the object, R will print out its content.
It is very important to distinguish values and objects in R:
| Type | Class | Example |
|---|---|---|
| Number | Numeric Value | 1, 2, ... |
| Word with quotes | Character Value | "one", "two", ... |
| Word without quotes | Object Name | function name, data name, ... |
These types of values are then stored in different objects:
All object assignments have the same form:
You want your object names to be descriptive, so you will need a convention for multiple words. I recommend snake_case where you separate lower-case words with _.
numeric_value <- 1
character_value <- "one"
vectors_with_numeric_values <- c(1, 2)
vectors_with_character_values <- c("one", "two")
dataframe_example <- data.frame(col1 = c("one", "two"), col2 = c(1, 2))
dataframe_example <- data.frame(
col1 = vectors_with_character_values,
col2 = vectors_with_numeric_values
)In the same R Script in RStudio, Copy, Paste, and Run the following code:
my_power <- c(0.5, 99.5)
my_knowledge <- c("without R", "with R")
barplot(height = my_power, names.arg = my_knowledge)
Once it works, try changing the values in my_power and my_knowledge to customise the chart.
05:00
Posit Cloud is a free remote computer, the computing is not run on your computer.
To open Data on Posit Cloud, you first need to Upload your file on this computer and to Import the data in R.
Step 1: Upload your File

Step 2: Import your Data

Remember that .csv files are basically text files
For early beginners on the Desktop version, directly open data with RStudio’s Import Dataset button.
If you see your data in the preview, you can click Import to create an object containing your data. A code will be executed, Copy and Paste the first line of this code in your R script. You will not have to do it manually once the code is in your script.
To ensure code reproducibility, open data with the appropriate function (e.g., read.csv() for csv files).
The main argument of these functions is file which corresponds to the path to a file, followed by the name of the file and its extension:
The following codes will generate an error:
Click on Upload to upload the “organisation_beta.csv” file on your Posit Cloud
Click on Import to import these data in R
Check that the data appears in the Environment pane. How many observations and variables does it have?
Copy the code that ran and paste it in your R script
05:00
Usually, only R Script files (.R) or Quarto files (.qmd) have to be saved as they allow the full replicability of transformations and results.
However, if you want to use the data that have been transformed, joined or pivoted, a function has to be used according to the type of export.
The simplest export is a .csv file with the function write.csv(). It has two main arguments:
x which is the name of the object to savefile which is the name of the output fileNote: don’t forget the file extension in the argument file
# Saved in the current directory
write.csv(x = my_file_object, file = "my_file_name.csv")
write.csv(my_file_object, "my_file_name.csv")
# Saved in the directory you prefer
write.csv(my_file_object, "C:/path/to/my/my_file_name.csv") # Windows
write.csv(my_file_object, "/Users/path/to/my/my_file_name.csv") # MacosSave the data contained in the object “organisation_beta” in a new .csv file (give a different name than “organisation_beta.csv” else this document will be overwritten)
03:00
Plenty of free learning materials are available online:
Video tutorials on YouTube, TikTok
Interactive tutorials, see for example:
Book tutorials, see for example:
Find all the free books in the Big Book of R!
str(ObjectName)F1 or ?)
A model contains:
To evaluate their relationship with the outcome, each effect hypothesis is related with a coefficient called Estimate and represented with \(b\) as follows:
\[Outcome = b_0 + b_1 Pred1 + b_2 Pred2 + b_3 Pred1 * Pred2 + e\]
Testing for the significance of the effect means evaluating if this estimate \(b\) value is significantly different, higher or lower than 0 as hypothesised in \(H_a\) by the scientist.
The lm() function calculates each estimate and tests them against 0 for you.
lm() has only two arguments that you should care about: formula and data.
formula is the translation of the equation of the modeldata is the name of the data frame object containing the variables.Here is a generic example:
Here is an example with organisation_beta.csv:
lm() has only one difficulty, the formula. The formula is the direct translation of the equation tested but with its own representation:
~ (read “according to” or “by”)+ sign: instead of *Here are some generic equations and their conversion in formula:
\[Outcome = b_0 + b_1 Pred1 + b_2 Pred2 + e\]
\[Outcome = b_0 + b_1 Pred1 + b_2 Pred2 + b_3 Pred3 + e\]
\[Outcome = b_0 + b_1 Pred1 + b_2 Pred2 + b_3 Pred1*Pred2 + e\]
Here are some equations from the organisation_beta.csv dataset and their conversion in formula:
\[js\_score = b_0 + b_1 salary + b_2 perf + e\]
\[js\_score = b_0 + b_1 salary + b_2 perf + b_3 salary * perf + e\]
Test the following models in Posit Cloud:
\[js\_score = b_0 + b_1 salary + b_2 gender + e\]
\[js\_score = b_0 + b_1 salary + b_2 gender + b_3 salary * gender + e\]
05:00
Exactly as in Jamovi, lm() by default investigates continuous predictors or categorical predictors having 2 categories:
However, to test the hypothesis of a categorical predictor having 3 or more categories, the ANOVA omnibus test is required.
It can be obtained by using the aov() function with the lm model as input:
While the function lm() computes the model, the function summary() displays the results
Call:
lm(formula = js_score ~ salary + gender, data = organisation_beta)
Residuals:
Min 1Q Median 3Q Max
-2.04185 -0.49565 0.06529 0.61611 1.71635
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -49.1865081 7.6889114 -6.397 0.00000663 ***
salary 0.0018837 0.0002575 7.316 0.00000121 ***
gendermale -0.5946699 0.4055990 -1.466 0.161
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8854 on 17 degrees of freedom
Multiple R-squared: 0.7613, Adjusted R-squared: 0.7333
F-statistic: 27.12 on 2 and 17 DF, p-value: 0.000005141
The output of the summary() function is pretty dense, but let’s analyse it line by line.
The first line reminds us of what the actual regression model is:
Call:
lm(formula = js_score ~ salary + gender, data = organisation_beta)
The next part provides a quick summary of the residuals (i.e., the \(e\) values),
Residuals:
Min 1Q Median 3Q Max
-2.04185 -0.49565 0.06529 0.61611 1.71635
This can be convenient as a quick check that the model is okay. Linear regression assumes that these residuals were normally distributed, with mean 0. In particular it’s worth quickly checking to see if the median is close to zero, and to see if the first quartile is about the same size as the third quartile. If they look badly off, there’s a good chance that the assumptions of regression are violated.
The next part of the R output looks at the coefficients of the regression model:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -49.1865079 7.6889114 -6.397 0.00000663 ***
salary 0.0018837 0.0002575 7.316 0.00000121 ***
gendermale -0.5946699 0.4055990 -1.466 0.161
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Each row in this table refers to one of the coefficients estimated in the regression model.
The first row is the intercept term, and the later ones look at each of the predictors. The columns give you all of the relevant information:
The only thing that the previous table doesn’t list is the degrees of freedom used in the t-test, which is always N−K−1 and is listed immediately below, in this line:
Residual standard error: 0.8854 on 17 degrees of freedom
The value of df=17 is equal to N−K−1, so that’s what we use for our t-tests. In the final part of the output we have the F-test and the R² values which assess the performance of the model as a whole
Multiple R-squared: 0.7613, Adjusted R-squared: 0.7333
F-statistic: 27.12 on 2 and 17 DF, p-value: 0.000005141
So in this case, the model performed significantly better than you’d expect by chance (F(2,17) = 27.12, p < 0.001), which isn’t all that surprising: the R² = 0.7333 value indicates that the regression model accounts for 73.3% of the variability in the outcome measure.
When we look back up at the t-tests for each of the individual coefficients, we have pretty strong evidence that salary has a significant effect.
To communicate about your statistical analyses in an academic report, the simplest method is to find the values in the summary() output and to copy-paste them in the text according to the format expected that we have seen in the previous lectures.
However, this task can be long, difficult and lead to human errors. Thankfully, R has additional packages that provide alternative functions to read linear regression models and communicate results.
Because there are too many packages, I will focus only on two additional packages: {performance} and {report}.
To install {performance} use the usual install.packages() function:
The package {performance}will print visualisations allowing to check the model’s assumptions (see https://easystats.github.io/performance/).
To print the performance diagnosis:
{performance}lm()check_model() from the {performance} packageTo install {report} use the usual install.packages() function:
The package {report} will print a text containing all the statistics already in sentences ready to be interpreted (see https://easystats.github.io/report/).
To print the statistical analyses:
{report}lm()report() from the {report} packageNote: If used in a quarto document, the chunk containing report() has to include the chunk option results='asis'
We fitted a linear model (estimated using OLS) to predict js_score with salary and perf (formula: js_score ~ salary + perf). The model explains a statistically significant and substantial proportion of variance (R2 = 0.74, F(2, 17) = 24.16, p < .001, adj. R2 = 0.71). The model’s intercept, corresponding to salary = 0 and perf = 0, is at -49.49 (95% CI [-66.60, -32.37], t(17) = -6.10, p < .001). Within this model:
Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using a Wald t-distribution approximation.
In Posit Cloud, check the check_model() and report() output from the lm() function testing the following models:
\[js\_score = b_0 + b_1 salary + b_2 gender + e\]
\[js\_score = b_0 + b_1 salary + b_2 gender + b_3 salary * gender + e\]
\[js\_score = b_0 + b_1 salary + b_2 location + b_3 salary * location + e\]
05:00
Thanks for your attention
and don’t hesitate to ask if you have any questions!
