MT611 - Quantitative Research Methods

.title[
# MT611 - Quantitative Research Methods
]
.subtitle[
## Lecture 5: Categories in the General Linear Model
]
.author[
### Damien Dupré
]
.date[
### Dublin City University
]

---

# Vocabulary

"Linear Model", "Linear Regression", "Multiple Regression" or simply "Regression" are all referring to the same model: **The General Linear Model**.

It contains:

- Only one Outcome/Dependent Variable
- One or more Predictor/Independent Variables of any type (categorical or continuous)
- Made of Main and/or Interaction Effects

`$$Y = b_{0} + b_{1}\,Predictor\,1 + b_{2}\,Predictor\,2+ ... + b_{n}\,Predictor\,n + e$$`

A Linear Regression is used **to test all the hypotheses at once** and to calculate the predictors' estimate.

Specific tests are available for certain type of hypothesis such as T-test or ANOVA but as they are special cases of Linear Regressions, their importance is limited (see [Jonas Kristoffer Lindeløv's blog post: Common statistical tests are linear models](https://lindeloev.github.io/tests-as-linear/)).

---

# General Linear Model Everywhere

.pull-left[
Most of the common statistical models (t-test, correlation, ANOVA; chi-square, etc.) are **special cases of linear models**.

This beautiful simplicity means that there is less to learn. In particular, it all comes down to `$y = ax + b$` which most students know from secondary school.

Unfortunately, **stats intro courses are usually taught as if each test is an independent tool**, needlessly making life more complicated for students and teachers alike.

Here, only **one test is taught to rule them all**: the General Linear Model (GLM).
]

.pull-right[
<img src="https://psyteachr.github.io/msc-data-skills/images/memes/glm_meme.png" width="100%" style="display: block; margin: auto;" />
]

---

# Analysis of the Estimate

Once the best line is found, each estimate of the tested equation is calculated by a software (i.e., `$b_0, b_1, ..., b_n$`).

- `$b_0$` is the intercept and has no interest for hypothesis testing
- `$b_1, ..., b_n$` are predictors' effect estimate and each of them is used to test an hypothesis

The predictors' effect estimate `$b_1, ..., b_n$` are **the value of the slope of the best line between each predictor** and the outcome.

It indicates **how many units of the outcome variable increases/decreases/changes when the predictor increases by 1 unit**

Technically, `$b$` is a weight or multiplier applied to the Predictor's values to obtain the Outcome's expected values

---

# Analysis of the Estimate

- If `$b_1, ..., b_n = 0$`, then:
  - The regression line is horizontal (no slope)
  - When the Predictor increases by 1 unit, the Outcome variable does not change
  - **The null alternative hypothesis is not rejected**

- If `$b_1, ..., b_n > 0$`, then:
  - The regression line is positive (slope up)
  - When the Predictor increases by 1 unit, the Outcome variable increases by `$b$`
  - **The null alternative hypothesis is rejected and the alternative hypothesis considered as plausible**

- If `$b_1, ..., b_n < 0$`, then:
  - The regression line is negative (slope down)
  - When the Predictor increases by 1 unit, the Outcome variable decreases by `$b$`
  - **The null alternative hypothesis is rejected and the alternative hypothesis considered as plausible**

---

# Significance of Effect's Estimate

The statistical significance of an effect estimate depends on the **strength of the relationship** and on the **sample size**:

- An estimate of `$b_1 = 0.02$` can be very small but still significantly different from `$b_1 = 0$`
- Whereas an estimate of `$b_1 = 0.35$` can be stronger but in fact not significantly different from `$b_1 = 0$`

The significance is the probability to obtain your results with your sample in the null hypothesis scenario:

- Also called `$p$`-value
- Is between 0% and 100% which corresponds to a value between 0.0 and 1.0

**If the `$p$`-value is lower to 5% or 0.05, then the probability to obtain your results in the null hypothesis scenario is low enough to say that the null hypothesis scenario is rejected and there must be a link between the variables.**

Remember that the `$p$`-value is the probability of the data given the null hypothesis: `$P(data|H_0)$`.

---

# 1. Hypotheses with Categorical Predictors having 2 Categories

---

# Hypotheses with Categorical Predictors

We will have a deep dive in the processing of Categorical predictor variables with linear regressions:

- How to analyse a Categorical predictor with only 2 categories?
- How to analyse a Categorical predictor with more than 2 categories?

---

# Example of Categorical Coding

.pull-left[
Imagine we sample male and female employees to see if the difference between their job satisfaction averages is due **to sampling luck or is reflecting a real difference in the population**.

That is, **is the difference between male and female employees statistically significant?**
]

.pull-right[
<table>
 <thead>
  <tr>
   <th style="text-align:right;"> employee </th>
   <th style="text-align:left;"> gender </th>
   <th style="text-align:right;"> js_score </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:right;"> 5.057311 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 6.642440 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 6.119694 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:right;"> 9.482198 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:right;"> 8.883347 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:right;"> 7.015606 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 4.633738 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:right;"> 7.919998 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 9 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 9.028004 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:right;"> 5.860449 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 10.000000 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 3.617721 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 6.948510 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 14 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 7.429012 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 15 </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:right;"> 7.292992 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 7.765043 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 17 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 6.380634 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 18 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 5.962925 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 19 </td>
   <td style="text-align:left;"> male </td>
   <td style="text-align:right;"> 5.607226 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:left;"> female </td>
   <td style="text-align:right;"> 4.635931 </td>
  </tr>
</tbody>
</table>
]

---

# Example of Categorical Coding

<img src="lecture_5_files/figure-html/unnamed-chunk-4-1.png" width="504" style="display: block; margin: auto;" />
]

<img src="lecture_5_files/figure-html/unnamed-chunk-5-1.png" width="504" style="display: block; margin: auto;" />
]

By default, Categorical variables **are coded using the alphabetical order** (e.g., here Female first then Male) using 1, 2, 3 and so on.

However, you can recode the variable yourself with your own order by creating a new variable using IF statement (e.g., `IF(gender == "female", 2, 1)` in Jamovi)

---

# Categorical Coding in Linear Regression

<table class="table" style="font-size: 16px; color: black; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:left;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 7.36 </td>
   <td style="text-align:right;"> 1.34 </td>
   <td style="text-align:right;"> 5.50 </td>
   <td style="text-align:left;"> &lt;0.001 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> gender_c </td>
   <td style="text-align:right;"> -0.34 </td>
   <td style="text-align:right;"> 0.80 </td>
   <td style="text-align:right;"> -0.43 </td>
   <td style="text-align:left;"> 0.675 </td>
  </tr>
</tbody>
</table>
]

<table class="table" style="font-size: 16px; color: black; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:left;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 6.34 </td>
   <td style="text-align:right;"> 1.19 </td>
   <td style="text-align:right;"> 5.34 </td>
   <td style="text-align:left;"> &lt;0.001 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> gender_c </td>
   <td style="text-align:right;"> 0.34 </td>
   <td style="text-align:right;"> 0.80 </td>
   <td style="text-align:right;"> 0.43 </td>
   <td style="text-align:left;"> 0.675 </td>
  </tr>
</tbody>
</table>
]

---

# Categorical Predictor with 2 Categories

Let's use another example with the `organisation_beta.csv` file

### Variable transformation

Instead of using `$salary$` as a **continuous variable**, let's convert it as `$salary\_c$` which is a **categorical variable**:
- Everything higher than or equal to salary average is labelled "**high**" salary
- Everything lower than salary average is labelled "**low**" salary

### Hypothesis

The `$js\_score$` of employees having a **high** `$salary\_c$` is different than the `$js\_score$` of employees having a **low** `$salary\_c$`

### In mathematical terms

`$$H_a: \mu(js\_score)_{high\,salary} \neq \mu(js\_score)_{low\,salary}$$`
`$$H_0: \mu(js\_score)_{high\,salary} = \mu(js\_score)_{low\,salary}$$`

---

# Categorical Predictor with 2 Categories

An hypothesis of differences between two groups is easily tested with a Linear Regression:

- If `$\mu_{1} \neq \mu_{2}$`, the slope of the line between these averages is not null (i.e., `$b_{1} \neq 0$`)
- If `$\mu_{1} = \mu_{2}$`, the slope of the line between these averages is null (i.e., `$b_{1} = 0$` )

### Explanation

.pull-left[
**Comparing the difference between two averages is the same as comparing the slope of the line crossing these two averages**
- If two averages are **not equal**, then **the slope of the line crossing these two averages is not 0**
- If two averages are **equal**, then the **slope of the line crossing these two averages is 0**
]

.pull-right[
<img src="lecture_5_files/figure-html/unnamed-chunk-10-1.png" width="288" style="display: block; margin: auto;" />
]

---

# Categorical Predictor with 2 Categories

### <span><i class="fas  fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i></span> Warning

JAMOVI and other software **automatically code categorical variable following alphabetical order** but sometimes you need to change these codes.

.pull-left[
For example, here **low coded with the value 1** and **high coded with the value 2** would make more sense.

The way how categorical variables are coded will influence the sign of the estimate (positive vs. negative)

But **it doesn't change the value of the statistical test** nor the `$p$`-value obtained
]

.pull-right[
<img src="lecture_5_files/figure-html/unnamed-chunk-11-1.png" width="288" style="display: block; margin: auto;" />
]

---

# Categorical Predictor with 2 Categories

### To sum up

**To test the influence of a categorical predictor** variable either nominal or ordinal **having two categories** (e.g., high vs. low, male vs. female, France vs. Ireland), it is possible to **test if the `$b$` associated to this predictor is significantly higher, lower, or different from 0**.

### Equation

`$$js\_score = b_{0} + b_{1}\,salary\_c + e$$`

### Communicating results

Exactly the same template as for Continuous Predictors:

> The predictions provided by the alternative model are significantly better than those provided by null model ( `$R^2 = .31$`, `$F(1, 18) = 8.27$`, `$p = .010$`).

> The effect of `$salary\_c$` on `$js\_score$` is statistically significant, therefore `$H_0$` can be rejected ( `$b = 1.87$`, 95\% CI `$[0.50, 3.24]$`, `$t(18) = 2.88$`, `$p = .010$`).

---
class: title-slide, middle

## Testing Main Effects with Categorical Predictors

---

# Testing Categorical Predictors

### In JAMOVI

1. Open your file
2. Set variables in their **correct type** (continuous, cat. nominal or cat. ordinal)
3. **Analyses > Regression > Linear Regression**
4. Set `$js\_score$` as DV (i.e., Outcome) and `$salary\_c$` as Factors (i.e., Categorical Predictor)

---

# Testing Categorical Predictors

### Model

> The prediction provided by the model with all predictors is significantly better than a model without predictors (** `$R^2 = .31$`, `$F(1, 18) = 8.27$`, `$p = .010$`**).

### Hypothesis with Default Coding (high = 1 vs. low = 2)

> The effect of `$salary\_c$` on `$js\_score$` is statistically significant, therefore `$H_{0}$` can be rejected (** `$b = -1.87$`, 95\% CI `$[-3.24, -0.50]$`, `$t(18) = -2.88$`, `$p = .010$`**).

### Hypothesis with Manual Coding (low = 1 vs. high = 2)

> The effect of `$salary\_c$` on `$js\_score$` is statistically significant, therefore `$H_{0}$` can be rejected (** `$b = 1.87$`, 95\% CI `$[0.50, 3.24]$`, `$t(18) = 2.88$`, `$p = .010$`**).

---

# Coding of Categorical Predictors

Choosing 1 and 2 are **just arbitrary numerical values** but any other possibility will produce the same `$p$`-value

However, choosing codes separated by 1 is handy because it's easily interpretable, the **estimate corresponds to the change from one category to another**:

> The `$js\_score$` of "high" `$salary\_c$` employees is 1.87 higher than the `$js\_score$` of "low" `$salary\_c$` employees (when "low" is coded 1 and "high" coded 2).

---

# Coding of Categorical Predictors

### Special case called **Dummy Coding** when a category is coded 0 and the other 1:
- Then the intercept, value of `$js\_score$` when salary is 0 corresponds to the category coded 0
- The test of the intercept is the test of the average for the category coded 0 against an average of 0
- Is called simple effect

### Special case called **Deviation Coding** when a category is coded 1 and the other -1:
- Then the intercept, corresponds to the average between the two categories
- The test of the intercept is the test of the average for the variable
- However, the distance between 1 and -1 is 2 units so the estimate is not as easy to interpret, therefore it is possible to choose categories coded 0.5 vs. -0.5 instead

---

# Dummy Coding in Linear Regression

Dummy Coding is when a category is coded 0 and the other coded 1.

For example, in JAMOVI recode female as 0 and male as 1 (Dummy Coding):

```
IF(gender == "female", 0, 1)
```

Dummy Coding is useful because one of the category becomes the intercept and is tested against 0.

<img src="lecture_5_files/figure-html/unnamed-chunk-13-1.png" width="504" style="display: block; margin: auto;" />
]

<table class="table" style="font-size: 16px; color: black; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:left;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 7.02 </td>
   <td style="text-align:right;"> 0.62 </td>
   <td style="text-align:right;"> 11.33 </td>
   <td style="text-align:left;"> &lt;0.001 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> gender_c </td>
   <td style="text-align:right;"> -0.34 </td>
   <td style="text-align:right;"> 0.80 </td>
   <td style="text-align:right;"> -0.43 </td>
   <td style="text-align:left;"> 0.675 </td>
  </tr>
</tbody>
</table>
]

---

# Deviation Coding in Linear Regression

Deviation Coding is when the intercept is situated between the codes of the categories.

For example, in JAMOVI recode female as -1 and male as 1 (Deviation Coding):

```
IF(gender == "female", -1, 1)
```

Deviation Coding is useful because **the average of the categories becomes the intercept** and is tested against 0.

However, in the Deviation Coding **using -1 vs. +1, the distance between the categories is 2** not 1. Therefore, even if the test of the slop is the exact same, the value of the slop (the estimate) is twice lower.

Consequently it is possible to use **a Deviation Coding with -0.5 vs. +0.5 to keep the distance of 1** between the categories.

For example, in JAMOVI recode female as -0.5 and male as 0.5 (Deviation Coding):

```
IF(gender == "female", -0.5, 0.5)
```

---

# Deviation Coding in Linear Regression

<table class="table" style="font-size: 16px; color: black; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:left;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 6.85 </td>
   <td style="text-align:right;"> 0.4 </td>
   <td style="text-align:right;"> 17.12 </td>
   <td style="text-align:left;"> &lt;0.001 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> gender_c </td>
   <td style="text-align:right;"> -0.17 </td>
   <td style="text-align:right;"> 0.4 </td>
   <td style="text-align:right;"> -0.43 </td>
   <td style="text-align:left;"> 0.675 </td>
  </tr>
</tbody>
</table>
]

<table class="table" style="font-size: 16px; color: black; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:left;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 6.85 </td>
   <td style="text-align:right;"> 0.4 </td>
   <td style="text-align:right;"> 17.12 </td>
   <td style="text-align:left;"> &lt;0.001 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> gender_c </td>
   <td style="text-align:right;"> -0.34 </td>
   <td style="text-align:right;"> 0.8 </td>
   <td style="text-align:right;"> -0.43 </td>
   <td style="text-align:left;"> 0.675 </td>
  </tr>
</tbody>
</table>
]

---
class: title-slide, middle

## Testing Interaction Effects with Categorical Predictors

---

# Interaction with Categorical Predictors

### In JAMOVI

1. Open your file
2. Set variables according their type
3. **Analyses > Regression > Linear Regression**
4. Set `$js\_score$` as DV and `$salary\_c$` as well as `$gender$` as Factors
4. In the **Model Builder** option: 
  - Select both `$salary\_c$` and `$gender$` to bring them in the Factors at once

### Model Tested

`$$js\_score = b_{0} + b_{1}\,salary\_c + b_{2}\,gender + b_{3}\,salary\_c*gender + e$$`

Note: The test of the interaction effect corresponds to the test of a variable resulting from the multiplication between the codes of `$salary\_c$` and the codes of `$gender$`.

---

# Interaction with Categorical Predictors

---
class: title-slide, middle

## Live Demo

---
class: title-slide, middle

## Exercise

With the `organisation_beta.csv` data, test the following models and conclude on each effect:

Model 1: `$js\_score = b_{0} + b_{1}\,perf + b_{2}\,gender + b_{3}\,perf*gender + e$`

Model 2: `$js\_score = b_{0} + b_{1}\,perf + b_{2}\,location + b_{3}\,perf*location+ e$`

---

# 2. Hypotheses with Categorical Predictor having 3+ Categories

---

# Categorical Predictor with 3+ Categories

### Problem with more than 2 groups

I would like to test the effect of the variable `$location$` which has 3 categories: "Ireland", "France" and "Australia".

In the Model Coefficient Table, to test the estimate of `$location$`, there is not 1 result for `$location$` but 2!
- Comparison of "Australia" vs. "France"
- Comparison of "Australia" vs. "Ireland"

**Why multiple `$p$`-value are provided for the same predictor?**

---

# Coding Predictors with 3+ categories

### Variables
- Outcome = `$js\_score$` (from 0 to 10)
- Predictor = `$location$` (3 categories: *Australia*, *France* and *Ireland*)

.pull-left[
<table class="table" style="font-size: 14px; color: black; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> employee </th>
   <th style="text-align:left;"> location </th>
   <th style="text-align:right;"> js_score </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> France </td>
   <td style="text-align:right;"> 5.057311 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:left;"> Australia </td>
   <td style="text-align:right;"> 6.642440 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:left;"> France </td>
   <td style="text-align:right;"> 6.119694 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:left;"> Ireland </td>
   <td style="text-align:right;"> 9.482198 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:left;"> France </td>
   <td style="text-align:right;"> 8.883347 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:left;"> Australia </td>
   <td style="text-align:right;"> 7.015606 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:left;"> France </td>
   <td style="text-align:right;"> 4.633738 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:left;"> Ireland </td>
   <td style="text-align:right;"> 7.919998 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 9 </td>
   <td style="text-align:left;"> Australia </td>
   <td style="text-align:right;"> 9.028004 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:left;"> Australia </td>
   <td style="text-align:right;"> 5.860449 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:left;"> Ireland </td>
   <td style="text-align:right;"> 10.000000 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:left;"> Australia </td>
   <td style="text-align:right;"> 3.617721 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:left;"> France </td>
   <td style="text-align:right;"> 6.948510 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 14 </td>
   <td style="text-align:left;"> Ireland </td>
   <td style="text-align:right;"> 7.429012 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 15 </td>
   <td style="text-align:left;"> Australia </td>
   <td style="text-align:right;"> 7.292992 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:left;"> Ireland </td>
   <td style="text-align:right;"> 7.765043 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 17 </td>
   <td style="text-align:left;"> Ireland </td>
   <td style="text-align:right;"> 6.380634 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 18 </td>
   <td style="text-align:left;"> France </td>
   <td style="text-align:right;"> 5.962925 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 19 </td>
   <td style="text-align:left;"> Australia </td>
   <td style="text-align:right;"> 5.607226 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:left;"> Australia </td>
   <td style="text-align:right;"> 4.635931 </td>
  </tr>
</tbody>
</table>
]

.pull-right[
<img src="lecture_5_files/figure-html/unnamed-chunk-23-1.png" width="504" style="display: block; margin: auto;" />
]

---

# Coding Predictors with 3+ categories

`$t$`-test can only compare 2 categories. Because Linear Regression Models are (kind of) `$t$`-test, categories will be compared 2-by-2 with one category as the reference to compare all the others.

For example a linear regression of `$location$` on `$js\_score$` will display not one effect for the `$location$` but the effect of the 2-by-2 comparison using a reference group by alphabetical order:

<table class="table" style="font-size: 16px; color: black; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:left;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 6.21 </td>
   <td style="text-align:right;"> 0.54 </td>
   <td style="text-align:right;"> 11.42 </td>
   <td style="text-align:left;"> &lt;0.001 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> locationFrance </td>
   <td style="text-align:right;"> 0.06 </td>
   <td style="text-align:right;"> 0.83 </td>
   <td style="text-align:right;"> 0.07 </td>
   <td style="text-align:left;"> 0.948 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> locationIreland </td>
   <td style="text-align:right;"> 1.95 </td>
   <td style="text-align:right;"> 0.83 </td>
   <td style="text-align:right;"> 2.35 </td>
   <td style="text-align:left;"> 0.031 </td>
  </tr>
</tbody>
</table>

In our case the reference is the group "Australia" (first letter).

Here is our problem: **How to test the overall effect of a variable with 3 or more Categories?**

---

# ANOVA Test for Overall Effects

BesideLinear Regression and `$t$`-test, researchers are using ANOVA a lot. ANOVA, stands for Analysis of Variance and is also a sub category of Linear Regression Models.

ANOVA is used to calculate the overall effect of categorical variable having more that 2 categories as `$t$`-test cannot cope. In the case of testing 1 categorical variable, a "one-way" ANOVA is performed.

**How ANOVA is working?**

### In real words
- `$H_a$`: at least one group is different from the others
- `$H_0$`: all the groups are the same

### In mathematical terms
- `$H_a$`: it is **not true** that `$\mu_{1} = \mu_{2} = \mu_{3}$`
- `$H_0$`: it is **true** that `$\mu_{1} = \mu_{2} = \mu_{3}$`

---

# ANOVA Test for Overall Effects

I won't go too much in the details but to check if at least one group is different from the others, the distance of each value to the overall mean (Between−group variation) is compared to the distance of each value to their group mean (Within−group variation).

**If the Between−group variation is the same as the Within−group variation, all the groups are the same.**

---

# ANOVA in our Example

An hypothesis for a categorical predictor with 3 or more categories predicts that **at least one group among the 3 groups will have an average significantly different than the other averages**.

### Hypothesis Formulation

> The `$js\_score$` of employees working in at least one specific `$location$` will be significantly different than the `$js\_score$` of employees working in the other `$location$`.

### In mathematical terms

- `$H_0$`: it is true that `$\mu(js\_score)_{Ireland} = \mu(js\_score)_{France} = \mu(js\_score)_{Australia}$`
- `$H_a$`: it is **not** true that `$\mu(js\_score)_{Ireland} = \mu(js\_score)_{France} = \mu(js\_score)_{Australia}$`

This analysis is usually preformed using a one-way ANOVA but as ANOVA are special cases of the General Linear Model, let's keep this approach.

---

# ANOVA in our Example

---

# ANOVA in our Example

### In JAMOVI

1. Open your file
2. Set variables according their type
3. Analyses > Regression > Linear Regression
4. Set `$js\_score$` as DV and `$location$` as Factors
5. In the **Model Coefficients** option: 
  - Select **Omnibus Test ANOVA test**

### Results

> There is no significant effect of employee's `$location$` on their average `$js\_score$` ( `$F(2, 17) = 3.30$`, `$p = .062$`)

---
class: title-slide, middle

## Live Demo

---
class: title-slide, middle

## Exercise

Using the `organisation_beta.csv` file, test the following models and conclude on the hypothesis related to each estimate:

Model 1: `$js\_score = b_{0} + b_{1}\,salary + b_{2}\,location + b_{3}\,perf + e$`

Model 2: `$$js\_score = b_{0} + b_{1}\,salary + b_{2}\,location + b_{3}\,perf + b_{4}\,salary*location +$$`
`$$b_{5}\,perf*location + b_{6}\,perf*salary + b_{7}\,salary*location*perf + e$$`

---
class: inverse, mline, left, middle

# 3. Manipulating Contrast with Categorical Predictors

---

# Post-hoc Tests

Imagine you want to test the specific difference between France and Ireland, **how to obtain a test of specific categories when using a categorical variable with 3 or more categories?**

The "Post-hoc" runs a separate `$t$`-test for all the pairwise category comparison:

|location1 |sep |location2 |    md|   se| df|     t| ptukey|
|:---------|:---|:---------|-----:|----:|--:|-----:|------:|
|Australia |-   |France    | -0.06| 0.83| 17| -0.07|   1.00|
|Australia |-   |Ireland   | -1.95| 0.83| 17| -2.35|   0.08|
|France    |-   |Ireland   | -1.90| 0.89| 17| -2.13|   0.11|

Even if it looks useful, "Post-hoc" test can be considered as `$p$`-Hacking because **there is no specific hypothesis testing, everything is compared**.

Some corrections for multiple tests are available such as Tukey, Scheffe, Bonferroni or Holm, but they are still very close to the bad science boundary.

---

# Contrasts or Factorial ANOVA

By using specific codes for the categories (also called **contrasts**), it is possible to test more precise hypotheses.

Actually, you are mastering Contrasts already. When a recoding is done on a variable with 2 categories (like Dummy Coding or Deviation Coding), a contrast is applied.

When a recoding is used on more than 2 categories, three rules have to be applied:

- Rule 1: **Categories with the same code are tested together**

> Coding Ireland 1, France 1 and Australia 2 compares Ireland and France versus Australia

- Rule 2: **The number of contrast possible is the number of categories - 1**

> `$location$` has 3 categories so 2 contrast comparisons can be performed

- Rule 3: **The value 0 means the category is not taken into account**

> Coding Ireland 1, France 0 and Australia 2 compares Ireland versus Australia

To understand contrasts, the best is to manually creating them in your spreadsheet.

---

# Sum to Zero Contrasts

Also called "Simple" contrast, each contrast encodes the difference between one of the groups and a baseline category, which in this case corresponds to the first group:

|Predictor's categories | Contrast1| Contrast2|
|:----------------------|---------:|---------:|
|Placebo                |        -1|        -1|
|Vaccine 1              |         1|         0|
|Vaccine 2              |         0|         1|
]

.pull-right[
<img src="lecture_5_files/figure-html/unnamed-chunk-31-1.png" width="504" style="display: block; margin: auto;" />
]

In this example:
- Contrast 1 compares Placebo with Vaccine 1
- Contrast 2 compares Placebo with Vaccine 2

However I won't be able to compare Vaccine 1 and Vaccine 2

---

# Polynomial Contrasts

They are the most powerful of all the contrasts to test linear and non linear effects: Contrast 1 is called Linear, Contrast 2 is Quadratic, Contrast 3 is Cubic, Contrast 4 is Quartic ...

|Predictor's categories | Contrast_1| Contrast_2|
|:----------------------|----------:|----------:|
|Low                    |         -1|          1|
|Medium                 |          0|         -2|
|High                   |          1|          1|
]

.pull-right[
<img src="lecture_5_files/figure-html/unnamed-chunk-33-1.png" width="504" style="display: block; margin: auto;" />
]

In this example:
- Contrast 1 checks the linear increase between **Low**, **Medium**, **High**
- Contrast 2 checks the quadratic change between **Low**, **Medium**, **High**

If the hypothesis specified a linear increase, we would expect Contrast 1 to be significant but Contrast 2 to be non-significant

---
class: title-slide, middle

## Live Demo

---

# Comparison of Contrasts Results

Let's see what happens with different contrast to compare the average `$js\_score$` according employee's `$location$`: **France**, **Ireland**, **Australia**

### Sum to Zero Contrasts

.pull-left[
<table class="table" style="font-size: 17px; color: black; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> category </th>
   <th style="text-align:right;"> sum_c1 </th>
   <th style="text-align:right;"> sum_c2 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> France </td>
   <td style="text-align:right;"> -1 </td>
   <td style="text-align:right;"> -1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ireland </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Australia </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
</tbody>
</table>
]

.pull-right[
<table class="table" style="font-size: 17px; color: black; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:left;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 6.88 </td>
   <td style="text-align:right;"> 0.35 </td>
   <td style="text-align:right;"> 19.82 </td>
   <td style="text-align:left;"> &lt;0.001 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sum_c1 </td>
   <td style="text-align:right;"> 1.28 </td>
   <td style="text-align:right;"> 0.50 </td>
   <td style="text-align:right;"> 2.55 </td>
   <td style="text-align:left;"> 0.021 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sum_c2 </td>
   <td style="text-align:right;"> -0.67 </td>
   <td style="text-align:right;"> 0.47 </td>
   <td style="text-align:right;"> -1.43 </td>
   <td style="text-align:left;"> 0.171 </td>
  </tr>
</tbody>
</table>
]

### Polynomial Contrasts

.pull-left[
<table class="table" style="font-size: 17px; color: black; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> category </th>
   <th style="text-align:right;"> poly_c1 </th>
   <th style="text-align:right;"> poly_c2 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> France </td>
   <td style="text-align:right;"> -1 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Ireland </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> -2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Australia </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
</tbody>
</table>
]

.pull-right[
<table class="table" style="font-size: 17px; color: black; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:left;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 6.88 </td>
   <td style="text-align:right;"> 0.35 </td>
   <td style="text-align:right;"> 19.82 </td>
   <td style="text-align:left;"> &lt;0.001 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> poly_c1 </td>
   <td style="text-align:right;"> -0.03 </td>
   <td style="text-align:right;"> 0.42 </td>
   <td style="text-align:right;"> -0.07 </td>
   <td style="text-align:left;"> 0.948 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> poly_c2 </td>
   <td style="text-align:right;"> -0.64 </td>
   <td style="text-align:right;"> 0.25 </td>
   <td style="text-align:right;"> -2.55 </td>
   <td style="text-align:left;"> 0.021 </td>
  </tr>
</tbody>
</table>
]

---
class: title-slide, middle

## Exercise

1. Using the `organisation_beta.csv` file, create contrast variables to reproduce the results obtained with Sum to Zero Contrasts

2. When it's done, explicit the hypotheses tested, the representation of the models and their corresponding equation

---

# Solution - Sum to Zero Contrasts

Variables:

- Outcome = `$js\_score$` (from 0 to 10)
- Predictor 1 = `$sum\_c1$` (Ireland vs France)
- Predictor 2 = `$sum\_c2$` (Australia vs France)

Hypotheses:

- `$H_{a_1}$`: The average `$js\_score$` of Irish employees is different than the average `$js\_score$` of French employees
  - `$H_{0_1}$`: The average `$js\_score$` of Irish employees is the same as the average `$js\_score$` of French employees

- `$H_{a_2}$`: The average `$js\_score$` of Australia employees is different than the average `$js\_score$` of France employees
  - `$H_{0_2}$`: The average `$js\_score$` of Australia employees is the same as the average `$js\_score$` of France employees

---

# Solution - Sum to Zero Contrasts

Model:

<div id="htmlwidget-31752ae4f30d7598595a" style="width:500px;height:200px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-31752ae4f30d7598595a">{"x":{"diagram":"\n  digraph {\n    graph [rankdir = LR]\n    node [shape = circle]\n    js_score\n    node [shape = square]\n    sum_c1, sum_c2\n    \n    sum_c1 -> js_score [label= b1]\n    sum_c2 -> js_score [label= b2]\n  }","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

Equation:

- `$js\_score = b_{0} + b_{1}\,sum\_c1 + b_{2}\,sum\_c2 + e$`

---
class: inverse, mline, left, middle

# Thanks for your attention and don't hesitate to ask if you have any questions!

[<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> @damien_dupre](http://twitter.com/damien_dupre)  
[<svg aria-hidden="true" role="img" viewBox="0 0 496 512" style="height:1em;width:0.97em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg> @damien-dupre](http://github.com/damien-dupre)  
[<svg aria-hidden="true" role="img" viewBox="0 0 640 512" style="height:1em;width:1.25em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M579.8 267.7c56.5-56.5 56.5-148 0-204.5c-50-50-128.8-56.5-186.3-15.4l-1.6 1.1c-14.4 10.3-17.7 30.3-7.4 44.6s30.3 17.7 44.6 7.4l1.6-1.1c32.1-22.9 76-19.3 103.8 8.6c31.5 31.5 31.5 82.5 0 114L422.3 334.8c-31.5 31.5-82.5 31.5-114 0c-27.9-27.9-31.5-71.8-8.6-103.8l1.1-1.6c10.3-14.4 6.9-34.4-7.4-44.6s-34.4-6.9-44.6 7.4l-1.1 1.6C206.5 251.2 213 330 263 380c56.5 56.5 148 56.5 204.5 0L579.8 267.7zM60.2 244.3c-56.5 56.5-56.5 148 0 204.5c50 50 128.8 56.5 186.3 15.4l1.6-1.1c14.4-10.3 17.7-30.3 7.4-44.6s-30.3-17.7-44.6-7.4l-1.6 1.1c-32.1 22.9-76 19.3-103.8-8.6C74 372 74 321 105.5 289.5L217.7 177.2c31.5-31.5 82.5-31.5 114 0c27.9 27.9 31.5 71.8 8.6 103.9l-1.1 1.6c-10.3 14.4-6.9 34.4 7.4 44.6s34.4 6.9 44.6-7.4l1.1-1.6C433.5 260.8 427 182 377 132c-56.5-56.5-148-56.5-204.5 0L60.2 244.3z"/></svg> damien-datasci-blog.netlify.app](https://damien-datasci-blog.netlify.app)  
[<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M16.1 260.2c-22.6 12.9-20.5 47.3 3.6 57.3L160 376V479.3c0 18.1 14.6 32.7 32.7 32.7c9.7 0 18.9-4.3 25.1-11.8l62-74.3 123.9 51.6c18.9 7.9 40.8-4.5 43.9-24.7l64-416c1.9-12.1-3.4-24.3-13.5-31.2s-23.3-7.5-34-1.4l-448 256zm52.1 25.5L409.7 90.6 190.1 336l1.2 1L68.2 285.7zM403.3 425.4L236.7 355.9 450.8 116.6 403.3 425.4z"/></svg> damien.dupre@dcu.ie](mailto:damien.dupre@dcu.ie)