class: center, middle, inverse, title-slide .title[ # MT611 - Quantitative Research Methods ] .subtitle[ ## Lecture 5: Categories in the General Linear Model ] .author[ ### Damien Dupré ] .date[ ### Dublin City University ] --- # Vocabulary "Linear Model", "Linear Regression", "Multiple Regression" or simply "Regression" are all referring to the same model: **The General Linear Model**. It contains: - Only one Outcome/Dependent Variable - One or more Predictor/Independent Variables of any type (categorical or continuous) - Made of Main and/or Interaction Effects `$$Y = b_{0} + b_{1}\,Predictor\,1 + b_{2}\,Predictor\,2+ ... + b_{n}\,Predictor\,n + e$$` A Linear Regression is used **to test all the hypotheses at once** and to calculate the predictors' estimate. Specific tests are available for certain type of hypothesis such as T-test or ANOVA but as they are special cases of Linear Regressions, their importance is limited (see [Jonas Kristoffer Lindeløv's blog post: Common statistical tests are linear models](https://lindeloev.github.io/tests-as-linear/)). --- # General Linear Model Everywhere .pull-left[ Most of the common statistical models (t-test, correlation, ANOVA; chi-square, etc.) are **special cases of linear models**. This beautiful simplicity means that there is less to learn. In particular, it all comes down to `\(y = ax + b\)` which most students know from secondary school. Unfortunately, **stats intro courses are usually taught as if each test is an independent tool**, needlessly making life more complicated for students and teachers alike. Here, only **one test is taught to rule them all**: the General Linear Model (GLM). ] .pull-right[ <img src="https://psyteachr.github.io/msc-data-skills/images/memes/glm_meme.png" width="100%" style="display: block; margin: auto;" /> ] --- # Analysis of the Estimate Once the best line is found, each estimate of the tested equation is calculated by a software (i.e., `\(b_0, b_1, ..., b_n\)`). - `\(b_0\)` is the intercept and has no interest for hypothesis testing - `\(b_1, ..., b_n\)` are predictors' effect estimate and each of them is used to test an hypothesis The predictors' effect estimate `\(b_1, ..., b_n\)` are **the value of the slope of the best line between each predictor** and the outcome. It indicates **how many units of the outcome variable increases/decreases/changes when the predictor increases by 1 unit** Technically, `\(b\)` is a weight or multiplier applied to the Predictor's values to obtain the Outcome's expected values --- # Analysis of the Estimate - If `\(b_1, ..., b_n = 0\)`, then: - The regression line is horizontal (no slope) - When the Predictor increases by 1 unit, the Outcome variable does not change - **The null alternative hypothesis is not rejected** -- - If `\(b_1, ..., b_n > 0\)`, then: - The regression line is positive (slope up) - When the Predictor increases by 1 unit, the Outcome variable increases by `\(b\)` - **The null alternative hypothesis is rejected and the alternative hypothesis considered as plausible** -- - If `\(b_1, ..., b_n < 0\)`, then: - The regression line is negative (slope down) - When the Predictor increases by 1 unit, the Outcome variable decreases by `\(b\)` - **The null alternative hypothesis is rejected and the alternative hypothesis considered as plausible** --- # Significance of Effect's Estimate The statistical significance of an effect estimate depends on the **strength of the relationship** and on the **sample size**: - An estimate of `\(b_1 = 0.02\)` can be very small but still significantly different from `\(b_1 = 0\)` - Whereas an estimate of `\(b_1 = 0.35\)` can be stronger but in fact not significantly different from `\(b_1 = 0\)` -- The significance is the probability to obtain your results with your sample in the null hypothesis scenario: - Also called `\(p\)`-value - Is between 0% and 100% which corresponds to a value between 0.0 and 1.0 **If the `\(p\)`-value is lower to 5% or 0.05, then the probability to obtain your results in the null hypothesis scenario is low enough to say that the null hypothesis scenario is rejected and there must be a link between the variables.** -- Remember that the `\(p\)`-value is the probability of the data given the null hypothesis: `\(P(data|H_0)\)`. --- class: inverse, mline, center, middle # 1. Hypotheses with Categorical Predictors having 2 Categories --- # Hypotheses with Categorical Predictors We will have a deep dive in the processing of Categorical predictor variables with linear regressions: - How to analyse a Categorical predictor with only 2 categories? - How to analyse a Categorical predictor with more than 2 categories? <img src="https://media.makeameme.org/created/what-if-i-8de3498d3a.jpg" width="50%" style="display: block; margin: auto;" /> --- # Example of Categorical Coding .pull-left[ Imagine we sample male and female employees to see if the difference between their job satisfaction averages is due **to sampling luck or is reflecting a real difference in the population**. That is, **is the difference between male and female employees statistically significant?** ] .pull-right[ <table> <thead> <tr> <th style="text-align:right;"> employee </th> <th style="text-align:left;"> gender </th> <th style="text-align:right;"> js_score </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 5.057311 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 6.642440 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 6.119694 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 9.482198 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 8.883347 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 7.015606 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 4.633738 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 7.919998 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 9.028004 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 5.860449 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 10.000000 </td> </tr> <tr> <td style="text-align:right;"> 12 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 3.617721 </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 6.948510 </td> </tr> <tr> <td style="text-align:right;"> 14 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 7.429012 </td> </tr> <tr> <td style="text-align:right;"> 15 </td> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 7.292992 </td> </tr> <tr> <td style="text-align:right;"> 16 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 7.765043 </td> </tr> <tr> <td style="text-align:right;"> 17 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 6.380634 </td> </tr> <tr> <td style="text-align:right;"> 18 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 5.962925 </td> </tr> <tr> <td style="text-align:right;"> 19 </td> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 5.607226 </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 4.635931 </td> </tr> </tbody> </table> ] --- # Example of Categorical Coding .pull-left[ .center[Using a Categorical variable having 2 category, **e.g., comparing female vs. male** ...] <img src="lecture_5_files/figure-html/unnamed-chunk-4-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ .center[... is the same as **comparing female coded 1 and male coded 2**] <img src="lecture_5_files/figure-html/unnamed-chunk-5-1.png" width="504" style="display: block; margin: auto;" /> ] By default, Categorical variables **are coded using the alphabetical order** (e.g., here Female first then Male) using 1, 2, 3 and so on. However, you can recode the variable yourself with your own order by creating a new variable using IF statement (e.g., `IF(gender == "female", 2, 1)` in Jamovi) --- # Categorical Coding in Linear Regression .pull-left[ .center[Default: **Female = 1** and **Male = 2**] <img src="lecture_5_files/figure-html/unnamed-chunk-6-1.png" width="504" style="display: block; margin: auto;" /> <table class="table" style="font-size: 16px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:left;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 7.36 </td> <td style="text-align:right;"> 1.34 </td> <td style="text-align:right;"> 5.50 </td> <td style="text-align:left;"> <0.001 </td> </tr> <tr> <td style="text-align:left;"> gender_c </td> <td style="text-align:right;"> -0.34 </td> <td style="text-align:right;"> 0.80 </td> <td style="text-align:right;"> -0.43 </td> <td style="text-align:left;"> 0.675 </td> </tr> </tbody> </table> ] .pull-right[ .center[Manual: **Male = 1** and **Female = 2**] <img src="lecture_5_files/figure-html/unnamed-chunk-8-1.png" width="504" style="display: block; margin: auto;" /> <table class="table" style="font-size: 16px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:left;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 6.34 </td> <td style="text-align:right;"> 1.19 </td> <td style="text-align:right;"> 5.34 </td> <td style="text-align:left;"> <0.001 </td> </tr> <tr> <td style="text-align:left;"> gender_c </td> <td style="text-align:right;"> 0.34 </td> <td style="text-align:right;"> 0.80 </td> <td style="text-align:right;"> 0.43 </td> <td style="text-align:left;"> 0.675 </td> </tr> </tbody> </table> ] --- # Categorical Predictor with 2 Categories Let's use another example with the `organisation_beta.csv` file ### Variable transformation Instead of using `\(salary\)` as a **continuous variable**, let's convert it as `\(salary\_c\)` which is a **categorical variable**: - Everything higher than or equal to salary average is labelled "**high**" salary - Everything lower than salary average is labelled "**low**" salary ### Hypothesis The `\(js\_score\)` of employees having a **high** `\(salary\_c\)` is different than the `\(js\_score\)` of employees having a **low** `\(salary\_c\)` ### In mathematical terms `$$H_a: \mu(js\_score)_{high\,salary} \neq \mu(js\_score)_{low\,salary}$$` `$$H_0: \mu(js\_score)_{high\,salary} = \mu(js\_score)_{low\,salary}$$` --- # Categorical Predictor with 2 Categories An hypothesis of differences between two groups is easily tested with a Linear Regression: - If `\(\mu_{1} \neq \mu_{2}\)`, the slope of the line between these averages is not null (i.e., `\(b_{1} \neq 0\)`) - If `\(\mu_{1} = \mu_{2}\)`, the slope of the line between these averages is null (i.e., `\(b_{1} = 0\)` ) ### Explanation .pull-left[ **Comparing the difference between two averages is the same as comparing the slope of the line crossing these two averages** - If two averages are **not equal**, then **the slope of the line crossing these two averages is not 0** - If two averages are **equal**, then the **slope of the line crossing these two averages is 0** ] .pull-right[ <img src="lecture_5_files/figure-html/unnamed-chunk-10-1.png" width="288" style="display: block; margin: auto;" /> ] --- # Categorical Predictor with 2 Categories ###
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
Warning JAMOVI and other software **automatically code categorical variable following alphabetical order** but sometimes you need to change these codes. .pull-left[ For example, here **low coded with the value 1** and **high coded with the value 2** would make more sense. The way how categorical variables are coded will influence the sign of the estimate (positive vs. negative) But **it doesn't change the value of the statistical test** nor the `\(p\)`-value obtained ] .pull-right[ <img src="lecture_5_files/figure-html/unnamed-chunk-11-1.png" width="288" style="display: block; margin: auto;" /> ] --- # Categorical Predictor with 2 Categories ### To sum up **To test the influence of a categorical predictor** variable either nominal or ordinal **having two categories** (e.g., high vs. low, male vs. female, France vs. Ireland), it is possible to **test if the `\(b\)` associated to this predictor is significantly higher, lower, or different from 0**. ### Equation `$$js\_score = b_{0} + b_{1}\,salary\_c + e$$` ### Communicating results Exactly the same template as for Continuous Predictors: > The predictions provided by the alternative model are significantly better than those provided by null model ( `\(R^2 = .31\)`, `\(F(1, 18) = 8.27\)`, `\(p = .010\)`). > The effect of `\(salary\_c\)` on `\(js\_score\)` is statistically significant, therefore `\(H_0\)` can be rejected ( `\(b = 1.87\)`, 95\% CI `\([0.50, 3.24]\)`, `\(t(18) = 2.88\)`, `\(p = .010\)`). --- class: title-slide, middle ## Testing Main Effects with Categorical Predictors --- # Testing Categorical Predictors ### In JAMOVI 1. Open your file 2. Set variables in their **correct type** (continuous, cat. nominal or cat. ordinal) 3. **Analyses > Regression > Linear Regression** 4. Set `\(js\_score\)` as DV (i.e., Outcome) and `\(salary\_c\)` as Factors (i.e., Categorical Predictor) <img src="https://raw.githubusercontent.com/damien-dupre/img/main/jamovi_lm_main_c2.png" width="100%" style="display: block; margin: auto;" /> --- # Testing Categorical Predictors ### Model > The prediction provided by the model with all predictors is significantly better than a model without predictors (** `\(R^2 = .31\)`, `\(F(1, 18) = 8.27\)`, `\(p = .010\)`**). ### Hypothesis with Default Coding (high = 1 vs. low = 2) > The effect of `\(salary\_c\)` on `\(js\_score\)` is statistically significant, therefore `\(H_{0}\)` can be rejected (** `\(b = -1.87\)`, 95\% CI `\([-3.24, -0.50]\)`, `\(t(18) = -2.88\)`, `\(p = .010\)`**). ### Hypothesis with Manual Coding (low = 1 vs. high = 2) > The effect of `\(salary\_c\)` on `\(js\_score\)` is statistically significant, therefore `\(H_{0}\)` can be rejected (** `\(b = 1.87\)`, 95\% CI `\([0.50, 3.24]\)`, `\(t(18) = 2.88\)`, `\(p = .010\)`**). --- # Coding of Categorical Predictors Choosing 1 and 2 are **just arbitrary numerical values** but any other possibility will produce the same `\(p\)`-value However, choosing codes separated by 1 is handy because it's easily interpretable, the **estimate corresponds to the change from one category to another**: > The `\(js\_score\)` of "high" `\(salary\_c\)` employees is 1.87 higher than the `\(js\_score\)` of "low" `\(salary\_c\)` employees (when "low" is coded 1 and "high" coded 2). --- # Coding of Categorical Predictors ### Special case called **Dummy Coding** when a category is coded 0 and the other 1: - Then the intercept, value of `\(js\_score\)` when salary is 0 corresponds to the category coded 0 - The test of the intercept is the test of the average for the category coded 0 against an average of 0 - Is called simple effect ### Special case called **Deviation Coding** when a category is coded 1 and the other -1: - Then the intercept, corresponds to the average between the two categories - The test of the intercept is the test of the average for the variable - However, the distance between 1 and -1 is 2 units so the estimate is not as easy to interpret, therefore it is possible to choose categories coded 0.5 vs. -0.5 instead --- # Dummy Coding in Linear Regression Dummy Coding is when a category is coded 0 and the other coded 1. For example, in JAMOVI recode female as 0 and male as 1 (Dummy Coding): ``` IF(gender == "female", 0, 1) ``` Dummy Coding is useful because one of the category becomes the intercept and is tested against 0. .pull-left[ <img src="lecture_5_files/figure-html/unnamed-chunk-13-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ <table class="table" style="font-size: 16px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:left;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 7.02 </td> <td style="text-align:right;"> 0.62 </td> <td style="text-align:right;"> 11.33 </td> <td style="text-align:left;"> <0.001 </td> </tr> <tr> <td style="text-align:left;"> gender_c </td> <td style="text-align:right;"> -0.34 </td> <td style="text-align:right;"> 0.80 </td> <td style="text-align:right;"> -0.43 </td> <td style="text-align:left;"> 0.675 </td> </tr> </tbody> </table> ] --- # Deviation Coding in Linear Regression Deviation Coding is when the intercept is situated between the codes of the categories. For example, in JAMOVI recode female as -1 and male as 1 (Deviation Coding): ``` IF(gender == "female", -1, 1) ``` Deviation Coding is useful because **the average of the categories becomes the intercept** and is tested against 0. However, in the Deviation Coding **using -1 vs. +1, the distance between the categories is 2** not 1. Therefore, even if the test of the slop is the exact same, the value of the slop (the estimate) is twice lower. Consequently it is possible to use **a Deviation Coding with -0.5 vs. +0.5 to keep the distance of 1** between the categories. For example, in JAMOVI recode female as -0.5 and male as 0.5 (Deviation Coding): ``` IF(gender == "female", -0.5, 0.5) ``` --- # Deviation Coding in Linear Regression .pull-left[ .center[Female = -1 and Male = 1] <img src="lecture_5_files/figure-html/unnamed-chunk-15-1.png" width="504" style="display: block; margin: auto;" /> <table class="table" style="font-size: 16px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:left;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 6.85 </td> <td style="text-align:right;"> 0.4 </td> <td style="text-align:right;"> 17.12 </td> <td style="text-align:left;"> <0.001 </td> </tr> <tr> <td style="text-align:left;"> gender_c </td> <td style="text-align:right;"> -0.17 </td> <td style="text-align:right;"> 0.4 </td> <td style="text-align:right;"> -0.43 </td> <td style="text-align:left;"> 0.675 </td> </tr> </tbody> </table> ] .pull-right[ .center[Female = -0.5 and Male = 0.5] <img src="lecture_5_files/figure-html/unnamed-chunk-17-1.png" width="504" style="display: block; margin: auto;" /> <table class="table" style="font-size: 16px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:left;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 6.85 </td> <td style="text-align:right;"> 0.4 </td> <td style="text-align:right;"> 17.12 </td> <td style="text-align:left;"> <0.001 </td> </tr> <tr> <td style="text-align:left;"> gender_c </td> <td style="text-align:right;"> -0.34 </td> <td style="text-align:right;"> 0.8 </td> <td style="text-align:right;"> -0.43 </td> <td style="text-align:left;"> 0.675 </td> </tr> </tbody> </table> ] --- class: title-slide, middle ## Testing Interaction Effects with Categorical Predictors --- # Interaction with Categorical Predictors ### In JAMOVI 1. Open your file 2. Set variables according their type 3. **Analyses > Regression > Linear Regression** 4. Set `\(js\_score\)` as DV and `\(salary\_c\)` as well as `\(gender\)` as Factors 4. In the **Model Builder** option: - Select both `\(salary\_c\)` and `\(gender\)` to bring them in the Factors at once ### Model Tested `$$js\_score = b_{0} + b_{1}\,salary\_c + b_{2}\,gender + b_{3}\,salary\_c*gender + e$$` Note: The test of the interaction effect corresponds to the test of a variable resulting from the multiplication between the codes of `\(salary\_c\)` and the codes of `\(gender\)`. --- # Interaction with Categorical Predictors <img src="https://raw.githubusercontent.com/damien-dupre/img/main/jamovi_lm_main_cint.png" width="100%" style="display: block; margin: auto;" /> --- class: title-slide, middle ## Live Demo --- class: title-slide, middle ## Exercise With the `organisation_beta.csv` data, test the following models and conclude on each effect: Model 1: `\(js\_score = b_{0} + b_{1}\,perf + b_{2}\,gender + b_{3}\,perf*gender + e\)` Model 2: `\(js\_score = b_{0} + b_{1}\,perf + b_{2}\,location + b_{3}\,perf*location+ e\)`
−
+
10
:
00
--- class: inverse, mline, center, middle # 2. Hypotheses with Categorical Predictor having 3+ Categories --- # Categorical Predictor with 3+ Categories ### Problem with more than 2 groups I would like to test the effect of the variable `\(location\)` which has 3 categories: "Ireland", "France" and "Australia". <img src="https://raw.githubusercontent.com/damien-dupre/img/main/jamovi_lm_main_c31.png" style="display: block; margin: auto;" /> In the Model Coefficient Table, to test the estimate of `\(location\)`, there is not 1 result for `\(location\)` but 2! - Comparison of "Australia" vs. "France" - Comparison of "Australia" vs. "Ireland" **Why multiple `\(p\)`-value are provided for the same predictor?** --- # Coding Predictors with 3+ categories ### Variables - Outcome = `\(js\_score\)` (from 0 to 10) - Predictor = `\(location\)` (3 categories: *Australia*, *France* and *Ireland*) .pull-left[ <table class="table" style="font-size: 14px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> employee </th> <th style="text-align:left;"> location </th> <th style="text-align:right;"> js_score </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> France </td> <td style="text-align:right;"> 5.057311 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 6.642440 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> France </td> <td style="text-align:right;"> 6.119694 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> Ireland </td> <td style="text-align:right;"> 9.482198 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> France </td> <td style="text-align:right;"> 8.883347 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 7.015606 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:left;"> France </td> <td style="text-align:right;"> 4.633738 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> Ireland </td> <td style="text-align:right;"> 7.919998 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 9.028004 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 5.860449 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:left;"> Ireland </td> <td style="text-align:right;"> 10.000000 </td> </tr> <tr> <td style="text-align:right;"> 12 </td> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 3.617721 </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:left;"> France </td> <td style="text-align:right;"> 6.948510 </td> </tr> <tr> <td style="text-align:right;"> 14 </td> <td style="text-align:left;"> Ireland </td> <td style="text-align:right;"> 7.429012 </td> </tr> <tr> <td style="text-align:right;"> 15 </td> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 7.292992 </td> </tr> <tr> <td style="text-align:right;"> 16 </td> <td style="text-align:left;"> Ireland </td> <td style="text-align:right;"> 7.765043 </td> </tr> <tr> <td style="text-align:right;"> 17 </td> <td style="text-align:left;"> Ireland </td> <td style="text-align:right;"> 6.380634 </td> </tr> <tr> <td style="text-align:right;"> 18 </td> <td style="text-align:left;"> France </td> <td style="text-align:right;"> 5.962925 </td> </tr> <tr> <td style="text-align:right;"> 19 </td> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 5.607226 </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 4.635931 </td> </tr> </tbody> </table> ] .pull-right[ <img src="lecture_5_files/figure-html/unnamed-chunk-23-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Coding Predictors with 3+ categories `\(t\)`-test can only compare 2 categories. Because Linear Regression Models are (kind of) `\(t\)`-test, categories will be compared 2-by-2 with one category as the reference to compare all the others. For example a linear regression of `\(location\)` on `\(js\_score\)` will display not one effect for the `\(location\)` but the effect of the 2-by-2 comparison using a reference group by alphabetical order: <table class="table" style="font-size: 16px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:left;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 6.21 </td> <td style="text-align:right;"> 0.54 </td> <td style="text-align:right;"> 11.42 </td> <td style="text-align:left;"> <0.001 </td> </tr> <tr> <td style="text-align:left;"> locationFrance </td> <td style="text-align:right;"> 0.06 </td> <td style="text-align:right;"> 0.83 </td> <td style="text-align:right;"> 0.07 </td> <td style="text-align:left;"> 0.948 </td> </tr> <tr> <td style="text-align:left;"> locationIreland </td> <td style="text-align:right;"> 1.95 </td> <td style="text-align:right;"> 0.83 </td> <td style="text-align:right;"> 2.35 </td> <td style="text-align:left;"> 0.031 </td> </tr> </tbody> </table> In our case the reference is the group "Australia" (first letter). Here is our problem: **How to test the overall effect of a variable with 3 or more Categories?** --- # ANOVA Test for Overall Effects BesideLinear Regression and `\(t\)`-test, researchers are using ANOVA a lot. ANOVA, stands for Analysis of Variance and is also a sub category of Linear Regression Models. ANOVA is used to calculate the overall effect of categorical variable having more that 2 categories as `\(t\)`-test cannot cope. In the case of testing 1 categorical variable, a "one-way" ANOVA is performed. **How ANOVA is working?** ### In real words - `\(H_a\)`: at least one group is different from the others - `\(H_0\)`: all the groups are the same ### In mathematical terms - `\(H_a\)`: it is **not true** that `\(\mu_{1} = \mu_{2} = \mu_{3}\)` - `\(H_0\)`: it is **true** that `\(\mu_{1} = \mu_{2} = \mu_{3}\)` --- # ANOVA Test for Overall Effects I won't go too much in the details but to check if at least one group is different from the others, the distance of each value to the overall mean (Between−group variation) is compared to the distance of each value to their group mean (Within−group variation). **If the Between−group variation is the same as the Within−group variation, all the groups are the same.** <img src="https://raw.githubusercontent.com/damien-dupre/img/main/one_way_anova_basics.png" width="100%" style="display: block; margin: auto;" /> --- # ANOVA in our Example An hypothesis for a categorical predictor with 3 or more categories predicts that **at least one group among the 3 groups will have an average significantly different than the other averages**. ### Hypothesis Formulation > The `\(js\_score\)` of employees working in at least one specific `\(location\)` will be significantly different than the `\(js\_score\)` of employees working in the other `\(location\)`. ### In mathematical terms - `\(H_0\)`: it is true that `\(\mu(js\_score)_{Ireland} = \mu(js\_score)_{France} = \mu(js\_score)_{Australia}\)` - `\(H_a\)`: it is **not** true that `\(\mu(js\_score)_{Ireland} = \mu(js\_score)_{France} = \mu(js\_score)_{Australia}\)` This analysis is usually preformed using a one-way ANOVA but as ANOVA are special cases of the General Linear Model, let's keep this approach. --- # ANOVA in our Example <img src="lecture_5_files/figure-html/unnamed-chunk-26-1.png" width="504" style="display: block; margin: auto;" /> --- # ANOVA in our Example ### In JAMOVI 1. Open your file 2. Set variables according their type 3. Analyses > Regression > Linear Regression 4. Set `\(js\_score\)` as DV and `\(location\)` as Factors 5. In the **Model Coefficients** option: - Select **Omnibus Test ANOVA test** <img src="https://raw.githubusercontent.com/damien-dupre/img/main/jamovi_lm_main_c32.png" width="40%" style="display: block; margin: auto;" /> ### Results > There is no significant effect of employee's `\(location\)` on their average `\(js\_score\)` ( `\(F(2, 17) = 3.30\)`, `\(p = .062\)`) --- class: title-slide, middle ## Live Demo --- class: title-slide, middle ## Exercise Using the `organisation_beta.csv` file, test the following models and conclude on the hypothesis related to each estimate: Model 1: `\(js\_score = b_{0} + b_{1}\,salary + b_{2}\,location + b_{3}\,perf + e\)` Model 2: `$$js\_score = b_{0} + b_{1}\,salary + b_{2}\,location + b_{3}\,perf + b_{4}\,salary*location +$$` `$$b_{5}\,perf*location + b_{6}\,perf*salary + b_{7}\,salary*location*perf + e$$`
−
+
05
:
00
--- class: inverse, mline, left, middle # 3. Manipulating Contrast with Categorical Predictors --- # Post-hoc Tests Imagine you want to test the specific difference between France and Ireland, **how to obtain a test of specific categories when using a categorical variable with 3 or more categories?** The "Post-hoc" runs a separate `\(t\)`-test for all the pairwise category comparison: |location1 |sep |location2 | md| se| df| t| ptukey| |:---------|:---|:---------|-----:|----:|--:|-----:|------:| |Australia |- |France | -0.06| 0.83| 17| -0.07| 1.00| |Australia |- |Ireland | -1.95| 0.83| 17| -2.35| 0.08| |France |- |Ireland | -1.90| 0.89| 17| -2.13| 0.11| Even if it looks useful, "Post-hoc" test can be considered as `\(p\)`-Hacking because **there is no specific hypothesis testing, everything is compared**. Some corrections for multiple tests are available such as Tukey, Scheffe, Bonferroni or Holm, but they are still very close to the bad science boundary. --- # Contrasts or Factorial ANOVA By using specific codes for the categories (also called **contrasts**), it is possible to test more precise hypotheses. Actually, you are mastering Contrasts already. When a recoding is done on a variable with 2 categories (like Dummy Coding or Deviation Coding), a contrast is applied. When a recoding is used on more than 2 categories, three rules have to be applied: -- - Rule 1: **Categories with the same code are tested together** > Coding Ireland 1, France 1 and Australia 2 compares Ireland and France versus Australia -- - Rule 2: **The number of contrast possible is the number of categories - 1** > `\(location\)` has 3 categories so 2 contrast comparisons can be performed -- - Rule 3: **The value 0 means the category is not taken into account** > Coding Ireland 1, France 0 and Australia 2 compares Ireland versus Australia -- To understand contrasts, the best is to manually creating them in your spreadsheet. --- # Sum to Zero Contrasts Also called "Simple" contrast, each contrast encodes the difference between one of the groups and a baseline category, which in this case corresponds to the first group: .pull-left[ |Predictor's categories | Contrast1| Contrast2| |:----------------------|---------:|---------:| |Placebo | -1| -1| |Vaccine 1 | 1| 0| |Vaccine 2 | 0| 1| ] .pull-right[ <img src="lecture_5_files/figure-html/unnamed-chunk-31-1.png" width="504" style="display: block; margin: auto;" /> ] In this example: - Contrast 1 compares Placebo with Vaccine 1 - Contrast 2 compares Placebo with Vaccine 2 However I won't be able to compare Vaccine 1 and Vaccine 2 --- # Polynomial Contrasts They are the most powerful of all the contrasts to test linear and non linear effects: Contrast 1 is called Linear, Contrast 2 is Quadratic, Contrast 3 is Cubic, Contrast 4 is Quartic ... .pull-left[ |Predictor's categories | Contrast_1| Contrast_2| |:----------------------|----------:|----------:| |Low | -1| 1| |Medium | 0| -2| |High | 1| 1| ] .pull-right[ <img src="lecture_5_files/figure-html/unnamed-chunk-33-1.png" width="504" style="display: block; margin: auto;" /> ] In this example: - Contrast 1 checks the linear increase between **Low**, **Medium**, **High** - Contrast 2 checks the quadratic change between **Low**, **Medium**, **High** If the hypothesis specified a linear increase, we would expect Contrast 1 to be significant but Contrast 2 to be non-significant --- class: title-slide, middle ## Live Demo --- # Comparison of Contrasts Results Let's see what happens with different contrast to compare the average `\(js\_score\)` according employee's `\(location\)`: **France**, **Ireland**, **Australia** ### Sum to Zero Contrasts .pull-left[ <table class="table" style="font-size: 17px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> category </th> <th style="text-align:right;"> sum_c1 </th> <th style="text-align:right;"> sum_c2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> France </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> -1 </td> </tr> <tr> <td style="text-align:left;"> Ireland </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table" style="font-size: 17px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:left;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 6.88 </td> <td style="text-align:right;"> 0.35 </td> <td style="text-align:right;"> 19.82 </td> <td style="text-align:left;"> <0.001 </td> </tr> <tr> <td style="text-align:left;"> sum_c1 </td> <td style="text-align:right;"> 1.28 </td> <td style="text-align:right;"> 0.50 </td> <td style="text-align:right;"> 2.55 </td> <td style="text-align:left;"> 0.021 </td> </tr> <tr> <td style="text-align:left;"> sum_c2 </td> <td style="text-align:right;"> -0.67 </td> <td style="text-align:right;"> 0.47 </td> <td style="text-align:right;"> -1.43 </td> <td style="text-align:left;"> 0.171 </td> </tr> </tbody> </table> ] ### Polynomial Contrasts .pull-left[ <table class="table" style="font-size: 17px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> category </th> <th style="text-align:right;"> poly_c1 </th> <th style="text-align:right;"> poly_c2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> France </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Ireland </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> -2 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> </tr> </tbody> </table> ] .pull-right[ <table class="table" style="font-size: 17px; color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:left;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 6.88 </td> <td style="text-align:right;"> 0.35 </td> <td style="text-align:right;"> 19.82 </td> <td style="text-align:left;"> <0.001 </td> </tr> <tr> <td style="text-align:left;"> poly_c1 </td> <td style="text-align:right;"> -0.03 </td> <td style="text-align:right;"> 0.42 </td> <td style="text-align:right;"> -0.07 </td> <td style="text-align:left;"> 0.948 </td> </tr> <tr> <td style="text-align:left;"> poly_c2 </td> <td style="text-align:right;"> -0.64 </td> <td style="text-align:right;"> 0.25 </td> <td style="text-align:right;"> -2.55 </td> <td style="text-align:left;"> 0.021 </td> </tr> </tbody> </table> ] --- class: title-slide, middle ## Exercise 1. Using the `organisation_beta.csv` file, create contrast variables to reproduce the results obtained with Sum to Zero Contrasts 2. When it's done, explicit the hypotheses tested, the representation of the models and their corresponding equation
−
+
10
:
00
--- # Solution - Sum to Zero Contrasts Variables: - Outcome = `\(js\_score\)` (from 0 to 10) - Predictor 1 = `\(sum\_c1\)` (Ireland vs France) - Predictor 2 = `\(sum\_c2\)` (Australia vs France) Hypotheses: - `\(H_{a_1}\)`: The average `\(js\_score\)` of Irish employees is different than the average `\(js\_score\)` of French employees - `\(H_{0_1}\)`: The average `\(js\_score\)` of Irish employees is the same as the average `\(js\_score\)` of French employees - `\(H_{a_2}\)`: The average `\(js\_score\)` of Australia employees is different than the average `\(js\_score\)` of France employees - `\(H_{0_2}\)`: The average `\(js\_score\)` of Australia employees is the same as the average `\(js\_score\)` of France employees --- # Solution - Sum to Zero Contrasts Model:
Equation: - `\(js\_score = b_{0} + b_{1}\,sum\_c1 + b_{2}\,sum\_c2 + e\)` --- class: inverse, mline, left, middle <img class="circle" src="https://github.com/damien-dupre.png" width="250px"/> # Thanks for your attention and don't hesitate to ask if you have any questions! [
@damien_dupre](http://twitter.com/damien_dupre) [
@damien-dupre](http://github.com/damien-dupre) [
damien-datasci-blog.netlify.app](https://damien-datasci-blog.netlify.app) [
damien.dupre@dcu.ie](mailto:damien.dupre@dcu.ie)