class: center, middle, inverse, title-slide .title[ # MT612 - Advanced Quant. Research Methods ] .subtitle[ ## Lecture 1: The General Linear Model ] .author[ ### Damien Dupré ] .date[ ### Dublin City University ] --- class: inverse, mline, center, middle # 1. General Information --- # Who am I? #### Development of the DynEmo Facial Expression Database (Master) * Dynamic and spontaneous emotions * Assessed with self-reports and by observers #### Analysis of Emotional User Experience of Innovative Tech. (Industrial PhD) * Understand users' acceptance of technologies from their emotional response * Based on multivariate self-reports #### Evaluation of Emotions from Facial and Physiological Measures (Industrial PostDoc) * Applications to marketing, sports and automotive industries * Dynamic changes with trend extraction techniques (2 patents) #### Performance Prediction using Machine Learning (Academic PostDoc) * Application to sport analytics * Big Data treatment (> 1 million users with activities recorded in the past 5 years) --- class: title-slide, middle ## Who are you? Please introduce yourself: - What is you first name? - Which school are you in? - What is your PhD about (in few words)? --- class: inverse, mline, center, middle # Aims and Assignment --- # What to Expect? This lecture focuses on the new way to teach statistics: 1. Understanding advanced statistical models 2. Using new open source software (JAMOVI and R) 3. Apply these knowledge and skills to research papers writing In the end, I want you to become a Data Scientist with enough knowledge and skills to: - Challenge bad science and wrong ideas from your supervisor - Apply to Data Science positions --- # Helpful Readings (Beginners) - **Teacups, giraffes, and statistics** by Hasse Walum and Desirée De Leon (2021) https://tinystats.github.io/teacups-giraffes-and-statistics/index.html - **Introduction to Modern Statistics** by Mine Çetinkaya-Rundel and Johanna Hardin (2021) https://openintro-ims.netlify.app/ - **Learning statistics with jamovi: A tutorial for psychology students and other beginners** by Danielle Navarro and David Foxcroft (2019) https://www.learnstatswithjamovi.com/ - **Learning statistics with R: A tutorial for psychology students and other beginners** by Danielle Navarro (2018) https://learningstatisticswithr-bookdown.netlify.app/ - **Statistical Thinking for the 21st Century** by Russell A. Poldrack (2022) https://statsthinking21.github.io/statsthinking21-core-site/ --- # Helpful Reading (Advanced) - **Advanced Regression Methods** by Cheng Hua, Youn-Jeng Choi, and Qingzhou Shi (2021) https://bookdown.org/chua/ber642_advanced_regression/ - **Analysing Data using Linear Models** by Stéphanie M. van den Berg (2022) https://bookdown.org/pingapang9/linear_models_bookdown/ - **Regression and Other Stories** by Andrew Gelman, Jennifer Hill, and Aki Vehtar (2022) https://avehtari.github.io/ROS-Examples/index.html - **An Introduction to Statistical Learning** by James Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2021) https://www.statlearning.com/ - **Mixed Models with R, Getting started with random effects** by Michael Clark (2021) https://m-clark.github.io/mixed-models-with-R/ --- # Details on the Assignment Based on your research topic, I will give you some data (in January). Your task will be **to write a ready to be published research paper** that includes: - A short introduction with a couple of references leading to your hypotheses - An extended method section presenting the variables, your model with a graphic representation, the equation to test your hypotheses and which test you are choosing to use - A result section that is publication top quality and additional results justifying the conditions of applications of the tests used - A short discussion and conclusion This paper will have **a maximum of 6 pages** and **a publication ready design** (any journal/conference final style but no draft manuscript design). Appendices are possible, specially if they includes codes to reproduce the results. They are not included in the page count. **The deadline is June 21st, 2023.** --- class: title-slide, middle ## Exercise If you haven't done it already, can you __look for the quantitative academic journal paper which is the closest to your PhD__. You need to download the pdf version of this paper and to __send it to my email damien.dupre@dcu.ie__.
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
Notes: - This paper should not be one of yours if you already have published some - This paper should include a statistical analysis (i.e., Regression analysis, ANOVA, t-test) and if possible their corresponding `\(p-values\)` --- class: inverse, mline, center, middle # 2. Essential Concepts to Master --- # Essential Concepts to Master In Academic Reports, all sections are linked: .center[**Introduction ➡️ Literature Review ➡️ Method ➡️ Results ➡️ Discussion & Conclusion**] -- To understand the statistics in the results section, it is essential to identify the concepts presented in each section:
--- class: title-slide, middle ## Variables Role and Type --- # Academic Papers' Introduction An introduction is a section **presenting your variables and why you investigate them**. There is little reference to previous academic research, just a description of actual facts. It should end with your **Research Question**, a question that includes all the main variables investigated which wonders about a potential relationship between them. For example: - "What is the relationship between Job Satisfaction, Salary and Gender?" - "How does sales experience influence the performance of sales managers and sales representatives?"
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
**Warning**: Each variable has a **Role** and a **Type**, it is essential to learn how to identify them. --- # Type of Variables Variables can have different types: - **Categorical**: If the variable's possibilities are words or sentences (character string) - if the possibilities cannot be ordered: Categorical Nominal (*e.g.*, `\(gender\)` male, female, other) - if the possibilities can be ordered: Categorical Ordinal (*e.g.*, `\(size\)` S, M, L) - **Continuous**: If the variable's possibilities are numbers (*e.g.*, `\(age\)`, `\(temperature\)`, ...)
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
**Warning**: Variables can be converted to either Categorical and Continuous but it is always better to keep them in their correct scale. <img src="img/jamovi_icons.png" width="30%" style="display: block; margin: auto;" /> --- # Role of Variables It's important to keep the two roles "variable doing the explaining" and "variable being explained" distinct. Let's denote the: - **Outcome**: "variable to be explained" (also called `\(Y\)`, Dependent Variable, or DV) - **Predictor**: "variable doing the explaining" (also called `\(X\)`, Independent Variable, or IV) -- Statistics is only about identifying relationship between Predictor and Outcome variables also called **effect** > An effect between 2 variables means that the changes in the values of a predictor variable are related to changes in the values of an outcome variable. > The aim of an Academic Report is to investigate if the **Variability of the Outcome Variable** is related to the variability of Predictor Variables. --- # Predictors, Outcomes and Controls An effect between a predictor variable and an outcome variable corresponds to the following model:
This arrow does not suggest causation but indicate correlation between `\(Predictor\)` and `\(Outcome\)`, there is no assumption of one causing the other. **An "effect" is reciprocal and does not involves causality**. Causality analysis is an other kind of test that involves: - To be sure that 2 variables are correlated - That one variable is the antecedent of the other - That no other variable is explaining this relationship --- # Predictors, Outcomes and Controls A significant effect of a `\(Predictor\)` on an `\(Outcome\)` variable means that **a predictor is explaining enough variance of the outcome** variable to show a significant relationship. .pull-left[ - If there is no effect between the variables, they are not sharing enough of their variability <img src="lecture_1_files/figure-html/unnamed-chunk-4-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ - If there is an effect between the variables, they are sharing a big part of their variability <img src="lecture_1_files/figure-html/unnamed-chunk-5-1.png" width="504" style="display: block; margin: auto;" /> ] To decide, if the part of the shared variability is big enough, a statistical test is required. --- class: title-slide, middle ## Formulating Hypotheses --- # Hypotheses in a Nutshell Hypotheses are: 1. Predictions supported by theory/literature 2. Affirmations designed to precisely describe the relationships between variables > *“Hypothesis statements contain two or more variables that are measurable or potentially measurable and that specify how the variables are related”* (Kerlinger, 1986) Hypotheses include: - Predictor(s) / Independent Variable(s) - Outcome / Dependent Variable (DV) - Direction of the outcome if the predictor increases
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
**Warning:** Hypothesis cannot test equality between groups or modalities, they can only test differences or effects --- # Alternative *vs.* Null Hypotheses Every hypothesis has to state a difference (between groups or according values) also called `\(H_a\)` (for alternative hypothesis) or `\(H_1\)` Every alternative hypothesis has a null hypothesis counterpart (no difference between groups or according values) also called `\(H_0\)` (pronounce H naught or H zero) -- `\(H_a\)` is viewed as a “challenger” hypothesis to the null hypothesis `\(H_0\)`. > **Statistics are used to test the probability of obtaining your results if the Null Hypothesis is true. If this probability is low, then we reject the Null Hypothesis (and consider the Alternative Hypothesis as credible).** But there is only two kind of alternative hypotheses: **Main Effect Hypotheses** and **Interaction Effect Hypotheses** --- # Main Effect Hypothesis Is the **predicted relationship between one `\(Predictor\)` and one `\(Outcome\)` variable** The `\(Outcome\)` needs to be Continuous (but some models can use a Categorical Outcome) The `\(Predictor\)` can be either Continuous or Categorical but the hypothesis formulation will change with its type - Effect representation:
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
**Warning**: The direction of the arrow does not involve causality, only correlation. --- # Main Effect Hypothesis Templates In the following formulation templates, **replace the variable names with yours** and *select the direction of the effect expected* ... - #### Case 1: Predictor is Continuous .small[{**outcome**} {*increases/decreases/changes*} when {**predictor**} increases] > .small[**Job satisfaction** *increases* when **salary** increases] -- - #### Case 2: Predictor is Categorical (2 Categories) .small[The {**outcome**} of {**predictor category 1**} is {*higher/lower/different*} than the {**outcome**} of {**predictor category 2**}] > .small[The **Job satisfaction** of **EU employees** is *higher* than the **job satisfaction** of **Non-EU employees**] -- - #### Case 3: Predictor is Categorical (3 or more Categories) .small[The {**outcome**} of at least one of the {**predictor**} is {*higher/lower/different*} than the {**outcome**} of the other {**predictor**}] > .small[The **Job satisfaction** of at least one of the **company's departments** is *higher* than the **Job satisfaction** of the other **company's departments**] --- # Interaction Effect Hypothesis **It predicts the influence of a second predictor on the relationship between a first predictor and an outcome variable** Notes: - The second predictor is also called moderator. - The main effect of each predictor must be hypothesised as well - The role of first and second predictors can be inverted with the exact same statistical results .pull-left[ Effects representation:
] .pull-right[ Exactly the same results:
] --- # Interaction Effect Hypothesis Templates In the following formulation templates, **replace the variable names with yours** and *select the direction of the effect expected* ... -- - #### Case 1: Predictor 2 is Continuous .small[The effect of {**predictor 1**} on {**outcome**} is {*higher/lower/different*} when {**predictor 2**} increases] -- - #### Case 2: Predictor 2 is Categorical (2 Categories) .small[The effect of {**predictor 1**} on {**outcome**} is {*higher/lower/different*} for {**predictor 2 category 1**} than for {***predictor 2 category 2**}] -- - #### Case 3: Predictor 2 is Categorical (3 or more Categories) .small[The effect of {**predictor 1**} on {**outcome**} is {*higher/lower/different*} for at least one of {**predictor 2**}] --
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
**Warning**: 1. An interaction effect hypothesis is also called moderation effect 2. By default, an interaction effect involves the test of the main effect hypotheses of all Predictors involved 3. Predictor 1 and 2 are commutable (can be inverted and produce the same hypothesis) --- class: title-slide, middle ## Model Representation --- # Model Representation Models are an overview of the predicted relationship between variables stated in the hypotheses You must follow these rules: - Rule 1: All the arrows correspond to an hypothesis to be tested - Rule 2: All the tested hypotheses have to be represented with an arrow - Rule 3: Hypotheses using the same Outcome variable should be included in the same model - Rule 4: Only one Outcome variable is included in each model (except for SEM model) --- # Model Representation .pull-left[ .center[**A simple arrow is a main effect**]
] .pull-right[ .center[**A crossing arrow is an interaction effect**]
.center[Note: By default, an interaction effect involves the test of the main effect hypotheses of all Predictors involved] ] --- # Structure of Models Distinguish square and circles - **squares** are actual **measures/items** - **circles** are **latent variables** related to measures/items Example: - `\(Salary\)` is directly measured (in $, €, or £) so it's a square. - `\(Job\,Satisfaction\)` is a latent variable with several questions so it's a circle. Items used for latent variables can be omitted in a model, variables are the most important. We can distinguish 2 types of relationship in a model: - Main effect relationship - Interaction effect relationship --- # Main Effect Relationship .pull-left[ .center[Relationship between one Predictor and one Outcome variable]
This model tests one hypothesis: - 1 main effect ] .pull-right[ .center[Relationship between two Predictors and one Outcome variable]
This model tests two hypotheses: - 2 main effects ] --- # Interaction Effect Relationship An interaction means that **the effect of a Predictor 1 on the Outcome variable will be different according the possibilities of a Predictor 2** (also called Moderation). .pull-left[ classic representation:
] .pull-right[ is the same as:
] This model tests three hypotheses: - 2 main effects - 1 interaction effect --- class: title-slide, middle ## Equation Corresponding to a Model --- # A Basic Equation Let's imagine the perfect scenario: **your predictor Predictor variable explains perfectly the outcome variable**. The corresponding equation is: `\(Outcome = Predictor\)` .pull-left[ <table class="table table-striped" style="font-size: 14px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:center;"> Observation </th> <th style="text-align:center;"> Outcome </th> <th style="text-align:center;"> Predictor </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> a </td> <td style="text-align:center;"> 10 </td> <td style="text-align:center;"> 10 </td> </tr> <tr> <td style="text-align:center;"> b </td> <td style="text-align:center;"> 9 </td> <td style="text-align:center;"> 9 </td> </tr> <tr> <td style="text-align:center;"> c </td> <td style="text-align:center;"> 8 </td> <td style="text-align:center;"> 8 </td> </tr> <tr> <td style="text-align:center;"> d </td> <td style="text-align:center;"> 7 </td> <td style="text-align:center;"> 7 </td> </tr> <tr> <td style="text-align:center;"> e </td> <td style="text-align:center;"> 6 </td> <td style="text-align:center;"> 6 </td> </tr> <tr> <td style="text-align:center;"> f </td> <td style="text-align:center;"> 5 </td> <td style="text-align:center;"> 5 </td> </tr> <tr> <td style="text-align:center;"> g </td> <td style="text-align:center;"> 4 </td> <td style="text-align:center;"> 4 </td> </tr> <tr> <td style="text-align:center;"> h </td> <td style="text-align:center;"> 3 </td> <td style="text-align:center;"> 3 </td> </tr> <tr> <td style="text-align:center;"> i </td> <td style="text-align:center;"> 2 </td> <td style="text-align:center;"> 2 </td> </tr> <tr> <td style="text-align:center;"> j </td> <td style="text-align:center;"> 1 </td> <td style="text-align:center;"> 1 </td> </tr> <tr> <td style="text-align:center;"> k </td> <td style="text-align:center;"> 0 </td> <td style="text-align:center;"> 0 </td> </tr> </tbody> </table> ] .pull-right[ <img src="lecture_1_files/figure-html/unnamed-chunk-17-1.png" width="504" style="display: block; margin: auto;" /> ] --- # A Basic Equation In the equation `\(Outcome = Predictor\)`, **three coefficients are hidden** because they are unused: - the **intercept coefficient** `\(b_{0}\)` (i.e., the value of the Outcome when the Predictor = 0) which is 0 in our case - the **estimate coefficient** `\(b_{1}\)` (i.e., how much the Outcome increases when the Predictor increases by 1) which is 1 in our case - the **error coefficient** `\(e\)` (i.e., how far from the prediction line the values of the Outcome are) which is 0 in our case So in general, the relation between a predictor and an outcome can be written as: `$$Outcome = b_{0} + b_{1}\,Predictor + e$$` which is in our case: `$$Outcome = 0 + 1 * Predictor + 0$$` --- # A Basic Equation The equation `\(Outcome = b_{0} + b_{1}\,Predictor + e\)` is the same as the good old `\(y = ax + b\)` (here ordered as `\(y = b + ax\)`) where `\(b_{0}\)` is `\(b\)` and `\(b_{1}\)` is `\(a\)`. It is very important to know that under **EVERY** statistical test, a similar equation is used (t-test, ANOVA, Chi-square are all linear regressions).
--- # Relationship between Variables Relationship between a `\(Predictor\)` and an `\(Outcome\)` variable (stated in a main effect hypothesis or in an interaction effect hypothesis) is analysed in terms of: .center[**"How many units of the Outcome variable increases/decreases/changes when the Predictor increases by 1 unit?"**] For example: > How much Job Satisfaction increases when the Salary increases by €1? The value of how much of the Outcome variable changes: - Is called the **Estimate** (also called Unstandardised Estimate) - Uses the letter `\(b\)` in equations (e.g., `\(b_1\)`, `\(b_2\)`, `\(b_3\)`, ...) For example: > If Job Satisfaction increases by 0.1 on a scale from 0 to 5 when the Salary increases by €1, then *b* associated to Salary is 0.1 --- # Notes on the Equations #### 1. Greek or Latin alphabet? `$$Y = \beta_{0} + \beta_{1}\,X_{1} + \epsilon \; vs. \; Y = b_{0} + b_{1}\,X_{1} + e$$` #### 2. Subscript `\(i\)` or not? `$$Y = b_{0} + b_{1}\,X_{1} + e \; vs. \; Y_{i} = b_{0} + b_{1}\,X_{1_{i}} + e_{i}$$` #### 3. Which sign between estimates and predictors? `$$Y = b_{0} + b_{1}.X_{1} + b_{2}*X_{2} + b_{3}\,X_{3} + e$$` #### 4. Hat on `\(Y\)` or not? Capital letter or not? .center[ `$$\hat{Y}\; or\; \hat{y}\; vs.\; Y\; or\; y$$` ] --- class: inverse, mline, center, middle # 3. The General Linear Model --- # The General Linear Model Now time has come to test these hypotheses by using our equation(s)! <img src="lecture_1_files/figure-html/unnamed-chunk-19-1.png" width="50%" style="display: block; margin: auto;" /> --- # Vocabulary "Linear Model", "Linear Regression", "Multiple Regression" or simply "Regression" are all referring to the same model: **The General Linear Model**. It contains: - Only one Outcome/Dependent Variable - One or more Predictor/Independent Variables of any type (categorical or continuous) - Made of Main and/or Interaction Effects `$$Y = b_{0} + b_{1}\,Predictor\,1 + b_{2}\,Predictor\,2+ ... + b_{n}\,Predictor\,n + e$$` A Linear Regression is used **to test all the hypotheses at once** and to calculate the predictors' estimate. Specific tests are available for certain type of hypothesis such as T-test or ANOVA but as they are special cases of Linear Regressions, their importance is limited (see [Jonas Kristoffer Lindeløv's blog post: Common statistical tests are linear models](https://lindeloev.github.io/tests-as-linear/)). --- # General Linear Model Everywhere .pull-left[ Most of the common statistical models (t-test, correlation, ANOVA; chi-square, etc.) are **special cases of linear models**. This beautiful simplicity means that there is less to learn. In particular, it all comes down to `\(y = ax + b\)` which most students know from secondary school. Unfortunately, **stats intro courses are usually taught as if each test is an independent tool**, needlessly making life more complicated for students and teachers alike. Here, only **one test is taught to rule them all**: the General Linear Model (GLM). ] .pull-right[ <img src="https://psyteachr.github.io/msc-data-skills/images/memes/glm_meme.png" width="100%" style="display: block; margin: auto;" /> ] --- # Applied Example ### Imagine the following case study... > The CEO of Organisation Beta has problems with the well-being of employees and wants to investigate the relationship between **Job Satisfaction (js_score)**, **salary** and **performance (perf)**. -- ### Therefore the CEO formulate 3 hypotheses: - `\(H_{a1}\)`: `\(js\_score\)` increases when `\(salary\)` increases - `\(H_{a2}\)`: `\(js\_score\)` increases when `\(perf\)` increases - `\(H_{a3}\)`: The effect of `\(salary\)` on `\(js\_score\)` increases when `\(perf\)` increases -- ### The corresponding model is: `$$js\_score = b_{0} + b_{1}\,salary + b_{2}\,perf + b_{3}\,salary*perf + e$$` --- # Where the Regression Line comes from? Draw all the possible lines on the frame. The best line, also called best fit, is the one which has the lowest amount or error. .pull-left[ <img src="lecture_1_files/figure-html/unnamed-chunk-21-1.png" width="360" style="display: block; margin: auto;" /> ] .pull-right[ There are 200 models on this plot, but a lot are really bad! We need to find the good models by making precise our intuition that a good model is "close" to the data. Therefore, we need a way to quantify the distance between the data and a model. Then we can fit the model by finding the value of `\(b_0\)` and `\(b_1\)` that generate the model with the smallest distance from this data. ] --- # Best Model, Lowest Error For each point this specific prediction error is called **Residual** `\(e_i\)` where `\(i\)` is a specific observation (e.g., employee here). The error of the model is the sum of the prediction error for each point (distance between actual value and predicted value). .pull-left[
] .pull-right[ The line which obtains the lowest error, has the smallest residuals. This line is chosen by the linear regression. One common way to do this in statistics to use the "Mean-Square Error" (aka `\(MSE\)`) or the "Root-Mean-Square Error" (aka `\(RMSE\)`). We compute the difference between actual and predicted values, square them, sum them and divide them by `\(n\)` observations (and the take the square root of them for the `\(RMSE\)`). ] --- # The (Root-)Mean-Square Error <img src="lecture_1_files/figure-html/unnamed-chunk-23-1.png" width="864" style="display: block; margin: auto;" /> `$$MSE = \frac{\sum_{i=1}^{N}(y\,predicted_{i} - y\,actual_{i})^{2}}{N}\!RMSE = \sqrt{\frac{\sum_{i=1}^{N}(y\,predicted_{i} - y\,actual_{i})^{2}}{N}}$$` These calculations has lots of appealing mathematical properties, which we are not going to talk about here. You will just have to take my word for it! --- # Analysis of the Estimate Once the best line is found, each estimate of the tested equation is calculated by a software (i.e., `\(b_0, b_1, ..., b_n\)`). - `\(b_0\)` is the intercept and has no interest for hypothesis testing - `\(b_1, ..., b_n\)` are predictors' effect estimate and each of them is used to test an hypothesis The predictors' effect estimate `\(b_1, ..., b_n\)` are **the value of the slope of the best line between each predictor** and the outcome. It indicates **how many units of the outcome variable increases/decreases/changes when the predictor increases by 1 unit** Technically, `\(b\)` is a weight or multiplier applied to the Predictor's values to obtain the Outcome's expected values --- # Analysis of the Estimate - If `\(b_1, ..., b_n = 0\)`, then: - The regression line is horizontal (no slope) - When the Predictor increases by 1 unit, the Outcome variable does not change - **The null alternative hypothesis is not rejected** -- - If `\(b_1, ..., b_n > 0\)`, then: - The regression line is positive (slope up) - When the Predictor increases by 1 unit, the Outcome variable increases by `\(b\)` - **The null hypothesis is rejected and the alternative hypothesis considered as plausible** -- - If `\(b_1, ..., b_n < 0\)`, then: - The regression line is negative (slope down) - When the Predictor increases by 1 unit, the Outcome variable decreases by `\(b\)` - **The null hypothesis is rejected and the alternative hypothesis considered as plausible** --- # Significance of Effect's Estimate The statistical significance of an effect estimate depends on the **strength of the relationship** and on the **sample size**: - An estimate of `\(b_1 = 0.02\)` can be very small but still significantly different from `\(b_1 = 0\)` - Whereas an estimate of `\(b_1 = 0.35\)` can be stronger but in fact not significantly different from `\(b_1 = 0\)` -- The significance is the probability to obtain your results with your sample in the null hypothesis scenario: - Also called `\(p\)`-value - Is between 0% and 100% which corresponds to a value between 0.0 and 1.0 **If the `\(p\)`-value is lower to 5% or 0.05, then the probability to obtain your results in the null hypothesis scenario is low enough to say that the null hypothesis scenario is rejected and there must be a link between the variables.** -- Remember that the `\(p\)`-value is the probability of the data given the null hypothesis: `\(P(data|H_0)\)`. --- # Estimating Regression's Coefficients The output of any software is two tables: - Model Fit Measure Table - Model Coefficients Table The **Model Fit Measure** table tests the prediction **accuracy of your overall model** (all predictors taken into account). The **Model Coefficients** table provide an estimate to each predictor `\(b_1, ..., b_n\)` (as well as the intercept `\(b_0\)`). the value of the estimate is statistically tested with a `\(p\)`-value to see if it is statistically different from 0 (null hypothesis). Therefore, this table is **used to test each hypotheses** separately. --- class: title-slide, middle ## Hypothesis with Continuous Predictor --- # Main Effect Example .pull-left[ ### Variables: - Outcome = `\(js\_score\)` (from 0 to 10) - Predictor = `\(salary\)` (from 0 to Inf.) ### Hypothesis: - `\(H_a\)`: `\(js\_score\)` increases when `\(salary\)` increases (i.e., `\(b_1>0\)`) - `\(H_0\)`: `\(js\_score\)` stay the same when `\(salary\)` increases (i.e., `\(b_1=0\)`) ### Equation: `$$js\_score = b_{0} + b_{1}\,salary + e$$` ] .pull-right[ <table> <thead> <tr> <th style="text-align:right;"> employee </th> <th style="text-align:right;"> salary </th> <th style="text-align:right;"> js_score </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 28876.89 </td> <td style="text-align:right;"> 5.057311 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 29597.12 </td> <td style="text-align:right;"> 6.642440 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 29533.34 </td> <td style="text-align:right;"> 6.119694 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 30779.97 </td> <td style="text-align:right;"> 9.482198 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 29916.63 </td> <td style="text-align:right;"> 8.883347 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 30253.32 </td> <td style="text-align:right;"> 7.015606 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 29971.45 </td> <td style="text-align:right;"> 4.633738 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 29957.13 </td> <td style="text-align:right;"> 7.919998 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 31368.60 </td> <td style="text-align:right;"> 9.028004 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 29774.23 </td> <td style="text-align:right;"> 5.860449 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 31516.47 </td> <td style="text-align:right;"> 10.000000 </td> </tr> <tr> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 28451.25 </td> <td style="text-align:right;"> 3.617721 </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 30584.61 </td> <td style="text-align:right;"> 6.948510 </td> </tr> <tr> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 30123.85 </td> <td style="text-align:right;"> 7.429012 </td> </tr> <tr> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 30215.94 </td> <td style="text-align:right;"> 7.292992 </td> </tr> <tr> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 30379.64 </td> <td style="text-align:right;"> 7.765043 </td> </tr> <tr> <td style="text-align:right;"> 17 </td> <td style="text-align:right;"> 29497.68 </td> <td style="text-align:right;"> 6.380634 </td> </tr> <tr> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 29666.79 </td> <td style="text-align:right;"> 5.962925 </td> </tr> <tr> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 28981.42 </td> <td style="text-align:right;"> 5.607226 </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 28928.21 </td> <td style="text-align:right;"> 4.635931 </td> </tr> </tbody> </table> ] --- # Main Effect Example .pull-left[ <img src="lecture_1_files/figure-html/unnamed-chunk-25-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ <img src="lecture_1_files/figure-html/unnamed-chunk-26-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Model Fit Measure Table The **Model Fit Measure** table tests the prediction **accuracy of your overall model** (all predictors taken into account). `\(Model_{a}: js\_score = b_{0} + b_{1}\;salary + e\;vs.\; Model_{0}: js\_score = b_{0} + e\)` <img src="img/jamovi_mfm.png" width="25%" style="display: block; margin: auto;" /> -- Default Columns: - The **Model** column indicate the reference of the model in case you want to compare multiple models - `\(R\)` is the correlation between the outcome variable and all predictors taken into account (i.e., the closer to 1 or -1 the better, however in social science models with more that 0.2 or less than -0.2 are already excellent) - `\(R^2\)` is the % of variance from the outcome explained by the model (e.g., `\(R^2 = 0.73\)` means the model explains 73% of the variance of the outcome variable). `\(R^2\)` is also called **Coefficient of Determination** --- # Model Coefficients Table The **Model Coefficients** table provide an estimate to each predictor `\(b_1, ..., b_n\)` (as well as the intercept `\(b_0\)`). the value of the estimate is statistically tested with a `\(p\)`-value to see if it is statistically different from 0 (null hypothesis). <img src="img/jamovi_mc.png" width="302" style="display: block; margin: auto;" /> -- Default Columns: - **Predictor** is the list of variables associated to parameters in your model (main and interaction) which includes the intercept - **Estimate** is the non-standardized relationship estimate of the best prediction line (expressed in the unit of the variable) - **SE** is the Standard Error and indicate how spread are the values around the estimate - `\(t\)` is the value of the statistical test comparing the estimate obtained with this sample with an estimate of 0 (i.e., `\(H_0\)`) - `\(p\)` is the p-value, i.e the probability to obtain our prediction with our sample in the null hypothesis scenario --- class: title-slide, middle ## Hypothesis with Continuous Predictor --- # Categorical Predictor with 2 Categories An hypothesis of differences between two groups is easily tested with a Linear Regression: - If `\(\mu_{1} \neq \mu_{2}\)`, the slope of the line between these averages is not null (i.e., `\(b_{1} \neq 0\)`) - If `\(\mu_{1} = \mu_{2}\)`, the slope of the line between these averages is null (i.e., `\(b_{1} = 0\)` ) ### Explanation .pull-left[ **Comparing the difference between two averages is the same as comparing the slope of the line crossing these two averages** - If two averages are **not equal**, then **the slope of the line crossing these two averages is not 0** - If two averages are **equal**, then the **slope of the line crossing these two averages is 0** ] .pull-right[ <img src="lecture_1_files/figure-html/unnamed-chunk-29-1.png" width="288" style="display: block; margin: auto;" /> ] --- class: title-slide, middle ## Hypotheses with Categorical Predictor having 3+ Categories --- # ANOVA Test for Overall Effects Beside Linear Regression and `\(t\)`-test, researchers are using ANOVA a lot. ANOVA, stands for Analysis of Variance and is also a sub category of Linear Regression Models. ANOVA is used to calculate the overall effect of categorical variable having more that 2 categories as `\(t\)`-test cannot cope. In the case of testing 1 categorical variable, a "one-way" ANOVA is performed. **How ANOVA is working?** ### In real words - `\(H_a\)`: at least one group is different from the others - `\(H_0\)`: all the groups are the same ### In mathematical terms - `\(H_a\)`: it is **not true** that `\(\mu_{1} = \mu_{2} = \mu_{3}\)` - `\(H_0\)`: it is **true** that `\(\mu_{1} = \mu_{2} = \mu_{3}\)` --- # ANOVA Test for Overall Effects I won't go too much in the details but to check if at least one group is different from the others, the distance of each value to the overall mean (Between−group variation) is compared to the distance of each value to their group mean (Within−group variation). **If the Between−group variation is the same as the Within−group variation, all the groups are the same.** <img src="img/one_way_anova_basics.png" width="100%" style="display: block; margin: auto;" /> --- class: title-slide, middle ## Assumptions of General Linear Regression Models --- # 4 Assumptions Statistical tests are widely used to test hypotheses, exactly how we just did but all statistical tests have requirements to meet before being applied. The General Linear Model has 4 requirements: ## 1. **L**inearity (of the effects) ## 2. **I**ndependence (of observations) ## 3. **N**ormality (of the residuals) ## 4. **E**qual Variance (of the residuals) While the assumption of a Linear Model are never perfectly met in reality, we must check if there are reasonable enough assumption that we can work with them. --- class: inverse, mline, center, middle # 4. GLM with JAMOVI --- # JAMOVI: Stats. Open. Now. Jamovi is a statistical spreadsheet software designed to be **easy to use**. Jamovi is a compelling alternative to costly statistical products such as SPSS, SAS and JMP to cite a few. Jamovi will always be **free and open** because Jamovi is made by the scientific community, for the scientific community. - It can be **downloaded from its website** https://www.jamovi.org/ - It can also be **used without installation**, in a web browser, https://cloud.jamovi.org/ for **online demo** but this demo undergoes periods of downtime, and may cease functioning (without warning) at any time.
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
Book "Learning Statistics with JAMOVI" free here: https://www.learnstatswithjamovi.com/ <img src="https://www.jamovi.org/assets/header-logo.svg" width="100%" style="display: block; margin: auto;" /> --- # JAMOVI GUI <img src="img/jamovi_gui.png" width="100%" style="display: block; margin: auto;" /> --- # Anatomy of JAMOVI ### 1. Different symbols for **variable types** <img src="img/jamovi_icons.png" width="15%" style="display: block; margin: auto;" /> ### 2. Distinction between **Factors** and **Covariates**: - A Factor is a predictor of type categorical (nominal or ordinal) - A Covariate is a predictor of type continuous
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
Expected variable type is displayed in bottom right corner of boxes ### 3. Customise your analysis by **unfolding optional boxes** ### 4. Two linear regression **tables by default**: - Model Fit Measures - Model Coefficients --- class: inverse, mline, center, middle # 5. GLM with R --- # Estimates and Linear Regression in R The `lm()` function calculate each estimate and test them against 0 for you. `lm()` has only two arguments that you should care about: `formula` and `data`. - `formula` is the translation of the equation of the model - `data` is the name of the data frame object containing the variables. Here is a generic example: ```r lm(formula = Outcome ~ Pred1 + Pred2, data = my_data_object) ``` Here is an example with {gapminder}: ```r lm(formula = lifeExp ~ gdpPercap + year, data = gapminder) ``` --- # Mastering the Formula `lm()` has only one difficulty, the `formula`. The `formula` is the direct translation of the equation tested but with its own representation: 1. The = sign is replaced by `~` (read "according to" or "by") 2. Each predictor is added with the `+` sign 3. An interaction effect uses the symbol `:` instead of * -- Here are some generic equations and their conversion in `formula`: `$$Outcome = b_0 + b_1 Pred1 + b_2 Pred2 + e$$` ```r lm(formula = Outcome ~ Pred1 + Pred2, data = my_data_object) ``` `$$Outcome = b_0 + b_1 Pred1 + b_2 Pred2 + b_3 Pred3 + e$$` ```r lm(formula = Outcome ~ Pred1 + Pred2 + Pred3, data = my_data_object) ``` `$$Outcome = b_0 + b_1 Pred1 + b_2 Pred2 + b_3 Pred1*Pred2 + e$$` ```r lm(formula = Outcome ~ Pred1 + Pred2 + Pred1 : Pred2, data = my_data_object) ``` --- # Mastering the Formula Here are some equations from the gapminder dataset and their conversion in `formula`: -- `$$lifeExp = b_0 + b_1 gdpPercap + b_2 year + e$$` ```r lm(formula = lifeExp ~ gdpPercap + year, data = gapminder) ``` -- `$$lifeExp = b_0 + b_1 gdpPercap + b_2 year + b_3 gdpPercap * year + e$$` ```r lm(formula = lifeExp ~ gdpPercap + year + gdpPercap : year , data = gapminder) ``` --- # Categorical Predictor Exactly as in Jamovi, `lm()` by default investigates continuous predictors or categorical predictors having 2 categories: ```r model_gapminder <- lm(formula = lifeExp ~ gdpPercap + year, data = gapminder) ``` However, to test the hypothesis of a categorical predictor having 3 or more categories, the ANOVA omnibus test is required. It can be obtained by using the `aov()` function with the lm model as input: ```r model_gapminder <- lm(formula = lifeExp ~ country + year, data = gapminder) aov(model_gapminder) ``` To make the code shorter, it is possible to pipe this `aov()` ```r model_gapminder <- lm(formula = lifeExp ~ country + year, data = gapminder) %>% aov() ``` --- # LM Summary While the function `lm()` computes the model, the function `summary()` display the results ```r model_gapminder <- lm(formula = lifeExp ~ gdpPercap + year, data = gapminder) summary(model_gapminder) ``` ``` Call: lm(formula = lifeExp ~ gdpPercap + year, data = gapminder) Residuals: Min 1Q Median 3Q Max -67.262 -6.954 1.219 7.759 19.553 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -418.42425945 27.61713769 -15.15 <0.0000000000000002 *** gdpPercap 0.00066973 0.00002447 27.37 <0.0000000000000002 *** year 0.23898275 0.01397107 17.11 <0.0000000000000002 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 9.694 on 1701 degrees of freedom Multiple R-squared: 0.4375, Adjusted R-squared: 0.4368 F-statistic: 661.4 on 2 and 1701 DF, p-value: < 0.00000000000000022 ``` --- # LM Summary The output of the `summary()` function is pretty dense, but let's analyse it line by line. The first line reminds us of what the actual regression model is: ``` Call: lm(formula = lifeExp ~ gdpPercap + year, data = gapminder) ``` The next part provides a quick summary of the residuals (i.e., the ε values), ``` Residuals: Min 1Q Median 3Q Max -67.262 -6.954 1.219 7.759 19.553 ``` This can be convenient as a quick check that the model is okay. **Linear regression assumes that these residuals were normally distributed, with mean 0.** In particular it’s worth quickly checking to see if the median is close to zero, and to see if the first quartile is about the same size as the third quartile. If they look badly off, there’s a good chance that the assumptions of regression are violated. --- # LM Summary The next part of the R output looks at the coefficients of the regression model: ``` Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -418.42425945 27.61713769 -15.15 <0.0000000000000002 *** gdpPercap 0.00066973 0.00002447 27.37 <0.0000000000000002 *** year 0.23898275 0.01397107 17.11 <0.0000000000000002 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ``` Each row in this table refers to one of the coefficient estimated in the regression model. The first row is the intercept term, and the later ones look at each of the predictors. The columns give you all of the relevant information: - The first column is the actual estimate of b (e.g., -418.42425945 for the intercept, 0.00066973 for gdpPercap and 0.23898275 for year). - The second column is the standard error estimate (SE). - The third column gives you the t-statistic. - Finally, the fourth column gives you the actual p value for each of these tests. --- # LM Summary The only thing that the previous table doesn’t list is the degrees of freedom used in the t-test, which is always N−K−1 and is listed immediately below, in this line: ``` Residual standard error: 9.694 on 1701 degrees of freedom ``` The value of df=1701 is equal to N−K−1, so that’s what we use for our t-tests. In the final part of the output we have the F-test and the R<sup>2</sup> values which assess the performance of the model as a whole ``` Multiple R-squared: 0.4375, Adjusted R-squared: 0.4368 F-statistic: 661.4 on 2 and 1701 DF, p-value: < 0.00000000000000022 ``` So in this case, the model did not perform significantly better than you’d expect by chance (F(2,1701) = 661.4, p < 0.001), which isn’t all that surprising: the R<sup>2</sup> = 0.4375 value indicate that the regression model accounts for 43.7% of the variability in the outcome measure. When we look back up at the t-tests for each of the individual coefficients, we have pretty strong evidence that gdpPercap and year have a significant effect. --- # Reporting Clean Results To communicate about your statistical analyses in an academic report, the simplest method is to find the values in the `summary()` output and to copy-paste them in the text according to the format expected that we have seen in the previous lectures. However, this task can be long, difficult and lead to human errors. Thankfully, R has additional packages that are providing alternative functions to read linear regression models and communicate results. Because there are too many packages, I will focus only on one additional packages: {report}. <img src="https://memegenerator.net/img/instances/73408711/whoa-i-know-linear-regression.jpg" style="display: block; margin: auto;" /> --- # Automatic Results with {report} To install {report} use the usual `install.packages()` function: ```r install.packages("report") ``` The package {report} will print a text containing all the statistics already in sentences ready to be interpreted (see https://easystats.github.io/report/). To print the statistical analyses: 1. Load the package {report} 2. Create an object containing the output of the function `lm()` 3. Use this object as input of the function `report()` from the {report} package **Note: If used in a RMarkdown document, the chunk containing `report()` has to include the chunk option `results='asis'`** --- # Automatic Results with {report} ```r library(report) model_gapminder <- lm(formula = lifeExp ~ gdpPercap + year, data = gapminder) report(model_gapminder) ``` We fitted a linear model (estimated using OLS) to predict lifeExp with gdpPercap (formula: lifeExp ~ gdpPercap + year). The model explains a statistically significant and substantial proportion of variance (R2 = 0.44, F(2, 1701) = 661.44, p < .001, adj. R2 = 0.44). The model's intercept, corresponding to gdpPercap = 0, is at -418.42 (95% CI [-472.59, -364.26], t(1701) = -15.15, p < .001). Within this model: - The effect of gdpPercap is statistically significant and positive (beta = 6.70e-04, 95% CI [6.22e-04, 7.18e-04], t(1701) = 27.37, p < .001; Std. beta = 0.51, 95% CI [0.47, 0.55]) - The effect of year is statistically significant and positive (beta = 0.24, 95% CI [0.21, 0.27], t(1701) = 17.11, p < .001; Std. beta = 0.32, 95% CI [0.28, 0.36]) Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using a Wald t-distribution approximation. and We fitted a linear model (estimated using OLS) to predict lifeExp with year (formula: lifeExp ~ gdpPercap + year). The model explains a statistically significant and substantial proportion of variance (R2 = 0.44, F(2, 1701) = 661.44, p < .001, adj. R2 = 0.44). The model's intercept, corresponding to year = 0, is at -418.42 (95% CI [-472.59, -364.26], t(1701) = -15.15, p < .001). Within this model: - The effect of gdpPercap is statistically significant and positive (beta = 6.70e-04, 95% CI [6.22e-04, 7.18e-04], t(1701) = 27.37, p < .001; Std. beta = 0.51, 95% CI [0.47, 0.55]) - The effect of year is statistically significant and positive (beta = 0.24, 95% CI [0.21, 0.27], t(1701) = 17.11, p < .001; Std. beta = 0.32, 95% CI [0.28, 0.36]) Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using a Wald t-distribution approximation. --- # Automatic Checks with {performance} ```r library(performance) model_gapminder <- lm(formula = lifeExp ~ gdpPercap + year, data = gapminder) check_model(model_gapminder) ``` <img src="lecture_1_files/figure-html/unnamed-chunk-48-1.png" width="50%" style="display: block; margin: auto;" /> --- # Automatic Everything with {easystats} The libraries {report} and {performace} are in fact included in a meta package called {easystats} (exactly like {dplyr} and {tidyr} are included in {tidyverse}). By installing {easystats} you will install all the packages within this ecosystem, see its website for more details on all the packages: https://easystats.github.io/easystats/ Among the functions available, one is extremely relevant because it includes all the ones previously seen: `model_dashboard()` ```r library(easystats) model_gapminder <- lm(formula = lifeExp ~ gdpPercap + year, data = gapminder) model_dashboard(model_gapminder) ``` Have a try! --- class: title-slide, middle ## Live Demo: How to write a paper in 5 min! https://damien-dupre.github.io/mt612/doc/voodoo_mock --- class: title-slide, middle ## Exercise In the coming days, I will send you simulated data using the variables presented in your reference paper. I will also send you hypotheses using these variables. Your task will be to test these hypotheses using the General Linear Model and to report their results exactly like in a research paper. Every student will present their results at the beginning of the next lecture. --- class: inverse, mline, left, middle <img class="circle" src="https://github.com/damien-dupre.png" width="250px"/> # Thanks for your attention and don't hesitate if you have any questions! [
@damien_dupre](http://twitter.com/damien_dupre) [
@damien-dupre](http://github.com/damien-dupre) [
damien-datasci-blog.netlify.app](https://damien-datasci-blog.netlify.app) [
damien.dupre@dcu.ie](mailto:damien.dupre@dcu.ie)