class: center, middle, inverse, title-slide .title[ # MT612 - Advanced Quant. Research Methods ] .subtitle[ ## Lecture 2: The Generalised Linear Model ] .author[ ### Damien Dupré ] .date[ ### Dublin City University ] --- # General Linear Models So far, we have seen the equation of General Linear Models as follow: `$$Y = b_{0} + b_{1}\,X_{1} + e$$` As mentioned previously, it is more rigorous practice to include the subscripts as follow: `$$Y_{i} = b_{0} + b_{1}\,X_{1_{i}} + e_{i}$$` where `\(i\)` is an observation number between 1 and N (your sample size) The residual of the `\(i^{th}\)` observation `\((x_i, y_i)\)` is the difference of the observed response `\((y_i)\)` and the response we would predict based on the model fit `\((\hat{y}_{i})\)`: `$$e_{i} = y_{i} - \hat{y}_{i}$$` We typically identify `\(\hat{y}_{i}\)` by plugging `\(x_i\)` into the model. --- # General Linear Models .pull-left[ For instance, to test the main effect hypothesis of `\(salary\)` on `\(js\_score\)`, the model including subscript will look like: `$$js\_score_{i} = b_{0} + b_{1}\,salary_{i} + e_{i}$$` where `\(i\)` is an employee (observation unit) between 1 and 20 (our sample size) <img src="lecture_2_files/figure-html/unnamed-chunk-1-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> employee </th> <th style="text-align:right;"> salary </th> <th style="text-align:right;"> js_score </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 28876.89 </td> <td style="text-align:right;"> 5.057311 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 29597.12 </td> <td style="text-align:right;"> 6.642440 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 29533.34 </td> <td style="text-align:right;"> 6.119694 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 30779.97 </td> <td style="text-align:right;"> 9.482198 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 29916.63 </td> <td style="text-align:right;"> 8.883347 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 30253.32 </td> <td style="text-align:right;"> 7.015606 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 29971.45 </td> <td style="text-align:right;"> 4.633738 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 29957.13 </td> <td style="text-align:right;"> 7.919998 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 31368.60 </td> <td style="text-align:right;"> 9.028004 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 29774.23 </td> <td style="text-align:right;"> 5.860449 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 31516.47 </td> <td style="text-align:right;"> 10.000000 </td> </tr> <tr> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 28451.25 </td> <td style="text-align:right;"> 3.617721 </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 30584.61 </td> <td style="text-align:right;"> 6.948510 </td> </tr> <tr> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 30123.85 </td> <td style="text-align:right;"> 7.429012 </td> </tr> <tr> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 30215.94 </td> <td style="text-align:right;"> 7.292992 </td> </tr> <tr> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 30379.64 </td> <td style="text-align:right;"> 7.765043 </td> </tr> <tr> <td style="text-align:right;"> 17 </td> <td style="text-align:right;"> 29497.68 </td> <td style="text-align:right;"> 6.380634 </td> </tr> <tr> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 29666.79 </td> <td style="text-align:right;"> 5.962925 </td> </tr> <tr> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 28981.42 </td> <td style="text-align:right;"> 5.607226 </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 28928.21 </td> <td style="text-align:right;"> 4.635931 </td> </tr> </tbody> </table> ] --- # General Linear Models Let's replace `\(e_i\)` in the equation for some of the `\(i\)` employees. <img src="lecture_2_files/figure-html/unnamed-chunk-3-1.png" width="864" style="display: block; margin: auto;" /> The estimates `\(b_{0}\)` and `\(b_{1}\)` will stay the same but the error `\(e_{i}\)` will change for each employee. `$$js\_score_{1} = -48.58 + 0.001\times salary_{1} + 0.17$$` `$$js\_score_{2} = -48.58 + 0.001\times salary_{2} + 0.42$$` and so on... --- # General Linear Models Beside the subscripts there is something I have deliberately hidden to you: how the distribution of the residuals `\(e_{i}\)` is expected to be. The real full writing of a General Linear Model is the following: `$$Y_{i} = b_{0} + b_{1}\,X_{1_{i}} + e_{i}$$` `$$e_{i} \sim \mathcal{N}(0, \sigma_{i})$$` This last line is scary but actually very simple, it claims one of the assumptions of the General Linear Models: Normality of the residuals. It means that the error `\(e_{i}\)` is assumed to be normally distributed with mean 0 and some standard deviation `\(\sigma_{i}\)`. For example, with `\(\sigma_{i}=2\)`, if we arrange all the `\(e_{i}\)` errors by size, it should follow this trend: <img src="lecture_2_files/figure-html/unnamed-chunk-4-1.png" width="864" style="display: block; margin: auto;" /> --- # General Linear Models In our case involving `\(salary\)` and `\(js\_score\)`, in the figure showing the residual error of every observation, the order is given by the `\(X\)` variable. <img src="lecture_2_files/figure-html/unnamed-chunk-5-1.png" width="864" style="display: block; margin: auto;" /> But if we sort them from the lowest to the highest and count them by section (i.e., plot an histogram), we should see that most of the residual error are around 0. <img src="lecture_2_files/figure-html/unnamed-chunk-6-1.png" width="864" style="display: block; margin: auto;" /> --- # Ecologically-Relevant Distributions Gaussian distribution is rarely adequate in the real life; GLMs offer ecologically meaningful alternatives - **Poisson** — counts; integers, non-negative, variance increases with mean - **Binomial** — observed proportions from a total; integers, non-negative, bounded at 0 and 1, variance largest at `\(\pi = 0.5\)` - **Binomial** — presence absence data; discrete values, 0 and 1, models probability of success - **Gamma** — concentrations; non-negative (strictly positive with log link) real values, variance increases with mean, many zero values and some high values --- # Generalized Linear Models So far all the model tested had the assumption that residuals are following a normal distribution. This is the case if the outcome variable is Continuous. However, Linear models can also be used with outcome variable that are not Continuous. In this case a **Generalized Linear Model** is used. Three different outcome variables can be tested: - if the outcome variable has only 2 possibilities (e.g., survive: "yes" or "no"), this is a **Logistic Regression** - if the outcome variable has only 2 possibilities but one of theme is very rare, this is a **Poisson Regression** - if the outcome variable has more than 2 possibilities, this is a **Multinominal Regression** --- class: inverse, mline, center, middle # 1. The Logistic Regression --- # Limitations of the GLM One of the requirement for the GLM is to have a continuous outcome. What happens if the outcome is not continuous? Suppose that we are trying to predict the department of each employee based on their salary. In this simplified example, there are three possible departments: Marketing, Sales, and HR. We could consider encoding these values in an outcome variable, `\(Y\)`, as follows: `$$Y = \begin{cases} 1 & if\, \color{orange}{Marketing}\\ 2 & if\, \color{orange}{Sales}\\ 3 & if\, \color{orange}{HR} \end{cases}$$` Using this coding, a linear regression model can be used to predict `\(Y\)` with salary as a predictor. Unfortunately, this coding implies an ordering on the outcomes, putting `\(Sales\)` in between `\(Marketing\)` and `\(HR\)`, and insisting that the difference between `\(Marketing\)` and `\(Sales\)` is the same as the difference between `\(Sales\)` and `\(HR\)`. In practice there is no particular reason that this needs to be the case. For instance, one could choose an equally reasonable coding: `$$Y = \begin{cases} 1 & if\, \color{orange}{Sales}\\ 2 & if\, \color{orange}{Marketing}\\ 3 & if\, \color{orange}{HR} \end{cases}$$` --- # Limitations of the GLM Using a coding for categorical variable imply a totally different relationship among the conditions for each combination of coding. Each of these coding would produce fundamentally different linear models that would ultimately lead to different sets of predictions on test observations. If the variable was Categorical Ordinal (e.g., Shirt size: S, M, L), then a 1, 2, 3 coding would be reasonable. Unfortunately, in general there is no natural way to convert a qualitative response variable with more than two levels into a quantitative response that is ready for linear regression. <img src="http://www.alexanderdemos.org/RegressionClass/OLS.png" width="50%" style="display: block; margin: auto;" /> --- class: title-slide, middle ## Logistic Regression: Theory --- # Case with a Binary Outcome For a binary (two categories) outcome variable, the situation is better. For binary instance, perhaps there are only two possibilities employee department: `\(Marketing\)` and `\(Sales\)`. We could then potentially recode this variable in a dummy variable as follows: `$$Y = \begin{cases} 0 & if\, \color{orange}{Marketing}\\ 1 & if\, \color{orange}{Sales} \end{cases}$$` Then consider the hypothesis that salary predicts/explains the employee department (i.e., the effect of salary on employee department), where the outcome variable department has two categories, `\(Marketing\)` or `\(Sales\)`. For a binary outcome variable with a 0/1 coding as above, a General Linear Model is not completely unreasonable at the first sight: <img src="lecture_2_files/figure-html/unnamed-chunk-8-1.png" width="864" style="display: block; margin: auto;" /> --- Now, let's have a closer look at how good the General Linear Model approach is to predict `\(Department = Sales\)` using `\(salary\)`: <img src="lecture_2_files/figure-html/unnamed-chunk-9-1.png" width="864" style="display: block; margin: auto;" /> Here we see the problem with this approach: for salaries below $29,000 we predict a negative probability of department; if we were to predict for very large salaries, we would get values bigger than 1. These predictions are not sensible, since of course the true probability of department, regardless of salary, must fall between 0 and 1. This problem is not unique to these data. Any time a straight line is fit to a binary response that is coded as 0 or 1, in principle we can always predict `\(p(Y) < 0\)` for some values of `\(Y\)` and `\(p(Y) > 1\)` for others (unless the range of `\(Y\)` is limited). --- # Case with a Binary Outcome Now let's have a look at the distribution of the residuals: <img src="lecture_2_files/figure-html/unnamed-chunk-10-1.png" width="864" style="display: block; margin: auto;" /> When the outcome variable has only 2 possible values, a linear model will never work because the residuals can never have a distribution that is even remotely looking normal. When we have an outcome variable that is categorical, so not continuous, we generally use logistic regression. --- # Logistic Regression Logistic regression is a generalized linear model where the outcome is a two-level categorical variable. The outcome variable ... - ... is denoted by `\(Y_i\)`, where the index `\(i\)` is used to represent observation `\(i\)`. - ... takes the value 1 with probability `\(p_i\)` and the value 0 with probability `\(1 - p_i\)`. In the organisation beta, `\(Y_i\)` will be used to represent whether an employee `\(i\)` is part of the Sales Department `\((Y_i=1)\)` or part of the Marketing Department `\((Y_i=0)\)`. The predictor variables are represented as follows: `\(X1_{i}\)` is the value of variable 1 for observation `\(i\)`, `\(X2_{i}\)` is the value of variable 2 for observation `\(i\)`, and so on. `$$transformation(p_i) = b_0 + b_1 X1_{i} + b_2 X2_{i} + \cdots + b_k Xk_{i} + e_i \nonumber \\ Y \sim Bern(p_{i})$$` --- class: inverse # Logistic Regression We want to choose a **transformation** in the equation that makes practical and mathematical sense. For example, we want a transformation that makes the range of possibilities on the left hand side of the equation equal to the range of possibilities for the right hand side; if there was no transformation for this equation, the left hand side could only take values between 0 and 1, but the right hand side could take values outside of this range. A common transformation for `\(p_i\)` is the **logit transformation**, which may be written as `$$logit(p_i) = \log_{e}\left( \frac{p_i}{1-p_i} \right)$$` We can rewrite the equation relating `\(Y_i\)` to its predictors using the logit transformation of `\(p_i\)`: `$$\log_{e}\left( \frac{p_i}{1-p_i} \right) = b_0 + b_1 x_{1,i} + b_2 x_{2,i} + \cdots + b_k x_{k,i} + e_i$$` --- # First step `$$Odds =\frac{p_{i}}{1-p_{i}}$$` <img src="lecture_2_files/figure-html/unnamed-chunk-11-1.png" width="504" style="display: block; margin: auto;" /> --- # Second step `$$logOdds =log(\frac{p_{i}}{1-p_{i}})$$` <img src="lecture_2_files/figure-html/unnamed-chunk-12-1.png" width="504" style="display: block; margin: auto;" /> --- # What is the model doing? <img src="img/image2a.png", width = "100%"> --- class: inverse # Logistic Regression To convert from values on the logistic regression scale to the probability scale, we need to back transform and then solve for `\(p_i\)`: `$$\begin{aligned} \log_{e}\left( \frac{p_i}{1-p_i} \right) &= b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i} \\ \frac{p_i}{1-p_i} &= e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}} \\ p_i &= \left( 1 - p_i \right) e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}} \\ p_i &= e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}} - p_i \times e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}} \\ p_i + p_i \text{ } e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}} &= e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}} \\ p_i(1 + e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}}) &= e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}} \\ p_i &= \frac{e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}}}{1 + e^{b_0 + b_1 x_{1,i} + \cdots + b_k x_{k,i}}} \end{aligned}$$` As with most applied data problems, we substitute the point estimates for the parameters (the `\(b_i\)`) so that we can make use of this formula. --- # Logistic Regression Rather than modelling this outcome `\(Y\)` directly, logistic regression models the probability that `\(Y\)` belongs to a particular category. For the department variable, logistic regression models the probability of belonging to a specific department. For example the probability of belonging to a specific department given the employee salary can be written as: `$$Pr(Department = Sales | Salary)$$` The values of `\(Pr(Department = Sales | Salary)\)`, which we abbreviate `\(p(Sales)\)`, will range between 0 and 1. Then for any given value of salary, a prediction can be made for the department `\(Sales\)`. --- class: inverse # Logistic Regression In logistic regression, we use the logistic function, model p(X) using a function that gives outputs between 0 and 1 for all values of X: `$$p(\texttt{Sales}) = \textrm{logistic}(b_0 + b_1 \texttt{Salary}_{Sales} + e_{Sales}) = \frac{\textrm{exp}(b_0 + b_1 \texttt{Salary}_{Sales} + e_{Sales}))}{1+\textrm{exp}(b_0+ b_1 \texttt{Salary}_{Sales} + e_{Sales}))} \nonumber \\ Department \sim Bern(p_{Sales})$$` or, more scary version: `$$p(\texttt{Sales}) = \textrm{logistic}(\beta_0 + \beta_1 \texttt{Salary}_{Sales}) = \frac{\mathrm{e}^{\beta_0 + \beta_1 \texttt{Salary}_{Sales}}}{1+\mathrm{e}^{\beta_0+ \beta_1 \texttt{Salary}_{Sales}}} \nonumber \\ Department \sim Bern(p_{Sales})$$` --- # Logistic Regression The Figure here below illustrates the fit of the logistic regression model to our data: - for low `\(salary\)` we now predict the probability of being part of the `\(Marketing\)` department but never at a perfect certitude level, i.e., zero. - for high `\(salary\)` we predict that the probability of being part of the `\(Sales\)` department is close to, but never above one. <img src="lecture_2_files/figure-html/unnamed-chunk-13-1.png" width="864" style="display: block; margin: auto;" /> The logistic function will always produce an S-shaped curve of this form, and so regardless of the value of `\(X\)`, we will obtain a sensible prediction. We also see that the logistic model is better able to capture the range of probabilities than is the linear regression model. --- # More Generalised Linear Models Logistic regression is one specific form of a generalised linear model. Here we have applied a generalised linear model with a so-called logit link function: instead of modelling dependent variable `\(Y\)` directly, we have modelled the logit of the probabilities of obtaining a `\(Y\)`-value of 1. There are many other link functions possible. One of them we will see in the chapter on generalised linear models for count data. But first, let’s see how logistic regression can be performed in Jamovi and in R, and how we should interpret the output. --- class: title-slide, middle ## Logistic Regression with Jamovi --- # Logistic Regression with Jamovi 1. You need one categorical nominal ordinal dependent variable (nominal or ordinal), and at least one continuous explanatory variable 2. In the “Analyses” tab, click the “Regression” button and from the menu that appears, in th e“Logistic Regression” section, select “2 Outcomes (Binomial)” 3. Drag and drop your outcome variable to Dependent Variable and your predictor to Covariates if continuous or Factors if categorical. 4. The result is shown in a Model Fit Measure table and in a Model Coefficient table <img src="https://docs.jamovi.org/_images/jg_select_regression_logistic.jpg" width="50%" style="display: block; margin: auto;" /> --- # Logistic Regression with Jamovi The result from a Logistic Regression analysis looks very much like that of the ordinary linear model. An important difference is that the statistics shown are no longer `\(t\)`-statistics, but `\(z\)`-statistics. This is because with logistic models, the ratio `\(b_1/SE\)` does not have a `\(t\)`-distribution. - In ordinary linear models, the ratio `\(b_1/SE\)` has a `\(t\)`-distribution because in linear models, the variance of the residuals, `\(\sigma^2_e\)`, has to be estimated (as it is unknown). If the residual variance were known, `\(b_1/SE\)` would have a standard normal distribution. - In logistic models, there is no `\(\sigma^2_e\)` that needs to be estimated (it is by default 1), so the ratio `\(b_1/SE\)` has a standard normal distribution. One could therefore calculate a `\(z\)`-statistic `\(z=b_1/SE\)` and see whether that value is smaller than 1.96 or larger than 1.96 which correspond to a `\(p\)`-value of 0.05. --- # Logistic Regression with Jamovi The interpretation of the slope parameters is very similar to other linear models. Note that we have the following equation for the logistic model: `$$\textrm{logit}(p_{\texttt{Sales}}) = b_0 + b_1 \texttt{salary} \nonumber \\ \texttt{Department} \sim Bern(p_{\texttt{Sales}})$$` If we fill in the values from the R output, we get `$$\textrm{logit}(p_{\texttt{Sales}}) = -70.7 + 0.002 \times \texttt{salary} \nonumber \\ \texttt{Department} \sim Bern(p_{\texttt{Sales}})$$` We can interpret these results by making some predictions. Imagine an employee with a yearly income of €25,000. Then the predicted logodds equals `\(-70.7 + 0.002 \times 25000= -20.7\)`. When we transform this back to a probability, we get `\(\textrm{exp}(-20.7)/(1+ \textrm{exp}(-20.7)) = 0.000000001\)`. So this model predicts that for people with a yearly income of €25,000, less than 0.001% of them are part of the Sales department. --- class: title-slide, middle ## Logistic Regression with R --- # Logistic Regression with R The function that we use in R is the `glm()` function, which stands for Generalised Linear Model. We can use the following code: ```r model_analysed <- glm( formula = department ~ salary, data = organisation_beta, family = binomial(link = logit) ) ``` `department` is our outcome variable, `salary` is our predictor variable, and these variables are stored in the data frame called `organisation_beta`. But further we have to specify that we want to use the Bernoulli distribution and a logit link function. So `link = logit`. Actually, the code can be a little bit shorter, because the logit link function is the default option with the binomial distribution: ```r model_analysed <- glm(department ~ salary, organisation_beta, family = binomial) ``` or ```r model_analysed <- glm(department ~ salary, organisation_beta, family = "binomial") ``` --- # Logistic Regression with R Below, we see the parameter estimates from this generalised linear model: ```r summary(model_analysed) ``` ``` Call: glm(formula = department ~ salary, family = "binomial", data = organisation_beta) Deviance Residuals: Min 1Q Median 3Q Max -1.2495 -0.8303 -0.2893 0.4911 1.9786 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -70.660141 34.675115 -2.038 0.0416 * salary 0.002331 0.001152 2.023 0.0430 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 25.898 on 19 degrees of freedom Residual deviance: 18.154 on 18 degrees of freedom AIC: 22.154 Number of Fisher Scoring iterations: 5 ``` --- # Logistic Regression with R Below, we see the output of the `report()` function: ```r library(report) report(model_analysed) ``` We fitted a logistic model (estimated using ML) to predict department with salary (formula: department ~ salary). The model's explanatory power is substantial (Tjur's R2 = 0.35). The model's intercept, corresponding to salary = 0, is at -70.66 (95% CI [-161.66, -17.59], p = 0.042). Within this model: - The effect of salary is statistically significant and positive (beta = 2.33e-03, 95% CI [5.67e-04, 5.35e-03], p = 0.043; Std. beta = 1.85, 95% CI [0.45, 4.24]) Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using a Wald z-distribution approximation. --- class: title-slide, middle ## Exercise: Logistic Regression with the Titanic Data In 1912, the ship Titanic sank after the collision with an iceberg. There were 2201 people on board that ship. Some of these were male, others were female. Some were passengers, others were crew, and some survived, and some did not. For the passengers there were three groups: those travelling first class, second class and third class. There were also children on board. Let's use titanic dataset. The most interesting outcome variable is the survive variable which says if the passenger has survived (coded 1) or not (coded 0). Because this outcome variable is Categorical (even if coded with numbers), we can expect its residuals to follow a Logistic distribution. Test the **main effect hypotheses of Gender and Age** as well as the **interaction effect hypothesis of Gender and Age** on the **probability to Survive**.
−
+
10
:
00
--- class: inverse, mline, center, middle # 2. The Poisson Regression --- # More Different Distributions Count data are inherently discrete, and often when using linear models, we see non-normal distributions of residuals. Let's count how many of the 20 employees measured have their salaries in intervals of €1,000. <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> salary_bracket </th> <th style="text-align:right;"> salary_center </th> <th style="text-align:right;"> Count </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (28000,29000] </td> <td style="text-align:right;"> 28500 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> (29000,30000] </td> <td style="text-align:right;"> 29500 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> (30000,31000] </td> <td style="text-align:right;"> 30500 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> (31000,32000] </td> <td style="text-align:right;"> 31500 </td> <td style="text-align:right;"> 2 </td> </tr> </tbody> </table> If we want to use the variable `\(Salary\)` to predict/explains how many employees are within each intervals, then a General Linear Model might not be the right approach again. The outcome variable here is `\(Count\)`. Similar to logistic regression, perhaps we can find a distribution other than the normal distribution that is more suitable for this kind of outcome variable? For dichotomous data (1/0) we found the Bernoulli distribution very useful. For count data like this variable `\(Count\)`, the traditional distribution is the Poisson distribution. --- class: title-slide, middle ## Poisson Regression: Theory --- # Poisson Regression The normal distribution has two parameters, the mean and the variance. The Bernoulli distribution has only 1 parameter (the probability), and the Poisson distribution has also only 1 parameter, lambda or `\(\lambda\)`. `\(\lambda\)` is a parameter that indicates tendency. The Poisson distribution has many values centred around the tendency parameter (therefore we call it a tendency parameter)! We see only discrete values, and no values below 0. If we take the mean of the distribution, we will find a value of `\(\lambda\)`. If we would compute the variance of the distribution we would also find `\(\lambda\)`! A Poisson model could be suitable for our data: a linear equation could predict the parameter `\(\lambda\)` and then the actual data show a Poisson distribution. `$$\lambda = b_0 + b_1 X \\ Y \sim Poisson(\lambda)$$` However, because of the additivity assumption, the equation `\(b_0 + b_1 X\)` leads to negative values. A negative value for `\(\lambda\)` is not logical, because we then have a tendency to observe data like -2 and -4 in our data, which is contrary to having count data, which consists of non-negative integers. A Poisson distribution always shows integers of at least 0, so one or way or another we have to make sure that we always have a `\(\lambda\)` of at least 0. --- # Poisson Regression Remember that we saw the reverse problem with logistic regression: there we wanted to have negative values for our dependent variable logodds ratio, so therefore we used the logarithm. Here we want to have positive values for our dependent variable, so we can use the inverse of the logarithm function: the exponential. Then we have the following model: `$$\lambda = exp(b_0 + b_1 X)= e^{b_0+b_1X} \\ Y \sim Poisson(\lambda)$$` This is a generalized linear model, now with a Poisson distribution and an exponential link function which makes any value positive. Let's analyse the assignment data with this generalized linear model. Our dependent variable is the `\(Count\)` of employee being in a certain interval (a number between 0 and 20), and the predictor variable is `\(Salary\)`. We expect that the mean `\(Salary\)` is associated with a higher number of employee. When we run the analysis, the result is as follows: `$$\lambda = exp(b_0 + b_1 \times Salary) \\ Count \sim Poisson(\lambda)$$` --- class: title-slide, middle ## Poisson Regression with JAMOVI --- # Poisson Regression In JAMOVI, a Generalized Linear Model with Poisson distribution can be computed using the **GAMLj** Module as follow: <img src="img/poisson_regression_jamovi.png" width="100%" style="display: block; margin: auto;" /> --- class: title-slide, middle ## Poisson Regression with R --- # Poisson Regression with R The function that we use in R is the `glm()` function, which stands for Generalised Linear Model. We can use the following code: ```r model_analysed <- glm( formula = Count ~ salary_center, data = organisation_beta_poisson, family = poisson ) ``` `Count` is our outcome variable, `salary_center` is our predictor variable, and these variables are stored in the data frame called `organisation_beta_poisson`. --- # Poisson Regression with R Below, we see the parameter estimates from this generalised linear model: ```r summary(model_analysed) ``` ``` Call: glm(formula = Count ~ salary_center, family = poisson, data = organisation_beta_poisson) Deviance Residuals: 1 2 3 4 -0.9699 1.0746 0.6532 -1.0455 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 6.4285001 6.0299210 1.066 0.286 salary_center -0.0001612 0.0002022 -0.797 0.425 (Dispersion parameter for poisson family taken to be 1) Null deviance: 4.2576 on 3 degrees of freedom Residual deviance: 3.6153 on 2 degrees of freedom AIC: 21.09 Number of Fisher Scoring iterations: 4 ``` --- # Poisson Regression with R Below, we see the output of the `report()` function: ```r library(report) report(model_analysed) ``` We fitted a poisson model (estimated using ML) to predict Count with salary_center (formula: Count ~ salary_center). The model's explanatory power is moderate (Nagelkerke's R2 = 0.23). The model's intercept, corresponding to salary_center = 0, is at 6.43 (95% CI [-5.40, 18.49], p = 0.286). Within this model: - The effect of salary center is statistically non-significant and negative (beta = -1.61e-04, 95% CI [-5.69e-04, 2.32e-04], p = 0.425; Std. beta = -0.21, 95% CI [-0.73, 0.30]) Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using a Wald z-distribution approximation. --- class: title-slide, middle ## Exercise: Poisson Regression with the Titanic Data In the previous subsection we looked at a count variable and we wanted to predict it from the employee salary. Let's look at an example where we want to predict a count variable using two categorical predictors from the titanic dataset. If we focus on only the adults, suppose we want to know whether there is a **relationship between the sex and the counts of people that survived the disaster**.
−
+
10
:
00
--- # Poisson Regression Example Let's analyse this data, we assign the value sex=1 to Females and sex=2 to Males. Our dependent variable is "Counts of adult survivors on the Titanic", and the predictor variable is sex. <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Sex </th> <th style="text-align:right;"> Count </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 385 </td> </tr> <tr> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 109 </td> </tr> </tbody> </table> Let’s do a Poisson regression. ```r model <- glm(Count ~ Sex, family = poisson, data = titanic_count) ``` Here is the Model Coefficient table: <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:left;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 5.95 </td> <td style="text-align:right;"> 0.05 </td> <td style="text-align:right;"> 116.81 </td> <td style="text-align:left;"> < 0.001 </td> </tr> <tr> <td style="text-align:left;"> Sexmale </td> <td style="text-align:right;"> -1.26 </td> <td style="text-align:right;"> 0.11 </td> <td style="text-align:right;"> -11.63 </td> <td style="text-align:left;"> < 0.001 </td> </tr> </tbody> </table> --- # Poisson Regression Example From the output we see that the expected count for females is `\(exp(5.95)=383.7\)` and the expected count for males is `\(exp(5.95 - 1.26)=108.8\)`. These expected counts are close to the observed counts of males and females. From the z-statistic, we see that the difference in counts between males and females is significant, `\(z =-11.6, p<0.001\)`. Note that a hypothesis test is a bit odd here: there is no clear population that we want to generalize the results to: there was only one Titanic disaster. Also, here we have data on the entire population of those people on board the Titanic, there is no random sample here. The difference in these counts is very small. But does this tell us that women were as likely to survive as men? Note that we have only looked at those who survived. How about the people that perished: were there more men that died than women? --- # Poisson Regression Example Then we see a different story: on the whole there were many more men than women, and a relatively small proportion of the men survived. In our sample of the data, most of the men perished: 734 perished and only 109 survived. Of the women, most of them survived: 81 perished and 385 survived, yielding a survival rate of 74%. Does this tell us that women are much more likely than men to survive collisions with icebergs? <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Sex </th> <th style="text-align:right;"> Survived </th> <th style="text-align:right;"> Count </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 81 </td> </tr> <tr> <td style="text-align:left;"> female </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 385 </td> </tr> <tr> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 734 </td> </tr> <tr> <td style="text-align:left;"> male </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 109 </td> </tr> </tbody> </table> Let’s first run a multiple Poisson regression analysis including the effects of both `\(sex\)` and `\(survival\)`. We first need to restructure the data a bit, so that we have both variables `\(sex\)` and `\(survived\)`. ```r model <- glm(Count ~ Sex + Survived, family = poisson, data = titanic_count) ``` --- # Poisson Regression Example Here is the Model Coefficient table: <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:left;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 5.67 </td> <td style="text-align:right;"> 0.05 </td> <td style="text-align:right;"> 111.01 </td> <td style="text-align:left;"> < 0.001 </td> </tr> <tr> <td style="text-align:left;"> Sexmale </td> <td style="text-align:right;"> 0.59 </td> <td style="text-align:right;"> 0.06 </td> <td style="text-align:right;"> 10.27 </td> <td style="text-align:left;"> < 0.001 </td> </tr> <tr> <td style="text-align:left;"> Survived </td> <td style="text-align:right;"> -0.50 </td> <td style="text-align:right;"> 0.06 </td> <td style="text-align:right;"> -8.78 </td> <td style="text-align:left;"> < 0.001 </td> </tr> </tbody> </table> `\(sex\)` is treated as a factor, since it is a string variable, and `\(survival\)` as a numeric variable (since survived is coded as a dummy). From the parameter values in the output, we can calculate the predicted numbers of male `\((sex = Male)\)` and female `\((sex = Female)\)` that survived and perished. The reference group consists of individuals that perished `\((survived = 0)\)` and are female: - For perished females we have `\(exp(5.67)=290.03\)`, - for female survivors we have `\(exp(5.67-0.5)=175.91\)`, - for male survivors we have `\(exp(5.67-0.5+0.592)=317.98\)` and - for male non-survivors we have `\(exp(5.67+0.592)=524.26\)`. --- # Poisson Regression Example The pattern that is observed is clearly different from the one that is predicted from the generalized linear model. The linear model predicts that there are fewer survivors than non-survivors, irrespective of sex, but we observed that in females, there are more survivors than non-survivors. It seems that sex is a moderator of the effect of survival on counts. <img src="lecture_2_files/figure-html/unnamed-chunk-33-1.png" width="864" style="display: block; margin: auto;" /> --- # Poisson Regression Example In order to test this moderation effect, we run a new generalised linear model for counts including an interaction effect of sex by survived. This is done in R by adding sex:survived: ```r model <- glm( Count ~ Sex + Survived + Sex:Survived, family = poisson, data = titanic_count ) ``` Here is the Model Coefficient table: <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:left;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 4.39 </td> <td style="text-align:right;"> 0.11 </td> <td style="text-align:right;"> 39.55 </td> <td style="text-align:left;"> < 0.001 </td> </tr> <tr> <td style="text-align:left;"> Sexmale </td> <td style="text-align:right;"> 2.20 </td> <td style="text-align:right;"> 0.12 </td> <td style="text-align:right;"> 18.83 </td> <td style="text-align:left;"> < 0.001 </td> </tr> <tr> <td style="text-align:left;"> Survived </td> <td style="text-align:right;"> 1.56 </td> <td style="text-align:right;"> 0.12 </td> <td style="text-align:right;"> 12.75 </td> <td style="text-align:left;"> < 0.001 </td> </tr> <tr> <td style="text-align:left;"> Sexmale:Survived </td> <td style="text-align:right;"> -3.47 </td> <td style="text-align:right;"> 0.16 </td> <td style="text-align:right;"> -21.71 </td> <td style="text-align:left;"> < 0.001 </td> </tr> </tbody> </table> --- # Poisson Regression Example When we plot the predicted counts from this new model with an interaction effect, we see that they are exactly equal to the counts that are actually observed in the data. <img src="lecture_2_files/figure-html/unnamed-chunk-36-1.png" width="864" style="display: block; margin: auto;" /> From the output we see that the interaction effect is significant, `\(z=−21.7,p<.001\)`. If we regard this data set as a random sample of all ships that sink after collision with an iceberg, we may conclude that in such situations, sex is a significant moderator of the difference in the numbers of survivors and non-survivors. One could also say: the proportion of people that survive a disaster like this is different in females than it is in males. --- class: title-slide, middle ## Homework Exercise In the random data that you received, **use the categorical variable with two categories as outcome** and **select 2 continuous variables as predictors**. **Test their main effects and their interaction effect** on the categorical variable with a logistic regression. --- # Generalized linear models <img src="img/links.png", width = "80%"> --- class: inverse, mline, left, middle <img class="circle" src="https://github.com/damien-dupre.png" width="250px"/> # Thanks for your attention and don't hesitate if you have any questions! [
@damien_dupre](http://twitter.com/damien_dupre) [
@damien-dupre](http://github.com/damien-dupre) [
damien-datasci-blog.netlify.app](https://damien-datasci-blog.netlify.app) [
damien.dupre@dcu.ie](mailto:damien.dupre@dcu.ie)