class: center, middle, inverse, title-slide .title[ # MT611 - Quantitative Research Methods ] .subtitle[ ## Lecture 3: Collect, Clean, and Transform Data ] .author[ ### Damien Dupré ] .date[ ### Dublin City University ] --- # Always Prepare the Data .center[**To efficiently perform any kind of analysis on data, it is essential to format the data file in the most suitable way.**] To do so, 3 steps are required: 1. Understanding how data have been collected 2. Cleaning the data from any noise 3. Transforming the data in order to obtain the relevant variables --- class: inverse, mline, center, middle # 1. Collecting Data --- # Data Everywhere <img src="https://www.memesmonkey.com/images/memesmonkey/f1/f1a0708d5337e2afa78b515520c6c5fd.jpeg" width="35%" style="display: block; margin: auto;" /> What are Big Data? - In general: Spreadsheet too large or too complex to be handled by conventional tools - For me: Spreadsheet larger then Microsoft Excel’s Limits (2016) - Total number of rows: 1,048,576 rows - Total number columns: 16,384 columns --- # Collecting Data A tool is required to collect data, but there are tools to measure anything: - Naturalistic Objects - Economic Status - Social Events - Psychological States - ... <img src="https://i.imgflip.com/3jtxnx.jpg" width="70%" style="display: block; margin: auto;" /> --- # Focus on Psychometric Scales .center[**A psychometric scale is a series of questions (items) used to measure a latent variable**] -- **A score has to be calculated** from all the items to produce the variable to analyse: - Average of all items (equal contribution to the latent variable score) - Factor analysis (unequal contribution to the latent variable score) -- **A scale should have been validated** by its authors: - Don't change the items except when possible - Don't change items' range of possibilities -- To be analysed, **a scale has some requirements**: - All the items must have the same range of possibilities/modalities - All the items must be analysed in the same direction - All the items must correlate together (scale reliability) - A unique theoretical construct score has to be calculated from all the items --- # Focus on Psychometric Scales Online Survey Software are replacing the paper-pencil method. Here is a list of some: - Google Form (from your google drive, DCU hosted) - Qualtrics (https://dcubusinessschool.eu.qualtrics.com, DCU Business School account) - Survey Monkey (https://www.surveymonkey.com/, 40 participants free) - LimeSurvey (https://www.limesurvey.org/, 25 participants free/month ) - Zoho (https://www.zoho.com, 100 participants free) - Gorilla Experiment (https://gorilla.sc, €1 per participants) --
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
Remember: - Always ask for age and gender in a survey - Participant recruitment by list of contact/email has between 15% and 10% of answer rate -- Platforms to find and reward participants: - Amazon MTurk (https://www.mturk.com/ mainly US and India) - Prolific Academic (https://www.prolific.co/ world wide) - Qualtrics (https://www.qualtrics.com/uk/research-services/online-sample/) - Survey Monkey (https://www.surveymonkey.com/mp/find-survey-participants/) --- # Latent Variables Scale to measure the **"Perceived Ease-of-use" of MS EXCEL** in 4 items were measured from 1 "totally disagree" to 7 "totally agree": .pull-left[ - **q1.** I think learning MS EXCEL is easy - **q2.** Understanding MS EXCEL is easy - **q3.** I am good at using MS EXCEL - **q4.** I think using MS EXCEL is easy ] .pull-right[
] -- Here are the results with 3 students. The score of the "Perceived Ease-of-use of MS EXCEL" latent variable is calculated using the average of all items. |employee | q1| q2| q3| q4| peo_score| |:--------|--:|--:|--:|--:|---------:| |Sinead | 7| 5| 7| 7| 6.50| |Patrick | 5| 4| 6| 6| 5.25| |Damien | 3| 1| 2| 3| 2.25| --- # Validity and Reliability **Validity = is my variable measuring the construct that I think I am measuring?** - Does the measurement make sense? - Would the results be reproduced with another scale measuring the same latent variable? - Are the results correlated to latent variables that are related? Validity test is only performed when a scale is created (no need for existing scales) **Reliability = consistency of items inside a measurement** - Test-retest reliability - Inter-rater reliability - Correlation inter-item (Cronbach's alpha) Reliability test is performed every time a scale is used but only using Cronbach's alpha --- # Validity and Reliability <img src="https://www.publichealthnotes.com/wp-content/uploads/2018/08/560px-Reliability_and_validity.svg_.png" width="50%" style="display: block; margin: auto;" /> --- # Survey Example **Perceived Usefulness:** - *(PU1)* Using MS EXCEL would improve my performance in statistical analysis - *(PU2)* Using MS EXCEL would increase my productivity in statistical analysis - *(PU3)* Using MS EXCEL would enhance my effectiveness in statistical analysis - *(PU4)* Using MS EXCELR would make it easier for me to engage in statistical analysis - *(PU5)* I think using MS EXCEL is very useful for me to engage in statistical analysis **Perceived Ease-Of-Use:** - *(PEOU1)* I think learning to use MS EXCEL is easy - *(PEOU2)* I think doing what I want via MS EXCEL is easy - *(PEOU3)* I think becoming skilful at using MS EXCEL is easy - *(PEOU4)* I think using MS EXCEL is easy **Behavioural Intention:** - *(BI1)* Assuming I had access to MS EXCEL, I intend to use it - *(BI2)* Given that I had access to MS EXCEL, I predict that I would use it --- # Survey Example
Google form: https://forms.gle/epRj45iR1jL7r3wV7 Qualtrics: https://dcubusinessschool.eu.qualtrics.com/jfe/form/SV_0JSAjOvXVxzJvgh --- # Accessing your Data From a Google Form 1. Login Google Account (be sure to use the DCU account) 2. Go to Drive 3. Click on your Survey form 4. Click on Response 5. Download Response Spreadsheet as .csv From a Qualtrics survey 1. Login Qualtrics (DCU login/pwd) https://dcubusinessschool.eu.qualtrics.com 2. Click on Data & Analysis 3. Click on Export & Import > Export Data...> CSV (Use numeric values)
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
Remember: - Always save the unmodified raw version of your data on the cloud (e.g., google drive, dropbox, one drive, ...) - Make sure you can download them as a .csv file. --- class: inverse, mline, center, middle # 2. Cleaning Data --- # An Essential Step .center[**Cleaning Data means removing any non-essential feature included in the spreadsheet**] - If you don't have big data... then use Microsoft Excel! <img src="https://miro.medium.com/max/624/1*FAzumPnvzKUDolMG7SNcHw.png" width="50%" style="display: block; margin: auto;" /> - If you do have big data, then use R or Python. --- # Question from Kareem <img src="https://raw.githubusercontent.com/damien-dupre/img/main/clean_tweet.png" width="100%" style="display: block; margin: auto;" /> --- # Exercise ## Find what is going wrong with this table <img src="https://raw.githubusercontent.com/damien-dupre/img/main/clean_zoom.jpg" width="100%" style="display: block; margin: auto;" /> --- # Answer from Michael <img src="https://raw.githubusercontent.com/damien-dupre/img/main/clean_answer.png" width="100%" style="display: block; margin: auto;" /> --- # Use a Name Convention An addition to Michael's list would be to transform headers with a proper naming convention My suggestion is **snake_case**: all small letter and words separated by "_" <img src="https://raw.githubusercontent.com/damien-dupre/img/main/name_convention.jpg" width="70%" style="display: block; margin: auto;" /> --- # To Clean Data ... ### 1. Ditch the chart and all non values Charts can mess up with other software ### 2. Column headings in row 1 No more than 1 heading row and remove blanks ### 3. Columns start at column A Remove blanks before data ### 4. Use a naming convention snake_case is preferable but any would do ### 5. Save as .csv file Better format and keeps only the current sheet --- class: title-slide, middle ## Exercise: Clean unicef.xlsx On the module Loop page, got to "Lecture Data" and download the document called "unicef.xlsx" **Open and clean the 1st sheet of this file but only until column P, remove all other columns**
−
+
15
:
00
--- class: inverse, mline, center, middle # 3. Transform Data --- # Master the Key Transformations Most important work is to tidy your data: - Takes time to saves time and solves problems - Only 5 key transformations need to be mastered .pull-left[ ## 1. Extension ## 2. Reduction ## 3. Direction ## 4. Aggregation ## 5. Combination ] .pull-right[ <img src="https://raw.githubusercontent.com/damien-dupre/img/main/key_movements.png" width="70%" style="display: block; margin: auto;" /> ] --- # Extension .center[**Extension = Create a new column**] <img src="https://raw.githubusercontent.com/damien-dupre/img/main/extension_img.png" width="50%" style="display: block; margin: auto;" /> In MS Excel: - First row is row name (name convention) - Second row is the function (starts with = sign) - Following rows are applying the function (double click bottom right corner of the cell) --- # Excel Functions ### For numeric values - Numeric operator ( + - / *) - $ (freeze row and/or column) - COUNT(), MIN(), MAX(), SUM(), AVERAGE (), STDEV() ### For character strings - LEFT() - CONCATENATE() ### Extra function - IF(condition, value if true, value if false) --- # Reduction .center[**Reduction = Keep only certain values**] <img src="https://raw.githubusercontent.com/damien-dupre/img/main/reduction_img.png" width="50%" style="display: block; margin: auto;" /> In MS Excel: - Select header row - In Data tab, use Filter - Click the drop-down arrow for the column you want to filter - Choose values to filter --- # Excel Filters
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
Remember: - Rows already filtered have a row index are coloured in blue - Copy-Paste filtered table in a new document if you want to work only on these values <div class="figure" style="text-align: center"> <img src="https://www.excel-easy.com/data-analysis/images/filter/filter-result.png" alt="Example of data filtered buy the column Country to keep only values corresponding to USA" width="50%" /> <p class="caption">Example of data filtered buy the column Country to keep only values corresponding to USA</p> </div> --- # Direction - Direction = Arrange Row Order - In MS Excel: - Select table - In Data tab, use Sort - Choose column to sort and how to sort <img src="https://raw.githubusercontent.com/damien-dupre/img/main/direction_img.png" width="40%" style="display: block; margin: auto;" /> Be careful of taking into account all the table (all rows & all columns) Double check if all columns changed! --- # Aggregation - Aggregation = Summary of Column - In MS Excel: - Simple = use function at the end of a table - Complex = use pivot table <img src="https://raw.githubusercontent.com/damien-dupre/img/main/aggregation_img.png" width="50%" style="display: block; margin: auto;" /> --- # Pivot Table in Excel 1. Select data 2. In Insert, use Pivot Table 3. Drag columns to sort by row/column 4. Choose value column to be aggregated 5. Choose type of aggregation If you want to use the Pivot Table for further analysis: - Copy-Paste it in another document - Paste as value (removes dynamic link) --- # Combination - Combination = Join two tables - In MS Excel: - One Column = vlookup function - Multiple Columns = Power Query (Windows only) <img src="https://raw.githubusercontent.com/damien-dupre/img/main/combination_img.png" width="50%" style="display: block; margin: auto;" /> -- ### =VLOOKUP(value, table, col_index, [range_lookup]) - value: The value to look for in the first column of a table - table: The table from which to retrieve a value - col_index: The column in the table from which to retrieve a value - range_lookup: TRUE = approximate match / FALSE = exact match --- class: title-slide, middle ## Exercise: Transformations On the module Loop page, got to "Lecture Data" and download the document called "organisation_alpha.xls" 1. **Extension**: Create a new variable/column which is the average response to all the questions from the survey for each employee (q1 to q9) 2. **Reduction**: Filter employee’s 2019 salary to keep only employees with a salary higher than 30k 3. **Aggregation**: Calculate the average salary by gender and by location 4. **Combination**: Using the VLOOKUP function, add to the table a column corresponding to the 2017 salary located in the 2nd sheet
−
+
15
:
00
--- class: inverse, mline, center, middle # Extra Analytic Tips --- # Tidy Data - Each variable has its own column - Each observation is placed in its own row - Each value is placed in its own cell <img src="https://raw.githubusercontent.com/damien-dupre/img/main/tidy_full.png" width="100%" style="display: block; margin: auto;" /> --- # Long or Wide? ## Long Format <img src="https://raw.githubusercontent.com/damien-dupre/img/main/long_table.png" width="80%" style="display: block; margin: auto;" /> ## Wide Format <img src="https://raw.githubusercontent.com/damien-dupre/img/main/wide_table.png" width="80%" style="display: block; margin: auto;" /> --- # Long or Wide? <img src="https://raw.githubusercontent.com/damien-dupre/img/main/long_or_wide.png" width="100%" style="display: block; margin: auto;" /> -- <img src="https://raw.githubusercontent.com/damien-dupre/img/main/long_and_wide.png" width="100%" style="display: block; margin: auto;" /> --- # Reshape Table in Excel In Data tab: 1. Get Data/New Query> From File > From [Workbook/CSV] 2. Select your file > Edit 3. Select columns to be reshaped 4. Transform - Pivot Columns: from long table to wide table - Unpivot Columns: from wide to long table <img src="https://raw.githubusercontent.com/damien-dupre/img/main/reshape_excel.gif" width="50%" style="display: block; margin: auto;" /> --- # Repeat Action Automatically In Excel, the Macro button allows to record a sequence of actions and to reproduce these actions: - VBA Code automatically recorded - Useful to process similar data files Recording a Macro: 1. View Tab 2. Use Macros > Record Macro 3. Do your actions 4. Save the macro with a keyboard shortcut 5. Use the macro again to reproduce your actions --- class: title-slide, middle ## Exercise: Reshape chess_llm.csv Researchers are measuring how good at chess are LLM Algorithms (ChatGPT, Gemini, ...), they want to test the following hypotheses: > H1: The average number of move to win at a chess game is different for at least 1 of the LLM Algorithms **Reshape the datafile to have 1 predictor variable called "llm_algorithm" and 1 outcome variable called n_move_to_win"**
−
+
10
:
00
--- class: inverse, mline, left, middle <img class="circle" src="https://github.com/damien-dupre.png" width="250px"/> # Thanks for your attention and don't hesitate to ask if you have any questions! [
@damien_dupre](http://twitter.com/damien_dupre) [
@damien-dupre](http://github.com/damien-dupre) [
damien-datasci-blog.netlify.app](https://damien-datasci-blog.netlify.app) [
damien.dupre@dcu.ie](mailto:damien.dupre@dcu.ie)