class: center, middle, inverse, title-slide # Urban and socio-economic correlates of property prices in Dublin’s area ## 2nd GoutAI Seminar - Rennes School of Business ### Damien Dupré ### Dublin City University - November 19th, 2021 --- # My Journey into Data Science #### Developement of the DynEmo Facial Expression Database (Master) * Dynamic and spontaneous emotions * Assessed with self-reports and by observers #### Analysis of Emotional User Experience of Innovative Tech. (Industrial PhD) * Understand users' acceptance of technologies from their emotional response * Based on multivariate self-reports #### Evaluation of Emotions from Facial and Physiological Measures (Industrial PostDoc) * Applications to marketing, sports and automotive industries * Dynamic changes with trend extraction techniques (2 patents) #### Performance Prediction using Machine Learning (Academic PostDoc) * Application to sport analytics * Big Data treatment (> 1 million users with activities recorded in the past 5 years) --- class: inverse, mline, center, middle # 1. A Closer Look at Property Prices --- # Market Characteristics An housing market is endemic to a country in a specific context. .pull-left[ Countries are very different in term of: - Architecture Styles/Composition - Building Materials - Legislation - Urban features - Population characteristics ] .pull-right[ <img src="goutai_2021_files/img/slides_insertimage_1.png" width="100%" style="display: block; margin: auto;" /> .center.tiny[Ratio of Housing prices adjusted for inflation (base year 2015)<br />Credit: Jeff Desjardins - Visual Capitalist (2019) [🔗](https://www.visualcapitalist.com/mapped-the-countries-with-the-highest-housing-bubble-risks/)] ] Ireland can be considered as a housing bubble as it has one of the highest ratio of Housing prices adjusted for inflation in the world. Almost 40% of Irish population leave in greater Dublin (1.9M/4.9M inhabitant) spread on a 318 km2 urban area resulting in one of the lowest population density in an European capital (4,811 inhabitant/km2). --- # Market History Housing prices are included in an economic context. Despite, having one of European highest economic growth (5.5% in 2019), Ireland suffered badly form the economic crash between 2008 and 2012. <img src="goutai_2021_files/slides_files/figure-html/unnamed-chunk-2-1.png" width="864" style="display: block; margin: auto;" /> .center.tiny[Distribution of properties sold in Dublin's area between 2010 and 2018] But since 2013, price have constantly risen with an average of 7% leading to a social crisis in Dublin where the number of homeless people has gone to the roof. --- # A Data Science Insight All properties purchased has to be published online on the [Property Price Register](https://propertypriceregister.ie): - Since January 2010 - Filed by the owners - Contains only the address and the price In its current form, the property price register is only used by buyers to compare the price of their future acquisition with the price that houses has be sold in the same street. However it could also be used to have an overall view of the price distribution by geocoding the addresses using sing [Nominatim REST API from Open Street Map](http://nominatim.openstreetmap.org): ```r address <- "10 Downing St, Westminster, London SW1A 2AA, United Kingdom" api_url <- "http://nominatim.openstreetmap.org/search/@addr@?format=json&addressdetails=0&limit=1" api_url |> stringr::str_replace("\\@addr\\@", stringr::str_replace(address, "\\s+", "\\%20")) |> jsonlite::fromJSON() |> dplyr::select(lat, lon) # lat lon # 1 51.50344025 -0.12770820958562096 ``` --- # Spatial Density with GAM Once the lat/lon has been obtained, it is possible to estimate the typical property price in a 2D space: - Use of Generalized Additive Model (GAM, see [Wood, 2017](Generalized additive models: An introduction with R)) - Application of a soap film smoother to solve the "finite area smoothing" problem due to the coastal shape of Dublin's area <img src="goutai_2021_files/img/slides_insertimage_2.png" width="60%" style="display: block; margin: auto;" /> .center.tiny[Illustration of the "finite area smoothing" problem.<br />Credit: Gavin Simpson (2016) [🔗](https://fromthebottomoftheheap.net/2016/03/27/soap-film-smoothers/)] --- # Spatial Density with GAM .pull-left[ **Step 1.** A matrix has to be drawn with key knots within the shape to predict ] .pull-right[ <img src="goutai_2021_files/slides_files/figure-html/unnamed-chunk-6-1.png" width="144" style="display: block; margin: auto;" /> .center.tiny[Key knots within Dublin's coastal area.] ] **Step 2.** A Bayesian spline smoothing using restricted maximum likelihood is applied to estimate the smoothness of the property price variation ([Wood, 2011](10.1111/j.1467-9868.2010.00749.x); [Lin & Zhang, 1999](10.1111/1467-9868.00183)): ```r gam_pred <- mgcv::gam( price ~ s(lng, lat, bs = "so", xt = list(bnd = bound)), data = data_dublin_2018_features, method = "REML", knots = knots, family = gaussian(link = "identity") ) |> gam_density(too.far = 0.05, n.grid = 100) ``` --- # Spatial Density with GAM <img src="goutai_2021_files/slides_files/figure-html/unnamed-chunk-8-1.png" width="504" style="display: block; margin: auto;" /> The prices differences between areas reveal that underlying factors are influencing these changes. .center[**Can we predict the property prices with Urban and Socio-Economic features? and can we identify the most relevant features?**] --- class: inverse, mline, center, middle # 2. Urban and Socio-Economic Correlates --- # Method In order to predict the property prices with Urban and Socio-Economic features, a machine learning algorithm is used: XGBoost (https://xgboost.readthedocs.io). > XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In order to make the final prediction, the training continues iteratively, inserting new trees that predict the residuals or errors of previous trees that are then combined with previous trees. It's called gradient boosting since, when introducing new models, it uses a gradient descent algorithm to minimise the loss. > XGBoost is a great choice for a wide variety of real-world machine learning problems and is the most used for Kaggle Data Science Challenges (see [🔗](https://towardsdatascience.com/xgboost-lightgbm-and-other-kaggle-competition-favorites-6212e8b0e835)) The price of each property is related with: - Its closest distance to each of 161 urban elements (e.g., bar, university, park, stadium, ...) - Each of the 48 socio-economic characteristics of the geographic small where the property is situated --- class: title-slide, middle ## Analysis 1: Urban Correlates --- # Urban Correlates Are property prices correlating with their distance to urban landmarks? Calculation of shortest distance to each property GPS location with one of 160 urban features from the [Overpass API of Open Street Map](https://wiki.openstreetmap.org/wiki/Overpass_API). For example, here is the query for all the pubs in Dublin: .pull-left[ ```r library(osmdata) opq("Dublin") |> add_osm_feature( key = "amenity", value = "pub" ) |> osmdata_sp() |> use_series("osm_points") |> plot() ``` ] .pull-right[ <img src="goutai_2021_files/slides_files/figure-html/unnamed-chunk-9-1.png" width="288" style="display: block; margin: auto;" /> .center.tiny[Location of results obtained from the query `amenity = pub` in Dublin using the R package {osmdata}.] ] --- # Urban Correlates Use these distances as predictors of the property price in a XGBoost model (tree based boosted regression, see [Chen & Guestrin, 2016](Generalized additive models: An introduction with R)). The dataset was randomly split in a train and in a test dataset (80% vs. 20%). The results reveal that the **model explains 44% of the variance of the property prices** of the test dataset ... <img src="goutai_2021_files/slides_files/figure-html/unnamed-chunk-11-1.png" width="504" style="display: block; margin: auto;" /> .center.tiny[Model's prediction accuracy using urban features *vs.* test dataset.] ... but some features are more important than others! --- # Urban Correlates Identify the contributing importance of each feature to the property price estimate. This contribution (also called "importance") is a measure of the improvement in accuracy brought by a feature. <table class="table" style="font-size: 14px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Feature Category </th> <th style="text-align:left;"> Feature Type </th> <th style="text-align:left;"> Importance </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> amenity </td> <td style="text-align:left;width: 10em; "> embassy </td> <td style="text-align:left;"> 14.8% </td> </tr> <tr> <td style="text-align:left;"> natural </td> <td style="text-align:left;width: 10em; "> grassland </td> <td style="text-align:left;"> 5.3% </td> </tr> <tr> <td style="text-align:left;"> amenity </td> <td style="text-align:left;width: 10em; "> bar </td> <td style="text-align:left;"> 2.3% </td> </tr> <tr> <td style="text-align:left;"> place </td> <td style="text-align:left;width: 10em; "> island </td> <td style="text-align:left;"> 2.2% </td> </tr> <tr> <td style="text-align:left;"> route </td> <td style="text-align:left;width: 10em; "> bus </td> <td style="text-align:left;"> 2.0% </td> </tr> <tr> <td style="text-align:left;"> boundary </td> <td style="text-align:left;width: 10em; "> administrative </td> <td style="text-align:left;"> 1.9% </td> </tr> <tr> <td style="text-align:left;"> power </td> <td style="text-align:left;width: 10em; "> line </td> <td style="text-align:left;"> 1.7% </td> </tr> <tr> <td style="text-align:left;"> boundary </td> <td style="text-align:left;width: 10em; "> political </td> <td style="text-align:left;"> 1.6% </td> </tr> <tr> <td style="text-align:left;"> barrier </td> <td style="text-align:left;width: 10em; "> wall </td> <td style="text-align:left;"> 1.5% </td> </tr> <tr> <td style="text-align:left;"> power </td> <td style="text-align:left;width: 10em; "> portal </td> <td style="text-align:left;"> 1.3% </td> </tr> </tbody> </table> .center.tiny[Top 10 most important feature according their contribution to the model.] The proximity to **embassies** and **grasslands** is a driver of property prices. --- class: title-slide, middle ## Analysis 2: Socio-economic Correlates --- # Socio-economic Correlates Is the distribution of population characteristics correlating with property prices? Localisation of the properties within the "small area" (i.e. highest administrative resolution map) used for the [Irish Census 2016](https://www.cso.ie/en/). Each small area is summarised according a percentage of 48 socio-economic features (e.g. age, employment, civil status, religion, ...). <img src="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2017/10/electoral-divisions-dublin-city-sapmap.png" width="60%" style="display: block; margin: auto;" /> .center.tiny[Divisions of Dublin small areas.<br />Credit: Shane Lynn (2018) [🔗](https://www.shanelynn.ie/the-irish-property-price-register-geocoded-to-small-areas/)] --- # Socio-economic Correlates The census characteristics of the small area containing the properties are used as predictors of the property price in a XGBoost model. Again, the dataset was randomly split in a train and in a test dataset (80% vs. 20%). The results reveal that **the 48 socio-economic features explain 42.8% of the property price variance** of the test dataset. <img src="goutai_2021_files/slides_files/figure-html/unnamed-chunk-15-1.png" width="504" style="display: block; margin: auto;" /> .center.tiny[Model's prediction accuracy using socio-economic features *vs.* test dataset.] --- # Socio-economic Correlates <table class="table" style="font-size: 14px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Feature Category </th> <th style="text-align:left;"> Feature Type </th> <th style="text-align:left;"> Importance </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> carers </td> <td style="text-align:left;width: 20em; "> Provides No Care </td> <td style="text-align:left;"> 28.8% </td> </tr> <tr> <td style="text-align:left;"> housing rooms </td> <td style="text-align:left;width: 20em; "> 8 or more Rooms </td> <td style="text-align:left;"> 17.6% </td> </tr> <tr> <td style="text-align:left;"> carers </td> <td style="text-align:left;width: 20em; "> Total Care Providers </td> <td style="text-align:left;"> 10.2% </td> </tr> <tr> <td style="text-align:left;"> religion </td> <td style="text-align:left;width: 20em; "> No Religion </td> <td style="text-align:left;"> 3.8% </td> </tr> <tr> <td style="text-align:left;"> disabilty age group </td> <td style="text-align:left;width: 20em; "> Persons with a disability aged 0 - 14 </td> <td style="text-align:left;"> 3.2% </td> </tr> <tr> <td style="text-align:left;"> population </td> <td style="text-align:left;width: 20em; "> Age 0 - 14 </td> <td style="text-align:left;"> 2.7% </td> </tr> <tr> <td style="text-align:left;"> religion </td> <td style="text-align:left;width: 20em; "> Other Catholic </td> <td style="text-align:left;"> 1.7% </td> </tr> <tr> <td style="text-align:left;"> housing tenure </td> <td style="text-align:left;width: 20em; "> Owner Occupier with Mortgage </td> <td style="text-align:left;"> 1.5% </td> </tr> <tr> <td style="text-align:left;"> housing rooms </td> <td style="text-align:left;width: 20em; "> 7 Rooms </td> <td style="text-align:left;"> 1.4% </td> </tr> <tr> <td style="text-align:left;"> housing tenure </td> <td style="text-align:left;width: 20em; "> Social Rented </td> <td style="text-align:left;"> 1.4% </td> </tr> </tbody> </table> .center.tiny[Top 10 most important feature according their contribution to the model.] The **density of individual not providing regular unpaid personal help** in the area is the most important socio-economic feature to predict property prices. **Large properties** in the area and the proportion of **individuals reporting having no religion** are also relatively important. --- class: inverse, mline, center, middle # 3. Conclusion --- # Conclusion By preforming a feature analysis with urban and socio-economic features, it is possible to evaluate and predict the potential price of a property. -- - The presence of embassies or parks are criteria that influence significantly the price of properties. -- - Similarly, the characteristics of inhabitants in the area such as religion, health and age is correlated to the evolution of housing prices. -- These results allow an understanding of why some areas have higher prices than others which is relevant information not only for real estate agents in charge of property valuations but also for buyers in order to estimate the real value of a property. --- # Limitations and Perspectives Still an ongoing work, more to follow in the coming weeks...
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
Essential features haven't been taken into account: - House characteristics (e.g., square meter size, number bedrooms, ...) - Macroeconomic trends (e.g., economic growth, unemployment, ...)
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
Essential features are difficult to assess: - Market size - Market demand A more global view is required to confirm the results: - Full country perspective - Comparison between cities A temporal perspective would also be interesting to study --- # Acknoledgment Many thanks to the contributors and developers of the OpenStreetMap project. .pull-left[ Many thanks to the developers and contributors of R and RStudio as well as to those of the packages used: - {tidyverse} - {magrittr} - {mgcv} - {xgboost} - {OpenStreetMap} - {osmdata} - {rgdal} - {sf} - {sp} ... to name the main ones. ] .pull-right[ More in "Geocomputation with R" (Lovelace, Nowosad, & Muenchow, 2021), a book on geographic data analysis, visualization and modelling. <img src="https://geocompr.robinlovelace.net/images/cover.png" width="50%" style="display: block; margin: auto;" /> .center.tiny[Geocomputation with R.<br />Credit: Lovelace, Nowosad, & Muenchow (2021) [🔗](https://geocompr.robinlovelace.net)] ] --- class: inverse, mline, left, middle <img class="circle" src="https://github.com/damien-dupre.png" width="250px"/> # Thanks for your attention, find me at... [
@damien_dupre](http://twitter.com/damien_dupre) [
@damien-dupre](http://github.com/damien-dupre) [
damien-datasci-blog.netlify.app](https://damien-datasci-blog.netlify.app) [
damien.dupre@dcu.ie](mailto:damien.dupre@dcu.ie)