class: center, middle, inverse, title-slide .title[ # BAA1030 - Data Analytics and Story Telling ] .subtitle[ ## Lecture 9: Visualisation in python with plotnine ] .author[ ### Damien Dupré - Dublin City University ] --- # Python so far We have already seen how to: - Subset values with `pl.filter()` - Subset columns with `pl.select()` - Compute new columns with `pl.with_columns()` - Aggregate be per using `pl.group_by()` and `agg()` More importantly we have seen how to chain multiple transformations. Let's start to to make some fancy visualisations! --- # Google Colab In your webrowser (Chrome, Firefox, ...): 1. Open these same slides on a tab to copy-paste the examples - From Loop: Lectures > Lecture 9 - Or from the URL: https://damien-dupre.github.io/BAA1030/lectures/lecture_9 2. In another tab, go to: https://colab.research.google.com - Sign in or Sign up (if not already done) - In your workspace, Click "New Notebook" (if not already done) --- # The gapminder Dataset The dataset used today is called `gapminder`. Each row in this table corresponds to a country at a specific year. For each row, we have 6 columns: - **country**: Name of country. - **year**: Year of the observation (between 1952 and 2007). - **pop**: Number of people living in the country. - **continent**: Which of the five continents the country is part of. - **lifeExp**: Life expectancy in years. - **gdpPercap**: Gross domestic product (in US dollars). to use the `gapminder` dataset, use the function `pl.read_csv` on the url: > https://raw.githubusercontent.com/damien-dupre/damien-dupre.github.io/refs/heads/main/gapminder.csv ``` python import polars as pl gapminder=pl.read_csv("https://raw.githubusercontent.com/damien-dupre/damien-dupre.github.io/refs/heads/main/gapminder.csv") ``` --- class: inverse, mline, center, middle # 1. Basic Visualistions with plotnine --- # Using plotnine The plotnine package provides an **easy way to create and to customise your plots**. Most used packages would include Matplotlib or Seaborn. However plotnine is based on the Grammar of Graphics, which makes is much easier to use and powerful. ``` python from plotnine import * ``` -- **“the grammar of graphics”** defines a set of rules for constructing statistical graphics by combining different types of layers. From the first (top) to the last (bottom), the mandatory layers are : <img src="https://miro.medium.com/v2/resize:fit:1400/1*hd6-LkI_sy4b4nu720eV_A.png" width="100%" style="display: block; margin: auto;" /> --- # Data In plotnine, the **data** layer is the name of the object that contains the variables to plot. The visualisation is initiated with the function `ggplot()` ``` python (ggplot(data=my_dataframe) # the argument name can be omitted too ) ``` -- Let's try to use the gapminder data: ``` python (ggplot(data=gapminder) # or ggplot(gapminder) ) ``` <img src="lecture_9_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> Only a grey frame is displayed! We need to define the axes (aesthetic mapping) and their corresponding layers. **Let's add additional layers with the symbol `+`.** --- # Aesthetic Mapping The **aesthetic mapping** refers to the frame of the plot, `x` for the variable in the x-axis and `y` for the variable in the y-axis. There are many more aesthetics such as `color`, `fill` and `group` but let's focus on the x and y axes for the moment. The mapping is performed by a function called `aes()` for aesthetics. This is an essential concept, the mapping of a plot (frame axes) is built from aesthetics. ``` r (ggplot(data=my_dataframe) + aes(x="my_xaxis_variable", y="my_yaxis_variable") ) ``` -- The arguments names of `aes()` can be omitted if filled in the right order: ``` r (ggplot(my_dataframe) + aes("my_xaxis_variable", "my_yaxis_variable") ) ``` Obviously, some type of layers are not suitable for all aesthetics, it depends on how many variables are included and their type (categorical or continuous). --- # Aesthetic Mapping .pull-left[ **Aesthetic mapping represents** not only the variables defined as **x-axis** and **y-axis** but also, **colours** of borders (colors), colours of shapes (fill), **shapes**, **size**, ... ] .pull-right[ ``` python (ggplot(data=my_dataframe) + aes( x="my_xaxis_variable", y="my_yaxis_variable", color="my_color_variable", fill="my_fill_variable", shape="my_shape_variable", size="my_size_variable" ) ) ``` ] -- The following code is adding not only a x- and y-axis to a plot, but also colours to the points: .pull-left[ ``` python (ggplot(data=gapminder) + aes( x="gdpPercap", y="lifeExp", color="continent" ) ) ``` ] .pull-right[ <img src="lecture_9_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> ] --- # Geometries **Geometries**, which are shapes we use to represent our data. There is a dedicated function for every type of shape to plot but all start with `geom_` |function |shape | |------------------|---------| |`geom_point()` |point | |`geom_line()` |line | |`geom_bar()` |bar | |`geom_histogram()`|histogram| |`geom_boxplot()` |boxplot | |... |... | See here the exhaustive list of all available geometry functions: <https://plotnine.org/reference/#geoms>
<i class="fas fa-arrow-circle-right faa-horizontal animated faa-slow " style=" color:blue;"></i>
Note: Not all data is suitable for all types of geometries. You have to find the geometry that corresponds to your data. --- # Geometries Example for **scatter plots**: ``` python (ggplot(data=my_dataframe) + aes(x="my_xaxis_variable", y="my_yaxis_variable") + geom_point() ) ``` Example for **line graph**: ``` python (ggplot(data=my_dataframe) + aes(x="my_xaxis_variable", y="my_yaxis_variable") + geom_line() ) ``` Example for **bar graph**: ``` python (ggplot(data=my_dataframe) + aes(x="my_xaxis_variable", y="my_yaxis_variable") + geom_col() ) ``` --- # Geometries Applied to gapminder Example for **scatter plots**: ``` python (ggplot(data=gapminder) + aes(x="gdpPercap", y="lifeExp", color="continent") + geom_point() ) ``` <img src="lecture_9_files/figure-html/unnamed-chunk-18-3.png" style="display: block; margin: auto;" /> --- # Geometries Applied to gapminder Example for **line graph**: ``` python (ggplot(data=gapminder) + aes(x="year", y="lifeExp", color="country") + geom_line() + guides(color="none") # removes legend because too many countries ) ``` <img src="lecture_9_files/figure-html/unnamed-chunk-20-5.png" style="display: block; margin: auto;" /> --- # Geometries Applied to gapminder Example for **bar graph**: ``` python gapminder_avg_continent_2007 = (gapminder .filter(pl.col("year") == 2007) .group_by(pl.col("continent")) .agg(pl.col("lifeExp").mean().alias("m_lifeExp")) ) (ggplot(data=gapminder_avg_continent_2007) + aes(x="continent", y="m_lifeExp", fill="continent") + geom_col() ) ``` <img src="lecture_9_files/figure-html/unnamed-chunk-22-7.png" style="display: block; margin: auto;" /> --- class: title-slide, middle ## Live Demo --- class: title-slide, middle ## Exercise Create: - A `ggplot()` layer with the `gapminder` data, - An `aes()` layer containing `continent` as x, `lifeExp` as y, and `continent` as color, - And a `geom_boxplot()` layer: ``` python (_ _ _(_ _ _) + aes(x=_ _ _, y=_ _ _, color=_ _ _) + _ _ _() ) ``` - A `ggplot()` layer with the `gapminder` data, - An `aes()` layer containing `year` as x, `pop` as y, and `continent` as fill, - And a `geom_col()` layer: ``` python (_ _ _(_ _ _) + aes(x=_ _ _, y=_ _ _, fill=_ _ _) + _ _ _() ) ```
−
+
05
:
00
--- class: inverse, mline, center, middle # 2. Advanced Visualistions with plotnine --- # Inherited Propriety of Geometries You can add as many geometry layers as you want, however repeating the mapping for each geometry layer is very redundant. Thankfully, if all your geometry layers are using the same aesthetics mapping, __it is possible to include this mapping inside the `ggplot()`__, then all the geometry layers will have the same mapping: ``` python (ggplot(gapminder, aes(x="year", y="lifeExp", color="country")) + geom_point() + geom_line() ) ``` -- If aesthetics are different for several geometry layers, it is also __possible to declare the aesthetics in the geometry__: ``` python (ggplot(gapminder) + geom_point(aes(x="year", y="lifeExp", shape="continent")) + geom_line(aes(x="year", y="lifeExp", color="country")) ) ``` --- # Themes Now, to make the plot more professional, let's remove that standard grey background using a different theme. Many themes come built into the plotnine package. My preference is `theme_bw()` but once you start typing `theme_` a list of options will pop up. ``` python (ggplot(gapminder) + aes(x="gdpPercap", y="lifeExp", color="continent") + geom_point() + theme_bw() ) ``` <img src="lecture_9_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> --- # Themes .pull-left[ Built-in themes includes: - `theme_538` - `theme_bw` - `theme_classic` - `theme_dark` - `theme_gray` - `theme_light` - `theme_linedraw` - `theme_matplotlib` - `theme_minimal` - `theme_seaborn` - `theme_tufte` - `theme_void` - `theme_xkcd` ] .pull-right[ <img src="https://sakaluk.wordpress.com/wp-content/uploads/2016/02/4282183.jpg" style="display: block; margin: auto;" /> ] --- # Facets Faceting is used to __split a particular visualisation by the values of another variable__. This will create multiple copies of the same type of plot with matching x and y axes, but whose content will differ. For example, suppose we were interested in looking at the evolution of life expectancy by continent from 1952. We could “split” this figure for each continent. In other words, we would plot a scatter plot for each continent separately. We do this by adding `facet_wrap("continent")` layer. --- # Facets ``` python (ggplot(gapminder) + aes(x="year", y="lifeExp", color="continent") + geom_point() + facet_wrap("continent") + theme_classic() ) ``` <img src="lecture_9_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> --- # Facets We can also specify the number of rows and columns in the grid by using the `nrow` and `ncol` arguments inside of `facet_wrap()`. For example, suppose we would like our faceted figure to have 1 rows instead of 2. We simply add an `nrow=1` argument to `facet_wrap("continent")`: ``` python (ggplot(gapminder) + aes(x="year", y="lifeExp", color="continent") + geom_point() + facet_wrap("continent", nrow=1) + theme_classic() ) ``` <img src="lecture_9_files/figure-html/unnamed-chunk-34-3.png" style="display: block; margin: auto;" /> --- # Labels Ggplot has a layer called `labs()` in order to change the name of the axis labels very quickly. ``` python (ggplot(gapminder) + aes(x="year", y="lifeExp", color="continent") + geom_point() + facet_wrap("continent", nrow=1) + labs( x="Year (from 1952 to 2007)", y="Life Expectancy", title="Evolution of life expectancy from 1952 to 2007 per continent." ) + theme_classic() ) ``` <img src="lecture_9_files/figure-html/unnamed-chunk-36-5.png" style="display: block; margin: auto;" /> `labs()` can rename x, y and title as well as many more texts like colour, subtitle, or caption --- # Statistics and Special Effects Instead of creating summaries inside the data frame object, ggplot has some function to calculate and display them automatically. The first special effect is the `geom_smooth()` layer. `geom_smooth()` is a classic geometry layer but which displays linear and non-linear trends. ``` python (ggplot(gapminder) + aes(x="year", y="lifeExp", color="continent") + geom_point() + geom_smooth() + facet_wrap("continent", nrow=1) + theme_classic() ) ``` <img src="lecture_9_files/figure-html/unnamed-chunk-38-7.png" style="display: block; margin: auto;" /> --- # Statistics and Special Effects `geom_smooth()` has one important extra argument called `method`. If method has the value __"lm"__, a linear regression will be shown. If method has the value __"loess"__ or __"gam"__, a non-linear regression will be shown. ``` python (ggplot(gapminder) + aes(x="year", y="lifeExp", color="continent") + geom_point() + geom_smooth(method="lm") + facet_wrap("continent", nrow=1) + theme_classic() ) ``` <img src="lecture_9_files/figure-html/unnamed-chunk-40-9.png" style="display: block; margin: auto;" /> --- # Additional Options ``` python (ggplot(gapminder) + aes(x="year", y="lifeExp", color="continent") + geom_point(alpha=0.2, size=1) + geom_smooth(method="lm") + facet_wrap("continent", nrow=1) + scale_x_continuous(breaks=[1960, 1980, 2000]) + labs( x="Year (from 1952 to 2007)", y="Life Expectancy", title="Evolution of life expectancy from 1952 to 2007 per continent." ) + theme_classic() + theme(text=element_text(size=20)) + guides(color="none") ) ``` --- # Additional Options <img src="lecture_9_files/figure-html/unnamed-chunk-42-11.png" style="display: block; margin: auto;" /> --- # Map Visualisations You will need a new package to read the world shape file: ``` python import geopandas as gp ``` To do a Map visualisation with plotnine, we __need the shape__ of each country. ``` python shape_world = gp.read_file("https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/world-administrative-boundaries/exports/shp") ``` .pull-left[ Then, you can plot this map with a default black filling colour: ``` python (ggplot() + geom_map(shape_world) ) ``` ] .pull-right[ <img src="lecture_9_files/figure-html/unnamed-chunk-46-13.png" style="display: block; margin: auto;" /> ] --- # Map Visualisations To obtain colours that are matching values of a specific variable, these variables need to be joined to the original data that we want to display on the map. `geom_map()` will use these variables to fill the countries' with a gradient of colour. For example, let's display the population for each country in 2007: ``` python pop_2007 = (gapminder .filter(pl.col("year") == 2007) ) map_pop_2007 = (shape_world .merge(gapminder.to_pandas(), left_on="name", right_on="country", how="outer") ) (ggplot(data=map_pop_2007) + aes(fill="pop") + geom_map() + coord_fixed() ) ```
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
Do NOT merge shape_world with an objects non-filtered and do NOT use the resulting object for any other visualisation. --- # Map Visualisations <img src="lecture_9_files/figure-html/unnamed-chunk-48-15.png" style="display: block; margin: auto;" /> Use the `coord_fixed()` layer to keep relationship between latitude and longitude scales --- # Map Visualisations `gapminder` and the `shape_world` objects have different spelling for the countries. For example "United States of America" can be "USA" or just "United States". To solve this, we can merge them with a new object providing ISO codes for countries regardless of their spelling, and that you can use to match them across datasets. ``` python # 1. Read the csv file containing the iso3 convention for all spellings country_code = pl.read_csv("https://raw.githubusercontent.com/adrivsh/country_names/refs/heads/master/names_to_iso.csv") # 2. Join country_code and gapminder to identify the iso3 code for each country gapminder_code = (gapminder.join(country_code, on="country", how="left")) # 3. Filter gapminder_code to have only values for 2007 gapminder_code_2007 = (gapminder_code.filter(pl.col("year") == 2007)) # 4. Merge the values with the shape file for each country using the iso3 shape_gapminder_code_2007 = (shape_world .merge(gapminder_code_2007.to_pandas(), on="iso3", how="outer") ) # 5. Plot the map (ggplot(data=shape_gapminder_code_2007) + aes(fill="pop") + geom_map() + coord_fixed() ) ``` --- # Map Visualisations <img src="lecture_9_files/figure-html/unnamed-chunk-50-17.png" style="display: block; margin: auto;" /> --- # Map Visualisations ``` python (ggplot(data=shape_gapminder_code_2007) + aes(fill="pop") + geom_map() + scale_fill_gradient(low="yellow", high="red") + labs( title="Differences between countries regarding their population in 2007", subtitle="Countries in grey have no data due to a mismatch with their names", caption="Source: gapminder", x="Longitude", y="Latitude", fill="Country Population" ) + coord_fixed() + theme_bw() ) ``` --- # Map Visualisations <img src="lecture_9_files/figure-html/unnamed-chunk-52-19.png" style="display: block; margin: auto;" /> --- class: title-slide, middle ## Live Demo --- class: title-slide, middle ## Exercise Build a ggplot with the `gapminder` dataset and one `aes()` layer which contains the x as `continent`, y as `lifeExp` and colour as `continent`. Use `geom_boxplot()` as geometry and `year` as a facet variable, use a theme of your choice as well and changes axis labels: ``` python (ggplot(... ``` Build a ggplot with the `gapminder` dataset and: - One `geom_line()` layer which contains the x as `year`, y as `lifeExp` and group as `country` in its `aes()`, - One `geom_smooth()` layer which contains the x as `year`, y as `lifeExp` and colour as `continent` in its `aes()` , - One `facet_wrap()` for each continent. ``` python (ggplot(_ _ _) + geom_line(aes(_ _ _)) + geom_smooth(aes(_ _ _)) + facet_wrap(_ _ _) ) ```
−
+
10
:
00
--- class: inverse, mline, center, middle # 3. Combine ggplot() and Transformations --- # Combine ggplot() and Transformations A very powerful way to create figures is to use a __ggplot at the end of data transformations__. Indeed, having a data frame object as first argument of the `ggplot()` function is similar to chaining it to the `ggplot()` function using `>>`: ``` python # this classic representation: (ggplot(data=gapminder) + aes(x="gdpPercap", y="lifeExp", color="continent") + geom_point() ) # is the same as: (gapminder >> ggplot() + aes(x="gdpPercap", y="lifeExp", color="continent") + geom_point() ) # is the same as: (gapminder >> ggplot(aes(x="gdpPercap", y="lifeExp", color="continent")) + geom_point() ) ``` The layers are still added with the `+` symbol. --- # filter() to ggplot() You can easily display only the data for a specific section of you interest. For example, let's filter the data only for Ireland: ``` python (gapminder .filter(pl.col("country") == "Ireland") >> ggplot(aes("year", "lifeExp", color="country")) + geom_line() ) ``` <img src="lecture_9_files/figure-html/unnamed-chunk-58-1.png" style="display: block; margin: auto;" /> --- # with_columns() to ggplot() If you need to display a variable that has to be created beforehand, you can always include a mutate statement in the chain. For example, let's create the variable `gdpPercountry` which is the result of the multiplication between countries' population and countries' gdpPercap. Then let's display this information for Ireland and France: ``` python (gapminder .with_columns(gdpPercountry=pl.col("gdpPercap")*pl.col("pop")) .filter(pl.col("country") == "Ireland") >> ggplot(aes("gdpPercountry", "lifeExp", color="country")) + geom_line() ) ``` <img src="lecture_9_files/figure-html/unnamed-chunk-60-3.png" style="display: block; margin: auto;" /> --- # group_by() and agg() to ggplot() Finally, one of the most useful possibilities will be to aggregate variables per groups and to display this information in figures. For example, let's create the average population per continent and display how this average evolves with time: ``` python (gapminder .group_by(pl.col("year"), pl.col("continent")) .agg(pl.col("pop").mean().alias("m_pop")) >> ggplot(aes("year", "m_pop", color="continent")) + geom_line() ) ``` <img src="lecture_9_files/figure-html/unnamed-chunk-62-5.png" style="display: block; margin: auto;" /> --- # group_by() and agg() to ggplot() We can also compare the sum of the population by continent for the year 2007: ``` python (gapminder .filter(pl.col("year") == 2007) .group_by(pl.col("continent")) .agg(pl.col("pop").sum().alias("s_pop")) >> ggplot(aes("continent", "s_pop", fill="continent")) + geom_col() ) ``` <img src="lecture_9_files/figure-html/unnamed-chunk-64-7.png" style="display: block; margin: auto;" /> --- class: inverse, mline, left, middle <img class="circle" src="https://github.com/damien-dupre.png" width="250px"/> # Thanks for your attention and don't hesitate to ask if you have any question! [
@damien-dupre](http://github.com/damien-dupre) [
https://damien-dupre.github.io](https://damien-dupre.github.io) [
damien.dupre@dcu.ie](mailto:damien.dupre@dcu.ie)