class: center, middle, inverse, title-slide .title[ # BAA1030 - Data Analytics and Story Telling ] .subtitle[ ## Lecture 8: Visualisation in python with plotnine ] .author[ ### Damien Dupré - Dublin City University ] --- # Python so far We have already seen how to: - Subset values with `pl.filter()` - Subset columns with `pl.select()` - Compute new columns with `pl.with_columns()` - Aggregate data per group using `pl.group_by()` and `agg()` More importantly we have seen how to chain multiple transformations. Let's start to make some fancy visualisations! --- # Google Colab In your web browser (Chrome, Firefox, ...): 1. Open these same slides on a tab to copy-paste the examples - From Loop: Lectures > Lecture 8 - Or from the URL: https://damien-dupre.github.io/BAA1030/lectures/lecture_8 2. In another tab, go to: https://colab.research.google.com - Sign in or Sign up (if not already done) - In your workspace, Click "New Notebook" (if not already done) --- # The gapminder Dataset The dataset used today is called `gapminder`. Each row in this table corresponds to a country at a specific year. For each row, we have 6 columns: - **country**: Name of country. - **year**: Year of the observation (between 1952 and 2007). - **pop**: Number of people living in the country. - **continent**: Which of the five continents the country is part of. - **lifeExp**: Life expectancy in years. - **gdpPercap**: Gross domestic product (in US dollars). to use the `gapminder` dataset, use the function `pl.read_csv` on the url: > https://raw.githubusercontent.com/damien-dupre/data/refs/heads/master/gapminder.csv ``` python import polars as pl gapminder=pl.read_csv("https://raw.githubusercontent.com/damien-dupre/data/refs/heads/master/gapminder.csv") ``` --- class: inverse, mline, center, middle # 1. Basic Visualisations with plotnine --- # Using plotnine The plotnine package provides an **easy way to create and to customise your plots**. Most used packages would include Matplotlib or Seaborn. However plotnine is based on the Grammar of Graphics, which makes it much easier to use and more powerful. ``` python from plotnine import * ``` -- **“the grammar of graphics”** defines a set of rules for constructing statistical graphics by combining different types of layers. From the first (top) to the last (bottom), the mandatory layers are : <img src="https://miro.medium.com/v2/resize:fit:1400/1*hd6-LkI_sy4b4nu720eV_A.png" width="100%" style="display: block; margin: auto;" /> --- # Data In plotnine, the **data** layer is the name of the object that contains the variables to plot. The visualisation is initiated with the function `ggplot()` ``` python (ggplot(data=my_dataframe) # the argument name can be omitted too ) ``` -- Let's try to use the gapminder data: ``` python (ggplot(data=gapminder) # or ggplot(gapminder) ) ``` <img src="lecture_8_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> Only a grey frame is displayed! We need to define the axes (aesthetic mapping) and their corresponding layers. **Let's add additional layers with the symbol `+`.** --- # Aesthetic Mapping The **aesthetic mapping** refers to the frame of the plot, `x` for the variable in the x-axis and `y` for the variable in the y-axis. There are many more aesthetics such as `color`, `fill` and `group` but let's focus on the x and y axes for the moment. The mapping is performed by a function called `aes()` for aesthetics. This is an essential concept, the mapping of a plot (frame axes) is built from aesthetics. ``` python (ggplot(data=my_dataframe) + aes(x="my_xaxis_variable", y="my_yaxis_variable") ) ``` -- The arguments names of `aes()` can be omitted if filled in the right order: ``` python (ggplot(my_dataframe) + aes("my_xaxis_variable", "my_yaxis_variable") ) ``` Obviously, some type of layers are not suitable for all aesthetics, it depends on how many variables are included and their type (categorical or continuous). --- # Aesthetic Mapping .pull-left[ **Aesthetic mapping represents** not only the variables defined as **x-axis** and **y-axis** but also, **colours** of borders (colors), colours of shapes (fill), **shapes**, **size**, ... ] .pull-right[ ``` python (ggplot(data=my_dataframe) + aes( x="my_xaxis_variable", y="my_yaxis_variable", color="my_color_variable", fill="my_fill_variable", shape="my_shape_variable", size="my_size_variable" ) ) ``` ] -- The following code is adding not only a x- and y-axis to a plot, but also colours to the points: .pull-left[ ``` python (ggplot(data=gapminder) + aes( x="gdpPercap", y="lifeExp", color="continent" ) ) ``` ] .pull-right[ <img src="lecture_8_files/figure-html/unnamed-chunk-13-3.png" style="display: block; margin: auto;" /> ] --- # Geometries **Geometries**, which are shapes we use to represent our data. There is a dedicated function for every type of shape to plot but all start with `geom_` |function |shape | |------------------|---------| |`geom_point()` |point | |`geom_line()` |line | |`geom_bar()` |bar | |`geom_histogram()`|histogram| |`geom_boxplot()` |boxplot | |... |... | See here the exhaustive list of all available geometry functions: <https://plotnine.org/reference/#geoms>
<i class="fas fa-arrow-circle-right faa-horizontal animated faa-slow " style=" color:blue;"></i>
Note: Not all data is suitable for all types of geometries. You have to find the geometry that corresponds to your data. --- # Geometries Example for **scatter plots**: ``` python (ggplot(data=my_dataframe) + aes(x="my_xaxis_variable", y="my_yaxis_variable") + geom_point() ) ``` Example for **line graph**: ``` python (ggplot(data=my_dataframe) + aes(x="my_xaxis_variable", y="my_yaxis_variable") + geom_line() ) ``` Example for **bar graph**: ``` python (ggplot(data=my_dataframe) + aes(x="my_xaxis_variable", y="my_yaxis_variable") + geom_col() ) ``` --- # Geometries Applied to gapminder Example for **scatter plots**: ``` python (ggplot(data=gapminder) + aes(x="gdpPercap", y="lifeExp", color="continent") + geom_point() ) ``` <img src="lecture_8_files/figure-html/unnamed-chunk-18-5.png" style="display: block; margin: auto;" /> --- # Geometries Applied to gapminder Example for **line graph**: ``` python (ggplot(data=gapminder) + aes(x="year", y="lifeExp", color="country") + geom_line() + guides(color="none") # removes legend because too many countries ) ``` <img src="lecture_8_files/figure-html/unnamed-chunk-20-7.png" style="display: block; margin: auto;" /> --- # Geometries Applied to gapminder Example for **bar graph**: ``` python gapminder_avg_continent_2007 = (gapminder .filter(pl.col("year") == 2007) .group_by(pl.col("continent")) .agg(pl.col("lifeExp").mean().alias("m_lifeExp")) ) (ggplot(data=gapminder_avg_continent_2007) + aes(x="continent", y="m_lifeExp", fill="continent") + geom_col() ) ``` <img src="lecture_8_files/figure-html/unnamed-chunk-22-9.png" style="display: block; margin: auto;" /> --- class: title-slide, middle ## Live Demo --- class: title-slide, middle ## Exercise Create: - A `ggplot()` layer with the `gapminder` data, - An `aes()` layer containing `continent` as x, `lifeExp` as y, and `continent` as color, - And a `geom_boxplot()` layer: ``` python (ggplot(gapminder) + aes(x="continent", y="______", color="________") + geom_boxplot() ) ``` - A `ggplot()` layer with the `gapminder` data, - An `aes()` layer containing `year` as x, `pop` as y, and `continent` as fill, - And a `geom_col()` layer: ``` python (ggplot(gapminder) + aes(x="____", y="___", fill="________") + geom_col() ) ```
−
+
05
:
00
--- class: inverse, mline, center, middle # 2. Advanced Visualisations with plotnine --- # Inherited Properties of Geometries You can add as many geometry layers as you want, however repeating the mapping for each geometry layer is very redundant. Thankfully, if all your geometry layers are using the same aesthetics mapping, __it is possible to include this mapping inside the `ggplot()`__, then all the geometry layers will have the same mapping: ``` python (ggplot(gapminder, aes(x="year", y="lifeExp", color="country")) + geom_point() + geom_line() ) ``` -- If aesthetics are different for several geometry layers, it is also __possible to declare the aesthetics in the geometry__: ``` python (ggplot(gapminder) + geom_point(aes(x="year", y="lifeExp", shape="continent")) + geom_line(aes(x="year", y="lifeExp", color="country")) ) ``` --- # Themes Now, to make the plot more professional, let's remove that standard grey background using a different theme. Many themes come built into the plotnine package. My preference is `theme_bw()` but once you start typing `theme_` a list of options will pop up. ``` python (ggplot(gapminder) + aes(x="gdpPercap", y="lifeExp", color="continent") + geom_point() + theme_bw() ) ``` <img src="lecture_8_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> --- # Themes .pull-left[ Built-in themes include: - `theme_538` - `theme_bw` - `theme_classic` - `theme_dark` - `theme_gray` - `theme_light` - `theme_linedraw` - `theme_matplotlib` - `theme_minimal` - `theme_seaborn` - `theme_tufte` - `theme_void` - `theme_xkcd` ] .pull-right[ <img src="https://sakaluk.wordpress.com/wp-content/uploads/2016/02/4282183.jpg" style="display: block; margin: auto;" /> ] --- # Facets Faceting is used to __split a particular visualisation by the values of another variable__. This will create multiple copies of the same type of plot with matching x and y axes, but whose content will differ. For example, suppose we were interested in looking at the evolution of life expectancy by continent from 1952. We could “split” this figure for each continent. In other words, we would plot a scatter plot for each continent separately. We do this by adding `facet_wrap("continent")` layer. --- # Facets ``` python (ggplot(gapminder) + aes(x="year", y="lifeExp", color="continent") + geom_point() + facet_wrap("continent") + theme_classic() ) ``` <img src="lecture_8_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> --- # Facets We can also specify the number of rows and columns in the grid by using the `nrow` and `ncol` arguments inside of `facet_wrap()`. For example, suppose we would like our faceted figure to have 1 row instead of 2. We simply add an `nrow=1` argument to `facet_wrap("continent")`: ``` python (ggplot(gapminder) + aes(x="year", y="lifeExp", color="continent") + geom_point() + facet_wrap("continent", nrow=1) + theme_classic() ) ``` <img src="lecture_8_files/figure-html/unnamed-chunk-34-3.png" style="display: block; margin: auto;" /> --- # Labels plotnine has a layer called `labs()` to change the axis labels and add titles. ``` python (ggplot(gapminder) + aes(x="year", y="lifeExp", color="continent") + geom_point() + facet_wrap("continent", nrow=1) + labs( x="Year (from 1952 to 2007)", y="Life Expectancy", title="Evolution of life expectancy from 1952 to 2007 per continent." ) + theme_classic() ) ``` <img src="lecture_8_files/figure-html/unnamed-chunk-36-5.png" style="display: block; margin: auto;" /> `labs()` can rename x, y and title as well as many more texts like colour, subtitle, or caption --- # Statistics and Special Effects Instead of creating summaries inside the data frame object, ggplot has some functions to calculate and display them automatically. The first special effect is the `geom_smooth()` layer. `geom_smooth()` is a classic geometry layer but which displays linear and non-linear trends. ``` python (ggplot(gapminder) + aes(x="year", y="lifeExp", color="continent") + geom_point() + geom_smooth() + facet_wrap("continent", nrow=1) + theme_classic() ) ``` <img src="lecture_8_files/figure-html/unnamed-chunk-38-7.png" style="display: block; margin: auto;" /> --- # Statistics and Special Effects `geom_smooth()` has one important extra argument called `method`. If method has the value __"lm"__, a linear regression will be shown. If method has the value __"loess"__ or __"gam"__, a non-linear regression will be shown. ``` python (ggplot(gapminder) + aes(x="year", y="lifeExp", color="continent") + geom_point() + geom_smooth(method="lm") + facet_wrap("continent", nrow=1) + theme_classic() ) ``` <img src="lecture_8_files/figure-html/unnamed-chunk-40-9.png" style="display: block; margin: auto;" /> --- # Additional Options ``` python (ggplot(gapminder) + aes(x="year", y="lifeExp", color="continent") + geom_point(alpha=0.2, size=1) + geom_smooth(method="lm") + facet_wrap("continent", nrow=1) + scale_x_continuous(breaks=[1960, 1980, 2000]) + labs( x="Year (from 1952 to 2007)", y="Life Expectancy", title="Evolution of life expectancy from 1952 to 2007 per continent." ) + theme_classic() + theme(text=element_text(size=20)) + guides(color="none") ) ``` --- # Additional Options <img src="lecture_8_files/figure-html/unnamed-chunk-42-11.png" style="display: block; margin: auto;" /> --- # Map Visualisations You will need a new package to read the world shape file: ``` python import geopandas as gp ``` To do a Map visualisation with plotnine, we __need the shape__ of each country. ``` python shape_world = gp.read_file("https://public.opendatasoft.com/api/explore/v2.1/catalog/datasets/world-administrative-boundaries/exports/shp") ``` .pull-left[ Then, you can plot this map with a default black filling colour: ``` python (ggplot() + geom_map(shape_world) ) ``` ] .pull-right[ <img src="lecture_8_files/figure-html/unnamed-chunk-46-13.png" style="display: block; margin: auto;" /> ] --- # Map Visualisations To obtain colours that match values of a specific variable, these variables need to be joined to the original data that we want to display on the map. `geom_map()` will use these variables to fill the countries with a colour gradient. For example, let's display the population for each country in 2007: ``` python pop_2007 = (gapminder .filter(pl.col("year") == 2007) ) map_pop_2007 = (shape_world .merge(gapminder.to_pandas(), left_on="name", right_on="country", how="outer") ) (ggplot(data=map_pop_2007) + aes(fill="pop") + geom_map() + coord_fixed() ) ```
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
Do NOT merge `shape_world` with an unfiltered object and do NOT use the resulting object for any other visualisation. --- # Map Visualisations <img src="lecture_8_files/figure-html/unnamed-chunk-48-15.png" style="display: block; margin: auto;" /> Use the `coord_fixed()` layer to keep relationship between latitude and longitude scales --- # Map Visualisations `gapminder` and the `shape_world` objects have different spelling for the countries. For example "United States of America" can be "USA" or just "United States". To solve this, we can merge them with a new object providing ISO codes for countries regardless of their spelling, and that you can use to match them across datasets. ``` python # 1. Read the csv file containing the iso3 convention for all spellings country_code = pl.read_csv("https://raw.githubusercontent.com/adrivsh/country_names/refs/heads/master/names_to_iso.csv") # 2. Join country_code and gapminder to identify the iso3 code for each country gapminder_code = (gapminder.join(country_code, on="country", how="left")) # 3. Filter gapminder_code to have only values for 2007 gapminder_code_2007 = (gapminder_code.filter(pl.col("year") == 2007)) # 4. Merge the values with the shape file for each country using the iso3 shape_gapminder_code_2007 = (shape_world .merge(gapminder_code_2007.to_pandas(), on="iso3", how="outer") ) # 5. Plot the map (ggplot(data=shape_gapminder_code_2007) + aes(fill="pop") + geom_map() + coord_fixed() ) ``` --- # Map Visualisations <img src="lecture_8_files/figure-html/unnamed-chunk-50-17.png" style="display: block; margin: auto;" /> --- # Map Visualisations ``` python (ggplot(data=shape_gapminder_code_2007) + aes(fill="pop") + geom_map() + scale_fill_gradient(low="yellow", high="red") + labs( title="Differences between countries regarding their population in 2007", subtitle="Countries in grey have no data due to a mismatch with their names", caption="Source: gapminder", x="Longitude", y="Latitude", fill="Country Population" ) + coord_fixed() + theme_bw() ) ``` --- # Map Visualisations <img src="lecture_8_files/figure-html/unnamed-chunk-52-19.png" style="display: block; margin: auto;" /> --- class: title-slide, middle ## Live Demo --- class: title-slide, middle ## Exercise Build a ggplot with the `gapminder` dataset and one `aes()` layer which contains x as `continent`, y as `lifeExp` and colour as `continent`. Use `geom_boxplot()` as geometry, `year` as a facet variable, a theme of your choice, and change the axis labels: ``` python (ggplot(gapminder) + aes(x="________", y="_______", color="________") + geom_boxplot() + facet_wrap("____") + labs(x="______", y="______", title="______") + theme_____() ) ``` Build a ggplot with the `gapminder` dataset and: - One `geom_line()` layer which contains x as `year`, y as `lifeExp` and group as `country` in its `aes()`, - One `geom_smooth()` layer which contains x as `year`, y as `lifeExp` and colour as `continent` in its `aes()`, - One `facet_wrap()` for each continent. ``` python (ggplot(gapminder) + geom_line(aes(x="____", y="_______", group="_______")) + geom_smooth(aes(x="____", y="_______", color="________")) + facet_wrap("________") ) ```
−
+
10
:
00
--- class: inverse, mline, center, middle # 3. Combine ggplot() and Transformations --- # Combine ggplot() and Transformations A very powerful way to create figures is to use a __ggplot at the end of data transformations__. Indeed, having a data frame object as first argument of the `ggplot()` function is similar to chaining it to the `ggplot()` function using `>>`: ``` python # this classic representation: (ggplot(data=gapminder) + aes(x="gdpPercap", y="lifeExp", color="continent") + geom_point() ) # is the same as: (gapminder >> ggplot() + aes(x="gdpPercap", y="lifeExp", color="continent") + geom_point() ) # is the same as: (gapminder >> ggplot(aes(x="gdpPercap", y="lifeExp", color="continent")) + geom_point() ) ``` The layers are still added with the `+` symbol. --- # filter() to ggplot() You can easily display only the data for a specific section of your interest. For example, let's filter the data only for Ireland: ``` python (gapminder .filter(pl.col("country") == "Ireland") >> ggplot(aes("year", "lifeExp", color="country")) + geom_line() ) ``` <img src="lecture_8_files/figure-html/unnamed-chunk-58-1.png" style="display: block; margin: auto;" /> --- # with_columns() to ggplot() If you need to display a variable that has to be created beforehand, you can always include a `with_columns()` statement in the chain. For example, let's create the variable `gdpPercountry` which is the result of the multiplication between countries' population and countries' gdpPercap. Then let's display this information for Ireland and France: ``` python (gapminder .with_columns(gdpPercountry=pl.col("gdpPercap")*pl.col("pop")) .filter(pl.col("country") == "Ireland") >> ggplot(aes("gdpPercountry", "lifeExp", color="country")) + geom_line() ) ``` <img src="lecture_8_files/figure-html/unnamed-chunk-60-3.png" style="display: block; margin: auto;" /> --- # group_by() and agg() to ggplot() Finally, one of the most useful possibilities will be to aggregate variables per groups and to display this information in figures. For example, let's create the average population per continent and display how this average evolves with time: ``` python (gapminder .group_by(pl.col("year"), pl.col("continent")) .agg(pl.col("pop").mean().alias("m_pop")) >> ggplot(aes("year", "m_pop", color="continent")) + geom_line() ) ``` <img src="lecture_8_files/figure-html/unnamed-chunk-62-5.png" style="display: block; margin: auto;" /> --- # group_by() and agg() to ggplot() We can also compare the sum of the population by continent for the year 2007: ``` python (gapminder .filter(pl.col("year") == 2007) .group_by(pl.col("continent")) .agg(pl.col("pop").sum().alias("s_pop")) >> ggplot(aes("continent", "s_pop", fill="continent")) + geom_col() ) ``` <img src="lecture_8_files/figure-html/unnamed-chunk-64-7.png" style="display: block; margin: auto;" /> --- class: inverse, mline, center, middle # 4. Vibe Coding Visualisations --- # What is Vibe Coding? Vibe coding is the practice of **describing what you want in natural language** and letting an AI assistant (such as ChatGPT, Claude, or GitHub Copilot) generate the code for you. This is especially powerful for data visualisation because: - Plots have many customisation options that are hard to memorise - AI assistants are very good at generating plotnine/ggplot code - You can describe the **plot you want to see** instead of looking up every argument - Iterating on a design is faster: just describe what to change
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
Vibe coding does **not** mean blindly trusting AI output. Always **read and understand** the code before running it! --- # Vibe Coding in Practice **Step 1:** Write a prompt describing the visualisation you want > "Using plotnine with the gapminder dataset, create a scatter plot of gdpPercap vs lifeExp for the year 2007 only, coloured by continent. Add a linear trend line, use theme_minimal, and label the axes 'GDP per Capita' and 'Life Expectancy'." **Step 2:** The AI generates plotnine code ``` python (gapminder .filter(pl.col("year") == 2007) >> ggplot(aes(x="gdpPercap", y="lifeExp", color="continent")) + geom_point() + geom_smooth(method="lm") + labs(x="GDP per Capita", y="Life Expectancy") + theme_minimal() ) ``` **Step 3:** Read the code, verify each layer matches your intent, and run it --- # Tips for Vibe Coding Visualisations .pull-left[ **Good prompts are specific:** - Name the dataset and column names - Specify the geometry (scatter, line, bar...) - Mention colours, facets, themes, and labels - State the library: "Using plotnine" (not matplotlib!) **Example of a good prompt:** > "Using plotnine, from gapminder filtered to 2007, create a bar chart of total population per continent, filled by continent, with theme_bw and a title." ] .pull-right[ **Common pitfalls to avoid:** - Not specifying plotnine (AI may use matplotlib or seaborn) - Forgetting to mention that data comes from Polars - Accepting code without checking column names - Not verifying the `>>` pipe is used correctly **Always verify:** - Are the column names in quotes and spelled correctly? - Is the geometry appropriate for your data? - Does the code chain Polars transformations with `>>` before `ggplot()`? - Do the layers make sense when read top to bottom? ] --- class: title-slide, middle ## Exercise Use an AI assistant (ChatGPT, Claude, or Copilot) to generate plotnine code for the following tasks. **Review the code before running it!** 1/ Ask the AI: *"Using plotnine and Polars, from a dataframe called gapminder, create a line chart showing the evolution of average life expectancy per continent over time. Use different colours for each continent and add a title."* 2/ Ask the AI: *"Using plotnine and Polars, from a dataframe called gapminder filtered to the year 2007, create a horizontal bar chart of GDP per capita by country for Europe only. Sort the bars from highest to lowest."* 3/ Copy the generated code into Google Colab, verify it looks correct, and run it. Did it produce the expected visualisation? If not, what did you need to fix?
−
+
10
:
00
--- class: inverse, mline, left, middle <img class="circle" src="https://github.com/damien-dupre.png" width="250px"/> # Thanks for your attention and don't hesitate to ask if you have any question! [
@damien-dupre](http://github.com/damien-dupre) [
https://damien-dupre.github.io](https://damien-dupre.github.io) [
damien.dupre@dcu.ie](mailto:damien.dupre@dcu.ie)