class: center, middle, inverse, title-slide .title[ # BAA1030 - Data Analytics and Story Telling ] .subtitle[ ## Lecture 8: Data Wrangling with Polars ] .author[ ### Damien Dupré - Dublin City University ] --- # Python so far We have already seen how to: - **Use python** in Google Colab - **Install, load and use packages** - Run **code cells** - Use **keyboard shortcuts**: <kbd>Ctrl</kbd> & <kbd>Enter</kbd> (Win) / <kbd>Command</kbd> & <kbd>Enter</kbd> (Mac) - **Create object** with the `=` sign - **Upload and read data** in Google Colab Now let's introduce Polars! <img src="https://pola.rs/share.jpg" width="50%" style="display: block; margin: auto;" /> --- # Data Wrangling and Manipulation `pandas` is a powerful and widely used Python library for data manipulation and analysis. It provides the DataFrame structure (two-dimensional) that allow to efficiently store, manipulate, and analyse structured data. It allows Data Manipulation like: - Adding, deleting, and modifying columns and rows. - Merging, joining, and concatenating datasets. - Reshaping data with pivot tables and groupby operations. - Performing statistical and mathematical operations (mean, sum, count, etc.). - Grouping data using .groupby() for advanced aggregations. However, `pandas` not very nice to read or to write --- # Data Wrangling and Manipulation `polars` is a new project for DataFrame manipulation. It is lightning fast and a bit more user friendly than `pandas`. <img src="https://codecut.ai/wp-content/uploads/2024/07/pandas_vs_polars-1-1.png" style="display: block; margin: auto;" /> --- # Data Wrangling with Polars There are five operations that you will use to do the vast majority of data manipulations: - `Extension` i.e., Make new variables (create new variables with functions of existing variables) - `Reduction` i.e., Subset observations (pick observations by their values) - `Selection` i.e., Subset variables (pick variables by their names) - `Aggregation` i.e., Summarise data (collapse many values down to a single summary) - `Combination` i.e., Merge two or more datasets using a common variable --- # Load polars To load the polars package: ``` python import polars as pl ``` In the following code inside these slides: - `pl` is the acronym to call functions from the package - `pl.col("variable name")` will access the specific `variable` --- # The gapminder dataset The dataset used today is called `gapminder`. Each row corresponds to a country at a specific year. For each row, we have 6 columns: - **country**: Name of country. - **year**: Year of the observation (between 1952 and 2007). - **pop**: Number of people living in the country. - **continent**: Which of the five continents the country is part of. - **lifeExp**: Life expectancy in years. - **gdpPercap**: Gross domestic product (in US dollars). to use the `gapminder` dataset, use the function `pl.read_csv` on the url: > https://raw.githubusercontent.com/damien-dupre/damien-dupre.github.io/refs/heads/main/gapminder.csv ``` python gapminder = pl.read_csv("https://raw.githubusercontent.com/damien-dupre/damien-dupre.github.io/refs/heads/main/gapminder.csv") ``` --- # The polars Syntax .pull-left[ The following code displays the 5 first rows with the `head()` function: ``` python gapminder.head(5) ``` ] .pull-right[ You can create a new object with the `=` assignment symbol and print it out: ``` python gapminder_short = gapminder.head(5) gapminder_short ``` ] However, instead of the classic style, we will use a Polars syntax using `(` as first character and `)` as last character, and spreading the code over multiple lines: .pull-left[ ``` python (gapminder .head(5) ) ``` ] .pull-right[ ``` python gapminder_short = (gapminder .head(5) ) ``` ] This way looks longer but it will be very useful to chain different transformations! --- # Method Chaining Many people don't need to be told why method chaining is nice. Many languages make it easy to write `thing.min().abs().str()` instead of `str(abs(min(thing)))`. It's not always easy for libraries to accommodate method chaining - in fancy terms, to be [fluent interfaces](https://en.wikipedia.org/wiki/Fluent_interface). Even Pandas used to be much less fluent: when *Modern Pandas* was released, methods like `assign` and `pipe` were quite recent. Fortunately Polars is very fluent and can be read easily using Method Chaining. ``` python (gapminder .with_columns(gdp_total = pl.col("pop") * pl.col("gdpPercap")) .select( pl.col("gdp_total"), pl.col("year").alias("measure_year"), pl.col("pop").alias("total_pop") ) .group_by(pl.col("measure_year")) .agg(pl.col("gdp_total").mean().alias("gdp_total_average")) ) ``` --- class: title-slide, middle ## Live Demo --- class: title-slide, middle ## Exercise 1/ Load Polars: ``` python import polars as pl ``` 2/ Create a `gapminder` object using the `polars` function `read_csv`: ``` python gapminder = pl.read_csv("https://raw.githubusercontent.com/damien-dupre/damien-dupre.github.io/refs/heads/main/gapminder.csv") ``` 3/ Use the function `head()` on the gapminder object ``` python gapminder.head(5) ```
−
+
05
:
00
--- class: inverse, mline, center, middle # 1. The select() function --- # The select() function It is not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often **narrowing in on the variables** you are actually interested in. `select()` allows you to rapidly zoom in on a **useful subset** using operations based on the names of the variables. Again the first argument is the name of the **data frame** object to process and the following arguments are the **name of the columns to keep**. ``` python (gapminder .select( pl.col("country"), pl.col("year"), pl.col("pop") ) ) ```
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
Importantly, **Column/Variable names explicitly mentioned** with Polars using the function `pl.col()` --- # Select and rename variables While you are keeping only specify variables with `select()`, these variables can also be renamed on the same time. The function `alias()` with the **new name** overwrite the **old name**. Example: ``` python (gapminder .select( pl.col("country"), pl.col("year").alias("measure_year"), pl.col("pop").alias("total_pop") ) ) ``` --- class: title-slide, middle ## Live Demo --- class: title-slide, middle ## Exercise From the data frame object `gapminder`, select the columns `lifeExp` and `gdpPercap` and rename them as life_expectancy and gdp_per_capita: ``` python (gapminder .select( __.___(____).____(_____), __.___(____).____(_____) ) ) ```
−
+
03
:
00
--- class: inverse, mline, center, middle # 2. The filter() function --- # The filter() function You will want to isolate bits of your data; maybe you want to only look at a single country or a few years. Polars calls this subsetting and `filter()` allows you to **subset observations based on their values**. `filter()`'s transformation is a conditional statement, only observations TRUE to the condition are kept. For example: ``` python (gapminder .filter(pl.col("country") == "Ireland") ) ``` - A Column/Variable is an array containing multiple values inside the data frame object. - The string "Ireland" does not exist in our environment and will not be saved. It is here just as a value, which explains the quotation marks. --- # Comparisons To use filtering effectively, you have to know how to select the observations that you want using the **comparison operators**. Python provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (exactly equal). When you are starting out with python, the **easiest mistake to make** is to use `=` instead of `==` when testing for equality. When this happens you will get an informative error: ``` python (gapminder .filter(pl.col("country") = "Ireland") ) ```
<i class="fas fa-exclamation-triangle faa-flash animated faa-slow " style=" color:red;"></i>
The single equal sign `=` is only to assign values to argument in functions whereas the double equal sign `==` checks for equality (use the "exactly equal" term to remember). --- # Multiple filters and multipe values Multiple can be filtered using specific variables like `is_in([])` or `is_between()`. Multiple filters can be added one after the other, separated with a comma. ``` python (gapminder .filter( pl.col("country").is_in(["Germany", "United States"]), pl.col("year").is_between(2000, 2007) ) ) ``` --- class: title-slide, middle ## Live Demo --- class: title-slide, middle ## Exercise 1/ Be sure that the packages `polars` is loaded 2/ Be sure that the object `gapminder` is created 3/ Create a new object that only contains data for France ``` python _____ = (_____ .filter(__.___("country") == ______) ) ``` 4/ Create a new object using the previous object that only contains data for France in 1982
−
+
03
:
00
--- class: inverse, mline, center, middle # 3. The with_columns() function --- # The with_columns() function Besides selecting sets of existing columns, it is often useful to **add new columns** that are functions of existing columns. That is the job of `with_columns()`. Once again the first element is the **name of the data frame** object to modify, then the second element is `with_columns()`. Inside `with_columns()`, the name of the new variable is followed by the `=` sign and the **condition creating the new values**. For example we can create a new column called `gdp_total` which contains the values resulting from the multiplication between `pop` and `gdpPercap`: ``` python (gapminder .with_columns(gdp_total = pl.col("pop") * pl.col("gdpPercap")) ) ``` `with_columns()` can also create multiple columns in the same statement, they just have to be separated by a comma `,`. --- class: title-slide, middle ## Live Demo --- class: title-slide, middle ## Exercise From the data frame object `gapminder`, create a new column called `country_upper` with the function `.str.to_uppercase()` on the column `country`: ``` python ____ = (____ .____(____ = __.____("____")._____) ) ```
−
+
05
:
00
--- class: inverse, mline, center, middle # 4. The group_by() and agg() functions --- # Summarise your data You can summarise your data with the function `len()`, `sum()`, `mean()`, `std()`, `median()` when selecting variables. For example, if we applied exactly the same code to a data frame, we get the average population per observation: ``` python (gapminder .select(pl.col("pop").mean()) ) ``` This summary can be renamed for transparency: ``` python (gapminder .select(pl.col("pop").mean().alias("pop_average")) ) ``` --- # The group_by() and agg() functions Summarising is not terribly useful **unless done for multiple groups** with `group_by()` and `agg()`. For example, if we applied exactly the same code to a data frame **grouped by year**, we get the average world population per year: ``` python (gapminder .group_by(pl.col("year")) .agg(pl.col("pop").mean().alias("pop_average")) ) ``` --- class: title-slide, middle ## Live Demo --- class: title-slide, middle ## Exercise 1/ From the data frame object `gapminder`, summarise the population average with the `mean()` function by `year` and by `continent` by adding the second grouping variable after the first one (use a coma to separate them): ``` python (gapminder .group_by( pl.col("year"), __.___("_________") ) .agg(pl.col("pop").mean().alias("pop_average")) ) ``` 2/ From the data frame object `gapminder`, summarise the population standard deviation with the `std()` function by `year` and by `continent` by adding the second grouping variable after the first one (use a coma to separate them).
−
+
03
:
00
--- class: inverse, mline, center, middle # 5. More complex operations --- # More complex operations Functions can be chained to create more complex queries according to your needs. In the example below we combine some of the contexts we have seen so far to create a more complex query: ``` python (gapminder .with_columns(gdp_total = pl.col("pop") * pl.col("gdpPercap")) .select( pl.col("gdp_total"), pl.col("year").alias("measure_year"), pl.col("pop").alias("total_pop") ) .group_by(pl.col("measure_year")) .agg(pl.col("gdp_total").mean().alias("gdp_total_average")) ) ``` --- class: title-slide, middle ## Live Demo --- class: title-slide, middle ## Exercise 1/ From the data frame object `gapminder`, summarise the sum of the worldwide population by `year` 2/ From the data frame object `gapminder`, summarise the lifeExp average with the `mean()` function by `year` and by `continent`
−
+
05
:
00
--- class: inverse, mline, center, middle # 6. The join() functions --- # Join 2 Tables .pull-left[ The principle is simple, two different tables are sharing a **key variable**. By joining these two table by this key variable, it is possible to merge them into one table and to keep all variables. However, there are different cases, imagine that your X table (on the left) has more observations on the key variable than the Y table (on the right) ] .pull-right[ <img src="https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/static/png/original-dfs.png" width="100%" style="display: block; margin: auto;" /> ] You might want to keep: - Only the observations included in the left table - Only the observations included in the right table - Only the observations included in both tables - All observations For more visualisation see [Tidy Animated Verbs](https://www.garrickadenbuie.com/project/tidyexplain/) --- # Join 2 Tables with Same Index Name Create 2 dataframe objects sharing 2 key variables: country and year ``` python table_1 = (gapminder .select(pl.col("country"), pl.col("year"), pl.col("pop")) ) table_2 = (gapminder .select(pl.col("country"), pl.col("year"), pl.col("lifeExp")) ) ``` Different possibilities to automatically join the dataframe objects ``` r table_right_joined = (table_1 .join(table_2, on = "country", how = "right") ) ``` `how` can be left, right, inner, or full. Set the argument `on` to manually choose multiple key variables, example: ``` python table_full_joined = (table_1 .join(table_2, on = ["country", "year"], how = "full") ) ``` --- # Join 2 Tables with Different Index Name Sometimes the name of the variable index used to match the two tables is different, in this case it is necessary to __manually specify the matching variables__. ``` python table_1 = (gapminder .select(pl.col("country"), pl.col("year"), pl.col("pop")) ) table_2 = (gapminder .select( pl.col("country").alias("country_measure"), pl.col("year").alias("year_measure"), pl.col("lifeExp")) ) ``` Set the argument `by` to manually choose the key variables, and specify the association: ``` python (table_1 .join( table_2, left_on = ["country", "year"], right_on = ["country_measure", "year_measure"] ) ) ``` --- # References Here is a list of very useful documents to get further: - [Polars official user guide](https://docs.pola.rs/) - [Modern Polars - A side-by-side comparison of the Polars and Pandas libraries](https://kevinheavey.github.io/modern-polars/) - [Polars - A Refreshingly Great DataFrame Library](https://blog.londogard.com/posts/2022-11-30-why-polars.html) - [(Pretty) big data wrangling with DuckDB and Polars](https://grantmcdermott.com/duckdb-polars/slides/slides.html) - [DataFrames on steroids with Polars](https://mint.westdri.ca/python/wb_polars_slides) --- class: inverse, mline, left, middle <img class="circle" src="https://github.com/damien-dupre.png" width="250px"/> # Thanks for your attention and don't hesitate to ask if you have any question! [
@damien-dupre](http://github.com/damien-dupre) [
https://damien-dupre.github.io](https://damien-dupre.github.io) [
damien.dupre@dcu.ie](mailto:damien.dupre@dcu.ie)