Open in molab View on GitHub Download .py

Introduction to Polars¶

Written by Luke Chang

Polars is a blazing-fast DataFrame library for Python, written in Rust. It is designed for high-performance data manipulation and offers an expressive, consistent API that makes data wrangling both efficient and enjoyable. Unlike older tools, Polars was built from the ground up to take advantage of modern hardware through multi-threaded execution, lazy evaluation, and memory-efficient columnar storage.

Why Polars?

Speed: Polars is one of the fastest DataFrame libraries available, routinely outperforming alternatives by 10-100x on large datasets.
Memory efficiency: Its Apache Arrow-based columnar format minimizes memory usage and avoids unnecessary copies.
Expressive API: The expression system lets you write concise, readable queries that are easy to compose and optimize.
Lazy evaluation: Polars can build an optimized query plan before executing, enabling automatic optimizations like predicate pushdown and projection pruning.

In this tutorial, we will learn the fundamentals of Polars using a faculty salary dataset. By the end, you will be comfortable loading data, transforming columns, filtering rows, grouping and aggregating, and using advanced features like window functions and lazy evaluation.

For more details, check out the official Polars documentation.

import polars as pl
import numpy as np

Polars Objects¶

Polars provides two core data structures: Series and DataFrame. Understanding these is the first step to working with Polars effectively.

Series¶

A Series is a typed, one-dimensional array. Every element in a Series has the same data type, which is determined at creation time. You can think of it as a single column of data.

# Create a Series from a list of integers
ages = pl.Series("age", [25, 30, 35, 40, 45])
ages

shape: (5,)

age
i64
25
30
35
40
45

# Create a Series with an explicit type
scores = pl.Series("score", [88.5, 92.0, 76.3, 95.1], dtype=pl.Float32)
scores

shape: (4,)

score
f32
88.5
92.0
76.300003
95.099998

# String Series
names = pl.Series("name", ["Alice", "Bob", "Charlie", "Diana"])
names

shape: (4,)

name
str
"Alice"
"Bob"
"Charlie"
"Diana"

DataFrame¶

A DataFrame is a two-dimensional table composed of multiple named Series (columns). Each column has its own data type, and all columns share the same length.

A key difference from some other tools: Polars DataFrames have no row index. Rows are identified by their position, and any row labeling must be done explicitly through a column. This keeps the API simple and avoids ambiguity.

# Create a DataFrame from a dictionary
df_example = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "age": [25, 30, 35, 40],
    "salary": [50000, 60000, 70000, 80000],
})
df_example

shape: (4, 3)

name	age	salary
str	i64	i64
"Alice"	25	50000
"Bob"	30	60000
"Charlie"	35	70000
"Diana"	40	80000

# Create a DataFrame from numpy arrays
rng = np.random.default_rng(42)
df_from_numpy = pl.DataFrame({
    "x": rng.normal(0, 1, 5),
    "y": rng.normal(0, 1, 5),
})
df_from_numpy

shape: (5, 2)

x	y
f64	f64
0.304717	-1.30218
-1.039984	0.12784
0.750451	-0.316243
0.940565	-0.016801
-1.951035	-0.853044

Loading Data¶

Polars can read data from many formats including CSV, Parquet, JSON, and more. We will load a faculty salary dataset that contains information about salaries, departments, years of experience, and other attributes.

The pl.read_csv() function reads a CSV file eagerly (loading everything into memory immediately). Polars can also read directly from URLs.

url = "https://raw.githubusercontent.com/ljchang/dartbrains/master/data/salary/salary.csv"
df = pl.read_csv(url)
df

shape: (77, 6)

salary	gender	departm	years	age	publications
i64	i64	str	i64	i64	i64
86285	0	"bio"	26	64	72
77125	0	"bio"	28	58	43
71922	0	"bio"	10	38	23
70499	0	"bio"	16	46	64
66624	0	"bio"	11	41	23
…	…	…	…	…	…
53662	1	"neuro"	1	31	3
57185	1	"stat"	9	39	7
52254	1	"stat"	2	32	9
61885	1	"math"	23	60	9
49542	1	"math"	3	33	5

For very large files, you can use pl.scan_csv() instead to create a LazyFrame that defers execution until you call .collect(). We will explore lazy evaluation in detail later in this tutorial.

Inspecting Data¶

Before diving into analysis, it is important to understand the structure and contents of your data. Polars provides several handy methods for this.

# View the first few rows
df.head()

shape: (5, 6)

salary	gender	departm	years	age	publications
i64	i64	str	i64	i64	i64
86285	0	"bio"	26	64	72
77125	0	"bio"	28	58	43
71922	0	"bio"	10	38	23
70499	0	"bio"	16	46	64
66624	0	"bio"	11	41	23

# View the last few rows
df.tail()

shape: (5, 6)

salary	gender	departm	years	age	publications
i64	i64	str	i64	i64	i64
53662	1	"neuro"	1	31	3
57185	1	"stat"	9	39	7
52254	1	"stat"	2	32	9
61885	1	"math"	23	60	9
49542	1	"math"	3	33	5

# Random sample of rows
df.sample(5, seed=42)

shape: (5, 6)

salary	gender	departm	years	age	publications
i64	i64	str	i64	i64	i64
58744	0	"physics"	20	50	9
44687	0	"chem"	4	34	19
57185	1	"stat"	9	39	7
51391	0	"stat"	5	35	8
54076	0	"physics"	19	49	12

# Summary statistics for all columns
df.describe()

shape: (9, 7)

statistic	salary	gender	departm	years	age	publications
str	f64	f64	str	f64	f64	f64
"count"	77.0	77.0	"77"	76.0	76.0	77.0
"null_count"	0.0	0.0	"0"	1.0	1.0	0.0
"mean"	67748.519481	0.142857	null	14.973684	45.486842	21.831169
"std"	15100.581435	0.387783	null	8.61777	9.005914	15.24053
"min"	44687.0	0.0	"bio"	1.0	31.0	3.0
"25%"	57185.0	0.0	null	8.0	38.0	9.0
"50%"	62607.0	0.0	null	14.0	44.0	19.0
"75%"	75382.0	0.0	null	23.0	53.0	33.0
"max"	112800.0	2.0	"stat"	34.0	65.0	72.0

# Column names and their data types
df.schema

Schema({'salary': Int64, 'gender': Int64, 'departm': String, 'years': Int64, 'age': Int64, 'publications': Int64})

# Shape of the DataFrame (rows, columns)
df.shape

# List of column names
df.columns

# Check for missing values in each column
df.null_count()

shape: (1, 6)

salary	gender	departm	years	age	publications
u32	u32	u32	u32	u32	u32
0	0	0	1	1	0

Dealing with Missing Values¶

Polars uses null to represent missing data. This is distinct from NaN (Not a Number), which is a valid floating-point value. The null type is consistent across all data types, whether numeric, string, or boolean.

# See how many nulls each column has
df.null_count()

shape: (1, 6)

salary	gender	departm	years	age	publications
u32	u32	u32	u32	u32	u32
0	0	0	1	1	0

# Drop all rows that contain any null value
df_no_nulls = df.drop_nulls()
print(f"Original rows: {df.shape[0]}, After dropping nulls: {df_no_nulls.shape[0]}")
df_no_nulls

Original rows: 77, After dropping nulls: 75

shape: (75, 6)

salary	gender	departm	years	age	publications
i64	i64	str	i64	i64	i64
86285	0	"bio"	26	64	72
77125	0	"bio"	28	58	43
71922	0	"bio"	10	38	23
70499	0	"bio"	16	46	64
66624	0	"bio"	11	41	23
…	…	…	…	…	…
53662	1	"neuro"	1	31	3
57185	1	"stat"	9	39	7
52254	1	"stat"	2	32	9
61885	1	"math"	23	60	9
49542	1	"math"	3	33	5

# Fill nulls with a specific strategy (e.g., fill numeric nulls with the column mean)
df_filled = df.with_columns(
    pl.col("years").fill_null(strategy="mean"),
    pl.col("age").fill_null(strategy="mean"),
)
df_filled.null_count()

shape: (1, 6)

salary	gender	departm	years	age	publications
u32	u32	u32	u32	u32	u32
0	0	0	0	0	0

# Fill nulls with a constant value
df.with_columns(
    pl.col("years").fill_null(0),
).head()

shape: (5, 6)

salary	gender	departm	years	age	publications
i64	i64	str	i64	i64	i64
86285	0	"bio"	26	64	72
77125	0	"bio"	28	58	43
71922	0	"bio"	10	38	23
70499	0	"bio"	16	46	64
66624	0	"bio"	11	41	23

The Expression API¶

The expression API is the heart of Polars. Expressions describe computations on columns without immediately executing them. They are the building blocks you pass to methods like select(), filter(), with_columns(), and group_by().agg().

The most common starting point is pl.col("column_name"), which references a column. From there, you can chain operations to build up complex transformations.

# Reference a single column and compute its mean
df.select(pl.col("salary").mean())

shape: (1, 1)

salary
f64
67748.519481

# Chain multiple operations: compute several summary statistics at once
df.select(
    pl.col("salary").mean().alias("mean_salary"),
    pl.col("salary").median().alias("median_salary"),
    pl.col("salary").min().alias("min_salary"),
    pl.col("salary").max().alias("max_salary"),
    pl.col("salary").std().alias("std_salary"),
)

shape: (1, 5)

mean_salary	median_salary	min_salary	max_salary	std_salary
f64	f64	i64	i64	f64
67748.519481	62607.0	44687	112800	15100.581435

# Compute the mean of multiple columns at once
df.select(pl.col("salary", "years", "age", "publications").mean())

shape: (1, 4)

salary	years	age	publications
f64	f64	f64	f64
67748.519481	14.973684	45.486842	21.831169

# Use pl.all() to compute the sum of every numeric column
df.select(pl.all().sum())

InvalidOperationError: `sum` operation not supported for dtype `str`

Traceback (most recent call last):
  File "", line 2, in <module>
    df.select(pl.all().sum())
    ~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/home/runner/work/dartbrains/dartbrains/.venv/lib/python3.13/site-packages/polars/dataframe/frame.py", line 10341, in select
    .collect(optimizations=QueryOptFlags._eager())
     ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/dartbrains/dartbrains/.venv/lib/python3.13/site-packages/polars/_utils/deprecation.py", line 97, in wrapper
    return function(*args, **kwargs)
  File "/home/runner/work/dartbrains/dartbrains/.venv/lib/python3.13/site-packages/polars/lazyframe/opt_flags.py", line 326, in wrapper
    return function(*args, **kwargs)
  File "/home/runner/work/dartbrains/dartbrains/.venv/lib/python3.13/site-packages/polars/lazyframe/frame.py", line 2464, in collect
    return wrap_df(ldf.collect(engine, callback))
                   ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: `sum` operation not supported for dtype `str`

# Conditional expressions with when/then/otherwise
df.select(
    pl.col("salary"),
    pl.col("departm"),
    pl.when(pl.col("salary") > 80000)
      .then(pl.lit("high"))
      .otherwise(pl.lit("standard"))
      .alias("salary_tier"),
)

shape: (77, 3)

salary	departm	salary_tier
i64	str	str
86285	"bio"	"high"
77125	"bio"	"standard"
71922	"bio"	"standard"
70499	"bio"	"standard"
66624	"bio"	"standard"
…	…	…
53662	"neuro"	"standard"
57185	"stat"	"standard"
52254	"stat"	"standard"
61885	"math"	"standard"
49542	"math"	"standard"

Creating New Columns¶

The with_columns() method adds new columns or transforms existing ones. It always returns a new DataFrame, leaving the original unchanged. This immutability makes your code easier to reason about and debug.

Use .alias("name") to give your computed column a name.

# Create a column showing salary in thousands
df_enhanced = df.with_columns(
    (pl.col("salary") / 1000).round(1).alias("salary_thousands"),
)
df_enhanced.head()

shape: (5, 7)

salary	gender	departm	years	age	publications	salary_thousands
i64	i64	str	i64	i64	i64	f64
86285	0	"bio"	26	64	72	86.3
77125	0	"bio"	28	58	43	77.1
71922	0	"bio"	10	38	23	71.9
70499	0	"bio"	16	46	64	70.5
66624	0	"bio"	11	41	23	66.6

# Create multiple new columns at once
df_multi = df.with_columns(
    (pl.col("salary") > 80000).alias("high_salary"),
    (pl.col("salary") / pl.col("publications")).round(0).alias("salary_per_pub"),
    (pl.col("age") - pl.col("years")).alias("age_at_start"),
)
df_multi.head()

shape: (5, 9)

salary	gender	departm	years	age	publications	high_salary	salary_per_pub	age_at_start
i64	i64	str	i64	i64	i64	bool	f64	i64
86285	0	"bio"	26	64	72	true	1198.0	38
77125	0	"bio"	28	58	43	false	1794.0	30
71922	0	"bio"	10	38	23	false	3127.0	28
70499	0	"bio"	16	46	64	false	1102.0	30
66624	0	"bio"	11	41	23	false	2897.0	30

# The original DataFrame is unchanged (immutability)
print("Original columns:", df.columns)
df.head(3)

Original columns: ['salary', 'gender', 'departm', 'years', 'age', 'publications']

shape: (3, 6)

salary	gender	departm	years	age	publications
i64	i64	str	i64	i64	i64
86285	0	"bio"	26	64	72
77125	0	"bio"	28	58	43
71922	0	"bio"	10	38	23

Selecting and Filtering¶

select() chooses which columns to keep, while filter() chooses which rows to keep. Both accept Polars expressions.

# Select specific columns by name
df.select("salary", "departm", "gender")

shape: (77, 3)

salary	departm	gender
i64	str	i64
86285	"bio"	0
77125	"bio"	0
71922	"bio"	0
70499	"bio"	0
66624	"bio"	0
…	…	…
53662	"neuro"	1
57185	"stat"	1
52254	"stat"	1
61885	"math"	1
49542	"math"	1

# Select using expressions (allows renaming/transforming inline)
df.select(
    pl.col("departm"),
    (pl.col("salary") / 1000).alias("salary_k"),
)

shape: (77, 2)

departm	salary_k
str	f64
"bio"	86.285
"bio"	77.125
"bio"	71.922
"bio"	70.499
"bio"	66.624
…	…
"neuro"	53.662
"stat"	57.185
"stat"	52.254
"math"	61.885
"math"	49.542

# Filter rows where salary exceeds 80000
df.filter(pl.col("salary") > 80000)

shape: (15, 6)

salary	gender	departm	years	age	publications
i64	i64	str	i64	i64	i64
86285	0	"bio"	26	64	72
97630	0	"chem"	34	64	43
82444	0	"chem"	31	61	42
104828	0	"geol"	null	50	44
112800	0	"neuro"	14	44	33
…	…	…	…	…	…
106412	0	"stat"	23	53	29
86980	0	"stat"	23	53	42
96936	0	"physics"	15	50	17
83216	0	"physics"	11	37	19
82142	0	"math"	9	39	9

# Combine multiple filter conditions with & (and) and | (or)
df.filter(
    (pl.col("salary") > 70000) & (pl.col("departm") == "neuro")
)

shape: (8, 6)

salary	gender	departm	years	age	publications
i64	i64	str	i64	i64	i64
112800	0	"neuro"	14	44	33
105761	0	"neuro"	9	39	30
92951	0	"neuro"	11	41	20
86621	0	"neuro"	19	49	10
85569	0	"neuro"	20	46	35
83896	0	"neuro"	10	40	22
79735	0	"neuro"	11	41	32
71518	0	"neuro"	7	37	34

# Filter using is_in to match against a set of values
df.filter(pl.col("departm").is_in(["neuro", "stat", "bio"]))

shape: (46, 6)

salary	gender	departm	years	age	publications
i64	i64	str	i64	i64	i64
86285	0	"bio"	26	64	72
77125	0	"bio"	28	58	43
71922	0	"bio"	10	38	23
70499	0	"bio"	16	46	64
66624	0	"bio"	11	41	23
…	…	…	…	…	…
52968	1	"bio"	18	48	32
58893	1	"neuro"	10	35	4
53662	1	"neuro"	1	31	3
57185	1	"stat"	9	39	7
52254	1	"stat"	2	32	9

# Get unique departments
df.select("departm").unique()

shape: (7, 1)

departm
str
"bio"
"math"
"stat"
"geol"
"physics"
"chem"
"neuro"

# Sort by salary descending
df.sort("salary", descending=True).head(10)

shape: (10, 6)

salary	gender	departm	years	age	publications
i64	i64	str	i64	i64	i64
112800	0	"neuro"	14	44	33
106412	0	"stat"	23	53	29
105761	0	"neuro"	9	39	30
104828	0	"geol"	null	50	44
97630	0	"chem"	34	64	43
96936	0	"physics"	15	50	17
92951	0	"neuro"	11	41	20
86980	0	"stat"	23	53	42
86621	0	"neuro"	19	49	10
86285	0	"bio"	26	64	72

Renaming Columns¶

Use rename() to change column names. Pass a dictionary mapping old names to new names.

df_renamed = df.rename({
    "departm": "department",
    "years": "years_experience",
})
df_renamed.head(3)

shape: (3, 6)

salary	gender	department	years_experience	age	publications
i64	i64	str	i64	i64	i64
86285	0	"bio"	26	64	72
77125	0	"bio"	28	58	43
71922	0	"bio"	10	38	23

Operations¶

Polars supports a wide range of operations on columns, including arithmetic, string manipulations, and type casting. Whenever possible, use native Polars expressions rather than custom Python functions for best performance.

# Arithmetic: give everyone a 10% raise
df.select(
    pl.col("departm"),
    pl.col("salary"),
    (pl.col("salary") * 1.10).round(0).cast(pl.Int64).alias("salary_with_raise"),
)

shape: (77, 3)

departm	salary	salary_with_raise
str	i64	i64
"bio"	86285	94914
"bio"	77125	84838
"bio"	71922	79114
"bio"	70499	77549
"bio"	66624	73286
…	…	…
"neuro"	53662	59028
"stat"	57185	62904
"stat"	52254	57479
"math"	61885	68074
"math"	49542	54496

# String operations: convert department names to uppercase
df.select(
    pl.col("departm").str.to_uppercase().alias("department_upper"),
    pl.col("salary"),
).head()

shape: (5, 2)

department_upper	salary
str	i64
"BIO"	86285
"BIO"	77125
"BIO"	71922
"BIO"	70499
"BIO"	66624

# Cast a column to a different type
df.select(
    pl.col("salary").cast(pl.Float64).alias("salary_float"),
    pl.col("gender").cast(pl.Utf8).alias("gender_str"),
).head()

shape: (5, 2)

salary_float	gender_str
f64	str
86285.0	"0"
77125.0	"0"
71922.0	"0"
70499.0	"0"
66624.0	"0"

For truly custom logic that cannot be expressed with native Polars expressions, you can use map_elements() to apply a Python function element-wise. However, this is significantly slower than native expressions because it bypasses Polars' optimized execution engine.

# map_elements example (slow — prefer native expressions when possible)
df.select(
    pl.col("departm"),
    pl.col("salary").map_elements(
        lambda x: f"${x:,}", return_dtype=pl.Utf8
    ).alias("salary_formatted"),
).head()

shape: (5, 2)

departm	salary_formatted
str	str
"bio"	"$86,285"
"bio"	"$77,125"
"bio"	"$71,922"
"bio"	"$70,499"
"bio"	"$66,624"

Joining Data¶

Polars supports several types of joins for combining DataFrames. The syntax is straightforward: call .join() on the left DataFrame and pass the right DataFrame along with the join key and type.

# Create two example DataFrames to demonstrate joins
departments = pl.DataFrame({
    "dept_code": ["bio", "chem", "neuro", "stat", "physics"],
    "full_name": ["Biology", "Chemistry", "Neuroscience", "Statistics", "Physics"],
    "building": ["LSC", "Burke", "Moore", "Kemeny", "Wilder"],
})

budgets = pl.DataFrame({
    "dept_code": ["bio", "chem", "neuro", "math", "geol"],
    "annual_budget": [500000, 750000, 900000, 300000, 400000],
})

# Inner join: only keeps rows where the key exists in both DataFrames
departments.join(budgets, on="dept_code", how="inner")

shape: (3, 4)

dept_code	full_name	building	annual_budget
str	str	str	i64
"bio"	"Biology"	"LSC"	500000
"chem"	"Chemistry"	"Burke"	750000
"neuro"	"Neuroscience"	"Moore"	900000

# Left join: keeps all rows from the left DataFrame
departments.join(budgets, on="dept_code", how="left")

shape: (5, 4)

dept_code	full_name	building	annual_budget
str	str	str	i64
"bio"	"Biology"	"LSC"	500000
"chem"	"Chemistry"	"Burke"	750000
"neuro"	"Neuroscience"	"Moore"	900000
"stat"	"Statistics"	"Kemeny"	null
"physics"	"Physics"	"Wilder"	null

# Full outer join: keeps all rows from both DataFrames
departments.join(budgets, on="dept_code", how="full", coalesce=True)

shape: (7, 4)

dept_code	full_name	building	annual_budget
str	str	str	i64
"bio"	"Biology"	"LSC"	500000
"chem"	"Chemistry"	"Burke"	750000
"neuro"	"Neuroscience"	"Moore"	900000
"math"	null	null	300000
"geol"	null	null	400000
"stat"	"Statistics"	"Kemeny"	null
"physics"	"Physics"	"Wilder"	null

# Vertical stacking (concatenating rows)
df_a = pl.DataFrame({"name": ["Alice", "Bob"], "score": [90, 85]})
df_b = pl.DataFrame({"name": ["Charlie", "Diana"], "score": [78, 92]})
pl.concat([df_a, df_b])

shape: (4, 2)

name	score
str	i64
"Alice"	90
"Bob"	85
"Charlie"	78
"Diana"	92

# Horizontal stacking (concatenating columns)
df_left = pl.DataFrame({"name": ["Alice", "Bob"], "age": [25, 30]})
df_right = pl.DataFrame({"salary": [50000, 60000], "dept": ["bio", "chem"]})
pl.concat([df_left, df_right], how="horizontal")

shape: (2, 4)

name	age	salary	dept
str	i64	i64	str
"Alice"	25	50000	"bio"
"Bob"	30	60000	"chem"

Grouping and Aggregation¶

Grouping is one of the most powerful operations in data analysis. Polars uses group_by() to split a DataFrame by one or more columns, then agg() to compute summary statistics within each group.

# Average salary by department
df.group_by("departm").agg(
    pl.col("salary").mean().alias("avg_salary"),
).sort("avg_salary", descending=True)

shape: (7, 2)

departm	avg_salary
str	f64
"neuro"	76465.6
"geol"	73548.5
"physics"	67987.0
"stat"	67242.8
"chem"	66003.454545
"bio"	63094.6875
"math"	60920.875

# Multiple aggregations in a single call
df.group_by("departm").agg(
    pl.col("salary").mean().alias("avg_salary"),
    pl.col("salary").max().alias("max_salary"),
    pl.col("salary").min().alias("min_salary"),
    pl.col("publications").mean().alias("avg_publications"),
    pl.len().alias("count"),
).sort("avg_salary", descending=True)

shape: (7, 6)

departm	avg_salary	max_salary	min_salary	avg_publications	count
str	f64	i64	i64	f64	u32
"neuro"	76465.6	112800	53662	27.733333	15
"geol"	73548.5	104828	52766	30.0	4
"physics"	67987.0	96936	54076	11.5	8
"stat"	67242.8	106412	51391	17.4	15
"chem"	66003.454545	97630	44687	29.727273	11
"bio"	63094.6875	86285	52968	25.5625	16
"math"	60920.875	82142	49542	7.0	8

# Group by multiple columns
df.group_by("departm", "gender").agg(
    pl.col("salary").mean().alias("avg_salary"),
    pl.len().alias("count"),
).sort("departm", "gender")

shape: (13, 4)

departm	gender	avg_salary	count
str	i64	f64	u32
"bio"	0	64100.571429	14
"bio"	1	56053.5	2
"chem"	0	67008.9	10
"chem"	1	55949.0	1
"geol"	0	73548.5	4
…	…	…	…
"neuro"	0	79571.461538	13
"neuro"	1	56277.5	2
"physics"	0	67987.0	8
"stat"	0	69169.461538	13
"stat"	1	54719.5	2

Window Functions¶

Window functions compute values across groups without collapsing rows. In Polars, you use the over() expression to define the grouping. This is extremely powerful because it lets you add group-level statistics as new columns while keeping every individual row intact.

This replaces the common pattern of grouping, computing a statistic, and then merging the result back to the original DataFrame.

# Add each department's average salary as a column
df.with_columns(
    pl.col("salary").mean().over("departm").alias("dept_avg_salary"),
).head(10)

shape: (10, 7)

salary	gender	departm	years	age	publications	dept_avg_salary
i64	i64	str	i64	i64	i64	f64
86285	0	"bio"	26	64	72	63094.6875
77125	0	"bio"	28	58	43	63094.6875
71922	0	"bio"	10	38	23	63094.6875
70499	0	"bio"	16	46	64	63094.6875
66624	0	"bio"	11	41	23	63094.6875
64451	0	"bio"	23	60	44	63094.6875
64366	0	"bio"	23	53	22	63094.6875
59344	0	"bio"	5	40	11	63094.6875
58560	0	"bio"	8	38	8	63094.6875
58294	0	"bio"	20	50	12	63094.6875

# Each person's salary as a percentage of their department's mean
df_pct = df.with_columns(
    (pl.col("salary") / pl.col("salary").mean().over("departm") * 100)
    .round(1)
    .alias("pct_of_dept_mean"),
)
df_pct.sort("pct_of_dept_mean", descending=True).head(10)

shape: (10, 7)

salary	gender	departm	years	age	publications	pct_of_dept_mean
i64	i64	str	i64	i64	i64	f64
106412	0	"stat"	23	53	29	158.3
97630	0	"chem"	34	64	43	147.9
112800	0	"neuro"	14	44	33	147.5
96936	0	"physics"	15	50	17	142.6
104828	0	"geol"	null	50	44	142.5
105761	0	"neuro"	9	39	30	138.3
86285	0	"bio"	26	64	72	136.8
82142	0	"math"	9	39	9	134.8
86980	0	"stat"	23	53	42	129.4
82444	0	"chem"	31	61	42	124.9

# Rank salary within each department
df.with_columns(
    pl.col("salary").rank(descending=True).over("departm").alias("dept_salary_rank"),
).sort("departm", "dept_salary_rank").head(15)

shape: (15, 7)

salary	gender	departm	years	age	publications	dept_salary_rank
i64	i64	str	i64	i64	i64	f64
86285	0	"bio"	26	64	72	1.0
77125	0	"bio"	28	58	43	2.0
71922	0	"bio"	10	38	23	3.0
70499	0	"bio"	16	46	64	4.0
66624	0	"bio"	11	41	23	5.0
…	…	…	…	…	…	…
58294	0	"bio"	20	50	12	11.0
56092	0	"bio"	2	40	4	12.0
55125	0	"bio"	8	38	9	13.0
54452	0	"bio"	13	43	7	14.0
54269	0	"bio"	26	56	12	15.0

Reshaping Data¶

Polars provides unpivot() to go from wide to long format and pivot() to go from long to wide format. These are essential when preparing data for visualization or statistical modeling.

# Create a wide-format example
df_wide = pl.DataFrame({
    "department": ["bio", "chem", "neuro"],
    "q1_budget": [100, 200, 150],
    "q2_budget": [110, 190, 160],
    "q3_budget": [105, 210, 155],
})
df_wide

shape: (3, 4)

department	q1_budget	q2_budget	q3_budget
str	i64	i64	i64
"bio"	100	110	105
"chem"	200	190	210
"neuro"	150	160	155

# Unpivot (wide to long): melt the quarterly columns into rows
df_long = df_wide.unpivot(
    index="department",
    on=["q1_budget", "q2_budget", "q3_budget"],
    variable_name="quarter",
    value_name="budget",
)
df_long

shape: (9, 3)

department	quarter	budget
str	str	i64
"bio"	"q1_budget"	100
"chem"	"q1_budget"	200
"neuro"	"q1_budget"	150
"bio"	"q2_budget"	110
"chem"	"q2_budget"	190
"neuro"	"q2_budget"	160
"bio"	"q3_budget"	105
"chem"	"q3_budget"	210
"neuro"	"q3_budget"	155

# Pivot (long to wide): spread the quarter values back into columns
df_long.pivot(
    on="quarter",
    index="department",
    values="budget",
)

shape: (3, 4)

department	q1_budget	q2_budget	q3_budget
str	i64	i64	i64
"bio"	100	110	105
"chem"	200	190	210
"neuro"	150	160	155

Lazy Evaluation¶

One of Polars' most powerful features is lazy evaluation. Instead of executing each operation immediately, a LazyFrame records the operations as a query plan. Polars then optimizes this plan before execution, which can dramatically improve performance.

Key optimizations that Polars applies automatically:

Predicate pushdown: Filters are pushed as early as possible, reducing the amount of data processed.
Projection pushdown: Only the columns you actually need are loaded from disk.
Common subexpression elimination: Repeated computations are calculated once.
Parallel execution: Independent operations run on multiple CPU cores.

To use lazy mode, start with pl.scan_csv() instead of pl.read_csv(), or convert an existing DataFrame with .lazy(). When you are ready to execute, call .collect().

# Create a LazyFrame by scanning the CSV
lf = pl.scan_csv(url)
print(type(lf))
lf

<class 'polars.lazyframe.frame.LazyFrame'>

naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

Csv SCAN [https://raw.githubusercontent.com/ljchang/dartbrains/master/data/salary/salary.csv]

PROJECT */6 COLUMNS

ESTIMATED ROWS: 82

# Build a query plan without executing it
query = (
    lf
    .filter(pl.col("salary") > 60000)
    .group_by("departm")
    .agg(
        pl.col("salary").mean().alias("avg_salary"),
        pl.len().alias("count"),
    )
    .sort("avg_salary", descending=True)
)

# View the optimized query plan
print(query.explain())

SORT BY [descending: [true]] [col("avg_salary")]
  AGGREGATE[maintain_order: false]
    [col("salary").mean().alias("avg_salary"), len().alias("count")] BY [col("departm")]
    FROM
    Csv SCAN [https://raw.githubusercontent.com/ljchang/dartbrains/master/data/salary/salary.csv]
    PROJECT 2/6 COLUMNS
    SELECTION: [(col("salary")) > (60000)]
    ESTIMATED ROWS: 82

# Execute the query plan and get a DataFrame
query.collect()

shape: (7, 3)

departm	avg_salary	count
str	f64	u32
"neuro"	81291.416667	12
"geol"	80476.0	3
"physics"	79061.0	4
"stat"	75247.333333	9
"chem"	74212.714286	7
"bio"	71610.285714	7
"math"	68714.0	4

You can also convert an existing DataFrame to a LazyFrame with .lazy() and back with .collect(). This is useful when you want to chain many operations and let Polars optimize them as a batch.

# Convert eager DataFrame to lazy, apply operations, then collect
result = (
    df.lazy()
    .with_columns(
        (pl.col("salary") / 1000).alias("salary_k"),
    )
    .filter(pl.col("salary_k") > 70)
    .select("departm", "salary_k", "publications")
    .collect()
)
result

shape: (28, 3)

departm	salary_k	publications
str	f64	i64
"bio"	86.285	72
"bio"	77.125	43
"bio"	71.922	23
"bio"	70.499	64
"chem"	97.63	43
…	…	…
"physics"	96.936	17
"physics"	83.216	19
"physics"	72.044	16
"math"	82.142	9
"math"	70.509	7

Exercises¶

Try these exercises to practice what you have learned. Each one uses the salary dataset.

Exercise 1: Filter the salary data to only include rows where the departm is "neuro" and salary is above 80,000. How many rows match? What is the average salary of this subset?

# Your code here

Exercise 2: Group the data by departm and gender. For each group, calculate the mean salary and the number of people. Sort the result by mean salary in descending order.

# Your code here

Exercise 3: Using over(), create a new column called pct_of_dept_mean that shows each person's salary as a percentage of their department's mean salary. Sort by this column in descending order. Who has the highest relative salary compared to their department?

# Your code here

Summary¶

In this tutorial, we covered the core concepts of Polars:

Series and DataFrames: Polars' fundamental data structures with strict typing and no row index.
Loading and inspecting data: Reading CSVs, checking shapes, schemas, and null counts.
Missing values: Using null_count(), drop_nulls(), and fill_null().
The expression API: Building computations with pl.col(), chaining operations, and using when/then/otherwise.
Creating columns: Adding new columns immutably with with_columns() and .alias().
Selecting and filtering: Choosing columns with select() and rows with filter().
Operations: Arithmetic, string methods, type casting, and map_elements().
Joins: Combining DataFrames with join() and concat().
Grouping and aggregation: Using group_by().agg() for summary statistics.
Window functions: Computing group-level values without collapsing rows using over().
Reshaping: Converting between wide and long formats with unpivot() and pivot().
Lazy evaluation: Building optimized query plans with scan_csv(), .lazy(), and .collect().

Polars' combination of speed, expressiveness, and lazy evaluation makes it an excellent tool for data analysis in Python. As your datasets grow larger and your queries more complex, these features will become increasingly valuable.