Skip to content

Introduction to Polars

Written by Luke Chang

Polars is a blazing-fast DataFrame library for Python, written in Rust. It is designed for high-performance data manipulation and offers an expressive, consistent API that makes data wrangling both efficient and enjoyable. Unlike older tools, Polars was built from the ground up to take advantage of modern hardware through multi-threaded execution, lazy evaluation, and memory-efficient columnar storage.

Why Polars?

  • Speed: Polars is one of the fastest DataFrame libraries available, routinely outperforming alternatives by 10-100x on large datasets.
  • Memory efficiency: Its Apache Arrow-based columnar format minimizes memory usage and avoids unnecessary copies.
  • Expressive API: The expression system lets you write concise, readable queries that are easy to compose and optimize.
  • Lazy evaluation: Polars can build an optimized query plan before executing, enabling automatic optimizations like predicate pushdown and projection pruning.

In this tutorial, we will learn the fundamentals of Polars using a faculty salary dataset. By the end, you will be comfortable loading data, transforming columns, filtering rows, grouping and aggregating, and using advanced features like window functions and lazy evaluation.

For more details, check out the official Polars documentation.

import polars as pl
import numpy as np

Polars Objects

Polars provides two core data structures: Series and DataFrame. Understanding these is the first step to working with Polars effectively.

Series

A Series is a typed, one-dimensional array. Every element in a Series has the same data type, which is determined at creation time. You can think of it as a single column of data.

# Create a Series from a list of integers
ages = pl.Series("age", [25, 30, 35, 40, 45])
ages
shape: (5,)
age
i64
25
30
35
40
45
# Create a Series with an explicit type
scores = pl.Series("score", [88.5, 92.0, 76.3, 95.1], dtype=pl.Float32)
scores
shape: (4,)
score
f32
88.5
92.0
76.300003
95.099998
# String Series
names = pl.Series("name", ["Alice", "Bob", "Charlie", "Diana"])
names
shape: (4,)
name
str
"Alice"
"Bob"
"Charlie"
"Diana"

DataFrame

A DataFrame is a two-dimensional table composed of multiple named Series (columns). Each column has its own data type, and all columns share the same length.

A key difference from some other tools: Polars DataFrames have no row index. Rows are identified by their position, and any row labeling must be done explicitly through a column. This keeps the API simple and avoids ambiguity.

# Create a DataFrame from a dictionary
df_example = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "age": [25, 30, 35, 40],
    "salary": [50000, 60000, 70000, 80000],
})
df_example
shape: (4, 3)
nameagesalary
stri64i64
"Alice"2550000
"Bob"3060000
"Charlie"3570000
"Diana"4080000
# Create a DataFrame from numpy arrays
rng = np.random.default_rng(42)
df_from_numpy = pl.DataFrame({
    "x": rng.normal(0, 1, 5),
    "y": rng.normal(0, 1, 5),
})
df_from_numpy
shape: (5, 2)
xy
f64f64
0.304717-1.30218
-1.0399840.12784
0.750451-0.316243
0.940565-0.016801
-1.951035-0.853044

Loading Data

Polars can read data from many formats including CSV, Parquet, JSON, and more. We will load a faculty salary dataset that contains information about salaries, departments, years of experience, and other attributes.

The pl.read_csv() function reads a CSV file eagerly (loading everything into memory immediately). Polars can also read directly from URLs.

url = "https://raw.githubusercontent.com/ljchang/dartbrains/master/data/salary/salary.csv"
df = pl.read_csv(url)
df
shape: (77, 6)
salarygenderdepartmyearsagepublications
i64i64stri64i64i64
862850"bio"266472
771250"bio"285843
719220"bio"103823
704990"bio"164664
666240"bio"114123
536621"neuro"1313
571851"stat"9397
522541"stat"2329
618851"math"23609
495421"math"3335

For very large files, you can use pl.scan_csv() instead to create a LazyFrame that defers execution until you call .collect(). We will explore lazy evaluation in detail later in this tutorial.

Inspecting Data

Before diving into analysis, it is important to understand the structure and contents of your data. Polars provides several handy methods for this.

# View the first few rows
df.head()
shape: (5, 6)
salarygenderdepartmyearsagepublications
i64i64stri64i64i64
862850"bio"266472
771250"bio"285843
719220"bio"103823
704990"bio"164664
666240"bio"114123
# View the last few rows
df.tail()
shape: (5, 6)
salarygenderdepartmyearsagepublications
i64i64stri64i64i64
536621"neuro"1313
571851"stat"9397
522541"stat"2329
618851"math"23609
495421"math"3335
# Random sample of rows
df.sample(5, seed=42)
shape: (5, 6)
salarygenderdepartmyearsagepublications
i64i64stri64i64i64
587440"physics"20509
446870"chem"43419
571851"stat"9397
513910"stat"5358
540760"physics"194912
# Summary statistics for all columns
df.describe()
shape: (9, 7)
statisticsalarygenderdepartmyearsagepublications
strf64f64strf64f64f64
"count"77.077.0"77"76.076.077.0
"null_count"0.00.0"0"1.01.00.0
"mean"67748.5194810.142857null14.97368445.48684221.831169
"std"15100.5814350.387783null8.617779.00591415.24053
"min"44687.00.0"bio"1.031.03.0
"25%"57185.00.0null8.038.09.0
"50%"62607.00.0null14.044.019.0
"75%"75382.00.0null23.053.033.0
"max"112800.02.0"stat"34.065.072.0
# Column names and their data types
df.schema
Schema({'salary': Int64, 'gender': Int64, 'departm': String, 'years': Int64, 'age': Int64, 'publications': Int64})
# Shape of the DataFrame (rows, columns)
df.shape
# List of column names
df.columns
# Check for missing values in each column
df.null_count()
shape: (1, 6)
salarygenderdepartmyearsagepublications
u32u32u32u32u32u32
000110

Dealing with Missing Values

Polars uses null to represent missing data. This is distinct from NaN (Not a Number), which is a valid floating-point value. The null type is consistent across all data types, whether numeric, string, or boolean.

# See how many nulls each column has
df.null_count()
shape: (1, 6)
salarygenderdepartmyearsagepublications
u32u32u32u32u32u32
000110
# Drop all rows that contain any null value
df_no_nulls = df.drop_nulls()
print(f"Original rows: {df.shape[0]}, After dropping nulls: {df_no_nulls.shape[0]}")
df_no_nulls
Original rows: 77, After dropping nulls: 75
shape: (75, 6)
salarygenderdepartmyearsagepublications
i64i64stri64i64i64
862850"bio"266472
771250"bio"285843
719220"bio"103823
704990"bio"164664
666240"bio"114123
536621"neuro"1313
571851"stat"9397
522541"stat"2329
618851"math"23609
495421"math"3335
# Fill nulls with a specific strategy (e.g., fill numeric nulls with the column mean)
df_filled = df.with_columns(
    pl.col("years").fill_null(strategy="mean"),
    pl.col("age").fill_null(strategy="mean"),
)
df_filled.null_count()
shape: (1, 6)
salarygenderdepartmyearsagepublications
u32u32u32u32u32u32
000000
# Fill nulls with a constant value
df.with_columns(
    pl.col("years").fill_null(0),
).head()
shape: (5, 6)
salarygenderdepartmyearsagepublications
i64i64stri64i64i64
862850"bio"266472
771250"bio"285843
719220"bio"103823
704990"bio"164664
666240"bio"114123

The Expression API

The expression API is the heart of Polars. Expressions describe computations on columns without immediately executing them. They are the building blocks you pass to methods like select(), filter(), with_columns(), and group_by().agg().

The most common starting point is pl.col("column_name"), which references a column. From there, you can chain operations to build up complex transformations.

# Reference a single column and compute its mean
df.select(pl.col("salary").mean())
shape: (1, 1)
salary
f64
67748.519481
# Chain multiple operations: compute several summary statistics at once
df.select(
    pl.col("salary").mean().alias("mean_salary"),
    pl.col("salary").median().alias("median_salary"),
    pl.col("salary").min().alias("min_salary"),
    pl.col("salary").max().alias("max_salary"),
    pl.col("salary").std().alias("std_salary"),
)
shape: (1, 5)
mean_salarymedian_salarymin_salarymax_salarystd_salary
f64f64i64i64f64
67748.51948162607.04468711280015100.581435
# Compute the mean of multiple columns at once
df.select(pl.col("salary", "years", "age", "publications").mean())
shape: (1, 4)
salaryyearsagepublications
f64f64f64f64
67748.51948114.97368445.48684221.831169
# Use pl.all() to compute the sum of every numeric column
df.select(pl.all().sum())
InvalidOperationError: `sum` operation not supported for dtype `str`

Traceback (most recent call last):
  File "", line 2, in <module>
    df.select(pl.all().sum())
    ~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/home/runner/work/dartbrains/dartbrains/.venv/lib/python3.13/site-packages/polars/dataframe/frame.py", line 10341, in select
    .collect(optimizations=QueryOptFlags._eager())
     ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/dartbrains/dartbrains/.venv/lib/python3.13/site-packages/polars/_utils/deprecation.py", line 97, in wrapper
    return function(*args, **kwargs)
  File "/home/runner/work/dartbrains/dartbrains/.venv/lib/python3.13/site-packages/polars/lazyframe/opt_flags.py", line 326, in wrapper
    return function(*args, **kwargs)
  File "/home/runner/work/dartbrains/dartbrains/.venv/lib/python3.13/site-packages/polars/lazyframe/frame.py", line 2464, in collect
    return wrap_df(ldf.collect(engine, callback))
                   ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: `sum` operation not supported for dtype `str`
# Conditional expressions with when/then/otherwise
df.select(
    pl.col("salary"),
    pl.col("departm"),
    pl.when(pl.col("salary") > 80000)
      .then(pl.lit("high"))
      .otherwise(pl.lit("standard"))
      .alias("salary_tier"),
)
shape: (77, 3)
salarydepartmsalary_tier
i64strstr
86285"bio""high"
77125"bio""standard"
71922"bio""standard"
70499"bio""standard"
66624"bio""standard"
53662"neuro""standard"
57185"stat""standard"
52254"stat""standard"
61885"math""standard"
49542"math""standard"

Creating New Columns

The with_columns() method adds new columns or transforms existing ones. It always returns a new DataFrame, leaving the original unchanged. This immutability makes your code easier to reason about and debug.

Use .alias("name") to give your computed column a name.

# Create a column showing salary in thousands
df_enhanced = df.with_columns(
    (pl.col("salary") / 1000).round(1).alias("salary_thousands"),
)
df_enhanced.head()
shape: (5, 7)
salarygenderdepartmyearsagepublicationssalary_thousands
i64i64stri64i64i64f64
862850"bio"26647286.3
771250"bio"28584377.1
719220"bio"10382371.9
704990"bio"16466470.5
666240"bio"11412366.6
# Create multiple new columns at once
df_multi = df.with_columns(
    (pl.col("salary") > 80000).alias("high_salary"),
    (pl.col("salary") / pl.col("publications")).round(0).alias("salary_per_pub"),
    (pl.col("age") - pl.col("years")).alias("age_at_start"),
)
df_multi.head()
shape: (5, 9)
salarygenderdepartmyearsagepublicationshigh_salarysalary_per_pubage_at_start
i64i64stri64i64i64boolf64i64
862850"bio"266472true1198.038
771250"bio"285843false1794.030
719220"bio"103823false3127.028
704990"bio"164664false1102.030
666240"bio"114123false2897.030
# The original DataFrame is unchanged (immutability)
print("Original columns:", df.columns)
df.head(3)
Original columns: ['salary', 'gender', 'departm', 'years', 'age', 'publications']
shape: (3, 6)
salarygenderdepartmyearsagepublications
i64i64stri64i64i64
862850"bio"266472
771250"bio"285843
719220"bio"103823

Selecting and Filtering

select() chooses which columns to keep, while filter() chooses which rows to keep. Both accept Polars expressions.

# Select specific columns by name
df.select("salary", "departm", "gender")
shape: (77, 3)
salarydepartmgender
i64stri64
86285"bio"0
77125"bio"0
71922"bio"0
70499"bio"0
66624"bio"0
53662"neuro"1
57185"stat"1
52254"stat"1
61885"math"1
49542"math"1
# Select using expressions (allows renaming/transforming inline)
df.select(
    pl.col("departm"),
    (pl.col("salary") / 1000).alias("salary_k"),
)
shape: (77, 2)
departmsalary_k
strf64
"bio"86.285
"bio"77.125
"bio"71.922
"bio"70.499
"bio"66.624
"neuro"53.662
"stat"57.185
"stat"52.254
"math"61.885
"math"49.542
# Filter rows where salary exceeds 80000
df.filter(pl.col("salary") > 80000)
shape: (15, 6)
salarygenderdepartmyearsagepublications
i64i64stri64i64i64
862850"bio"266472
976300"chem"346443
824440"chem"316142
1048280"geol"null5044
1128000"neuro"144433
1064120"stat"235329
869800"stat"235342
969360"physics"155017
832160"physics"113719
821420"math"9399
# Combine multiple filter conditions with & (and) and | (or)
df.filter(
    (pl.col("salary") > 70000) & (pl.col("departm") == "neuro")
)
shape: (8, 6)
salarygenderdepartmyearsagepublications
i64i64stri64i64i64
1128000"neuro"144433
1057610"neuro"93930
929510"neuro"114120
866210"neuro"194910
855690"neuro"204635
838960"neuro"104022
797350"neuro"114132
715180"neuro"73734
# Filter using is_in to match against a set of values
df.filter(pl.col("departm").is_in(["neuro", "stat", "bio"]))
shape: (46, 6)
salarygenderdepartmyearsagepublications
i64i64stri64i64i64
862850"bio"266472
771250"bio"285843
719220"bio"103823
704990"bio"164664
666240"bio"114123
529681"bio"184832
588931"neuro"10354
536621"neuro"1313
571851"stat"9397
522541"stat"2329
# Get unique departments
df.select("departm").unique()
shape: (7, 1)
departm
str
"math"
"geol"
"stat"
"chem"
"physics"
"bio"
"neuro"
# Sort by salary descending
df.sort("salary", descending=True).head(10)
shape: (10, 6)
salarygenderdepartmyearsagepublications
i64i64stri64i64i64
1128000"neuro"144433
1064120"stat"235329
1057610"neuro"93930
1048280"geol"null5044
976300"chem"346443
969360"physics"155017
929510"neuro"114120
869800"stat"235342
866210"neuro"194910
862850"bio"266472

Renaming Columns

Use rename() to change column names. Pass a dictionary mapping old names to new names.

df_renamed = df.rename({
    "departm": "department",
    "years": "years_experience",
})
df_renamed.head(3)
shape: (3, 6)
salarygenderdepartmentyears_experienceagepublications
i64i64stri64i64i64
862850"bio"266472
771250"bio"285843
719220"bio"103823

Operations

Polars supports a wide range of operations on columns, including arithmetic, string manipulations, and type casting. Whenever possible, use native Polars expressions rather than custom Python functions for best performance.

# Arithmetic: give everyone a 10% raise
df.select(
    pl.col("departm"),
    pl.col("salary"),
    (pl.col("salary") * 1.10).round(0).cast(pl.Int64).alias("salary_with_raise"),
)
shape: (77, 3)
departmsalarysalary_with_raise
stri64i64
"bio"8628594914
"bio"7712584838
"bio"7192279114
"bio"7049977549
"bio"6662473286
"neuro"5366259028
"stat"5718562904
"stat"5225457479
"math"6188568074
"math"4954254496
# String operations: convert department names to uppercase
df.select(
    pl.col("departm").str.to_uppercase().alias("department_upper"),
    pl.col("salary"),
).head()
shape: (5, 2)
department_uppersalary
stri64
"BIO"86285
"BIO"77125
"BIO"71922
"BIO"70499
"BIO"66624
# Cast a column to a different type
df.select(
    pl.col("salary").cast(pl.Float64).alias("salary_float"),
    pl.col("gender").cast(pl.Utf8).alias("gender_str"),
).head()
shape: (5, 2)
salary_floatgender_str
f64str
86285.0"0"
77125.0"0"
71922.0"0"
70499.0"0"
66624.0"0"

For truly custom logic that cannot be expressed with native Polars expressions, you can use map_elements() to apply a Python function element-wise. However, this is significantly slower than native expressions because it bypasses Polars' optimized execution engine.

# map_elements example (slow — prefer native expressions when possible)
df.select(
    pl.col("departm"),
    pl.col("salary").map_elements(
        lambda x: f"${x:,}", return_dtype=pl.Utf8
    ).alias("salary_formatted"),
).head()
shape: (5, 2)
departmsalary_formatted
strstr
"bio""$86,285"
"bio""$77,125"
"bio""$71,922"
"bio""$70,499"
"bio""$66,624"

Joining Data

Polars supports several types of joins for combining DataFrames. The syntax is straightforward: call .join() on the left DataFrame and pass the right DataFrame along with the join key and type.

# Create two example DataFrames to demonstrate joins
departments = pl.DataFrame({
    "dept_code": ["bio", "chem", "neuro", "stat", "physics"],
    "full_name": ["Biology", "Chemistry", "Neuroscience", "Statistics", "Physics"],
    "building": ["LSC", "Burke", "Moore", "Kemeny", "Wilder"],
})

budgets = pl.DataFrame({
    "dept_code": ["bio", "chem", "neuro", "math", "geol"],
    "annual_budget": [500000, 750000, 900000, 300000, 400000],
})
# Inner join: only keeps rows where the key exists in both DataFrames
departments.join(budgets, on="dept_code", how="inner")
shape: (3, 4)
dept_codefull_namebuildingannual_budget
strstrstri64
"bio""Biology""LSC"500000
"chem""Chemistry""Burke"750000
"neuro""Neuroscience""Moore"900000
# Left join: keeps all rows from the left DataFrame
departments.join(budgets, on="dept_code", how="left")
shape: (5, 4)
dept_codefull_namebuildingannual_budget
strstrstri64
"bio""Biology""LSC"500000
"chem""Chemistry""Burke"750000
"neuro""Neuroscience""Moore"900000
"stat""Statistics""Kemeny"null
"physics""Physics""Wilder"null
# Full outer join: keeps all rows from both DataFrames
departments.join(budgets, on="dept_code", how="full", coalesce=True)
shape: (7, 4)
dept_codefull_namebuildingannual_budget
strstrstri64
"bio""Biology""LSC"500000
"chem""Chemistry""Burke"750000
"neuro""Neuroscience""Moore"900000
"math"nullnull300000
"geol"nullnull400000
"physics""Physics""Wilder"null
"stat""Statistics""Kemeny"null
# Vertical stacking (concatenating rows)
df_a = pl.DataFrame({"name": ["Alice", "Bob"], "score": [90, 85]})
df_b = pl.DataFrame({"name": ["Charlie", "Diana"], "score": [78, 92]})
pl.concat([df_a, df_b])
shape: (4, 2)
namescore
stri64
"Alice"90
"Bob"85
"Charlie"78
"Diana"92
# Horizontal stacking (concatenating columns)
df_left = pl.DataFrame({"name": ["Alice", "Bob"], "age": [25, 30]})
df_right = pl.DataFrame({"salary": [50000, 60000], "dept": ["bio", "chem"]})
pl.concat([df_left, df_right], how="horizontal")
shape: (2, 4)
nameagesalarydept
stri64i64str
"Alice"2550000"bio"
"Bob"3060000"chem"

Grouping and Aggregation

Grouping is one of the most powerful operations in data analysis. Polars uses group_by() to split a DataFrame by one or more columns, then agg() to compute summary statistics within each group.

# Average salary by department
df.group_by("departm").agg(
    pl.col("salary").mean().alias("avg_salary"),
).sort("avg_salary", descending=True)
shape: (7, 2)
departmavg_salary
strf64
"neuro"76465.6
"geol"73548.5
"physics"67987.0
"stat"67242.8
"chem"66003.454545
"bio"63094.6875
"math"60920.875
# Multiple aggregations in a single call
df.group_by("departm").agg(
    pl.col("salary").mean().alias("avg_salary"),
    pl.col("salary").max().alias("max_salary"),
    pl.col("salary").min().alias("min_salary"),
    pl.col("publications").mean().alias("avg_publications"),
    pl.len().alias("count"),
).sort("avg_salary", descending=True)
shape: (7, 6)
departmavg_salarymax_salarymin_salaryavg_publicationscount
strf64i64i64f64u32
"neuro"76465.61128005366227.73333315
"geol"73548.51048285276630.04
"physics"67987.0969365407611.58
"stat"67242.81064125139117.415
"chem"66003.454545976304468729.72727311
"bio"63094.6875862855296825.562516
"math"60920.87582142495427.08
# Group by multiple columns
df.group_by("departm", "gender").agg(
    pl.col("salary").mean().alias("avg_salary"),
    pl.len().alias("count"),
).sort("departm", "gender")
shape: (13, 4)
departmgenderavg_salarycount
stri64f64u32
"bio"064100.57142914
"bio"156053.52
"chem"067008.910
"chem"155949.01
"geol"073548.54
"neuro"079571.46153813
"neuro"156277.52
"physics"067987.08
"stat"069169.46153813
"stat"154719.52

Window Functions

Window functions compute values across groups without collapsing rows. In Polars, you use the over() expression to define the grouping. This is extremely powerful because it lets you add group-level statistics as new columns while keeping every individual row intact.

This replaces the common pattern of grouping, computing a statistic, and then merging the result back to the original DataFrame.

# Add each department's average salary as a column
df.with_columns(
    pl.col("salary").mean().over("departm").alias("dept_avg_salary"),
).head(10)
shape: (10, 7)
salarygenderdepartmyearsagepublicationsdept_avg_salary
i64i64stri64i64i64f64
862850"bio"26647263094.6875
771250"bio"28584363094.6875
719220"bio"10382363094.6875
704990"bio"16466463094.6875
666240"bio"11412363094.6875
644510"bio"23604463094.6875
643660"bio"23532263094.6875
593440"bio"5401163094.6875
585600"bio"838863094.6875
582940"bio"20501263094.6875
# Each person's salary as a percentage of their department's mean
df_pct = df.with_columns(
    (pl.col("salary") / pl.col("salary").mean().over("departm") * 100)
    .round(1)
    .alias("pct_of_dept_mean"),
)
df_pct.sort("pct_of_dept_mean", descending=True).head(10)
shape: (10, 7)
salarygenderdepartmyearsagepublicationspct_of_dept_mean
i64i64stri64i64i64f64
1064120"stat"235329158.3
976300"chem"346443147.9
1128000"neuro"144433147.5
969360"physics"155017142.6
1048280"geol"null5044142.5
1057610"neuro"93930138.3
862850"bio"266472136.8
821420"math"9399134.8
869800"stat"235342129.4
824440"chem"316142124.9
# Rank salary within each department
df.with_columns(
    pl.col("salary").rank(descending=True).over("departm").alias("dept_salary_rank"),
).sort("departm", "dept_salary_rank").head(15)
shape: (15, 7)
salarygenderdepartmyearsagepublicationsdept_salary_rank
i64i64stri64i64i64f64
862850"bio"2664721.0
771250"bio"2858432.0
719220"bio"1038233.0
704990"bio"1646644.0
666240"bio"1141235.0
582940"bio"20501211.0
560920"bio"240412.0
551250"bio"838913.0
544520"bio"1343714.0
542690"bio"26561215.0

Reshaping Data

Polars provides unpivot() to go from wide to long format and pivot() to go from long to wide format. These are essential when preparing data for visualization or statistical modeling.

# Create a wide-format example
df_wide = pl.DataFrame({
    "department": ["bio", "chem", "neuro"],
    "q1_budget": [100, 200, 150],
    "q2_budget": [110, 190, 160],
    "q3_budget": [105, 210, 155],
})
df_wide
shape: (3, 4)
departmentq1_budgetq2_budgetq3_budget
stri64i64i64
"bio"100110105
"chem"200190210
"neuro"150160155
# Unpivot (wide to long): melt the quarterly columns into rows
df_long = df_wide.unpivot(
    index="department",
    on=["q1_budget", "q2_budget", "q3_budget"],
    variable_name="quarter",
    value_name="budget",
)
df_long
shape: (9, 3)
departmentquarterbudget
strstri64
"bio""q1_budget"100
"chem""q1_budget"200
"neuro""q1_budget"150
"bio""q2_budget"110
"chem""q2_budget"190
"neuro""q2_budget"160
"bio""q3_budget"105
"chem""q3_budget"210
"neuro""q3_budget"155
# Pivot (long to wide): spread the quarter values back into columns
df_long.pivot(
    on="quarter",
    index="department",
    values="budget",
)
shape: (3, 4)
departmentq1_budgetq2_budgetq3_budget
stri64i64i64
"bio"100110105
"chem"200190210
"neuro"150160155

Lazy Evaluation

One of Polars' most powerful features is lazy evaluation. Instead of executing each operation immediately, a LazyFrame records the operations as a query plan. Polars then optimizes this plan before execution, which can dramatically improve performance.

Key optimizations that Polars applies automatically:

  • Predicate pushdown: Filters are pushed as early as possible, reducing the amount of data processed.
  • Projection pushdown: Only the columns you actually need are loaded from disk.
  • Common subexpression elimination: Repeated computations are calculated once.
  • Parallel execution: Independent operations run on multiple CPU cores.

To use lazy mode, start with pl.scan_csv() instead of pl.read_csv(), or convert an existing DataFrame with .lazy(). When you are ready to execute, call .collect().

# Create a LazyFrame by scanning the CSV
lf = pl.scan_csv(url)
print(type(lf))
lf
<class 'polars.lazyframe.frame.LazyFrame'>
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

Csv SCAN [https://raw.githubusercontent.com/ljchang/dartbrains/master/data/salary/salary.csv]

PROJECT */6 COLUMNS

ESTIMATED ROWS: 82
# Build a query plan without executing it
query = (
    lf
    .filter(pl.col("salary") > 60000)
    .group_by("departm")
    .agg(
        pl.col("salary").mean().alias("avg_salary"),
        pl.len().alias("count"),
    )
    .sort("avg_salary", descending=True)
)

# View the optimized query plan
print(query.explain())
SORT BY [descending: [true]] [col("avg_salary")]
  AGGREGATE[maintain_order: false]
    [col("salary").mean().alias("avg_salary"), len().alias("count")] BY [col("departm")]
    FROM
    Csv SCAN [https://raw.githubusercontent.com/ljchang/dartbrains/master/data/salary/salary.csv]
    PROJECT 2/6 COLUMNS
    SELECTION: [(col("salary")) > (60000)]
    ESTIMATED ROWS: 82
# Execute the query plan and get a DataFrame
query.collect()
shape: (7, 3)
departmavg_salarycount
strf64u32
"neuro"81291.41666712
"geol"80476.03
"physics"79061.04
"stat"75247.3333339
"chem"74212.7142867
"bio"71610.2857147
"math"68714.04

You can also convert an existing DataFrame to a LazyFrame with .lazy() and back with .collect(). This is useful when you want to chain many operations and let Polars optimize them as a batch.

# Convert eager DataFrame to lazy, apply operations, then collect
result = (
    df.lazy()
    .with_columns(
        (pl.col("salary") / 1000).alias("salary_k"),
    )
    .filter(pl.col("salary_k") > 70)
    .select("departm", "salary_k", "publications")
    .collect()
)
result
shape: (28, 3)
departmsalary_kpublications
strf64i64
"bio"86.28572
"bio"77.12543
"bio"71.92223
"bio"70.49964
"chem"97.6343
"physics"96.93617
"physics"83.21619
"physics"72.04416
"math"82.1429
"math"70.5097

Exercises

Try these exercises to practice what you have learned. Each one uses the salary dataset.

Exercise 1: Filter the salary data to only include rows where the departm is "neuro" and salary is above 80,000. How many rows match? What is the average salary of this subset?

# Your code here

Exercise 2: Group the data by departm and gender. For each group, calculate the mean salary and the number of people. Sort the result by mean salary in descending order.

# Your code here

Exercise 3: Using over(), create a new column called pct_of_dept_mean that shows each person's salary as a percentage of their department's mean salary. Sort by this column in descending order. Who has the highest relative salary compared to their department?

# Your code here

Summary

In this tutorial, we covered the core concepts of Polars:

  • Series and DataFrames: Polars' fundamental data structures with strict typing and no row index.
  • Loading and inspecting data: Reading CSVs, checking shapes, schemas, and null counts.
  • Missing values: Using null_count(), drop_nulls(), and fill_null().
  • The expression API: Building computations with pl.col(), chaining operations, and using when/then/otherwise.
  • Creating columns: Adding new columns immutably with with_columns() and .alias().
  • Selecting and filtering: Choosing columns with select() and rows with filter().
  • Operations: Arithmetic, string methods, type casting, and map_elements().
  • Joins: Combining DataFrames with join() and concat().
  • Grouping and aggregation: Using group_by().agg() for summary statistics.
  • Window functions: Computing group-level values without collapsing rows using over().
  • Reshaping: Converting between wide and long formats with unpivot() and pivot().
  • Lazy evaluation: Building optimized query plans with scan_csv(), .lazy(), and .collect().

Polars' combination of speed, expressiveness, and lazy evaluation makes it an excellent tool for data analysis in Python. As your datasets grow larger and your queries more complex, these features will become increasingly valuable.