.transform() vs .withColumns(): Right Tool, Right Grain
pyspark
Spark
best-practices
Using .transform() for column-level work hides intent and increases code volume. .withColumns() is explicit, compact, and generates a single plan node. Use .transform() for pipeline-level chaining instead.
Modified
27/02/2026
NoteLive Execution
All code in this post is rendered against a Databricks workspace before publication.
NoteContext
Inspired by Denys K.’s LinkedIn post on using .transform() to break up long .withColumn() chains. The intent is spot on — nobody wants 500-line mega-chains. But there’s a better tool for column-level work.
Summary
Wrapping individual .withColumn() calls in functions and chaining with .transform() hides column-level intent — you have to read every function definition to know what columns are being added
.withColumns() (Spark 3.3+ / DBR 12.0+) keeps every column name and expression visible at the call site, in a single plan node
.transform() shines at pipeline-level chaining: deduplication, SCD logic, audit metadata — not individual column expressions
Both approaches are testable.
NoteSetup: Create Sample Data
Faker-generated data matching the original post’s scenario: first, last, age, email, status.
To understand what columns this adds, you need to read 4 separate function definitions. The column names and expressions are scattered. Each .transform() wrapping a .withColumn() still creates a separate Project node — 4 nodes total, same as a loop. And the total line count went up, not down.
Each .transform() function is a meaningful pipeline stage — deduplication, enrichment, auditing. Inside each stage, .withColumns() handles the column work. This is composable, readable, and generates an efficient plan.
What About Testing?
A fair point for .transform(): you can test each function in isolation with a tiny DataFrame. But .withColumns() is equally testable:
from pyspark.sql import functions as F# Define expressions as reusable variables_age_group = ( F.when(F.col("age") <30, "young") .when(F.col("age") <50, "mid") .otherwise("senior"))# Test against a 3-row DataFrame — same as you would with a functiontest_df = spark.createDataFrame([(25,), (45,), (65,)], ["age"])test_df.select(_age_group.alias("age_group")).show()# Or test the full enrichment dictENRICHMENT_COLS = {"age_group": _age_group,"status_clean": F.when(F.col("status") =="active", "ACTIVE") .otherwise("INACTIVE"),}test_df2 = spark.createDataFrame( [(25, "active"), (45, "inactive")], ["age", "status"],)test_df2.withColumns(ENRICHMENT_COLS).show()