Calling .withColumn() in a loop creates a new DataFrame per iteration, bloating your logical plan and slowing down the Catalyst optimizer. Use .withColumns() to apply all transformations in a single pass.
Modified
18/02/2026
NoteLive Execution
All code in this post is rendered against a Databricks workspace before publication.
Summary
Calling .withColumn() in a loop creates a new DataFrame per iteration, generating deeply nested logical plans that slow down the Catalyst optimizer
.withColumns() (Spark 3.3+ / DBR 12.0+) accepts a dictionary of {column_name: expression} and applies all transformations in a single plan step
One of the most common PySpark anti-patterns is wrapping .withColumn() in a for loop. It reads well, feels Pythonic, and works… until it doesn’t.
Every call to .withColumn() returns a brand-new DataFrame object. Spark stacks each transformation as a separate Project node in the logical plan. With 50 columns in your loop, that’s 50 nested Project nodes the Catalyst optimizer has to traverse, analyze, and optimize through. With hundreds of columns (common in wide fact tables or IoT data) you can even hit a StackOverflowError because plan tree traversal is recursive.
.withColumns() was introduced in PySpark 3.3 to solve exactly this. One dictionary in, one plan node out.
Every example below runs against this df, a realistic messy dataset that mirrors what you’d typically get from a legacy source system.
The column names are intentionally inconsistent: CustID (PascalCase), custName (camelCase), trnx_date (abbreviated snake_case), trnxAmt (camelCase + abbreviation), ACCT-bal (UPPER + hyphen), pmt$Status (dollar sign), addr.Line1 (dot + PascalCase). This is what real source data can look like.
+------+---------------------+----------+---------+--------+----------+------------------------------+
|CustID|custName |trnx_date |trnxAmt |ACCT-bal|pmt$Status|addr.Line1 |
+------+---------------------+----------+---------+--------+----------+------------------------------+
|93810 | Patrick Sanchez |2024-05-30|6396.0645|763.04 |failed |819 Johnson Course |
|38657 | Lance Hoffman |2025-07-25|2452.6916|6546.67 |pending |79402 Peterson Drives Apt. 511|
|59797 | Ryan Munoz |2025-09-30|6768.6046|44555.07|pending |161 Calderon River Suite 931 |
|16006 | Patricia Galloway |2025-11-04|5906.9668|1105.03 |pending |4752 Kelly Skyway |
|44993 | Melinda Jones |2025-05-27|2190.2844|25020.44|pending |76483 Cameron Trail |
+------+---------------------+----------+---------+--------+----------+------------------------------+
only showing top 5 rows
Code Examples
The Anti-Pattern
from pyspark.sql import functions as F# Each iteration creates a new DataFrame with a deeper logical plancolumns_to_add = {"trnx_date_upper": F.upper(F.col("trnx_date")),"cust_name_trimmed": F.trim(F.col("custName")),"acct_balance_abs": F.abs(F.col("`ACCT-bal`")),"txn_amt_rounded": F.round(F.col("trnxAmt"), 2),"load_timestamp": F.current_timestamp(),}df_result = dffor col_name, expression in columns_to_add.items(): df_result = df_result.withColumn(col_name, expression)# 5 iterations = 5 nested Project nodes in the plandf_result.show(5, truncate=False)
+------+---------------------+----------+---------+--------+----------+------------------------------+---------------+-----------------+----------------+---------------+--------------------------+
|CustID|custName |trnx_date |trnxAmt |ACCT-bal|pmt$Status|addr.Line1 |trnx_date_upper|cust_name_trimmed|acct_balance_abs|txn_amt_rounded|load_timestamp |
+------+---------------------+----------+---------+--------+----------+------------------------------+---------------+-----------------+----------------+---------------+--------------------------+
|93810 | Patrick Sanchez |2024-05-30|6396.0645|763.04 |failed |819 Johnson Course |2024-05-30 |Patrick Sanchez |763.04 |6396.06 |2026-02-18 12:13:29.416136|
|38657 | Lance Hoffman |2025-07-25|2452.6916|6546.67 |pending |79402 Peterson Drives Apt. 511|2025-07-25 |Lance Hoffman |6546.67 |2452.69 |2026-02-18 12:13:29.416136|
|59797 | Ryan Munoz |2025-09-30|6768.6046|44555.07|pending |161 Calderon River Suite 931 |2025-09-30 |Ryan Munoz |44555.07 |6768.6 |2026-02-18 12:13:29.416136|
|16006 | Patricia Galloway |2025-11-04|5906.9668|1105.03 |pending |4752 Kelly Skyway |2025-11-04 |Patricia Galloway|1105.03 |5906.97 |2026-02-18 12:13:29.416136|
|44993 | Melinda Jones |2025-05-27|2190.2844|25020.44|pending |76483 Cameron Trail |2025-05-27 |Melinda Jones |25020.44 |2190.28 |2026-02-18 12:13:29.416136|
+------+---------------------+----------+---------+--------+----------+------------------------------+---------------+-----------------+----------------+---------------+--------------------------+
only showing top 5 rows
The Fix: .withColumns()
from pyspark.sql import functions as F# Single plan node, single pass, same resultdf_result = df.withColumns({"trnx_date_upper": F.upper(F.col("trnx_date")),"cust_name_trimmed": F.trim(F.col("custName")),"acct_balance_abs": F.abs(F.col("`ACCT-bal`")),"txn_amt_rounded": F.round(F.col("trnxAmt"), 2),"load_timestamp": F.current_timestamp(),})df_result.show(5, truncate=False)
+------+---------------------+----------+---------+--------+----------+------------------------------+---------------+-----------------+----------------+---------------+--------------------------+
|CustID|custName |trnx_date |trnxAmt |ACCT-bal|pmt$Status|addr.Line1 |trnx_date_upper|cust_name_trimmed|acct_balance_abs|txn_amt_rounded|load_timestamp |
+------+---------------------+----------+---------+--------+----------+------------------------------+---------------+-----------------+----------------+---------------+--------------------------+
|93810 | Patrick Sanchez |2024-05-30|6396.0645|763.04 |failed |819 Johnson Course |2024-05-30 |Patrick Sanchez |763.04 |6396.06 |2026-02-18 12:13:31.643766|
|38657 | Lance Hoffman |2025-07-25|2452.6916|6546.67 |pending |79402 Peterson Drives Apt. 511|2025-07-25 |Lance Hoffman |6546.67 |2452.69 |2026-02-18 12:13:31.643766|
|59797 | Ryan Munoz |2025-09-30|6768.6046|44555.07|pending |161 Calderon River Suite 931 |2025-09-30 |Ryan Munoz |44555.07 |6768.6 |2026-02-18 12:13:31.643766|
|16006 | Patricia Galloway |2025-11-04|5906.9668|1105.03 |pending |4752 Kelly Skyway |2025-11-04 |Patricia Galloway|1105.03 |5906.97 |2026-02-18 12:13:31.643766|
|44993 | Melinda Jones |2025-05-27|2190.2844|25020.44|pending |76483 Cameron Trail |2025-05-27 |Melinda Jones |25020.44 |2190.28 |2026-02-18 12:13:31.643766|
+------+---------------------+----------+---------+--------+----------+------------------------------+---------------+-----------------+----------------+---------------+--------------------------+
only showing top 5 rows
Real-World: Standardise Messy Columns and Expand Abbreviations
Our setup DataFrame has columns like CustID, trnxAmt, ACCT-bal, pmt$Status, addr.Line1. Mixed casing, special characters, and cryptic abbreviations everywhere. Here’s how to clean them all in one pass:
from pyspark.sql import functions as Fimport redef to_snake_case(name: str) ->str:"""Convert any naming convention to snake_case.""" name = re.sub(r'[.$\-@#!]', '_', name) # special chars to _ name = re.sub(r'([a-z])([A-Z])', r'\1_\2', name) # camelCase split name = re.sub(r'_{2,}', '_', name) # collapse doublesreturn name.lower().strip('_')ABBREVIATIONS = {"trnx": "transaction", "cust": "customer","acct": "account", "pmt": "payment","addr": "address", "amt": "amount","bal": "balance", "qty": "quantity","desc": "description", "dt": "date","nbr": "number", "cd": "code","nm": "name",}def expand_abbreviations(name: str) ->str:"""Expand common data abbreviations in a column name.""" parts = name.split("_")return"_".join(ABBREVIATIONS.get(p, p) for p in parts)# Build the transformation dict dynamicallytransformations = {}for col_name in df.columns: clean_name = expand_abbreviations(to_snake_case(col_name)) transformations[clean_name] = F.col(f"`{col_name}`")df_clean = df.withColumns(transformations)print(df_clean.columns)
import org.apache.spark.sql.functions._// Scala uses a Map of column name to expressionval dfResult = df.withColumns(Map("trnx_date_upper"->upper(col("trnx_date")),"cust_name_trimmed"->trim(col("custName")),"acct_balance_abs"->abs(col("`ACCT-bal`")),"txn_amt_rounded"->round(col("trnxAmt"),2),"load_timestamp"->current_timestamp()))
TipCheck Your Spark Version
.withColumns() requires PySpark 3.3+. If you’re not using that version yet upgrade your databricks runtime for goodness sake.
Detail
Why .withColumn() in a Loop Is Costly
Each .withColumn() call returns a new DataFrame with a new Project node appended to the unresolved logical plan. With N loop iterations, you get N nested Project nodes.
The Catalyst optimizer has to traverse and optimize through each one. Analysis, resolution, and optimization passes all scale with plan depth. Beyond roughly 100 columns you can hit StackOverflowError because plan tree traversal is recursive. Even before that, planning time can balloon from milliseconds to seconds.
How .withColumns() Fixes This
Single Plan Node: All column expressions merge into one Project node
Faster Optimization: Catalyst processes one flat set of expressions instead of a deeply nested tree
Cleaner Code: A dictionary is more readable and maintainable than a loop
Prevents StackOverflow: Even with hundreds of columns, the plan stays shallow
Idiomatic: This is the pattern the Spark developers intended for multi-column operations
When to Use .withColumns()
Adding metadata or audit columns to ETL outputs
Applying type casts or formatting to multiple columns
Building derived columns from business rules
Any time you find yourself writing a loop with .withColumn()