From 20 withColumn() Calls to 1 withColumns()

pyspark
performance
Spark
Calling .withColumn() in a loop creates a new DataFrame per iteration, bloating your logical plan and slowing down the Catalyst optimizer. Use .withColumns() to apply all transformations in a single pass.
Modified

18/02/2026

NoteLive Execution

All code in this post is rendered against a Databricks workspace before publication.

Summary

  • Calling .withColumn() in a loop creates a new DataFrame per iteration, generating deeply nested logical plans that slow down the Catalyst optimizer
  • .withColumns() (Spark 3.3+ / DBR 12.0+) accepts a dictionary of {column_name: expression} and applies all transformations in a single plan step
  • Real-world use cases: metadata enrichment, column standardisation, acronym expansion, cleaning messy column names

Introduction

One of the most common PySpark anti-patterns is wrapping .withColumn() in a for loop. It reads well, feels Pythonic, and works… until it doesn’t.

Every call to .withColumn() returns a brand-new DataFrame object. Spark stacks each transformation as a separate Project node in the logical plan. With 50 columns in your loop, that’s 50 nested Project nodes the Catalyst optimizer has to traverse, analyze, and optimize through. With hundreds of columns (common in wide fact tables or IoT data) you can even hit a StackOverflowError because plan tree traversal is recursive.

.withColumns() was introduced in PySpark 3.3 to solve exactly this. One dictionary in, one plan node out.

df.withColumns(colsMap: dict[str, Column]) -> DataFrame

Every example below runs against this df, a realistic messy dataset that mirrors what you’d typically get from a legacy source system.

The column names are intentionally inconsistent: CustID (PascalCase), custName (camelCase), trnx_date (abbreviated snake_case), trnxAmt (camelCase + abbreviation), ACCT-bal (UPPER + hyphen), pmt$Status (dollar sign), addr.Line1 (dot + PascalCase). This is what real source data can look like.

from faker import Faker
from pyspark.sql import functions as F
from pyspark.sql.types import *
import random

fake = Faker()
Faker.seed(42)
random.seed(42)

# Generate 1000 rows of realistic messy data
rows = []
for _ in range(1000):
    rows.append((
        fake.random_int(min=10000, max=99999),          # CustID
        f"  {fake.name()}  ",                            # custName (with whitespace)
        fake.date_between("-2y", "today").isoformat(),   # trnx_date
        round(random.uniform(5.0, 9999.99), 4),         # trnxAmt (extra decimals)
        round(random.uniform(-500.0, 50000.0), 2),      # ACCT-bal
        random.choice(["pending", "complete", "failed", "REVERSED"]),  # pmt$Status
        fake.street_address(),                           # addr.Line1
    ))

schema = StructType([
    StructField("CustID",     IntegerType()),
    StructField("custName",   StringType()),
    StructField("trnx_date",  StringType()),
    StructField("trnxAmt",    DoubleType()),
    StructField("ACCT-bal",   DoubleType()),
    StructField("pmt$Status", StringType()),
    StructField("addr.Line1", StringType()),
])

df = spark.createDataFrame(rows, schema)
df.show(5, truncate=False)
+------+---------------------+----------+---------+--------+----------+------------------------------+
|CustID|custName             |trnx_date |trnxAmt  |ACCT-bal|pmt$Status|addr.Line1                    |
+------+---------------------+----------+---------+--------+----------+------------------------------+
|93810 |  Patrick Sanchez    |2024-05-30|6396.0645|763.04  |failed    |819 Johnson Course            |
|38657 |  Lance Hoffman      |2025-07-25|2452.6916|6546.67 |pending   |79402 Peterson Drives Apt. 511|
|59797 |  Ryan Munoz         |2025-09-30|6768.6046|44555.07|pending   |161 Calderon River Suite 931  |
|16006 |  Patricia Galloway  |2025-11-04|5906.9668|1105.03 |pending   |4752 Kelly Skyway             |
|44993 |  Melinda Jones      |2025-05-27|2190.2844|25020.44|pending   |76483 Cameron Trail           |
+------+---------------------+----------+---------+--------+----------+------------------------------+
only showing top 5 rows

Code Examples

The Anti-Pattern

from pyspark.sql import functions as F

# Each iteration creates a new DataFrame with a deeper logical plan
columns_to_add = {
    "trnx_date_upper":   F.upper(F.col("trnx_date")),
    "cust_name_trimmed":  F.trim(F.col("custName")),
    "acct_balance_abs":   F.abs(F.col("`ACCT-bal`")),
    "txn_amt_rounded":    F.round(F.col("trnxAmt"), 2),
    "load_timestamp":     F.current_timestamp(),
}

df_result = df
for col_name, expression in columns_to_add.items():
    df_result = df_result.withColumn(col_name, expression)
# 5 iterations = 5 nested Project nodes in the plan

df_result.show(5, truncate=False)
+------+---------------------+----------+---------+--------+----------+------------------------------+---------------+-----------------+----------------+---------------+--------------------------+
|CustID|custName             |trnx_date |trnxAmt  |ACCT-bal|pmt$Status|addr.Line1                    |trnx_date_upper|cust_name_trimmed|acct_balance_abs|txn_amt_rounded|load_timestamp            |
+------+---------------------+----------+---------+--------+----------+------------------------------+---------------+-----------------+----------------+---------------+--------------------------+
|93810 |  Patrick Sanchez    |2024-05-30|6396.0645|763.04  |failed    |819 Johnson Course            |2024-05-30     |Patrick Sanchez  |763.04          |6396.06        |2026-02-18 12:13:29.416136|
|38657 |  Lance Hoffman      |2025-07-25|2452.6916|6546.67 |pending   |79402 Peterson Drives Apt. 511|2025-07-25     |Lance Hoffman    |6546.67         |2452.69        |2026-02-18 12:13:29.416136|
|59797 |  Ryan Munoz         |2025-09-30|6768.6046|44555.07|pending   |161 Calderon River Suite 931  |2025-09-30     |Ryan Munoz       |44555.07        |6768.6         |2026-02-18 12:13:29.416136|
|16006 |  Patricia Galloway  |2025-11-04|5906.9668|1105.03 |pending   |4752 Kelly Skyway             |2025-11-04     |Patricia Galloway|1105.03         |5906.97        |2026-02-18 12:13:29.416136|
|44993 |  Melinda Jones      |2025-05-27|2190.2844|25020.44|pending   |76483 Cameron Trail           |2025-05-27     |Melinda Jones    |25020.44        |2190.28        |2026-02-18 12:13:29.416136|
+------+---------------------+----------+---------+--------+----------+------------------------------+---------------+-----------------+----------------+---------------+--------------------------+
only showing top 5 rows

The Fix: .withColumns()

from pyspark.sql import functions as F

# Single plan node, single pass, same result
df_result = df.withColumns({
    "trnx_date_upper":   F.upper(F.col("trnx_date")),
    "cust_name_trimmed":  F.trim(F.col("custName")),
    "acct_balance_abs":   F.abs(F.col("`ACCT-bal`")),
    "txn_amt_rounded":    F.round(F.col("trnxAmt"), 2),
    "load_timestamp":     F.current_timestamp(),
})

df_result.show(5, truncate=False)
+------+---------------------+----------+---------+--------+----------+------------------------------+---------------+-----------------+----------------+---------------+--------------------------+
|CustID|custName             |trnx_date |trnxAmt  |ACCT-bal|pmt$Status|addr.Line1                    |trnx_date_upper|cust_name_trimmed|acct_balance_abs|txn_amt_rounded|load_timestamp            |
+------+---------------------+----------+---------+--------+----------+------------------------------+---------------+-----------------+----------------+---------------+--------------------------+
|93810 |  Patrick Sanchez    |2024-05-30|6396.0645|763.04  |failed    |819 Johnson Course            |2024-05-30     |Patrick Sanchez  |763.04          |6396.06        |2026-02-18 12:13:31.643766|
|38657 |  Lance Hoffman      |2025-07-25|2452.6916|6546.67 |pending   |79402 Peterson Drives Apt. 511|2025-07-25     |Lance Hoffman    |6546.67         |2452.69        |2026-02-18 12:13:31.643766|
|59797 |  Ryan Munoz         |2025-09-30|6768.6046|44555.07|pending   |161 Calderon River Suite 931  |2025-09-30     |Ryan Munoz       |44555.07        |6768.6         |2026-02-18 12:13:31.643766|
|16006 |  Patricia Galloway  |2025-11-04|5906.9668|1105.03 |pending   |4752 Kelly Skyway             |2025-11-04     |Patricia Galloway|1105.03         |5906.97        |2026-02-18 12:13:31.643766|
|44993 |  Melinda Jones      |2025-05-27|2190.2844|25020.44|pending   |76483 Cameron Trail           |2025-05-27     |Melinda Jones    |25020.44        |2190.28        |2026-02-18 12:13:31.643766|
+------+---------------------+----------+---------+--------+----------+------------------------------+---------------+-----------------+----------------+---------------+--------------------------+
only showing top 5 rows

Real-World: Standardise Messy Columns and Expand Abbreviations

Our setup DataFrame has columns like CustID, trnxAmt, ACCT-bal, pmt$Status, addr.Line1. Mixed casing, special characters, and cryptic abbreviations everywhere. Here’s how to clean them all in one pass:

from pyspark.sql import functions as F
import re

def to_snake_case(name: str) -> str:
    """Convert any naming convention to snake_case."""
    name = re.sub(r'[.$\-@#!]', '_', name)            # special chars to _
    name = re.sub(r'([a-z])([A-Z])', r'\1_\2', name)  # camelCase split
    name = re.sub(r'_{2,}', '_', name)                 # collapse doubles
    return name.lower().strip('_')

ABBREVIATIONS = {
    "trnx": "transaction", "cust": "customer",
    "acct": "account",     "pmt":  "payment",
    "addr": "address",     "amt":  "amount",
    "bal":  "balance",     "qty":  "quantity",
    "desc": "description", "dt":   "date",
    "nbr":  "number",      "cd":   "code",
    "nm":   "name",
}

def expand_abbreviations(name: str) -> str:
    """Expand common data abbreviations in a column name."""
    parts = name.split("_")
    return "_".join(ABBREVIATIONS.get(p, p) for p in parts)

# Build the transformation dict dynamically
transformations = {}
for col_name in df.columns:
    clean_name = expand_abbreviations(to_snake_case(col_name))
    transformations[clean_name] = F.col(f"`{col_name}`")

df_clean = df.withColumns(transformations)
print(df_clean.columns)
['CustID', 'custName', 'trnx_date', 'trnxAmt', 'ACCT-bal', 'pmt$Status', 'addr.Line1', 'customer_id', 'customer_name', 'transaction_date', 'transaction_amount', 'account_balance', 'payment_status', 'address_line1']

Real-World: Metadata Enrichment in One Pass

Adding audit columns is one of the most common ETL tasks. Do it in one call instead of five:

from pyspark.sql import functions as F

pipeline_run_id = "run-2026-02-16-001"

df_enriched = df.withColumns({
    "ingestion_timestamp": F.current_timestamp(),
    "source_system":       F.lit("CRM_PROD"),
    "pipeline_run_id":     F.lit(pipeline_run_id),
    "row_hash":            F.sha2(F.concat_ws("|", *[F.col(f"`{c}`").cast("string") for c in df.columns]), 256),
    "is_valid": F.when(
        F.col("trnxAmt").isNotNull() & (F.col("trnxAmt") > 0), True
    ).otherwise(False),
})

df_enriched.select("CustID", "ingestion_timestamp", "source_system", "row_hash", "is_valid").show(5, truncate=40)
+------+--------------------------+-------------+----------------------------------------+--------+
|CustID|       ingestion_timestamp|source_system|                                row_hash|is_valid|
+------+--------------------------+-------------+----------------------------------------+--------+
| 93810|2026-02-18 12:13:32.732663|     CRM_PROD|0a16938d05f57ecdac3a93290a6b091113435...|    true|
| 38657|2026-02-18 12:13:32.732663|     CRM_PROD|2852cc60ce03fa94327046eaa23647536574e...|    true|
| 59797|2026-02-18 12:13:32.732663|     CRM_PROD|8fdc5170c3c067b621370b9b015f84dd4e66c...|    true|
| 16006|2026-02-18 12:13:32.732663|     CRM_PROD|15e0ed67555e8ed6a965c5fa76a6071d3b953...|    true|
| 44993|2026-02-18 12:13:32.732663|     CRM_PROD|7dcd0fb0db5ffe761939d1de92d8b1798783e...|    true|
+------+--------------------------+-------------+----------------------------------------+--------+
only showing top 5 rows

Scala Equivalent

import org.apache.spark.sql.functions._

// Scala uses a Map of column name to expression
val dfResult = df.withColumns(Map(
  "trnx_date_upper"   -> upper(col("trnx_date")),
  "cust_name_trimmed"  -> trim(col("custName")),
  "acct_balance_abs"   -> abs(col("`ACCT-bal`")),
  "txn_amt_rounded"    -> round(col("trnxAmt"), 2),
  "load_timestamp"     -> current_timestamp()
))
TipCheck Your Spark Version

.withColumns() requires PySpark 3.3+. If you’re not using that version yet upgrade your databricks runtime for goodness sake.

Detail

Why .withColumn() in a Loop Is Costly

Each .withColumn() call returns a new DataFrame with a new Project node appended to the unresolved logical plan. With N loop iterations, you get N nested Project nodes.

The Catalyst optimizer has to traverse and optimize through each one. Analysis, resolution, and optimization passes all scale with plan depth. Beyond roughly 100 columns you can hit StackOverflowError because plan tree traversal is recursive. Even before that, planning time can balloon from milliseconds to seconds.

How .withColumns() Fixes This

  1. Single Plan Node: All column expressions merge into one Project node
  2. Faster Optimization: Catalyst processes one flat set of expressions instead of a deeply nested tree
  3. Cleaner Code: A dictionary is more readable and maintainable than a loop
  4. Prevents StackOverflow: Even with hundreds of columns, the plan stays shallow
  5. Idiomatic: This is the pattern the Spark developers intended for multi-column operations

When to Use .withColumns()

  • Adding metadata or audit columns to ETL outputs
  • Applying type casts or formatting to multiple columns
  • Building derived columns from business rules
  • Any time you find yourself writing a loop with .withColumn()

References & Further Reading

Back to top