Using Delta Tables and schema evolution in Azure Synapse // DataFrame

Coming from a SQL background, Delta Tables might be a perfect introduction to Data Lakehouse architecture.

Delta Table at its most basic level is a collection of versioned parquet files and related metadata, commonly stored in Azure Data Lake Storage (cloud-based hard drive).

The main features offered by Delta Tables include ACID support, schema enforcement and evolution, time travel (data versioning with rollbacks, audit trail), and data mutability through upsert and delete operations (supporting CDC and SCD operations).

Schema evolution is the feature I’d like to highlight here, as implementing it in a Synapse notebook using Python takes only a couple of steps to implement. It eliminates the need to manually handle a scenario most businesses want to have anyway and allows developers to focus on implementing the complex stuff.

Adding new columns is as simple as applying the new schema, with existing rows having that column set to null, and new rows having it populated based on source.

#saving an empty dataframe with the desired schema
new_schema = df.limit(0)

#applying the desired schema to an existing Delta Table located at "table_path"
new_schema \
    .write.format('delta') \
    .mode('append').option('mergeSchema', 'true') \
    .save(table_path)

If you’d like to read a bit more on the basics of Delta Tables in Azure, I recommend going through the following Microsoft Learn guide, as it’s a great introduction to the Delta Lake and related technologies.