How to optimize DWH column updates // DataFrame

Column updates might not be an issue while the warehouse is new, but as rows accumulate the performance might start to drop drastically and refactoring will be a challenge. It’s best to plan ahead and design your warehouse properly, so let’s cover how to optimize DWH column updates.

SCD1 scenario

First, consider a simple DWH SCD1 update statement (assuming only 9 non-nullable, non-FK columns):

UPDATE tar SET
    Column1 = src.Column1
  , Column2 = src.Column2
  , Column3 = src.Column3
  ...
  , Column9 = src.Column9
  , uDate = SYSDATETIME()
FROM dbo.DWHTable tar
    INNER JOIN dbo.StagingTable src
 ON src.BusinessKey = tar.BusinessKey
WHERE
    tar.Column1 <> src.Column1
 OR tar.Column2 <> src.Column2
 OR tar.Column3 <> src.Column3
 ...
 OR tar.Column9 <> src.Column9;

Naturally, you need to match your staging and DWH table by relevant business key and update the row if any of the attributes changed. Obvious optimization is to index the BusinessKey column, though that will only help to an extent.

Persisted Checksum

In this case, you can help out the query engine by utilizing CHECKSUM function over the non-key attributes and PERSISTING the results. The CHECKSUM function is quite fast by itself, but persisting it makes it readily available. You would extend the tables as:

ALTER TABLE dbo.Table ADD RowChecksum AS CHECKSUM(Column1, Column2,..., Column9) PERSISTED;
ALTER TABLE dbo.StagingTable ADD RowChecksum AS CHECKSUM(Column1, Column2,..., Column9) PERSISTED;

Furthermore, the BusinessKey index should be extended to INCLUDE the RowChecksum column. The execution time (compared to no RowChecksum) should be reduced by at least 50% and even more as tables get larger.

Another nice thing about CHECKSUM function is that it handles NULLs, so you don’t need to check for them if some columns are nullable and your code will be much cleaner.

Caveats

Finally, as with most things in SQL, there are always some caveats and you need to be aware of them to know how to work around them.

CHECKSUM function will return different results for two logically equivalent entities if the data types don’t match (eg. nvarchar is not the same as varchar) or if you encounter some implicit rounding; work around this by carefully defining your data types or by applying column conversion inside of the CHECKSUM
be careful that the columns are ordered the same in both CHECKSUM functions, otherwise you’ll receive different results
the index will be updated every time one of the CHECKSUM columns is updated or rows are added; this adds some overhead to data load, but CHECKSUM function is quite fast and difference is negligible
every time you change the table schema, you need to update the CHECKSUM function definition; even if it doesn’t sound like much work, it adds to development overhead (sometimes requires deployment scripts)
you should update FK columns in a separate statement so it’s best not to include them in CHECKSUM (don’t forget that FK update also queries the PK table)
after a certain point, even RowChecksum might not be enough and you might need to refactor your updates to work in batches
the performance gains on smaller tables might be negligible but in those cases the negatives would also be negligible
it might be fairly complicated to maintain CHECKSUM function with a lot of transformations between staging and DWH table; in that case it might be better to only use the CHECKSUM function on the staging table and declare a regular integer column on the DWH table (in which you store the staging checksum)

Hopefully these caveats don’t turn you away from using CHECKSUM as I believe it will generally make your life easier, even if it complicates development.