Azure Data Lakehouse architecture combines the best elements of data lakes and data warehouses. Delta file format, combined with low cost storage, enables new ways of working with data pipelines and machine learning workloads.
Users can use Azure Synapse for data warehousing workloads, and Databricks for advanced analytics and ad-hoc data exploration. Both platforms can run on top of the same data stored in Azure Data Lake.
However, both services also offer similar capabilities and compete. In this article I would like to compare Azure Synapse Serverless and Databricks SQL Analytics as query engines on top of Azure Data Lake Gen 2 data.
This article is a vendor neutral attempt to compare Azure Synapse and Databricks when using open data formats. The approach taken uses TPC-DS analytics queries to test performance and available functionalities. However, this is not a thorough performance comparison. Underlying data, Azure Synapse and Databricks can be further tweaked to optimize query results.
For sure there are hundreds of other questions related to SQL compatibility, caching, developer friendliness, cluster vs. endpoint, etc. For now, let’s limit the scope to the questions above.
Use TPC-DS benchmark data to compare Synapse Serverless and Databricks SQL Compute performance and execution cost.
Getting data for testing is always a challenge, but luckily there are bright people who created datasets for such benchmarks.
I have three datasets: 1 GB, 10 GB and 1 TB:
|Table||Avg Row Size in Bytes||1 GB row counts||10 GB row counts||1 TB row counts|
Azure Data Lake Gen 2 bronze zone stores originally generated data (1GB, 10 GB and 1TB datasets) in parquet format.
Silver zone is used to store optimized datasets, converted to delta format. Benchmark tests will run datasets in delta format.
1 GB database is not partitioned. 10 GB and 1 TB database transaction tables are partitioned as following:
All databases are registered in Databricks metastore. Databases separate 1 GB, 10 GB 1TB datasets and delta from parquet.
Spark-sql-perf created tables on top of parquet automatically. I’ve moved the files in addition to silver and converted to parquet.
Delta is not supported with external tables yet, so only openrowset will be used
CREATE VIEW dbo.web_site AS SELECT * FROM OPENROWSET(BULK 'TPC-DS/SourceFiles001GB_delta/web_site/', DATA_SOURCE = 'data_source_silver', FORMAT = 'DELTA') as a
Also, I decided to check parquet performance. For parquet tables, I use external tables.
CREATE EXTERNAL TABLE [dbo].[web_site] ( web_site_sk bigint, web_site_id nvarchar(16), ..... ) WITH ( LOCATION = 'TPC-DS/SourceFiles001GB_parquet/web_site/', DATA_SOURCE = 'data_source_bronze', FILE_FORMAT = parquet_format )
There are 90 analytical queries + 24 “warmup” queries (not included in duration calculations). 9 queries were removed as some require rewriting specifically for Spark SQL and a few for Synapse. Hence, I left only identical queries for both platforms.
You can find the queries here.
JMeter is used often in such testing scenarios. I followed the steps here to set it up on my machine.
I don’t test concurrent queries, so I disable Multi-cluster Load Balancing
Azure Databricks bills you for virtual machines (VMs) provisioned in clusters and Databricks Units (DBUs) based on the VM instance selected.
Total hourly price for SQL Endpoints:
You only pay for executed queries and the pricing is based on the amount of data processed by each query. Metadata-only queries (DDL statements) do not incur a cost. Queries will incur a minimum charge of 10 MB and each query will be rounded up to the nearest 1 MB of data processed.
Price: $5 per TB of data processed
All the executed queries are visible in the monitoring tab. To be able to separate tests, I’ve used different SQL users (Submitter) to know which run processed data. It was not possible to filter by the serverless pool name.
I am disappointed with Azure Synapse delta support. It’s much slower than everything else. On the other hand, Synapse Serverless external tables are around 7% slower that Databricks 2X-Small cluster, but almost 8 times cheaper.
Winner - Azure Synapse Serverless with external tables on parquet files if you don’t look for top tier performance and cost is a factor. Also, Synapse Serverless comes with access management and access APIs similar to SQL Server world.
Even the least powerful Databricks endpoint leaves Synapse Serverless behind. Serverless Delta positively surprises by running faster than external tables on parquet. I assume Delta takes care of partitions better.
Winner - Databricks SQL Analytics is a faster and cheaper alternative.
Unfortunately, Synapse Serverless with openrowset and external tables were not capable to run queries against 1 TB of data tables. Various types of timeouts, exceptions, “There is insufficient system memory in resource pool ‘VDWFrontendPool’ to run this query” where popping up.
Winner - Databricks SQL Analytics
Use Databricks SQL Analytics when you look for scalable and stable solution to query any data from Azure Data Lake.
Use Azure Synapse Serverless when your data is in GBs, performance is not key, and you are concerned about the rest of Azure/Microsoft ecosystem integrations.