Data Platform School
HomeLandscapeAbout me

Azure Synapse Serverless vs Databricks SQL Analytics

By Valdas Maksimavicius
Published in Data Architecture
July 16, 2021
5 min read
Azure Synapse Serverless vs Databricks SQL Analytics

Azure Data Lakehouse architecture combines the best elements of data lakes and data warehouses. Delta file format, combined with low cost storage, enables new ways of working with data pipelines and machine learning workloads.

Users can use Azure Synapse for data warehousing workloads, and Databricks for advanced analytics and ad-hoc data exploration. Both platforms can run on top of the same data stored in Azure Data Lake.

However, both services also offer similar capabilities and compete. In this article I would like to compare Azure Synapse Serverless and Databricks SQL Analytics as query engines on top of Azure Data Lake Gen 2 data.

This article is a vendor neutral attempt to compare Azure Synapse and Databricks when using open data formats. The approach taken uses TPC-DS analytics queries to test performance and available functionalities. However, this is not a thorough performance comparison. Underlying data, Azure Synapse and Databricks can be further tweaked to optimize query results.

Questions:

  • Which is faster - Synapse Serverless or Databricks SQL endpoint?
  • Is Delta equally supported across Synapse Serverless and Databricks SQL?
  • When to use Synapse Serverless and when Databricks SQL?
  • What is the performance with OPENROWSET AND EXTERNAL tables?
  • What is the cost of using Synapse Serverless and Databricks SQL?

For sure there are hundreds of other questions related to SQL compatibility, caching, developer friendliness, cluster vs. endpoint, etc. For now, let’s limit the scope to the questions above.


Table Of Contents

Preparations

Round 1 - 1 GB

Round 2 - 10 GB

Round 3 - 1 TB

Verdict


Preparations

Define game rules

Use TPC-DS benchmark data to compare Synapse Serverless and Databricks SQL Compute performance and execution cost.

Choices:

  • All data in data lake for both platforms (no preloading to SQL pools or dbfs)
  • Both platforms run on the same data
  • Delta format over parquet
  • Use standard out of the box settings, no optimization
  • No concurrent queries
  • 90 analytical queries with warm-up queries, 3 runs, take average
  • Use JMeter to run all queries
  • Cost (Synapse provides per query pricing, while SQL Analytics has hourly pricing)
  • Comparing Synapse Serverless pricing with Databricks is a bit comparing apple with oranges, but as a customer I want to know how much it will cost over time.
  • Databricks SQL endpoints take up to 5 mins to start and lowest inactivity of 10 mins before cluster shut down. I’ve decided not to include it in pricing as I would have also include this time in duration, which didn’t make sense to me.
  • User-friendliness (ease of creating objects, tables, UI)

Deploy Azure resources

  1. Azure Synapse Workspace (serverless only)
  2. Databricks with enabled SQL Analytics
  3. Azure Data Lake Gen 2 storage

azure resources

Get data

Getting data for testing is always a challenge, but luckily there are bright people who created datasets for such benchmarks.

Here are a few links ( link 1, link 2, link 3 ) that helped me to generate required data based on TCP-DS

I have three datasets: 1 GB, 10 GB and 1 TB:

TableAvg Row Size in Bytes1 GB row counts10 GB row counts1 TB row counts
call_center30562442
catalog_page13911,71812,00030,000
catalog_returns166143,6721,439,882144,004,725
catalog_sales2261,439,93514,400,4251,439,976,202
customer132100,000500,00012,000,000
customer_address11050,000250,0006,000,000
customer_demographics421,920,8001,920,8001,920,800
date_dim14173,04973,04973,049
household_demographics217,2007,2007,200
income_band16202020
inventory1611,745,000133,110,000783,000,000
item28118,000102,000300,000
promotions1243005001,500
reason3835456
ship_mode56202020
store263121021,002
store_returns134288,2792,877,632288,009,578
store_sales1642,879,78928,800,5012,879,966,589
time_dim5986,40086,40086,400
warehouse11751020
web_page96602003,000
web_returns16271,772720,27472,002,305
web_sales226718,9317,200,088719,959,864
web_site292304254

Azure Data Lake Gen 2 bronze zone stores originally generated data (1GB, 10 GB and 1TB datasets) in parquet format.

adls databases bronze

Silver zone is used to store optimized datasets, converted to delta format. Benchmark tests will run datasets in delta format.

adls databases silver

Define partitions

1 GB database is not partitioned. 10 GB and 1 TB database transaction tables are partitioned as following:

  • catalog_returns PARTITIONED BY cr_returned_date_sk
  • web_returns PARTITIONED BY wr_returned_date_sk
  • web_sales PARTITIONED BY ws_sold_date_sk
  • store_returns PARTITIONED BY sr_returned_date_sk
  • store_sales PARTITIONED BY ss_sold_date_sk
  • inventory PARTITIONED BY inv_date_sk
  • catalog_sales PARTITIONED BY cs_sold_date_sk

adls partitions

Register databases and tables

Databricks databases and tables

All databases are registered in Databricks metastore. Databases separate 1 GB, 10 GB 1TB datasets and delta from parquet.

databases databricks

Spark-sql-perf created tables on top of parquet automatically. I’ve moved the files in addition to silver and converted to parquet.

Synapse databases and tables

databases synapse

Delta is not supported with external tables yet, so only openrowset will be used

CREATE VIEW dbo.web_site
AS

SELECT *
FROM OPENROWSET(BULK 'TPC-DS/SourceFiles001GB_delta/web_site/',
                DATA_SOURCE = 'data_source_silver',
                FORMAT = 'DELTA') as a

Also, I decided to check parquet performance. For parquet tables, I use external tables.

CREATE EXTERNAL TABLE [dbo].[web_site]
(
    web_site_sk           bigint,
    web_site_id           nvarchar(16),
    .....
    
)
WITH
(
    LOCATION = 'TPC-DS/SourceFiles001GB_parquet/web_site/',
    DATA_SOURCE = 'data_source_bronze',
    FILE_FORMAT = parquet_format
)

Define queries

There are 90 analytical queries + 24 “warmup” queries (not included in duration calculations). 9 queries were removed as some require rewriting specifically for Spark SQL and a few for Synapse. Hence, I left only identical queries for both platforms.

You can find the queries here.

Configure testing environment with JMeter

JMeter is used often in such testing scenarios. I followed the steps here to set it up on my machine.

jmeter

Define cluster sizes and pricing

Databricks SQL Analytics endpoints

databricks endpoint

I don’t test concurrent queries, so I disable Multi-cluster Load Balancing

Azure Databricks bills you for virtual machines (VMs) provisioned in clusters and Databricks Units (DBUs) based on the VM instance selected.

  • SQL Endpoint (compute) price - $0.22/DBU-hour

databricks endpoint sizes

  • (To be verified) SQL Endpoints use Ev3-series virtual machines

Total hourly price for SQL Endpoints:

  • 2X-Small - 4 x $0.22/DBU-hour + 2 x $0.64/VM-hour (E8 v3)
  • X-Small - 6 x $0.22/DBU-hour + 3 x $0.64/VM-hour (E8 v3)
  • Small - 12 x $0.22/DBU-hour + 4 x $0.64/VM-hour (E8 v3) + 1 x $1.28/VM-hour (E16 v3)
  • Medium - 24 x $0.22/DBU-hour + 8 x $0.64/VM-hour (E8 v3) + 1 x $2.56/VM-hour (E32 v3)
  • Large - 40 x $0.22/DBU-hour + 12 x $0.64/VM-hour (E8 v3) + 1 x $4.376/VM-hour (E64 v3)

Azure Synapse Serverless

You only pay for executed queries and the pricing is based on the amount of data processed by each query. Metadata-only queries (DDL statements) do not incur a cost. Queries will incur a minimum charge of 10 MB and each query will be rounded up to the nearest 1 MB of data processed.

Price: $5 per TB of data processed

All the executed queries are visible in the monitoring tab. To be able to separate tests, I’ve used different SQL users (Submitter) to know which run processed data. It was not possible to filter by the serverless pool name.

synapse pricing

Round 1 - 1GB

1gb

Explanation:

  • One run consists of 90 analytical queries
  • Three runs one after another were triggered

I am disappointed with Azure Synapse delta support. It’s much slower than everything else. On the other hand, Synapse Serverless external tables are around 7% slower that Databricks 2X-Small cluster, but almost 8 times cheaper.

Winner - Azure Synapse Serverless with external tables on parquet files if you don’t look for top tier performance and cost is a factor. Also, Synapse Serverless comes with access management and access APIs similar to SQL Server world.

Round 2 - 10 GB

10gb

Explanation:

  • One run consists of 90 analytical queries
  • Three runs one after another were triggered
  • Databricks considerably sped up with caching in run 2 and 3. I took average of all 3 runs.

Even the least powerful Databricks endpoint leaves Synapse Serverless behind. Serverless Delta positively surprises by running faster than external tables on parquet. I assume Delta takes care of partitions better.

Winner - Databricks SQL Analytics is a faster and cheaper alternative.

Round 3 - 1 TB

1tb

Explanation:

  • One run consists of 90 analytical queries
  • Three runs one after another were triggered
  • Databricks considerably sped up with caching in run 2 and 3. I took average of all 3 runs.

Unfortunately, Synapse Serverless with openrowset and external tables were not capable to run queries against 1 TB of data tables. Various types of timeouts, exceptions, “There is insufficient system memory in resource pool ‘VDWFrontendPool’ to run this query” where popping up.

Winner - Databricks SQL Analytics

Verdict

Databricks results

  • Run no 1 run with each cluster was up to 7 times slower. Subsequent runs obviously used caching. Included results take average of all runs.
  • Enterprise ready solution for smaller and very large data
  • Developer-friendly approach to work with Delta tables from SQL Analytics portal.
  • Requires endpoint (cluster) startups, doesn’t feel cloud native in 2021 :)

Use Databricks SQL Analytics when you look for scalable and stable solution to query any data from Azure Data Lake.

Synapse results

  • Modern approach that doesn’t require any cluster startups. It’s just there waiting for user queries.
  • Almost no difference felt between run 1, 2 and 3. Does any caching exist there?
  • Ambiguous access setup; “Content of directory on path cannot be listed”, “Cannot bulk load. The file does not exist or you don’t have file access rights”
  • Need to create credentials, data formats, specify structure
  • Limited Delta support
  • Issues with large datasets

Use Azure Synapse Serverless when your data is in GBs, performance is not key, and you are concerned about the rest of Azure/Microsoft ecosystem integrations.


Tags

#databricks

The latest set of Azure Data Platform best practices - April 2021

A simple blog post evolved to 25+ page guide with 75+ different recommendations.
Download Document
Previous Article
Apache Ranger Evaluation for Cloud Migration and Adoption Readiness
Valdas Maksimavicius

Valdas Maksimavicius

IT Architect | Microsoft Data Platform MVP

Resources

ADVERTISE WITH US

Topics

Data Architecture
Data Engineering
Data Governance
Miscellaneous

Related Posts

Launching Databricks at If Insurance | Medium
April 24, 2021
1 min
© 2021, All Rights Reserved.

Quick Links

About mePrivacyContactLandscape

Social Media