Intermittent Azure Data Factory Pipeline Errors

Intermittent Azure Data Factory Pipeline Errors

Debugging Pipeline Failures

·

4 min read

We recently had an increase in ADF pipeline failures. The errors started appearing after a recent incident where the table data was truncated. Out of ~100 pipelines, only 1 was directly impacted by this truncation. However, all pipelines saw an increase in intermittent errors. Due to the incident, a database of the last recent backup before the truncation was restored. A restore process was underway through Azure Data Factory with the Dataset for the Source being the restored Azure SQL Database Hyperscale and the target.

Operation on target Copy Activity failed: Failure happened on 'Sink' side.

ErrorCode=UserErrorFailedToConnectOdbcSource,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=ERROR [08001] [Microsoft][ODBC Driver 17 for SQL Server]SSL Provider: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

ERROR [08001] [Microsoft][ODBC Driver 17 for SQL Server]Client unable to establish connection ERROR [01S00] [Microsoft][ODBC Driver 17 for SQL Server]Invalid connection string attribute,Source=Microsoft.DataTransfer.Runtime.GenericOdbcConnectors,''Type=System.Data.Odbc.OdbcException,Message=ERROR [08001] [Microsoft][ODBC Driver 17 for SQL Server]SSL Provider: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

ERROR [08001] [Microsoft][ODBC Driver 17 for SQL Server]Client unable to establish connection ERROR [01S00] [Microsoft][ODBC Driver 17 for SQL Server]Invalid connection string attribute,Source=,'

Operation on target Copy First data events To Staging failed: Failure happened on 'Sink' side. ErrorCode=SqlFailedToConnect,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Cannot connect to SQL Database. Please contact SQL server team for further support. Server: 'tcp:some.database.windows.net,1433', Database: 'some_database', User: 'x'. Check the linked service configuration is correct, and make sure the SQL Database firewall allows the integration runtime to access.,Source=Microsoft.DataTransfer.ClientLibrary,''Type=System.Data.SqlClient.SqlException,Message=Cannot open database "some_database" requested by the login. The login failed. Login failed for user 'x'.,Source=.Net SqlClient Data Provider,SqlErrorNumber=4060,Class=11,ErrorCode=-2146232060,State=2,Errors=[{Class=11,Number=4060,State=2,Message=Cannot open database "some_database" requested by the login. The login failed.,},{Class=14,Number=18456,State=1,Message=Login failed for user 'xxxxxxxxxxx'.,},],'

A ticket was opened to Microsoft Support.

CPU

The initial suspicion was that the CPU was spiked on the target database where analytics and the restore was being run. However, the spikes have been occurring prior to the restore and incident.

The other database where the Dataset for the source of the Copy Activity was defined was not spiking. To see if errors were reduced, we also increased the number of vCores from 24 to 32. The compute sliding bar is only linear up to 20 vCores (2, 4, 10, 12, 14, 16, 18, 20, 24, 32, 40, 80).

Integration Runtime

Azure Data Factory works by having integration runtime (IR) agents run on Windows Virtual Machines. Diagnostics were run on these clients and reports submitted to Microsoft Support.

The CPU and network utilization on these clients were not excessive. The agents were already configured to have a concurrent job limit higher than 5:

Network

I had a suspicion that the issue may be due to the SQL Connection Pool being exhausted by excessive connections from the new restore job:

SELECT * FROM sys.dm_exec_requests 
SELECT * FROM sys.dm_exec_session_wait_stats
SELECT * FROM sys.dm_exec_requests

These queries didn't show connections exceeding 200 for the peak periods or time range where the errors occurred.

ODBC Datasets

Diving deeper into the failures, I noticed that the pipelines that failed the most often were using an ODBC Dataset. I tried extending the connect timeout and connection timeout properties for the connection strings associated with this dataset, but it did not impact the connection issues.

I had initially suspected that these were setup in relation to Column Encryption on the pipeline target tables. Running the query below, I did find some of the tables having Column Encryption while others did not associated with the ADF pipelines that had the ODBC datasets.

SELECT 
    t.name as [Table],
    c.name as [Column], 
    c.encryption_type_desc
FROM   
    sys.all_columns c INNER JOIN
    sys.tables t on c.object_id = t.object_id
WHERE  
    c.encryption_type IS NOT NULL AND t.name LIKE 'YOUR_TABLE%'
ORDER BY
    t.name,
    c.name

I decided to clone one of the failing pipelines using the ODBC datasets and updated them to use Azure SQL Database datasets instead, while porting over the queries. The other pipelines were using these in their Copy Activities.

This seemed to resolve the failures:

I suspect that these pipelines were not updated to use the Azure SQL Database datasets since they were working before and determined that some of these pipelines should not even be triggered.

Summary

Typically any retries on the individual activities can help resolve failures, but these pipelines run on a hourly schedules and eventually succeed. Adding retries on the individual Copy Activities is also tricky depending on if the operation is a direct insert or an upsert. There are also a lot of pipelines activities to update if this mitigation strategy was chosen.

A ticket is still open with Microsoft to investigate the less frequently occurring intermitting failures who is looping in a network team.

We have some other challenges around restoring the data on the target, but that is a story for another day.

References