Most of the projects that I am involved with, utilizes Azure Data Factory v2 in some shape or form, and some of those even uses the SQL Server Integration Services Integration Runtime (Try saying that 5 times fast..)
One of the bigger parts of using this, is shutting down the runtime when it’s not used, to optimise the cloud costs of the platform, as to not pay for something you’re not utilising. As per Microsoft Documentation, this is done through two possible solutions.
- Azure Automation
- Recently released Management REST APIs
The old way, the Automation approach relays on custom powershell code, which is then called in Data Factory through a Web Activity. This removes some logic from the data factory itself, and often getting the error code back to the ADF was a pain. Also most importantly, these were asynchronous calls, so you’d have to implement a static wait in the Data Factory, and hope that it was available by the time the wait activity was done.
The new approach however, is a synchronous call to the API, so the web activity will wait until the IR is ready, before continuing. This means your flow is a lot more simplified.
However, there are two major faults with this new approach, which I will hope to cover in this article.
The problem – Part 1
The base at the issue, lays in the approach itself. Because we use an API to execute the activity with, there’s no underlying checks going on under the hood, for what to do in certain situations. Lets take the template “Schedule ADF pipeline to start and stop Azure-SSIS IR just in time before and after running SSIS package” from the template gallery inside Data Factory, provided by Microsoft, as an example.
Now, this is pretty straight forward, start the IR, do something and finally stop it again. However, because the Start activity doesn’t do any validation before you POST to the REST API, you’ll be met by an error as shown below, if the runtime is already running.
That’s problem #1 with this approach, because this could potentially mean that if for some reason the integration runtime is turned on when your scheduled data load is triggered; it will fail. Of course, someone might just just say ‘oh but you can just continue your flow, by allowing the pipeline to continue the flow, even if the start fails.’ And that’s right, we COULD, but what if it actually fails for real in start up, and your integration runtime really isn’t ready? In a world where data is considered mission critical, I would not recommend living by chance, nor designing for it.
Fixing the issue – Part 1
Now that we’ve established an issue, lets get around to fixing it.
In its most basic form, what we need to do before we start the IR, is to check if it’s already running. We can do that in ADF through a lookup activity.
For those that didn’t know, the SSIS-IR in ADF utilises SQL Server Integration Services Scale Out, which was introduced with SQL Server 2017. This allows the SSIS to operate using a master-slave topology, where the workload is spread out. This means that we can use the system tables in SSISDB, to query the status of our IR. In order to query the SSISDB, you will need to first create a Linked Service reference to the database. This can be done with the AutoResolveIntegrationRuntime that is default in the ADF, unless your SSISDB is located on an Azure SQL Managed Instance with no public endpoint, in which case you’d need a Selfhosted Integration Runtime to query the database. You will also need a dataset on top of that linked service, which can be pointed to “non” table, as we only use queries throughout this. We query SSISDB by using the Lookup Activity, with the following code as our query.
SELECT COUNT(*) as ActiveWorkers FROM catalog.worker_agents WHERE LastOnlineTime >= DATEADD(MINUTE, -1, GETDATE())
The code itself is fairly simple, we look at the catalog.worker_agents table, and count the number of active workers, filtered on the LastOnlineTime column. This has records for every single worker that has existed, so the filter only finds if a worker has been online within the past minute. The LastOnlineTime is only updated once per minute, so we are not able to decrease it any further. This query returns a number of active workers for us, which we will then use in an If Activity.
The If Condition itself is a equals() based on the output of our previous activity, which then directs us down the right path. If it returns 0, thus no IR’s are currently running, then it will evaluate to true, and start our True Activities, which is the REST API call, to start the runtime. If it is different from 0, meaning an IR is currently running, then it was evaluate to false, and start the False Activities. In my test scenario, I created a 1 second wait, just to better show the flow in the Monitor logs for the sake of overview. This can be left blank if you so decide. The API is as described in the above mentioned msdocs, just the sample url, adjusting with the required alterations.
With that, problem #1 is now resolved, our pipeline can now be executed, and even if the IR is running, it will shown green all the way, and not cause issues with our flow.
The problem – Part 2
With the starting of the IR fixed, lets dive into issue # 2, which is the stopping part.
In a world more than one pipeline exists, stopping the IR could cause other pipelines to fail, because we simply pull the plug with the REST API. This one is a bit more complex than the starting, as it’s not just a simple “yes / no” check every time. We have cases where we want the flow to terminate whatever is running, if we so desire. So, this means another variable has to be taken into the equation, in order for us to handle this properly.
Fixing the issue – Part 2
Once again, my proposed solution contains 2 activities; a Lookup Activity, and an If Activity, with the sub activities. In this pipeline however, we will implement a parameter to it as well, which will allow us to overrule the shutdown, and forcibly terminate the IR, if we so choose.
Once again, the lookup queries the SSISDB. As an addition to the script we used in the first step, we are now also interested in if there are any packages running. It’s worth nothing that the table catalog.executions now has an undocumented column called worker_agent_id, which correlates to the RunId of the Activity in Data Factory Monitor, which started the SSIS package. But sadly, there are no links between the worker_agents table, and the executions table. So right now, we are forced to handle them seperately, and assume a link between them, but this might change in the future.
;WITH ActiveWorkers as ( SELECT WorkerAgentId FROM catalog.worker_agents WHERE LastOnlineTime >= DATEADD(MINUTE, -1, GETDATE()) ) SELECT COUNT(*) as ActivePackages, (Select COUNT(*) FROM ActiveWorkers) as ActiveWorkersCount FROM catalog.executions WHERE status IN (1,2,5);
In our If Condition, we now need to handle the parameter, and the output of the activity. Thankfully ADF has quite extensive expression support, so that’s managable. The general flow in the expression itself goes something like this. If it evaluates to true, it will try to shut down the SSIS-IR, if false then it will do nothing.
- ActiveWorkersCount from our lookup needs to be greater than 0.
- If it is 0, which means the runtime is already stopped, and no actions needs to be made.
- If ForceStop is true then that takes precedence.
- If Active Packages is 0 then true, otherwise false.
In summary, this means there are 2 reasons for it to evaluate to true; the SSIS-IR is running, but no packages are active, or it is running and ForceStop is set to true.
With this implemented, we’ve enhanced the REST APIs made by Microsoft, in order to create a more robust data flow platform. There are still a few holes in this, such as the fact we can only check the status of the workers with a minutes delay, but it’s a hell of a lot better than failing to begin with.
One could hope that eventually there will be a built in activity to manage this integration runtime, however my dream is that at some point we never have to manage this at all, it should be possible to create a fully PaaS solution, where the IR spins up if an Execute SSIS Package activity is executed, and then shut down once we’re done, similar to how Databricks functions.
The JSON definitions of these two pipelines can be found on my github, which you can find here: https://github.com/DanniJuhl/BlogContent
I hope this has helped you manage the SSIS-IR better in Azure Data Factory v2, and allow you to better manage the cost of your data platform in Azure!
Please do leave a comment down below, or contact me on Twitter if you found this to be useful, or general feedback!