#AzureAdventCalendar entry – Azure Data Factory

Azure Advent Calendar is a community event which consists of 75 videos over a span of 25 days, made by speakers all over the world; all super excited about #Azure! This is orchestrated by Gregor Suttie (Blog | Twitter) and Richard Hooper (Blog | Twitter), so the biggest of shout outs to them, and once again thanks for allowing me to participate!

Rather than sitting here and reading whatever boring text I have to give you, why don’t you check out the work of my fellow contributors (and perhaps even mine) over at YouTube and check out all the amazing content. Once you’re done there, make sure to head over to the website for the entire event to see what’s upcoming, and links to blog posts from the already available sessions!

If you don’t have time to view them all right now, let me give you my personal top three submissions so far, that are absolutely need-to-watch:

My submission is on Azure Data Factory, with a high level structural walkthrough of the service, and then some examples on design patterns when using ADF to move data, may it be between cloud datastores, or from On Premises to cloud. You can find this video here:

Schedule SSIS-IR in Azure Data factory v2 – The right way

Most of the projects that I am involved with, utilizes Azure Data Factory v2 in some shape or form, and some of those even uses the SQL Server Integration Services Integration Runtime (Try saying that 5 times fast..)

One of the bigger parts of using this, is shutting down the runtime when it’s not used, to optimise the cloud costs of the platform, as to not pay for something you’re not utilising. As per Microsoft Documentation, this is done through two possible solutions.

  • Azure Automation
  • Recently released Management REST APIs

The old way, the Automation approach relays on custom powershell code, which is then called in Data Factory through a Web Activity. This removes some logic from the data factory itself, and often getting the error code back to the ADF was a pain. Also most importantly, these were asynchronous calls, so you’d have to implement a static wait in the Data Factory, and hope that it was available by the time the wait activity was done.

The new approach however, is a synchronous call to the API, so the web activity will wait until the IR is ready, before continuing. This means your flow is a lot more simplified.

However, there are two major faults with this new approach, which I will hope to cover in this article.

The problem – Part 1

The base at the issue, lays in the approach itself. Because we use an API to execute the activity with, there’s no underlying checks going on under the hood, for what to do in certain situations. Lets take the template “Schedule ADF pipeline to start and stop Azure-SSIS IR just in time before and after running SSIS package” from the template gallery inside Data Factory, provided by Microsoft, as an example.

Now, this is pretty straight forward, start the IR, do something and finally stop it again. However, because the Start activity doesn’t do any validation before you POST to the REST API, you’ll be met by an error as shown below, if the runtime is already running.

That’s problem #1 with this approach, because this could potentially mean that if for some reason the integration runtime is turned on when your scheduled data load is triggered; it will fail. Of course, someone might just just say ‘oh but you can just continue your flow, by allowing the pipeline to continue the flow, even if the start fails.’ And that’s right, we COULD, but what if it actually fails for real in start up, and your integration runtime really isn’t ready? In a world where data is considered mission critical, I would not recommend living by chance, nor designing for it.

Fixing the issue – Part 1

Now that we’ve established an issue, lets get around to fixing it.

In its most basic form, what we need to do before we start the IR, is to check if it’s already running. We can do that in ADF through a lookup activity.

Overview of the pipeline that handles starting the IR.

For those that didn’t know, the SSIS-IR in ADF utilises SQL Server Integration Services Scale Out, which was introduced with SQL Server 2017. This allows the SSIS to operate using a master-slave topology, where the workload is spread out. This means that we can use the system tables in SSISDB, to query the status of our IR. In order to query the SSISDB, you will need to first create a Linked Service reference to the database. This can be done with the AutoResolveIntegrationRuntime that is default in the ADF, unless your SSISDB is located on an Azure SQL Managed Instance with no public endpoint, in which case you’d need a Selfhosted Integration Runtime to query the database. You will also need a dataset on top of that linked service, which can be pointed to “non” table, as we only use queries throughout this. We query SSISDB by using the Lookup Activity, with the following code as our query.

SELECT 
    COUNT(*) as ActiveWorkers
FROM catalog.worker_agents
WHERE LastOnlineTime >= DATEADD(MINUTE, -1, GETDATE())

The code itself is fairly simple, we look at the catalog.worker_agents table, and count the number of active workers, filtered on the LastOnlineTime column. This has records for every single worker that has existed, so the filter only finds if a worker has been online within the past minute. The LastOnlineTime is only updated once per minute, so we are not able to decrease it any further. This query returns a number of active workers for us, which we will then use in an If Activity.

The If condition itself, is a equals() based on the output of our previous activity.

The If Condition itself is a equals() based on the output of our previous activity, which then directs us down the right path. If it returns 0, thus no IR’s are currently running, then it will evaluate to true, and start our True Activities, which is the REST API call, to start the runtime. If it is different from 0, meaning an IR is currently running, then it was evaluate to false, and start the False Activities. In my test scenario, I created a 1 second wait, just to better show the flow in the Monitor logs for the sake of overview. This can be left blank if you so decide. The API is as described in the above mentioned msdocs, just the sample url, adjusting with the required alterations.

https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.DataFactory/factories/{factoryName}/integrationRuntimes/{integrationRuntimeName}/start?api-version=2018-06-01

With that, problem #1 is now resolved, our pipeline can now be executed, and even if the IR is running, it will shown green all the way, and not cause issues with our flow.

The problem – Part 2

With the starting of the IR fixed, lets dive into issue # 2, which is the stopping part.

In a world more than one pipeline exists, stopping the IR could cause other pipelines to fail, because we simply pull the plug with the REST API. This one is a bit more complex than the starting, as it’s not just a simple “yes / no” check every time. We have cases where we want the flow to terminate whatever is running, if we so desire. So, this means another variable has to be taken into the equation, in order for us to handle this properly.

Fixing the issue – Part 2

Once again, my proposed solution contains 2 activities; a Lookup Activity, and an If Activity, with the sub activities. In this pipeline however, we will implement a parameter to it as well, which will allow us to overrule the shutdown, and forcibly terminate the IR, if we so choose.

The same basic setup, but with a parameter that gets set at the same of execution.

Once again, the lookup queries the SSISDB. As an addition to the script we used in the first step, we are now also interested in if there are any packages running. It’s worth nothing that the table catalog.executions now has an undocumented column called worker_agent_id, which correlates to the RunId of the Activity in Data Factory Monitor, which started the SSIS package. But sadly, there are no links between the worker_agents table, and the executions table. So right now, we are forced to handle them seperately, and assume a link between them, but this might change in the future.

;WITH ActiveWorkers as (
SELECT 
   WorkerAgentId
FROM catalog.worker_agents
WHERE LastOnlineTime >= DATEADD(MINUTE, -1, GETDATE())
)
SELECT 
    COUNT(*) as ActivePackages,
    (Select COUNT(*) FROM ActiveWorkers) as ActiveWorkersCount
FROM catalog.executions
WHERE status IN (1,2,5);

In our If Condition, we now need to handle the parameter, and the output of the activity. Thankfully ADF has quite extensive expression support, so that’s managable. The general flow in the expression itself goes something like this. If it evaluates to true, it will try to shut down the SSIS-IR, if false then it will do nothing.

  1. ActiveWorkersCount from our lookup needs to be greater than 0.
    • If it is 0, which means the runtime is already stopped, and no actions needs to be made.
  2. If ForceStop is true then that takes precedence.
  3. If Active Packages is 0 then true, otherwise false.

In summary, this means there are 2 reasons for it to evaluate to true; the SSIS-IR is running, but no packages are active, or it is running and ForceStop is set to true.

Conclusion

With this implemented, we’ve enhanced the REST APIs made by Microsoft, in order to create a more robust data flow platform. There are still a few holes in this, such as the fact we can only check the status of the workers with a minutes delay, but it’s a hell of a lot better than failing to begin with.

One could hope that eventually there will be a built in activity to manage this integration runtime, however my dream is that at some point we never have to manage this at all, it should be possible to create a fully PaaS solution, where the IR spins up if an Execute SSIS Package activity is executed, and then shut down once we’re done, similar to how Databricks functions.

The JSON definitions of these two pipelines can be found on my github, which you can find here: https://github.com/DanniJuhl/BlogContent

I hope this has helped you manage the SSIS-IR better in Azure Data Factory v2, and allow you to better manage the cost of your data platform in Azure!

Please do leave a comment down below, or contact me on Twitter if you found this to be useful, or general feedback!

Beginning of a new chapter

Before I start on anything else, thank you for visiting my blog, and welcome! This is my very first post on this blog, and hopefully first of many to come. I’m a big believer in aligning expectations from the very start, so the executive summary goes something like: I work with Azure, expect this blog to be primarily revolved around Azure, more specifically: the Data Platform.

Back in 2018 I had a lovely tweeticussion with Randolph West (Blog | Twitter) about how to best approach getting into writing. They responded with this; “Writing takes two things: reading a lot and writing a lot!”.

Looking back, that advice is so blatantly obvious, but for me at the time, that wasn’t the case. I followed the advice however, and started allocating time in my schedule to catch up on blogs, and try to be more outgoing on social media to establish a network with the existing blogging community. It goes hand in hand with my desire to start blogging, but I still felt as though there were some uncertainties about how to best get started.

Then when Andy Mallon (Blog | Twitter) wrote about his newest project, DataBlogger.info, I contacted Andy the same night, asking for advice on how to actually get started. He gave me some super great advice, and even dug up some of the resources previously given to him, for which I’m eternally grateful.

The ultimo result of that conversation is this; Dannijuhl.com.

Now, why am I starting a blog?

Mostly because I believe my work as a consultant sometimes puts me in a unique position, where I experience something that may not be an easy, or straightforward thing to fix; and I want to document those things, to help better the community, and myself. And of course, as Brent Ozar phrased it in his How to start a blog post; “Blogging for career success.”. It’s not a direct reason for me, but I don’t see a scenario where it wouldn’t help out at some point.

“Beware of he who would deny you access to information, for in his heart he dreams himself your master.”

– Pravin Lal

My goal short term is to post weekly, on Tuesdays, and then based on traction, and if I can get myself into the rhythm, then transition into bi-weekly within the next few months.

About me

Note: This bottom part will also find its home somewhere else on this site, but I’ve always been taught it’s well mannered to introduce yourself.

My name is Danni Juhl, and at the time of writing this, I am 28 years old. Geographical location is outskirts of Aarhus, Jutland, Denmark. I work as a Specialist in Inspari, a Danish consultant company specialising in business intelligence on the Microsoft stack. When I started nearly 2 years ago, it was primarily to be in the SQL Server team, focusing on anything that comes with that; performance tuning, installations, administration, High Availability and such. However I quickly realised that another role was available in the company, and that was Azure. As with a lot of other companies, including the clients I were working on, the cloud was finally starting to look a bit more interesting. Companies were adopting cloud first strategies, cloud were a bullet point in every implementation meeting, and most importantly of all, most people were talk, but when it came to actual execution, it was shaky at best. So that’s where I’ve been for the past 2 years: working with implementing Azure data platforms.