External DNS queries dropped causing timeouts and unresponsive services, web-browsing, Office 365, Windows Virtual Desktop(WVD), etc.
(Resolution at the end…)
Azure based environment.
Classic AD with a few VM based Domain controllers in Azure.
DC’s are acting as the primary DNS service for all servers, services and clients in the environment.
Azure firewall is configured as the primary firewall solution for the Azure based environments.
The subnet(s) where the DC’s are located are routed (UDR) to use the Azure Firewall for outgoing internet traffic
The DC’s IP are the added as the primary and secondary DNS for all vNets in Azure, for office subnets, VPN, etc.
DNS service on the DC’s are configured to forward external DNS queries, as is the default, either a custom Forwarder, or the DNS hints.
As the workload increases, you may notice a congestion in the Azure Firewall, at a certain amount of external forwarded DNS queries, the Azure Firewall will choke at 100%.
You may notice this as a user, when simple internet pages in a browser no longer opens, if you use WVD, the environment may stop responding all together due to the lack of DNS service in the WVD hostpool.
As an admin, you may see that a repeated number of NSLOOKUP to an external address, will result in intermittent timeouts.
A part of the DNS service is that it uses UDP, and Azure Firewall uses SNAT for address translation from every internal source, resulting in every UDP request from one IP to an external provider (22.214.171.124, 126.96.36.199, 188.8.131.52 etc.) will use one port out of the 65.000 available in the TCP protocol for that unique destination. When the number of DNS requests to the same address (say google.com) gets repeated, every request gets its own port, and after a while, all 65.000 are taken. Causing congestion in the Firewall and it starts to drop requests. Two requests to google.com and two to microsoft.com will thus result in 4 ports being used, while single requests to 1000 different addresses will only use the one port.
In our case, we saw that during office hours, after enough users and systems were migrated, the Firewall reached 100% in UDP SNAT utilization, causing timeouts on our DNS servers (DC’s) causing services and systems to fail.
Issue can clearly be seen starting at the beginning of day on Thursday July 2nd, then on Friday July 3rd, stopped over the weekend, then up again on Monday July 6th. The graph also shows that we have users in Europe as well as in America, which prolongs the period a bit.
After some t-shooting, severity A case with Microsoft premier support and help from Microsoft FastTrack, we had two ways to quickly mitigate the problem and get the services operational again.
- For us, the fastest we could implement, was to add a second public IP on the Azure Firewall, doubling the amount of ports used for SNAT to 130.000 (65.000 x2). REMEMBER, you cannot control what goes to what port, so if any of your internal servers or systems connect to an external service, and the Azure Firewalls public IP is whitelisted, you have to add this second IP to the whitelist as well. Every place that has it whitelisted. This because Azure Firewall randomly selects the public IP to use for outgoing traffic.
This immediately reduced the UDP SNAT util from 100%+ to 60-70%
The utilization quickly dropped, to about 60-70%
- The best solution, which we implemented a few minutes later, is to add Microsofts Azure ‘public’ DNS IP as the forward DNS server on our DNS servers (the DC’s). This is an Azure Datacenter service, and as such, it will be recognized and accesses using the Azure backbone and never going through the Azure Firewall at all. This reduced the load from 60-70% on UDP SNAT to 0%!
The IP to use is: 184.108.40.206 (This will only be accessible from Azure environments, do NOT use on onprem DNS servers!)
The utilization here dropped completely to 0%, since all DNS queries now go to the Azure ‘internal’ DNS
Microsoft Azure DNS IP: 220.127.116.11
Azure Firewall SNAT private IP address ranges
Deploy an Azure Firewall with multiple public IP addresses using Azure PowerShell
Thomas Vuylsteke – Microsoft Azure Fasttrack team
Microsoft Premier Support
Akelius Residential Property AB (Martin Supan, Mattias Segerström)
Thomas Odell Balkeståhl on LinkedIn