Quantcast
Channel: 2Pint Software
Viewing all articles
Browse latest Browse all 60

MEMCM Bug! Enhanced HTTP Causes Degraded BranchCache Performance

$
0
0

2Pint Software

This is a very long and complicated blog post. We haven’t gotten to the bottom of everything yet, but we have been working this for a few months now, and we finally have a pretty good idea of what is going on! Read on to see if this could affect your environment and what you can do about it.

TL;DR – Will This Affect Me?

ISSUE: Windows clients with BranchCache enabled  experience random crashes and ‘behave oddly’ when Enhanced HTTP is enabled on your MEMCM site. This then spreads to corrupt the local BranchCache caches until there is no peer to peer goodness left at all. So if you ARE using Enhanced HTTP and BranchCache, read on. If not,  exit here, or read on , it’s kinda interesting and you  might learn something anyway 🙂

So what on earth is going on?

When a MEMCM Task Sequence is in play, it uses WINHTTP for the downloads. Unfortunately, with this Enhanced HTTP support in place , the interaction with BranchCache causes really odd results. This can lead to Task Sequences that stop working etc. So, if Enhanced HTTP is the only thing you have (AND you have no Network Access Account) you can even get to the point where NO data can be downloaded from your DP’s, no patches no nothing. Of course it can be a little random too, so that really helps… NOT.

But this is not the whole thing, it can also severely affect the way you can get content down from MP’s and create all sorts of odd communication errors in ANY place where an HTTP GET call is going between clients and your site servers that have BranchCache enabled. The error is very hard to troubleshoot and can affect things in a very strange way. This is one of the scenarios where having split MP and DP roles on different machines REALLY helps. So please people, split the MP and DP roles for the sake of your own sanity.

How about an example? Look at the following piece of log, where the actual Task Sequence engine tries to download a sample package, and the actual underlying WinHTTP call returns an error 80072efd.  An error like 8007 2efd indicates 8007 (generic windows networking) with the sub number errorcode  in hex 2efd = 12029 decimal. From knowledge we know that most 12 thousand range errors are winsock/winhttp or “Internet Error” codes. 12029 indicates “A connection with the server could not be established” which is odd, as the error can happen mid download (when we need to instruct the BranchCache server of the issue).

This means the issue is not really a ConfigMgr issue, it’s the underlying infrastructure creating the issues. (Although setting very, very long HTTP headers is not really “best practice”, the RFC does not limit the length). So one can say the issue is owned by ConfigMgr as they broke it.

Things in play here:

This is a somewhat complicated issue, with many moving parts, so lets try to break it down:

  • When using Enhanced HTTP in ConfigMgr, HTTPS is used with a sort of “self signed” certificates, it’s what’s used to talk to the CCMTOKEN_ based virtual directories on the Distribution Point.
  • CCMTOKEN paths are often also used when PKI is in use, however, the PKI infrastructure in CM seems to slap the PKI based client cert onto the calls OK (at least for BITS). So PKI environments seem uneffected.
  • In regular Enhanced HTTP mode, the authentication is not handled by regular client certificate communication, but via HTTP headers transferring the authentication data.
  • These headers cause issues when in conjunction with BranchCache requesting a hash creation on the server side, or notifying the server of missed P2P bytes.
  • The issue seems to confuse WINHTTP.DLL which then corrupts the BranchCache stack on the client and then things get ugly fast…
  • BITS seems to be unaffected by this, as long as the BranchCache stack is not corrupted via regular WINHTTP calls. More on this later.

Ok, give me a one liner of the situation here 2Pint people, I am soo confused I can hardly think!:

OK, here it goes: So, the way that Enhanced HTTP integrates with BranchCache, caches gets corrupted, downloads break down issues happens, unless something is changed.

Am I affected?

If you have enabled Enhanced HTTP on your site and have any Windows machines using BranchCache you should take action!

I am affected, what can do I fix this?

Choose your favorite approach for the fix. One of the following is enough, you only have to choose one of the following bitter pills:

  • Stop using Enhanced HTTP and get that PKI of yours set up – hihi. (Joking aside,  this was actually the recommendation some people had to do)
  • Stop using BranchCache – then wait for the call from your network team and start to update the resume 🙂
  • Stop using Windows and migrate to Linux – harr harr!
  • Go all modern and migrate all to Intune – *giggles*.
  • Disable BranchCache’s WINHTTP integration, use the 2Pint BITSACP (from the OSDToolkit) in every TS to use BITS to download content, then complain to your Microsoft rep to get a ‘proper fix’ for this issue. Read more on this below.

Note:Most browsers still use WININET integration, so you are OK to continue to using that.

How can I tell if I am affected?

We used two different ways of detecting this, one is to see the the corrupted performance counters values, the other one is to query content in the client caches using our BCMon tool. Interersing huh? So lets see what this looks like in real life. Lets look at the perf counter first;

Maybe hard to spot for a layman, however the yellow marking makes it easy to know what we are looking at. But what does it mean? Ok, so the “Cache partial file segments” indicates segments (hashed chunks) that have not been processed yet or where something went wrong. This is typically showing “some” data during an active download, and then in a perfect world it will die down to zero. As an Install.wim of about 3.5GB consist of about 35000 hashes, that is a hash to Kilobyte ratio of one hash per ~100KB. So a value of 4292967286 missed hashes indicates that BranchCache thinks it missed a file that is about 4292967286 * 100KB = a whopping 419235086MB which is a just a little shy of 400TB. Something is not right.

On an OK system, with the example WinPE image downloaded, this is what it should look like:

Next test. We also saw that BranchCache downloaded things into the caches but never peered with each other, so what on earth is going on there? What lead us onto this test was that the BranchCache “CurrentActiveCacheSize” was marginally smaller than on OK systems, which didn’t make any sense.  Lets have a look. In this case, we downloaded the same CI (Content Identifier remember, a bunch of Hashes right?), using BCMon (from the NON CCMTOKEN path), and then queried the cache to see if it was OK. The result surprised us greatly, where we saw machines failing to BranchCache, the cache was filled with crap. Nothing could be put together OK.

Ok, lets fire up BCMon and query a local cache:

In the picture above, we see that we have a 254901271 bytes in cache, but nothing can be reused, it’s all just gibberish. This can be visualized by the gray blob, indicating low (almost non existing) on the cache. The actual small red pieces are just indicating that the BranchCache file itself would de-dupe (you would get the same result on an empty cache).

A proper result should look like this;

Another good way to tell is an army of RED when accessing content from the CCMTOKEN paths, in logs showing access to the servers, in this case taken from the SMSTS.LOG for a TS:

Where can I get more information about this?

Please contact your dedicated Microsoft PFE, in case of failing to contact one that speaks BranchCache there is always the option of breathing into a brown paper bag while hyperventilating and then reaching out to your ConfigMgr community. Or ping the 2Pinters! You are not alone…

When will this be fixed? We heard Jan/Feb timeframe?

We can neither confirm nor deny that statement.

Why hasn’t anyone else reported this?

People have, just ask Sune. But also, this was a tricky one. Most customers won’t notice until way later. As the majority of downloads fail against the CCMTOKEN paths, they are saved by the other virtual directories and the fact that they have a Network Access Account. Customers on PKI typically should never run into this. As the failure rate greatly affects speed, most people just see that downloads take forever. So that leads to the poor sods that enabled Enhanced HTTP and got rid of their Network Access Account, they are pretty much dead in the water.

Ok, enough whining people, what can we do?

So, BITS seems to be immune against this as long as WINHTTP doesnt corrupt the BranchCache stack first. So in our testing, we disabled BranchCache for WINHTTP only and then did nothing else. This seems to have fixed the issue. But as the issue doesn’t always happen, we cant guarantee that. But anyone that wants to try this out, give it a test (after testing in the LAB, then QA, then after raising it in the change board):

  1. Disable WINHTTP and BranchCache – low risk operation, no need to reboot. Set the following registry key: (Don’t forget the 32bits space if you have 32bit apps)
    KEY: Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Internet Settings\WinHttp
    Value: DisableBranchCache
    Data: 1
  2. Inventory the BranchCache registry performance keys and make sure we have nothing, you can’t do this by checking the registry, so you have to query the counters. The following PowerShell will do:

As the above value is showing 4294967244 this indicates that this machine surely had some issues and needs some TLC. As a new downloads start, the value will be reset so keep querying to make sure you have “sensible” data. If you keep having high values (but not that high) your BranchCache infrastructure is toast regardless and we can help you fix that.

Summary

  1. It can be bad – but not terminal! If the above made no sense whatsoever, please feel free to oping us and we will help if we can.
  2. This is not over – we will be publishing more info as we uncover it!

 

//A

The post MEMCM Bug! Enhanced HTTP Causes Degraded BranchCache Performance appeared first on 2Pint Software.


Viewing all articles
Browse latest Browse all 60

Trending Articles