Monday, July 11, 2011

Sluggish site, multiple authentication prompts - caused by stack overflow in IIS application pool worker process

On one of my servers, users were reporting:
  • Site was sluggish
  • Sometimes several authentication prompts

I check the site, and here is what I found:
In the ULS log, some messages logged by mssdmn.exe including:
CSTS3Accessor::Init: InitRequest failed for URL http://my-server/clientxyz/Pages/Home.aspx Return error to caller, hr=80041204  [sts3acc.cxx:546]  d:\office\source\search\native\gather\protocols\sts3\sts3acc.cxx

In the System log, regular Warning events logged by WAS, ID 5011 with the message:
A process serving application pool 'SharePoint - [my-server]80' suffered a fatal communication error with the Windows Process Activation Service. The process id was '12028'. The data field contains the error number.

In the Application log, regular Information events logged by Windows Error Reporting, ID 1001 with the message:
Fault bucket , type 0
Event Name: CLR20r3
Response: Not available
Cab Id: 0

Problem signature:
P1: w3wp.exe
P2: 7.5.7601.17514
P3: 4ce7afa2
P4: Microsoft.SharePoint
P5: 14.0.0.0
P6: 4bad8a7a
P7: 5cc9
P8: 0
P9: System.StackOverflowException
P10: 

Attached files:

These files may be available here:
C:\ProgramData\Microsoft\Windows\WER\ReportQueue\AppCrash_w3wp.exe_69fc9436e8b5896ded626b3ecf45dce746b517_3ed230ba

Analysis symbol: 
Rechecking for solution: 0
Report Id: 245f9c35-aaa0-11e0-96d0-f4ef7acc1a53
Report Status: 4

These all happened every 15 min - coincidentally the frequency of my incremental crawls

What can I conclude from this evidence? The W3WP is crashing. This would explain the users' various reports of sluggish behaviour and repeated authentication prompts. Now to figure out why the application pool process is crashing.

Some searching and I find this TechNet article Event ID 5011 - IIS Application Pool Availability under Troubleshoot Windows Server 2008 R2. So, I go ahead and download and install Debug Diagnostics x64. So far so good, but for some reason it runs only in Analyze Only mode. The documentation and help screens don't match the application. So, I thought I'd try the 32-bit one. This one matched the documentation, but is unable to attach to my w3wp.exe processes. Searching some more, I find that Debug Diagnostics is not supported on Windows 2008 R2 (major WTF moment!) and to follow How To: Collect a Crash dump of an IIS worker process on IIS 7.0 (and above) instead. So, I am now looking for WERCON. Well, of course this doesn't exist either, but I can read the .wer file with Notepad and this didn't give me much more information than the Application Log.

Frustrated, I give up on the diagnostics and try to do a little more experimenting. Is my FAST search crawler overloading the application pool? I create a Crawler Impact Rule to limit the crawler to one document request at a time. Still, this did not help.

Even more frustrated, I decide to see how much of an impact the crawl has. I go back to the ULS log and look at the URLs and try them out in a browser. Jackpot! The very first address I tried crashed the worker process. I try it a few more times and each time I got the same crash. 

Curious, I did more digging. The site was created at roughly the same time the WAS warning messages started showing up in the System log.

So, it turns out that one of my sites was causing the W3WP to crash and whenever the search crawler runs, it tries to crawl that site causing the crash. To workaround the problem, I added a new Crawl Rule to exclude the culprit site. This appears to have worked.


No comments:

Post a Comment