Betty
Betty

Reputation: 9189

Diagnosing TFS Build Hanging after 'Copy Files to Drop Location' step

I need some advice on how to diagnose a hanging build. It’s only been happening in the last week or two and I have good reason to suspect it’s something that I’ve done recently and not just a coincidence

Setup

Symptoms

Recent Changes

What I’ve Tried

The Team Foundation service host request monitor has detected the following condition: Date (UTC): 3/02/2014 12:54:06 a.m. Machine: CODEBASE Application Domain: /LM/W3SVC/1/ROOT/tfs-1-130357641583538280 Assembly: Microsoft.TeamFoundation.Framework.Server, Version=12.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a; v4.0.30319 Service Host: 0dc282b5-59a8-4941-b541-a4f7d314cd0f Process Details: Process Name: w3wp Process Id: 2508 Thread Id: 2504

Detailed Message: A request for service host XXXX has been executing for 37 seconds, exceeding the warning threshold of 30. Request details: Request Context Details Url: /tfs/XXXX/XXXX/_api/_build/stop?__v=4 Method: ApiBuild.stop Parameters: uri = vstfs:///Build/Build/34064 User Agent: Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 Unique Id: 00000000-0000-0000-0000-000000000000

The Team Foundation service host request monitor has detected the following condition: Date (UTC): 30/01/2014 11:10:01 p.m. Machine: CODEBASE Application Domain: /LM/W3SVC/1/ROOT/tfs-1-130355232548668648 Assembly: Microsoft.TeamFoundation.Framework.Server, Version=12.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a; v4.0.30319 Service Host: 0dc282b5-59a8-4941-b541-a4f7d314cd0f Process Details: Process Name: w3wp Process Id: 70320 Thread Id: 14540

Detailed Message: A request for service host XXXX has been executing for 37 seconds, exceeding the warning threshold of 30. Request details: Request Context Details Url: /tfs/XXXX/Build/v4.0/BuildService.asmx Method: StopBuilds Parameters: uris[0] = vstfs:///Build/Build/34051 uris = Count = 1 User Agent: Team Foundation (devenv.exe, 12.0.21005.1, Premium, SKU:16) Unique Id: 4d2d3213-fd41-4c4d-8ab0-b87619c96a42

The Team Foundation service host request monitor has detected the following condition: Date (UTC): 31/01/2014 3:14:17 a.m. Machine: CODEBASE Application Domain: /LM/W3SVC/1/ROOT/tfs-1-130355232548668648 Assembly: Microsoft.TeamFoundation.Framework.Server, Version=12.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a; v4.0.30319 Service Host: Process Details: Process Name: w3wp Process Id: 70320
Thread Id: 14540

Detailed Message: There are no active requests for service host XXXX that exceed the warning threshold of 30.

A quick google suggests upping the timeout in the tfs registry (http://xavierdilipkumar.com/post/2013/07/04/TFS-event-7005-and-7006-warning.aspx) I've tried that and it doesn't appear to change anything.

Upvotes: 9

Views: 5030

Answers (7)

Betty
Betty

Reputation: 9189

Also appears to happen if the build agent cannot connect to the build controller server on port 9191.

Easily testable with a telnet client.

Appears that my server decided it was on an unknown network and kicked the firewall into overdrive. (The second time I got this issue, not sure if this was the reason I got it the first time but it seems reasonable).

Upvotes: 0

Johann Blais
Johann Blais

Reputation: 9469

I had this issue when an activity tried to log a huge message in the build log (namely the FxCopCmd activity from the CodePlex TFS Build Extensions project).

The build agent would successfully finish the build but the controller had to chew the huge message into the build log, and it was silently crashing/hanging.

I was able to track the issue down by navigating to C:\Users\[TfsServiceAccount]\AppData\Local\Temp\BuildAgent\[AgentNumber]\Logs\[BuildNumber]\ActivityLog.xml.

The last build message was truncated and by looking at the content, I recognized the FxCop output. In my case, I just set the LogToConsole parameter to False for the FxCop activity in the build process template, and the build completed successfully.

Upvotes: 0

julealgon
julealgon

Reputation: 8162

We were having the same problem here, the builds were kept open even after successfully passing all workflow stages.

I logged into the build machine and noticed the build controller was "running 6 builds" for some reason, even though there were no builds at all showing in the queue in Visual Studio.

After restarting the controller, the next build worked the first time.

Just wanted to let this one here as a possible answer. I'm not sure yet why the controller had those stuck builds though.

Upvotes: 1

Betty
Betty

Reputation: 9189

I'm reluctant to mark this as an answer because I'm not entirely sure why it worked.

Suspecting something was wrong with the build machine I created a new Build Agent on a fresh install - the hanging issue still occurred.

I then added a Build Controller to that machine and noticed that new builds using that controller would complete. This suggested that there was a communication issue between either the BA and the BC, or the BA and the primary AT.

Given that our primary AT had other issues we decided to remove it from the picture, we switched the DNS to point at the second AT and disabled all services on the old primary. Instantly builds started to complete (including the ones that had been stuck for a number of days).

I still don't know which component was broken or why, especially since it worked fine in this configuration for a month prior. I can only assume there was either another change that I am not aware of, or the corruption of the primary AT was causing bigger issues.

Upvotes: 1

Mark Leighton Fisher
Mark Leighton Fisher

Reputation: 5693

You might fire up Sysinternals Process Monitor to see when the processes actually exit and what they were doing (Process Monitor monitors "real-time file system, Registry and process/thread activity").

Upvotes: 2

Bozman
Bozman

Reputation: 477

can you look in the tfs bs logs at

Event Viewer -> Applications and Services Logs -> Microsoft -> Team Foundation Server -> Build-Services -> Operational

these timeouts generally relate to permissions. you should look for TF215106 access denied events. Although the files appear to be there, are they all the current date or are there some with different (older) dates? Also are they any alerts/steps happening when the file drop occurs?

Other than that it could be timing out because one of the dependencies is being used by another service.

Upvotes: 3

Giulio Vian
Giulio Vian

Reputation: 8343

The best course of action is to call Microsoft Support and open a Service Request. Make sure it gets priority A - your TFS production environment is not working - and be prepared to give them support and access.

The only hint from the log is that call to ApiBuild.stop. It suggests that the build workflow completed, so the code hosting it is calling back to the AT to mark the build completed. As you have no warnings from previous calls, it could be some problem at the database level. You may try activating SQL Tracing but it's not a trivial task, as you should be able to compare the trace with a working one.

Good luck

Upvotes: 1

Related Questions