This is a post so I don’t forget how I fixed access to one of our environments yesterday, and hopefully it will be useful to some of you.
We have a good many pretty complex environments deployed to our lab hyper-V servers, controlled by Lab manager. Operations such as starting, stopping or repairing those environments can take a long, long time, but this time we had one that was quite definitely stuck. The lab view showed the many servers in the lab with green progress bars about halfway across but after many hours we saw no progress. The trouble is, at this point you can’t issue any other commands to the environment from within the Lab Manager console – it’s impossible to cancel the operation and regain access to the environment.
Normally in these situations, stepping from Lab Manager to the SCVMM console can help. Stopping and restarting the VMs through SCVMM can often give lab manager the kick it needs to wake up. However, this time that had no effect. We then tried restarting the TFS servers to see if they’d got stuck, but that didn’t help either.
At this point we had no choice but to roll up our sleeves and look in the TFS database. You’d be surprised (or perhaps not) at how often we need to do that…
First of all we looked in the LabEnvironment table. That showed us our environment, and the State column contained a value of Repairing.
Next up, we looked in the LabOperation table. Searching for rows where the DataspaceId column value matched that of our environment in the LabEnvironment table showed a RepairVirtualEnvironment operation.
In the tbl_JobSchedule table we found an entry where the JobId column matched the JobGuid column from the LabOperation table. The interval on that was set to 15, from which we inferred that the repair job was being retried every fifteen minutes by the system. We found another entry for the same JobId in the tbl_JobDefinition table.
Starting to join the dots up, we finally looked in the LabObject database. Searching for all the rows with the same DataspaceId as earlier returned all the lab hosts, environments and machines that were associated with the Team Project containing the lab. In this table, our environment row had a PendingOperationId which matched that of the row in the LabOperation table we found earlier.
We took the decision to attempt to revive our stuck environment by removing the stuck job. That would mean carefully working through all the tables we’d explored and deleting the rows, hopefully in the correct order. As the first part of that, we decided to change the value of the State column in the LabEnvironment table to Started, hoping to avoid crashing TFS should it try to parse all the information about the repair job we were about to slowly remove.
Imagine our surprise, then, when having made that one change, TFS itself cleaned up the database, removed all the table entries referring to the repair environment job and we were immediately able to issue commands to the environment again!