I think I’ve mentioned before how I’ve been updating our IT infrastructure. Company growth has meant a need for expanded services. Add to that new versions of SharePoint and Exchange, mix in a need to run virtual servers for development and you have a need for more tin.
Over the past six months I’ve expanded our domain to keep pace with our growing needs. The number of physical servers we have has increased, with a few more virtual servers for specific roles that I prefer to keep separate but which don’t really merit their own box.
As part of this growth, I added a second domain controller. Our existing DC was also running Exchange 2003, and this situation has caused me the most headaches in the sliding block puzzle of service upgrade and migration: We couldn’t demote the DC on our old server because of Exchange 2003, but I was reticent about putting in Exchange 2007 until I had redundancy of critical services (DC, DNS, etc).
Updating Domains, getting ready for Exchange
I will admit at this point that my knowledge of AD is not as deep as I would like, although it is increasing daily. That does mean, however, that I check before I leap - find articles on MSDN, TechNet and the wider blogosphere to find the pitfalls so I avoid pratfalls.
So, I read carefully about raising the functional level of the Forest and Domain when installing a 2003 R2 domain, made sure everything was patched and service packed before starting, read and re-read the instructions. When confident I had run through all the prerequisites I ran dcpromo
to add my domain controller.
I was then left with two servers, both of which had the necessary tools to mange AD, both of which were registered in DNS as DC’s, both of which appeared to be fine.
Nothing I read suggested that I needed to check anything else to make sure the process had completed… (You can see where this is going, can’t you…?)
Exchange 2007 - the big transition
Over the first weekend in April we transitioned from Exchange 2003 to Exchange 2007. Once again, I did my reading. I ran the Exchange Best Practice Analyser and made sure that our Exchange 2003 installation was in tip-top condition. I compared two or three different sets of instructions on how to run throughthe process, setting on one from an Exchange community site because of some extra little nuggets of insight it contained.
The transition went relatively smoothly. The new server went in, was configured correctly and the Exchange 2007 site was connected to the Exchange 2003 site. Mailboxes were transferred (we had a problem with one, but we fixed it) and clients were checked to have connected to the new server.
Once happy, we uninstalled the old Exchange, as per instructions.
It took a full day, but we were being careful and thorough. We thought it had gone fine.
The next step would be to remove our old DC from the AD and decommission the server. Being cautious, we wanted to test that things wouldn’t stop if we removed the old DC, so we unplugged the network cable…
Chaos!
Everything stopped - Exchange clients disconnected, logons stopped, everything!
Is there a doctor in the house?
Stage one when hit with a problem - gather as much information as possible.
We looked at our systems, we checked logs, we watched the Outlook clients connecting to exchange. When we disconnected our old DC, nothing seemed to want to talk to the new DC. I checked the Exchange server settings and made sure the server was set to use the new DC for its configuration and all seemed fine.
We noticed an error that the clients couldn’t connect to Global Catalogue server, so I did some more reading, realised that the old DC was our global catalogue server and so followed the steps to change the role over to the new DC. Everything said it had worked, but nothing changed.
I did some more reading about role masters and set the new DC to be the master for each role - at least I thought I did - through the AD users and groups tool. Still nothing.
At this point I decided that either I could spent days or weeks researching and prodding, or I could call in the cavalry. The support team we have access to as a Gold Partner are fantastic - I can never praise them enough - and sure enough I had people on the problem within an hour of logging the call.
Because we initially thought the problem was with our Exchange config, we dealt with a very efficient Exchange support guy. He worked methodically through the problem, and started to look deeper into our domain and DC’s as he zeroed in on it being a domain issue.
At this point, I encountered the AD support tools being used in anger for the first time. I passed the support guys dozens of log files. We also discovered what appeared to be the problem - my new DC wasn’t really a DC!
That last statement is a bit too simplistic. Our new DC was happily replicating the AD. It reported everything being fine when examined with replmon. Both DC’s agreed on their view of the world.
What I didn’t know was that in addition to the AD replicas, a NETLOGON share is created on the new DC by dcpromo. I also did not know that this process had failed - at no point did anything tell me. Because there was no share, the server was not dealing with client requests correctly, which is why our systems had a fit when I unplugged the old DC.
Peering into a deep, dark well
Having identified the fault, my exchange guy called in an AD specialist to assist. He ably worked through the fault. There are a sequence of steps to follow which will trigger a rebuild of the netlogon share. We worked through them. They didn’t work. We knew they didn’t work because the share wasn’t created. Apart from a couple of event log messages which I didn’t consider to be helpful, nothing told us what was wrong.
Having failed to rebuild the share on the new DC, my AD ninja looked at the old DC. He decided to rebuild the same share on the existing DC, the thinking being that the replication was failing because of a fault on the source, rather than the destination. In order to do this, the domain group policies would be destroyed and rebuilt as defaults.
This process took some time, but to cut a very long story short, it appears that our default group policy objects were corrupted, which was blocking the replication. By deleting them and rebuilding the sysvol directory structure on our original DC, then forcing a rebuild on the new DC, the AD was fixed.
My eternal gratitude to the Microsoft support guys. My point, long and meandering though the journey has been, is this: At no point did I see anything which suggested corruption of those objects. At no point did I see anything which suggested they were the cause of the replication fault.
My toolbox is missing!
In order to get the information the support guys needed, I had to install first the Support Tools from the installation media and then the resource kit tools downloaded via the web. Those tools should have been installed by default, or at least should have been added when I created my new DC.
Even when I’d installed the tools, they didn’t really give me much information. Now, I will readily admit here that I am new to the tools, and continued reading will doubtless help me in this regard, but the key point is a simple one:
I can’t see what’s going on!
Shhh… say it quietly… NDS
I supported IT solutions including Novell servers for fifteen years before joining Black Marble. In my previous role we had some thirty servers with a fairly complex, but well structured NDS directory. Over those years, we had some problems with replication and corruption, and every time we did, we started with the same procedure: We watched.
What Active Directory is lacking, in my humble opinion, is an equivalent of the Novell DStrace tool. DSTrace allows you to watch the activity of your directory replicas. By careful use of the various options you can configure your servers to show you replication traffic, requests and responses and more. Colour coding allows you to spot errors and warnings and after a while you start to see patterns in the mass of text. If we had an NDS problem we could use DStrace to get a feel for the cause - you could see if there were corrupt objects which weren’t replicating between servers. You could even figure out which servers were right and wrong.
Once you’d seen the fault, the dsrepair tool allowed you to tackle it either with surgical precision or with heavy artillery. You could force a replication of an individual object, overwriting the corrupted copy by force, or use drastic measures like deleting a replica of the directory or a partition.
Where are those tools for active directory? If they exist, please tell me, because I’d like to get my hands on them. I can’t imaging dealing with huge installations of AD without that kind of toolset.
A wishlist…
What would I like to see then? I’m writing this post before I start rummaging around the web, and if I find examples of these tools I’ll post about them.
- A tool which checks the integrity of the directory and it’s objects, and identifies where replicas on different servers disagree.
- A tool that allows me to see all the AD traffic in real time - logging to a database might be useful, but just seeing the messages on screen would be a start. I want to be able to toggle different messages - errors, warnings, replication traffic, client requests and responses etc to get a feel for what works and what doesn’t.
- A tool to allow me to fix individual objects - to replace them from backup or to overwrite them with a copy from another replica (by far my preferred method).
If this lot already exists then tell me. If there are good books on the subject then point me at them. I’ve found some support articles which are helpful, but not as much as I’d like. I’m not precious - if this all stems from a fundamental misunderstanding or lack of knowledge on my part I’m happy to admit my mistake. However, at this point I’m leaning more to it being an indication that AD still hasn’t matured to the level of NDS in terms of management and control.