Building environments for Lab Manager: Why bare metal scripting fails

In the world of DevOps it’s all about the scripts: I’ve seen some great work done by some clever people to create complex environments with multiple VMs all from scratch using PowerShell. That’s great, but unfortunately in the world of Lab Manager it just doesn’t work well at all.

We’ve begun the pretty mammoth task of generating a new suite of VMs for our Lab Manager deployment to allow the developers and testers to create multi-machine environments. I had hoped to follow the scripting path and create these things much more on the fly, but it wasn’t to be.

I hope to document our progress over the next few weeks. This post is all about the aims and the big issues we have that make us take the path we are following.

Needs and Wants

Let’s start with our requirements:

A flexible, multi-server environment with a range of Microsoft software platforms to allow devs to work on complex projects.
All servers must be part of the same domain.
All products must be installed according to best practice – no running SharePoint as local service here!
Multiple versions of products are needed: SharePoint 2010 and 2013; CRM 4 and 2011; SQL 2008 R2 and 2012; Biztalk 2010 and 2013.
‘Flexible’ VMs running IIS or bare server 2008 R2/2012 are needed.
No multi-role servers. We learned from past mistakes: Don’t put SQL on the DC because Lab Manager Network Isolation causes trouble.
Environments must only consist of the VMs that are needed.
Lab Manager should present these as available VMs that can be composed into an environment: No saving complete environments until they have been composed for a project.
Developers want to be able to run the same VMs locally on their own workstations for development; Lab Environments are for testing and UAT but we need consistency across them all.

That’s quite a complex set of needs to meet. What we decided to build was the following suite of VMs:

Domain Controller. Server 2012.
SQL 2012 DB server. Server 2012.
SharePoint 2013 (WFE+APP on one box). Server 2012. Uses SQL 2012 for DB.
Office Web Apps 2013. Server 2012.
Azure Workflow Server (for SharePoint 2013 workflows). Server 2012.
CRM 2011 Server. Server 2012. Users SQL 2012 for DB.
Biztalk 2013 Server. Server 2012. Users SQL 2012 for DB.
IIS 8 server. Server 2012.
‘Flexible’ Server 2012. For when you just want a random server for something.
SQL 2008 R2 server. Server 2008 R2.
SharePoint 2010 (WFE+APP+OWA on one box). Server 2008 R2. Uses SQL 2008 R2 for DB.
CRM 4. Server 2008 R2. Uses SQL 2008 R2 for DB.
Biztalk 2010. Server 2008 R2. Uses SQL 2008 R2 for DB.
IIS 7.5 server. Server 2008 R2.
‘Flexible’ Server 2008 R2.

In infrastructure terms we end up with a number of important elements:

Our AD domain and DNS domain: .local.
For SharePoint 2013 Apps we need a different domain. For ease this is apps..local.
To ensure we can use SSL for our web sites we need a CA. This is used to issue device certs and web certs. For simplicity a wildcard cert (*..local) is issued.
Services such as SharePoint web applications all get DNS registrations. Each service gets an IP address and these addresses are bound to servers in addition to their primary IPs.
All services are configured to work on our private network. If we want them to work on the public network (Lab machines, excluding the DNS, can have multiple NICs and multiple networks) then we’ll deal with that once the environment is composed and deployed through Lab Manager.

Problems and constraints

The biggest problem with Lab Manager is the way Network Isolation works. Lab asks SCVMM to deploy a new environment. If network isolation is required (because you are deploying a DC and member servers more than once through many copies of the same servers) then Lab creates a new Hyper-V virtual network (named with GUID) and connects the VMs to that. It then configures static addresses on that network for the VMS. It starts with the DC and counts up.

My experience is that trying to be clever with servers that are sysprepped and then run scripts simply confuse the life out of Lab. You really need your VMs to be fully working and finished right out of the gate. Unfortunately, that mans building the full environment by hand, completely, and then storing them all with SCVMM before importing each VM into Lab.

There are still a few wrinkles that I know we have to iron out,even with this approach:

In the wonderful world of the SharePoint 2013 app model we need subdomains and wildcard DNS entries. We also need multiple IP addresses on the server. Right now we haven’t tested this with lab. What we are hoping is that we can build our environment on the correct address space as Lab uses. It counts up from 1, so our additional IPs will count down from 254. What we don’t know is whether Lab will remove all the IP addresses from the NICs when it configures the machines. If it does, then we will need to have some powershell that runs to configure the VMs correctly.
Since we don’t have to use all the VMs we have built in any given environment, DNS registration becomes important. Servers should register themselves with the DNS running on our DC, but we need to make sure we don’t have incorrect registrations hanging around.

Both of these areas will have to be addressed during this week, so I’ll post an update on how we get on.

Consistency is still key

Even though we can’t use scripts to create our environment from bare metal, consistency is still really important. We are, therefore, using scripts to ensure that we are following a set of fixed, replicable steps for each VM build. By leaving the scripts on the VM when we finished, we also have some documentation as to what is configured. We are also trying, where possible, to ensure that if a script is re-run it won’t cause havoc by creating duplicate configurations or corrupting existing ones.

Each of our VMs has been generated in SCVMM using templates we built for the two base operating systems. That avoids differences in OS install and allows me to get a new VM running in minutes with very little involvement. By scripting our steps, should things go badly wrong we can throw away a VM and run through those steps again. It’s getting trickier as we move forward, though; rebuilding our SQL boxes once we have SharePoint installed would be a pain.

Frustration begets good practice

In fact, one of the most useful things that has come out of this project so far is a growing set of robust powershell modules that perform key functions for us. They are things that we have scripted already, but in the past these have been scripts we have edited and run manually as part of install procedures. Human intervention meant we created simpler scripts. This week I have been shifting some of the things those scripts do into functions. The functions are much more complex, as they carefully check for success and failure at every step. However, the end result is a separation of the function that does the work and the parameters that change from job to job. The scripts will will create for any given installation now are simpler and are paired with an appropriate module of functions.

Deciding where to spend time

Many will read this blog and raise their hands in despair at the time we are spending to build this environment. Surely the scripted approach is better? Interestingly, our developers would disagree. They want to be able to get a new environment up and running quickly. The truth is that our Lab/SCVMM solution can push a new multi-server rig out and have it live and usable far quicker than we could do with bare metal scripts. The potential for failure using scripts if things don’t happen in exactly the right order is quite high. More importantly, if devs are sat on their hands then billable time is being wasted. Better to spend the time up front to give us an environment with a long life.

Dev isn’t production, except it is, sort of…

The crux of this is that we need to build a production-grade environment. Multiple times. With a fair degree of variation each time. If I was deploying new servers to production my AD and network infrastructure would already be there. I could script individual roles for new servers. If I was building a throwaway test or training rig where adherence to best practice wasn’t critical then I could use the shortcuts and tricks that allow scripted builds.

Development projects run into difficulties when the dev environment doesn’t match the rigor of production. We’ve had issues with products like SharePoint, where development machines have run a simple next-next-finish wizard approach to installation. Things work in that kind of installation that fail in a production best practice installation. I’m not jumping for joy over the time it’s taking to build our new rigs, but right now I think it’s the best way.