A Few Gotchas About Going Multi-Cloud with AWS, Microsoft Azure and HashiCorp tools.

One of the more interesting types of work we do at Contino is help our clients make sense of the differences between AWS and Microsoft Azure. While the HashiCorp toolchain (Packer, Terraform, Vault, Vagrant, Consul and Nomad) have made provisioning infrastructure a breeze compared to writing hundreds of lines of Python, they almost make achieving a multi-cloud infrastructure deployment seem too easy.

This post will outline some of the differences I’ve observed with using these tools against both cloud platforms. As well, since I used the word “multi-cloud” in my first paragraph, I’ll briefly discuss some general talking points on “things to consider” before embarking on a multi-cloud journey at the end.

Azure and ARM Are Inseparable

One of the core features that make Terraform and Packer tick are providers and builders, respectively. These allow third-parties to write their own “glue” code that tells Terraform how to create VMs or Packer how to create machine images. This way, Terraform and Packer simply become “thin-clients” for your desired platform. HashiCorp’s recent move of moving provider code out of the Terraform binary in version 0.10 emphasizes this.

Alas, when you create VMs with Terraform or machine images with Packer, you’re really asking the AWS Golang SDK to do those things. This is mostly the case with Azure, with one big exception: the Azure Resource Manager, or ARM.

ARM is more-or-less like AWS CloudFormation. You create a JSON template of the resources that you’d like to deploy into a single resource group along with the relationships that should exist between those resources and submit that into ARM as a deployment. It’s pretty nifty stuff.

However, instead of Terraform or Packer using the Azure Go SDK directly to create these resources, they both rely on ARM through the Azure Go SDK to do that job for them. I’m guessing that HashiCorp chose to do it this way to avoid rework (i.e. “why create a resource object in our provider or builder when ARM already does most of that work?”) While this doesn’t have too many implications in how you actually use these tools against Azure, there are some notable differences in what happens at runtime.

Azure Deployments Are Slower

My experience has shown me that the Azure ARM Terraform provider and Packer builder takes slightly more time to “get going” than the AWS provider does, especially when using Standard_A class VMs. This can make testing code changes quite tedious.

Consider the template below. This uses a t2.micro instance to provision a Red Hat image with no customizations.

{
"description": "Basic RHEL image.",
"variables": {
"access_key": null,
"secret_key": null
},
"builders": [
{
"type": "amazon-ebs",
"access_key": "{{ user `access_key` }}",
"secret_key": "{{ user `secret_key` }}",
"region": "us-east-1",
"instance_type": "t2.micro",
"source_ami": "ami-c998b6b2",
"ami_name": "test_ami",
"ssh_username": "ec2-user",
"vpc_id": "vpc-8a2dbbf2",
"subnet_id": "subnet-306b673c"
}
],
"provisioners": [
{
"type": "shell",
"inline": [
"#This is required to allow us to use `sudo` from our Packer provisioner.",
"#This is enabled by default on all RHEL images for \"security.\"",
"sudo sed -i.bak -e '/Defaults.*requiretty/s/^/#/' /etc/sudoers"
]
},
{
"type": "shell",
"inline": ["echo Hey there"]
}
]
}

Assuming a fast internet connection (I did this test with a ~6 Mbit connection), it doesn’t take too much time for Packer to generate an AMI for us.

$ time packer build -var 'access_key=REDACTED' -var 'secret_key=REDACTED' aws.json
==> amazon-ebs: Creating temporary security group for this instance: packer_5a136414-1ba5-7c7d-890c-697a8563d4be
==> amazon-ebs: Authorizing access to port 22 from 0.0.0.0/0 in the temporary security group...
==> amazon-ebs: Launching a source AWS instance...
==> amazon-ebs: Adding tags to source instance
amazon-ebs: Adding tag: "Name": "Packer Builder"
...
amazon-ebs: Hey there
==> amazon-ebs: Stopping the source instance...
amazon-ebs: Stopping instance, attempt 1
==> amazon-ebs: Waiting for the instance to stop...
==> amazon-ebs: Creating the AMI: test_ami
amazon-ebs: AMI: ami-20ff765a
...
Build 'amazon-ebs' finished.

==> Builds finished. The artifacts of successful builds are:
--> amazon-ebs: AMIs were created:
us-east-1: ami-20ff765a

real 1m50.900s
user 0m0.020s
sys 0m0.008s

Let’s repeat this exercise with Azure. Here’s that template again, but Azure-ified:

{
"description": "Basic RHEL image.",
"variables": {
"client_id": null,
"client_secret": null,
"subscription_id": null,
"azure_location": null,
"azure_resource_group_name": null
},
"builders": [
{
"type": "azure-arm",
"communicator": "ssh",
"ssh_pty": true,
"managed_image_name": "rhel-{{ user `base_rhel_version` }}-rabbitmq-x86_64",
"managed_image_resource_group_name": "{{ user `azure_resource_group_name` }}",
"os_type": "Linux",
"vm_size": "Standard_B1",
"client_id": "{{ user `client_id` }}",
"client_secret": "{{ user `client_secret` }}",
"subscription_id": "{{ user `subscription_id` }}",
"location": "{{ user `azure_location` }}",
"image_publisher": "RedHat",
"image_offer": "RHEL",
"image_sku": "7.3",
"image_version": "latest"
}
],
"provisioners": [
{
"type": "shell",
"inline": [
"#This is required to allow us to use `sudo` from our Packer provisioner.",
"#This is enabled by default on all RHEL images for \"security.\"",
"sudo sed -i.bak -e '/Defaults.*requiretty/s/^/#/' /etc/sudoers"
]
},
{
"type": "shell",
"inline": ["echo Hey there"]
}
]
}

And here’s us running this Packer build. I decided to use a Basic_A0 instance size, as that is the closest thing that Azure has to a t2.micro instance that was available for my subscription. (The Standard_B series is what I originally intended to use, as, like the t2 line, those are burstable.)

Notice that it takes almost TEN times as long with the same Linux distribution and similar instance sizes!

$ packer build -var 'client_id=REDACTED' -var 'client_secret=REDACTED' -var 'subscription_id=REDACTED' -var 'tenant_id=REDACTED' -var 'resource_group=REDACTED' -var 'location=East US' azure.json
azure-arm output will be in this color.

==> azure-arm: Running builder ...
azure-arm: Creating Azure Resource Manager (ARM) client ...
==> azure-arm: Creating resource group ...
==> azure-arm: -> ResourceGroupName : 'packer-Resource-Group-s6sj74tdvk'
==> azure-arm: -> Location : 'East US'
...
azure-arm: Hey there
==> azure-arm: Querying the machine's properties ...
==> azure-arm: -> ResourceGroupName : 'packer-Resource-Group-s6sj74tdvk'
==> azure-arm: -> ComputeName : 'pkrvms6sj74tdvk'
==> azure-arm: -> Managed OS Disk : '/subscriptions/8bbbc92b-6d16-4eb2-8f95-7a0769748c8d/resourceGroups/packer-Resource-Group-s6sj74tdvk/providers/Microsoft.Compute/disks/osdisk'
==> azure-arm: Powering off machine ...
==> azure-arm: -> ResourceGroupName : 'packer-Resource-Group-s6sj74tdvk'
==> azure-arm: -> ComputeName : 'pkrvms6sj74tdvk'
==> azure-arm: Capturing image ...
==> azure-arm: -> Compute ResourceGroupName : 'packer-Resource-Group-s6sj74tdvk'
==> azure-arm: -> Compute Name : 'pkrvms6sj74tdvk'
==> azure-arm: -> Compute Location : 'East US'
==> azure-arm: -> Image ResourceGroupName : 'REDACTED'
==> azure-arm: -> Image Name : 'IMAGE_NAME'
==> azure-arm: -> Image Location : 'eastus'
<strong>==> azure-arm: Deleting resource group ...</strong>
==> azure-arm: -> ResourceGroupName : 'packer-Resource-Group-s6sj74tdvk'
==> azure-arm: Deleting the temporary OS disk ...
==> azure-arm: -> OS Disk : skipping, managed disk was used...
Build 'azure-arm' finished.

==> Builds finished. The artifacts of successful builds are:
--> azure-arm: Azure.ResourceManagement.VMImage:

ManagedImageResourceGroupName: REDACTED
ManagedImageName: IMAGE_NAME
ManagedImageLocation: eastus

<strong>real 10m27.036s
user 0m0.056s
sys 0m0.020s</strong>

The worst part about this is that it takes this long even when it fails!

Notice the “Deleting resource group…” line I highlighted. You’ll likely spend a lot of time looking at that line. For some reason, cleanup after an ARM deployment can take a while. I’m guessing that this is due to three things:

  1. Azure creating intermediate resources, such as virtual networks (VNets), subnets and compute, all of which can take time,
  2. ARM waiting for downstream SDKs to finish deleting resources and/or any associated metadata, and
  3. Packer issuing asynchronous operations to the Azure ARM service, which requires polling the operationResult endpoint every so often to see how things played out.

Pro-Tip: Use the az Python CLI before running things!

As recovering from Packer failures can be quite time-consuming, you might want to consider leveraging the Azure command-line clients to ensure that inputs into Packer templates are correct. Here’s quick example: if you want to confirm that the service principal client_id and client_secret are correct, you might want to add something like this into your pipeline:

#!/usr/bin/env bash
client_id=$1
client_secret=$2
tenant_id=$3

if ! az login --service-principal -u "$client_id" -p "$client_secret" --tenant "$tenant_id"
then
echo "ERROR: Invalid credentials." >&2
exit 1
fi

This will save you at least three minutes during exection…as well as something else that’s a little more frustrating.

The AWS provider and builder are more actively consumed

Both the AWS and Azure Terraform providers and Packer builders are mostly maintained internally by HashiCorp. However, what you’ll find out after using the Azure ARM provider for a short while is that its usage within the community pales in comparison.

I ran into an issue with the azure-arm builder whereby it failed to find a resource group that I created for an image I was trying to build. Locating that resource group with az groups list and the same client_id and secret worked fine, and I was able to find the resource group in the console. As well, I gave the service principal “Owner” permission, so there were no access limitations preventing it from finding this resource group.

After spending some time going into the builder source code and firing up Charles Web Proxy, it turned out that my error had nothing to do with resource groups! It turns out that the credentials I was passing into Packer from my Makefile were incorrect.

What was more frustrating is that I couldn’t find anything on the web about this problem. One would think that someone else using this builder would have discovered this before I did, especially after this builder having been available for at least 6 months since this time of writing.

It also seems that there are, by far, more internal commits and contributors to the Amazon builders than those for Azure, which seem to largely be maintained by Microsoft folks. Despite this disparity, the Azure contributors are quite fast and are very responsive (or at least they were to me!).

Getting Started Is Slightly More Involved on Azure

In the early days of cloud computing, Amazon’s EC2 service focused entirely on VMs. Their MVP at the time was: we’ll make creating, maintaining and destroying VMs fast, easy and painless. Aside from subnets and some routing details, much of the networking overhead was abstracted away. Most of the self-service offerings that Amazon currently has weren’t around, or at least not yet. Deploying an app onto AWS still required knowledge on how to set up EC2 instances and deploy onto them, which allowed companies like Digital Ocean and Heroku to rise into prominence. Over time, this premise seems to have held up, as most of AWS’s other offerings heavily revolve around EC2 in various forms.

Microsoft took the opposite direction with Azure. Azure’s mission statement was to deploy apps onto the cloud as quickly as possible without users having to worry about the details. This is still largely the case, especially if one is deploying an application from Visual Studio. Infrastructure-as-a-Service was more-or-less an afterthought, which led to some industry confusion over where Azure “fit” in the cloud computing spectrum. Consequently, while Microsoft added and expanded their infrastructure offerings over time, the abstractions that were long taken for granted in AWS haven’t been “ported over” as quickly.

This is most evident when one is just getting started with AWS and the HashiCorp suite for the first time versus starting up on Azure. These are the steps that one needs to take in order to get a working Packer image into AWS:

  1. Sign up for AWS.
  2. Log into AWS.
  3. Go to IAM and create a new user.
  4. Download the access and secret keys that Amazon gives you.
  5. Assign that user Admin privileges over all AWS services.
  6. Download the AWS CLI (or install Docker and use the anigeo/awscli image)
  7. Configure your client: aws configure
  8. Create a VPC: aws ec2 create-vpc --cidr-block 10.0.0.0/16
  9. Create an Internet Gateway: aws ec2 create-internet-gateway
  10. Attach the gateway to your VPC so that your machines can Internet: aws ec2 attach-internet-gateway --internet-gateway-id $id_from_step_9 --vpc-id $vpc_id_from_step_8
  11. Create a subnet: aws ec2 create-subnet --vpc-id $vpc_id_from_step_8 --cidr-block 10.0.1.0/24
  12. Update that subnet so that it can issue publicly accessible IP addresses to VMs created within it: aws ec2 modify-subnet-attribute --subnet-id $subnet_id_from_step_11 --map-public-ip-on-launch
  13. Download Packer (or use the hashicorp/packer Docker image)
  14. Create a Packer template for Amazon EBS.
  15. Deploy! `packer build -var ‘access_key=$access_key’ -var ‘secret_key=$secret_key’ your_template.json

If you want to understand why an AWS VPC requires an internet gateway or how IAM works, finding whitepapers on these topics is a fairly straightforward Google search.

Getting started on Azure, on the other hand, is slightly more laborious as documented here. Finding in-depth answers about Azure primitives has also been slightly more difficult, in my experience. Most of what’s available are Microsoft Docs entries about how to do certain things and non-technical whitepapers. Finding a Developer Guide like those available in AWS was difficult.

In Conclusion

Using multiple cloud providers is a smart way of leveraging different pricing schemes between two providers. It is also an interesting way of adding more DR than a single cloud provider can provide alone (which is kind-of a farce, as AWS spans dozens of datacenters across the world, many of which are in the US, though region-wide outages have happened before, albeit rarely.

HashiCorp tools like Terraform and Packer make managing this sort of infrastructure much easier to do. However, both providers aren’t created equal, and the AWS support that exists is, at this time of writing, significantly more extensive. While this certainly doesn’t make using Azure with Terraform or Packer impossible, you might find yourself doing more homework than initially expected!

About Me

IMG-0456-min

I’m a Technical Principal for Contino. We specialize in helping large and heavily-regulated enterprises make cloud adoption and DevOps culture a reality. I’m passionate about bringing DevOps to the enterprise. I’m also passionate about bikes, brews and travel!

Advertisements

Provisioning VMware Workstation Machines from Artifactory with Vagrant

I wrote a small
Vagrantfile
and helper library for provisioning VMware VMs from boxes hosted on Artifactory. I put this together with the intent of helping us easily provision our Rancher/Cattle/Docker-based platform wholesale on our machines to test changes before pushing them up.

Here it is: https://github.com/carlosonunez/vagrant_vmware_artifactory_example

Tests are to be added soon! I’m thinking Cucumber integration tests with unit tests on the helper methods and Vagrantfile correctness.

I also tried to emphasize small, isolated and easily readable methods with short call chains and zero side effects.

The pipeline would look roughly like this:

  • Clone repo containing our Terraform configurations, cookbooks and this Vagrantfile
  • Make changes
  • Do unit tests (syntax, linting, coverage, etc)
  • Integrate by spinning up a mock Rancher/Cattle/whatever environment with Vagrant
  • Run integration tests (do lb’s work, are services reachable, etc)
  • Vagrant destroy for teardown
  • Terraform apply to push changes to production

We haven’t gotten this far yet, but this Vagrantfile is a good starting point.

Making sense of this ChatOps thing

So I’m still not entirely sold on the urgency or importance of “chatops.”

I’m a huge fan of Google Assistant neé Now. I wish that I could replace Siri with it daily. It can answer nearly any question you throw at it, and it is smart enough to do contextual things that resemble conversations. For fun, I just asked Siri to navigate me to my favorite winery from Lewisville, TX to Grapevine, TX, Messina Hof while away. Here’s what it came back with:

siri-fail

Not very useful. What’s a Messina?

Google Assistant, on the other hand, knows what’s up…kind of:

google-win

It didn’t get me to the Grapevine location my fiancée and I always go to, but it (a) knew I was talking about Messina Hof, and (b) navigated me to their biggest vineyard in Bryan, TX (a.k.a Aggieland, opinions notwithstanding).

Here’s the thing, though: in almost every case, I will probably open Google Maps and search for the location there. I’m sure that, in the near future, Assistant will be knowledgable enough to know the exact location I want and whether I should stop for gas and a coffee on the way there (Google’s awesome new phone will probably help accelerate that). In the present, however, it’s a lot faster to do all of that from the app.

Which kind of explains my issue with chatops.

What’s ChatOps?

PagerDuty (awesome on-call management app, highly recommend) explains that, holistically, chatops:

…is all about conversation-driven development. By bringing your tools into your conversations and using a chat bot modified to work with key plugins and scripts, teams can automate tasks and collaborate, working better, cheaper and faster.

Since this is DevOps and that definition wouldn’t be complete without referring to tooling of some sort, remember this?

aol_bots

Think that, but with your infrastructure, more Slack, more modern Web and fewer early 2000s nostalgia:

original

The overall goal of chatops is to use communication mediums that we take advantage of on a daily basis to manage workflows and infrastructure more seamlessly. (To me, email automation would not only squarely fit in with this design pedagogy, but, as discussed later, would also probably be the most compatible and far-reaching solution for people.)

I’m not saying ChatOps isn’t awesome.

There are several frameworks out there that enable companies and teams to start playing around. Hubot, by Github, is the most well-known one. It works with just about every messaging platform out there, including Lync if you have a XMPP gateway set up. Slack integrations and webhooks are also very popular for companies using that product. When implemented correctly, chatops can be quite powerful.

Being able to say phrases like /deploybot deploy master of <project> to preprod or /beachbot create a sandbox environment for myawesometool from carlosnunez’s fork on Slack or Jabber and action on them would be incredibly neat, not to mention incredibly fast. This can be immensely valuable in several high-touch situations such as troubleshooting unexpected issues with infrastructure or automating product releases from a common tool.

More mature implementations can go much, much deeper than that.

44-livingston-blog-post-image-20160601174608

I listened to an extremely interesting episode of Planet Money recently that explained an interesting period of growth for Subaru in the late 1990s to early 2000s. Subaru was struggling to compete with booming Japanese automakers at the time. They were producing cheaper cars faster and were successful in aggressively targetting the mid-market that Subaru classically did well in. Growth eventually went negative, and morales plummeted with it.

In the late 1990s, they made a discovery while trying to find a modicum of success with what they currently had. They discovered that out of their entire lineup of products, only one was selling consistently: the Impreza. They sought to find out why.

What they found was surprising. They saw that this car, and only this car, had a strong positive correlation with female buyers, specifically females that lived together. So they, with the help of Mulryan/Nash, their ad agency, tried something rash: they aimed to exclusively target homosexual couples in almost all of their ad campaigns.

Their sales soared. In fact, they were the only auto manufacturer to generate revenue during the 2008 Global Financial Crisis.

(Check out the full story here if you’re interested in learning more!)

Wouldn’t it have been awesome if they had bots that scoured sales demographics data from their network of dealerships and turn the identified trends covered within into emails or chats that marketing or sales managers can parse and make these same decisions on? How much faster do you think they would have been able to identify this and action on it? How many other trends could they have uncovered and made potential sales on?

That’s what I think when I hear about ChatOps. But let’s get back to reality.

I’m saying that it’s just not that crucial.

There are a lot of things that have to be done “right” before chatops can work. Monitoring and alerting have to be on point, especially for implementing things like automated alert or alarm bots. Creating new development environments have to be automated or at least have a consistent process from which automation can occur. Configuration management has to exist and has to be consistent for deployment bots to work. The list goes on.

Here in lies the rub: for engineers, accomplishing these things from a command-line tool is just as simple, and developers and engineers tend to spend just as much time with their tools as their IM client. Furthermore, implementing new systems introduces complexity, so introducing chatops to an organization when their tooling needs improvement will usually lead to my Messina-that-isn’t-Messina Hof situation from before where the quality of both toolsets ultimately suffers. So if the goal of implementing chatops is to make engineering’s life easier (or to make it easier for non-technical people to gain more understandable views into their tech), there might be easier and more important wins to be had first.

It’s not the end-all-be-all…yet.

Financial companies, tech-friendly law firms and news organizations use chatops to help model the state of markets, find trends in big law to identify new opportunities and uncover breaking news to broadcast around the world. The intrinsic value of ChatOps is definitely apparent.

That said, the foundation of the house comes first. Infrastructure, process and culture have to be solid and at least somewhat automated before chatops can make sense.

About Me

20160408

I’m a DevOps consultant for ThoughtWorks, a software company striving for engineering excellence and a better world for our next generation of thinkers and leaders. I love everything DevOps, Windows, and Powershell, along with a bit of burgers, beer and plenty of travel. I’m on twitter @easiestnameever and LinkedIn at @carlosindfw.

Driving technical change isn’t always technical

Paperful office

Locked rooms full of potential secrets was nothing new for a multinational enterprise that a colleague of mine consulted for a few years ago. A new employee stumbling upon one of these rooms, however, was.

What that employee found in his accidental discovery was a bit unusual: a room full of boxes, all of which were full of neatly-filed printouts of what seemed like meeting minutes. Curious about his new find, he asked his coworkers if they knew anyting about this room.

None did.

It took him weeks to find the one person that had a clue about this mysterious room. According to her, one team was asked to summarize their updates every week, and every week, someone printed them out, shipped it to the papers-to-the-metaphoric-ceiling room and categorized it.

Seems strange? This fresh employee thought so. He sought to find out why.

After a few weeks of semi-serious digging, he excavated the history behind this process. Many, many years ago (I’m talking about bring-your-family-into-security-at-the-airport days), an executive was on his way to a far-away meeting and remembered along the way that he forgot to bring a summary of updates for an important team that was to come up in discussion. Panicked, he asked his executive assistant to print it out and bring it to him post haste. She did.

To prevent this from happening again, she printed and filed this update out every week in the room that eventually became the paper jungle gym. She trained her replacement to do this, her replacement trained her replacement; I think you see where this is headed. The convenience eventually became a “rule,” and because we tend to be conformant in social situations, this rule was never contested.

None of those printed updates in that room were ever used.


This has nothing to do with DevOps.

Keep reading.

I’m not sure of what became of that rule (and neither does my colleague). There is one thing I’m sure of, though: tens of thousands of long-lived companies of all sizes have processes like these. Perhaps your company’s deployments to production depend on an approval from some business unit that’s no longer involved with the frontend. Perhaps your company requires a thorough and tedious approval process for new software regardless of its triviality or use. Perhaps your team’s laptops and workstations are locked down as much as a business analyst who only uses their computers for Excel, Word and PowerPoint. (It’s incredible what they can do. Excel itself is a damn operating system; it even includes its own memory manager.)

Some of the simplest technology changes you can make to help your company go faster to market don’t involve technology at all. If you notice a rule or process that doesn’t make sense, it might be worth your while to do your own digging and question it. More people might agree with you than you think.

About Me

I’m a DevOps consultant for ThoughtWorks, a software company striving for engineering excellence and a better world for our next generation of thinkers and leaders. I love everything DevOps, Windows, and Powershell, along with a bit of burgers, beer and plenty of travel. I’m on twitter @easiestnameever and LinkedIn at @carlosindfw.

Config management and cloud provisioning: There be dragons

So I’ve tried using configuration management to deploy infrastructure to two different clouds and learned this: whenever you think “it would be great if we could deploy to EC2 with Chef,” use CloudFormation or Terraform instead.

Why? Here are a few reasons that come to mind:

  • CloudFormation/Terraform is easier. Terraform YAML is nicer than CloudFormation JSON, but both are *way* easier than trying to shoehorn Jinja2 (Ansible) or chef-provisioning Ruby to do what you want. Like, hundreds of lines easier.

    I once tried to use Ansible to automate provisioning of Active Directory forests onto EC2. I had to create my own roles for handling AMI selection, security group CRUD operations, EBS provisioning, etc. The 2000+ lines of YAML I wrote to uphold all that bass ultimately became about 200 lines of ugly, yet functional, CloudFormation JSON.

    Yeah.
  • Built-in rollback is awesome. CloudFormation and Terraform both support some kind of rollback. Chef provisioning does as well with the :rollback action (I don’t think Ansible does; at least it didn’t when I used the EC2 plugin), but it’s not guaranteed.
  • I really liked the CloudFormation API. I haven’t tried Terraform’s CLI yet, but I would imagine that it’s just as awesome. aws cloudformation provides a lot of useful information that’s easy to action upon in a Chef recipe or Ansible play, especially given that both platforms have support for CloudFormation “built-in.” What’s better, the AWS SDKs have full support for CloudFormation as well, which means…
  • You’re not locked into anything. This was the biggest takeaway from my experiences using chef-provisioning or ansible-ec2. If you ever decide to move away from Chef or Ansible, you’ll need to port over your deployment code with it. Depending on the platform, this could take anywhere from hours to weeks.

    Not a problem with CloudFormation or Terraform. Perhaps you’ll need to change how your Chef shell resource behaves, but that’s a lot easier to deal with, in my opinion.

Using your config management solution to do it all is really attractive. It’s usually not a bad idea either. However, when it comes to cloud, tread carefully!

About Me

Carlos Nunez is a DevOps consultant for ThoughtWorks, a software company striving for engineering excellence and a better world for our next generation of thinkers and leaders. He loves everything DevOps, Windows, and Powershell, along with a bit of burgers, beer and plenty of travel.

Follow him on Twitter! @easiestnameever.

Start small; move fast

Seinfeld wasn’t always the heavily-syndicated network cash cow it is today. The hit show started as an experiment for Jerry and Larry David. They wanted to write a show to describe the life of a comedian in New York, namely, Jerry’s. Despite Jerry’s limited acting and writing experience, they wrote their pilot in the late 1980’s and sold it as “The Jerry Chronicles,” which NBC made its first national appearance of on July 1989.

I’ll spare you the details, but eventually the crew found their beat and, shortly afterwards, historic levels of success. but I will say this: every episode of Seinfeld was based off of, and written by, a personal story from someone on its writing staff. Compared to the sitcom-by-committee shows that prevailed during the time, this was a small, but drastic, change that eventually made its way into the mainstream. (For example, every cast member on The Office, a favorite of mine, wrote their own episode; some more than once.)

Moving fast; not as fast as you might think

I don’t know much else about sitcoms, but I do know this: DevOps is chock-full of hype that’s very easy to get lost in. Super-fast 15 minute standups across teams that magically get things done. Lightweight Python or Ruby apps that somehow manage to converge thousands of servers to relentless uniformity. Everything about the cloud. Immutable infrastructure that wipes instead of updates. It’s very tempting to want to go fast in a world full of slow, but doing so without really thinking about it can lead to fracturing, confusion and, ironically, even more slowness.

Configuration management is a pertinent example of this. Before the days of Chef, Puppet or even CFEngine, most enterprises depended on huge, complex configuration management databases (CMDBs), ad-hoc scripts and mountains of paperwork, documentation and physical run-books to manage their “estate” or “fleet.” It was very easy for CFOs to justify the installation and maintenance of these systems: audits were expensive, violating the rules that audits usually exposed was even more expensive, and the insanely-complex CMDBs that required leagues of consultants to provision were cheap in comparison.

Many of these money-rich companies are still using these systems to manage their many thousands of servers and devices. Additionally, many of them also have intricate and possibly stifling processes for introducing new software (think: six months, at minimum, to install something like Sublime Text). Introducing Chef to the organization without a plan sounds awesome in theory but can easily lead to non-trivial amounts of sadness in reality.

The anatomy of the status quo

There are many reasons behind why I think this is, at least from what I’ve noticed during my time at large orgs. Here are the top two that I’ve observed with more frequency:

  • People fear/avoid things that they don’t understand. HufPo ran an article about this in 2011. They found that most people feel more comfortable with things that have been around longer than those that haven’t. The same goes for much of what goes on at work. New things means new processes, new training, and new complexities.
  • Some things actually exist for a reason.
  • Many people using change management tools for the first time deride them to being useless formalities of yesteryear when systems were mainframes and engineers required slide rules. However, much of their value actually stems from complying to and being flexible with similarly-complicated regulations to which those companies are beholden. Consequently, trying to replace all of that with JIRA, while not impossible, will be an incredibly-epic uphill battle.

Slow is smooth; smooth is fast.

Now, I’m not saying all of this to say that imposing change in the enterprise is impossible. Nordstrom, for instance, went from a stolid retail corporation to a purveyor of open source tech. NCR, GE and other corporate Goliaths that you might recognize are doing the same.

What I am saying, however, is to do something like what Jerry Seinfeld did: start small, and start lean. If you’ve been itching to bring Ansible to your company in a big way, perhaps it might be worthwhile to tap into the company’s next wonder-child investment and use it for a small section of the project. Passionate about replacing scp scripts with Github? It might be worthwhile to find a prominent project that’s using this approach and implement it for them. (Concessions are actually a very powerful way of introducing change when done right. In fact, doing favors for people is an old sales trick, as experiments have shown that people feel beholden to other people that do favors for them.

Finding a pain point, acting on it in a smart way and failing fast are the principal tenets of doing things the “lean” way, and you don’t even need to create your own LLC to do it! In fact, to me, this is what DevOps is really about: using technology in smart ways to get business done by getting everyone on the same page.

About Me

Carlos Nunez is a DevOps consultant for ThoughtWorks, a software company striving for engineering excellence and a better world for our next generation of thinkers and leaders. He loves everything DevOps, Windows, and Powershell, along with a bit of burgers, beer and plenty of travel.

Follow him on Twitter! @easiestnameever.