Enable Linux on Windows the fast way.

Do you have a Windows machine running Windows 10 Anniversary Edition? Do you want to install Ubuntu on that machine so you can have a real Terminal and do real Linux things (Something something DOCKER DOCKER DOCKER something something)? Do you want to do this all through Powershell?

Say no more. I got you.

Start an elevated Powershell session. (Click on the Start button. Type “powershell” into the Search bar. Hit Shift then Enter. Click “Ok.”) Copy and paste this into it. Restart your machine. Enjoy Linux on Windows. What a time to be alive.

# Create AppModelUnlock if it doesn't exist, required for enabling Developer Mode
 $RegistryKeyPath = "HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\AppModelUnlock"
 if (-not(Test-Path -Path $RegistryKeyPath)) {
 New-Item -Path $RegistryKeyPath -ItemType Directory -Force
 }

# Add registry value to enable Developer Mode
 New-ItemProperty -Path $RegistryKeyPath -Name AllowDevelopmentWithoutDevLicense -PropertyType DWORD -Value 1

# Enable the Linux subsystem
 Get-WindowsOptionalFeature -Online | ?{$_.FeatureName -match "Linux"} | %{ Enable-WindowsOptionalFeature -Online -FeatureName $_.FeatureName}
 Restart-Computer -Force

# Install Ubuntu
 # Start an elevated Powershell session first
 lxrun /install /y
 lxrun /setdefaultuser <username that you want>

# Start it!
 bash

Suggestions

  • Install Chocolatey. It’s a package manager for Windows. It’s damn good. You can write your own packages too.
  • Install ConsoleZ: choco install consolez. It’s the best.
  • Install gvim: choco install gvim.
  • Install vcxsrv (the new xming, now with an even more abstract name!): choco install vcxsrv
  • Put Set-PSReadLineOption -EditMode Emacs into your profile: vim $PROFILE. Enjoy emacs keybindings for your Powershell session.
  • You can forward X11 applications to Windows! Prefix your application with DISPLAY:=0 after installing and starting vcxsrv. Speed is fine; it’s a lot faster than doing it over SSH (as expected since Ubuntu is running under a Windows subsystem and these syscalls are abstracted by Window syscalls).

About Me

I’m a DevOps consultant for ThoughtWorks, a software company striving for engineering excellence and a better world for our next generation of thinkers and leaders. I love everything DevOps, Windows, and Powershell, along with a bit of burgers, beer and plenty of travel. I’m on twitter @easiestnameever and LinkedIn at @carlosindfw.

Winning at Ansible: How to manipulate items in a list!

The Problem

Ansible is a great configuration management platform with a very, very extensible language for expressing yoru infrastructure as code. It works really well for common workflows (deploying files, adding authorized_keys, creating new EC2 instances, etc), but its limitations become readily apparent as you begin embarking in more custom and complex plays.

Here’s a quick example. Let’s say you have a playbook that uses a variable (or var in Ansible-speak) that contains a list of tables, like this:

important_files:
- file_name: ssh_config
file_path: /usr/shared/ssh_keys
file_purpose: Shared SSH config for all mapped users.
- file_name: bash_profile
file_path: /usr/shared/bash_profile
file_purpose: Shared .bash_profile for all mapped users.

(You probably wouldn’t manage files in Ansible this way, as it already comes with a fleshed-out module for doing things with files; I just wanted to pick something that was easy to work with for this post.)

If you wanted to get a list of file_names from this var, you can do so pretty easily with set_fact and map:

- name: "Get file_names."
set_fact:
file_names: "{{ important_files | map(attribute='file_name') }}"

This should return:

[ u'/usr/shared/ssh_keys', u'/usr/shared/bash_profile' ]

However, what if you wanted to modify every file path to add some sort of identifier, like this:

[ u'/usr/shared/ssh_keys_12345', u'/usr/shared/bash_profile_12345' ]

The answer isn’t as clear. One of the top answers for this approach suggested extending upon the map Jinja2 filter to make this happen, but (a) I’m too lazy for that, and (b) I don’t want to depend on code that might not be on an actual production Ansible management host.

The solution

It turns out that the solution for this is more straightforward than it seems:

- name: "Set file suffix"
set_fact:
file_suffix: "12345"

- name: &quot;Get and modify file_names.&quot;
set_fact:
file_names: "{{ important_files | map(attribute='file_name') | list | map('regex_replace','(.*)','\\1_{{ file_suffix }}') | list }}"

Let’s break this down and explain why (I think) this works:

  • map(attribute='file_name') selects items in the list whose key matches the attribute given.
  • list casts the generated data structure back into a list (I’ll explain this below)
  • map('regex_replace','$1','$2') replaces every string in the list with the pattern given. This is what actually does what you want.
  • list casts the results back down to a list again.

The thing that’s important to note about this (and the thing that had me hung up on this for a while) is that every call to map (or most other Jinja2 filters) returns the raw Python objects, NOT the objects that they point to!

What this means is that if you did this:

- name: "Set file suffix"
set_fact:
file_suffix: "12345"

- name: "Get and modify file_names."
set_fact:
file_names: "{{ important_files | map(attribute='file_name') | map('regex_replace','(.*)','\\1_{{ file_suffix }}') }}"

You might not get what you were expecting:

ok: [localhost] => {
    "msg": "Test - <generator object do_map at 0x7f9c15982e10>."
}

This is sort-of, kind-of explained in this bug post, but it’s not very well documented.

Conclusion

This is the first of a few blog posts on my experiences of using and failing at Ansible in real life. I hope that these save someone a few hours!

About Me

Carlos Nunez is a site reliability engineer for Namely, a modern take on human capital management, benefits and payroll. He loves bikes, brews and all things Windows DevOps and occasionally helps companies plan and execute their technology strategies.

Technical Thursdays: DNS, or why using the Internet is kind of like going to Starbucks

This Thursday, we’ll talk about a system that has been extremely critical (and extremely taken for granted) for shaping the Internet as we know it: the domain name system, or DNS for short.

Before I explain what DNS is, I’ll talk about something I try really hard to hate but ultimately can’t: Starbucks.

I go to Starbucks at least once a day. Given that Google has more coffee machines (and baristas!) sitting idle than my handy downstairs Starbucks does on even their busiest days, this is slightly embarrassing to admit. I love their drinks, but as a recovering coffee snob, I passive-aggressively hate that I love their drinks. My relationship with that Seattle staple is kind of like how a lot of people feel about Taylor Swift: they’ll hate on her forever but will never admit to playing 1989 on repeat.

Wait, that’s just me?

Okay. I can live with that.

Anyway, what I find fascinating about Starbucks aside from their many variants of non-coffee coffee drinks (that are so good but so bad) is how baristas communicate drinks to each other. Somehow, someway, your order for a tall caramel-flavored latte with soy milk, whip cream and a double-shot of espresso is always a tall caramel whip redeye latte to every Starbucks barista on the planet, but trying that on a barista at Cafe Grumpy will usually get you banned for life.

What’s even more fascinating about this is that DNS works “exactly” the same way when you go to BuzzFeed.com on your phone or computer to endlessly browse lists of cat pictures and gifs of people doing funny things.

(Don’t pretend like you don’t.)

You probably know that underneath the the lists and relationship videos, BuzzFeed is really a ton of servers doing lots of hard work to deliver this quality content, and buzzfeed.com is just one of the servers that shows them to you.

What you might not know is that the name of that server isn’t buzzfeed.com; it’s actually: 54.241.35.79. That’s it’s IP address.

If you type in those four (or eight) numbers into Chrome (or whatever your browser of choice is; I use Safari for reasons that won’t be discussed here to avoid an intense holy war), it’ll take you right to BuzzFeed.

How does your computer know that these two things go to the same place? The answer is DNS.

What Is This DNS Magic That You Speak Of?

DNS is a system that maps names like buzzfeed.com or Wikipedia.org to IP addresses. It was created in the early 1980s when the Internet was much much MUCH smaller and has been iterated and improved upon significantly since then. Here’s the original RFC that describes how it works, and surprisingly, a lot of it has held up over time!

These mappings are stored in records. There are several kinds of them. The name-to-IP mapping that I described earlier is stored in an A record, but a DNS can also have records for other mappings to things like shortcuts to A records (CNAME records), mail servers on the network to which that IP address belongs (MX records) or random data (TXT records).

When your computer attempts to find the IP address for a web site, its DNS client (also called a resolver) performs a DNS query. The response it gets back is the DNS response.

So original, I know.

Dots and zones

The dots in a website URL are very important. Every word behind each dot is called a DNS domain, and every one of those words maps to something.

The last word in the URL, i.e. the .com, .org and .football, is called a top-level domain or TLD. Every single one is maintained by the Internet Assigned Numbers Authority, or the IANA. In the early days of simple Internet, this used to give you an idea of what the website was for. .coms were for commercial use or companies, .orgs were for non-profits and foundations, .net were for personal websites and country-specific TLDs like .us or .it were for government-run websites.

However, like most things from that time period, that’s gone completely out the window (do you think bit.ly is in Libya?).

Records within a DNS are broken up into zones, and servers within the DNS are responsible for upholding their zone. These zones are usually HUGE text files that get stored completely within that server’s memory for really fast access. When your computer sends a DNS query, the DNS server you’re configured to use will ask for this server if it doesn’t have the record it’s looking for stored anywhere. It does this by asking for a special record called the State-Of-Authority, or SOA, which tells it where to go next in its search.

DNS is so hot right now

Almost every single web site you’ve visited within the last 20 years or so has likely taken advantage of DNS. If you’re like me, that’s probably a lot of websites! Furthermore, many of the assets on those web sites (think: images and code for all of those fancy site effects) are referred to by name and resolved by DNS.

The Internet as we know it would not function without DNS. As of yesterday, the size of the entire Internet was just over 1 BILLION unique web sites (and growing! exponentially!) and used by over 3 BILLION people.

Now imagine all of that traffic being handled by a single Dell server somewhere in this vast sea of Internet.

You can’t? Good. Me neither.

DNS at WEB SCALE

So how does DNS manage to work for all of these people for all of these web sites? When it comes to matters of scale, the answer is usually: throw a metric crap ton of servers at it.

DNS is no exception.

The Root

There are a few layers of servers involved in your typical DNS query. The first and top-most layer starts at the DNS root servers. These servers are ran by the Internic and are used to tell you which servers own what TLDs (see below).

There are 13 root servers throughout the world, {A through M}.root-servers.net. As you can imagine, they are very, very, very powerful clusters of servers.

The TLD companies

Every TLD is managed by a company. The DNS servers run by these companies contain the records for every website that uses those TLDs. In the case of bit.ly, for example, the records for bit.ly will live on a DNS server managed by the IANA, whereas the records for stupidsiteabout.football will be managed by Donuts.

Whenever you buy a domain with GoDaddy, (a) you are doing yourself a disservice and need to get on Gandi or Hover right now, and (b) your payment gives you the ability to create records that eventually land up on these servers.

The Public Servers

The next layer of servers in the query are the public DNS servers. These are usually hosted by either your ISP, Google or DNS companies like Dyn or OpenDNS, but there are MANY DNS servers available out there. These are almost always the DNS servers that you use on a daily basis.

While they usually have the same set of records that the root servers have, they’ll refer to the root servers above if they’re missing anything. Also, because they are used more frequently than the root servers above, they are often more susceptible to people doing bad things, so the good DNS servers will implement lots of security enhancements to prevent these things from happening. Finally, the really big DNS services usually have MANY more servers available than the root servers, so your query will always be responded to quickly.

Your Dinky Linksys

The third layer of servers involved in the queries most people make aren’t actually servers at all! Your home router most likely runs a small DNS server to help make responses to queries a lot faster. They don’t store a lot of records, and they are typically written pretty badly, so I often reconfigure these routers for my clients so that use Google or OpenDNS instead.

Your job probably has DNS servers of their own to improve performance and also upkeep internal and private records.

Your iPhone

The final layer of a query ends (well, starts) right at your phone or computer. Your computer’s DNS resolver will often store responses to common queries for a short period of time to avoid having to use DNS servers as often as possible.

While this is often a very good thing, this often causes problems when records change. If you’ve ever tried to go onto a website and were unable to, this is often one reason why. Fortunately, fixing this is as simple as clearing your DNS cache. In Windows, you can do this by clicking Start, then typing cmd /c ipconfig /flushdns into your search bar. Use these instructions to do this on your Mac or these instructions to do this on your iPhone or iPad.

This is starting to get long and I’m in the mood for a caramel frap now, so I’m going to stop while I’m ahead here!

Did you learn something today? Did I miss something? Let me know in the comments!

Technical Thursdays: Calculate Directory Sizes Stupidly Fast With PowerShell.

Scenario

A file share that a group in your business is dependent on is running out of space. As usual, they have no idea why they’re running out of space, but they need you, the sysadmin, to fix it, and they need it done yesterday.

This has been really easy for Linux admins for a long time now: Do this

du -h / | sort -nr

and delete folders or files from folders at the top that look like they want to be deleted.

Windows admins haven’t been so lucky…at least those that wanted to do it on the command-line (which is becoming increasingly important as Microsoft focuses more on promoting Windows Server Core and PowerShell). `

dir sort-of works, but it only prints sizes on files, not directories. This gets tiring really fast, since many big files are system files, and you don’t want to be that guy that deletes everything in C:\windows\system32\winsxs again.

Doing it in PowerShell is a lot better in this regard (as written by Ed Wilson from The Scripting Guys)

function Get-DirectorySize ($directory) {
Get-ChildItem $directory -Recurse | Measure-Object -Sum Length | Select-Object `
    @{Name="Path"; Expression={$directory.FullName}},
    @{Name="Files"; Expression={$_.Count}},
    @{Name="Size"; Expression={$_.Sum}}
}

This code works really well in getting you a folder report..until you try it on a folder like, say, C:\Windows\System32, where you have lots and lots of little files that PowerShell needs to (a) measure, (b) wait for .NET to marshal the Win32.File system object into an System.IO.FIle object, then (c) wrap into the fancy PSObject we know and love.

This is exacerbated further upon running this against a remote SMB or CIFS file share, which is the more likely scenario these days. In this case, Windows needs to make a SMB call to tell the endpoint on which the file share is hosted to measure the size of the directories you’re looking to report on. With CMD, once WIndows gets this information back, CMD pretty much dumps the result onto the console and goes away. .NET, unfortunately, has to create System.IO.File objects for every single file in that remote directory, and in order to do that, it needs to retrieve extended file information.

By default, it does this for every single file. This isn’t a huge overhead when the share is on the same network or a network with a low-latency/high-bandwidth path. This is a huge problem when this is not the case. (I discovered this early in my career when I needed to calculate folder sizes on shares in Sydney from New York. Australia’s internet is slow and generally awful. I was not a happy man that day.)

Lee Holmes, a founding father of Powershell, wrote about this here. It looks like this is still an issue in Powershell v5 and, based on his blog post, will continue to remain an issue for some time.

This post will show you some optimizations that you can try that might improve the performance of your directory sizing scripts. All of this code will be available on my GitHub repo.

Our First Trick: Use CMD

One common way of sidestepping this issue is by using a hidden cmd window running dir /s /b and doing some light string parsing like this:

function Get-DirectorySizeWithCmd {
    param (
        [Parameter(Mandatory=$true)]
        [string]$folder
    )

    $lines = & cmd /c dir /s $folder /a:-d # Run dir in a hidden cmd.exe prompt and return stdout.

    $key = "" ; # We’ll use this to store our subdirectories.
    $fileCount = 0
    $dict = @{} ; # We’ll use this hashtable to hold our directory to size values.
    $lines | ?{$_} | %{ 
        # These lines have the directory names we’re looking for. When we see them,
        # Remove the “Directory of” part and save the directory name.
        if ( $_ -match " Directory of.*" ) { 
            $key = $_ -replace " Directory of ",”" 
            $dict[$key.Trim()] = 0 
        } 
        # Unless we encounter lines with the size of the folder, which always looks like "0+ Files, 0+ bytes”
        # In this case, take this and set that as the size of the directory we found before, then clear it to avoid
        # overwriting this value later on.
        elseif ( $_ -match "\d{1,} File\(s\).*\d{1,} bytes" ) { 
            $val = $_ -replace ".* ([0-9,]{1,}) bytes.*","`$1” 
            $dict[$key.Trim()] = $val ; 
            $key = “" 
        }
        # Every other line is a file entry, so we’ll add it to our sum.
        else {
            $fileCount++
        }

    }
    $sum = 0
    foreach ( $val in $dict.Values ) {
        $sum += $val
    }
    New-Object -Type PSObject -Property @{
        Path = $folder;
        Files = $fileCount;
        Size = $sum
    }

}

It’s not true Powershell, but it might save you a lot of time over high-latency connections. (It is usually slower on local or nearby storage.

Our Second Trick: Use Robocopy

Most Windows sysadmins know about the usefulness of robocopy during file migrations. What you might not know is how good it is at sizing directories. Unlike dir, robocopy /l /nfl /ndl:

  1. It won’t list every file or directory it finds in its path, and
  2. It provides a little more control over the output, which makes it easier for you to parse when the output makes it way to your Powershell session.

Here’s some sample code that demonstrates this approach:

function Get-DirectorySizeWithRobocopy {
    param (
        [Parameter(Mandatory=$true)]
        [string]$folder
    )

    $fileCount = 0 ; 
    $totalBytes = 0 ; 
    robocopy /l /nfl /ndl $folder \localhostC$nul /e /bytes | ?{ 
        $_ -match "^[ t]+(Files|Bytes) :[ ]+d" 
    } | %{ 
        $line = $_.Trim() -replace '[ ]{2,}',',' -replace ' :',':' ; 
        $value = $line.split(',')[1] ; 
        if ( $line -match "Files:" ) { 
            $fileCount = $value } else { $totalBytes = $value } 
        } ; 
        [pscustomobject]@{Path=',';Files=$fileCount;Bytes=$totalBytes} 
    }
}

The Target

For this post, we’ll be using a local directory with ~10,000 files that were about 1 to 10k in length (the cluster size on the server I used is ~8k, so they’re really about 8-80k in size) and spread out across 200 directories. The code written below will generate this for you:

$maxNumberOfDirectories = 20

$maxNumberOfFiles = 10
$minFileSizeInBytes = 1024
$maxFileSizeInBytes = 1024*10
$maxNumberOfFilesPerDirectory = [Math]::Round($maxNumberOfFiles/$maxNumberOfDirectories)

for ($i=0; $i -lt $maxNumberOfDirectories; $i++) {
    mkdir “./dir-$i” -force

    for ($j=0; $j -lt $maxNumberOfFilesPerDirectory; $j++) {
        $fileSize = Get-Random -Min $minFileSizeInBytes -Max $maxFileSizeInBytes
        $str = ‘a’*$fileSize
        echo $str | out-file “./file-$j” -encoding ascii
        mv “./file-$j” “./dir-$i"

}
}

I used values of 1000 and 10000 for $maxNumberOfFiles while keeping the number of directories at 20.

Here’s how we did:

1k files 10k files
Get-DIrectorySize ~60ms ~2500ms
Get-DirectorySizeWithCmd ~110ms ~3600ms
Get-DIrectorySizeWithRobocopy ~45ms ~85ms

I was actually really surprised to see how performant robocopy was. I believe that cmd would be just as performant if not more so if it didn’t have to do as much printing to the console as it does.

/MT isn’t a panacea

The /MT switch tells robocopy to split off the copy job given amongst several child robocopy instances. One would think that this would speed things up, since the only thing faster than robocopy is more robocopy. It turns out that this was actually NOT the case, as its times ballooned up to around what we saw with cmd. I presume that this has something to do with the way that those jobs are being pooled, or that each process is actually logging to their own stdout buffers.

TL;DR: Don’t use it.

A note about Jobs

PowerShell Jobs seem like a lucrative option. Jobs make it very easy to run several pieces of code concurrently. For long-running scriptblocks, Jobs are actually an awesome approach.

Unfortunately, Jobs will work against you for a problem like this. Every Powershell Job invokes a new Powershell session with their own Powershell processes. Each runspace within that session will use at least 20MB of memory, and that’s without modules! Additionally, you’ll need to invoke every Job serially, which means that the time spent in just starting each job could very well exceed the amount of time it takes robocopy to compute your directory sizes. Finally, if you use cmd or robocopy to compute your directory sizes, every job will invoke their own copies of cmd and robocopy, which will further increase your memory usage for, potentially, very little benefit.

TL;DR: Don’t use Jobs either.

That’s all I’ve got! I hope this helps!

Do you have another solution that works? Has this helped you size directories a lot faster than before? Let’s talk about it in the comments!

About Me

I’m the founder of caranna.works, an IT engineering firm in Brooklyn that builds smarter and cost-effective IT solutions that help new and growing companies grow fast. Sign up for your free consultation to find out how. http://caranna.works.