Winning at Ansible: How to manipulate items in a list!

The Problem

Ansible is a great configuration management platform with a very, very extensible language for expressing yoru infrastructure as code. It works really well for common workflows (deploying files, adding authorized_keys, creating new EC2 instances, etc), but its limitations become readily apparent as you begin embarking in more custom and complex plays.

Here’s a quick example. Let’s say you have a playbook that uses a variable (or var in Ansible-speak) that contains a list of tables, like this:

important_files:
- file_name: ssh_config
file_path: /usr/shared/ssh_keys
file_purpose: Shared SSH config for all mapped users.
- file_name: bash_profile
file_path: /usr/shared/bash_profile
file_purpose: Shared .bash_profile for all mapped users.

(You probably wouldn’t manage files in Ansible this way, as it already comes with a fleshed-out module for doing things with files; I just wanted to pick something that was easy to work with for this post.)

If you wanted to get a list of file_names from this var, you can do so pretty easily with set_fact and map:

- name: "Get file_names."
set_fact:
file_names: "{{ important_files | map(attribute='file_name') }}"

This should return:

[ u'/usr/shared/ssh_keys', u'/usr/shared/bash_profile' ]

However, what if you wanted to modify every file path to add some sort of identifier, like this:

[ u'/usr/shared/ssh_keys_12345', u'/usr/shared/bash_profile_12345' ]

The answer isn’t as clear. One of the top answers for this approach suggested extending upon the map Jinja2 filter to make this happen, but (a) I’m too lazy for that, and (b) I don’t want to depend on code that might not be on an actual production Ansible management host.

The solution

It turns out that the solution for this is more straightforward than it seems:

- name: "Set file suffix"
set_fact:
file_suffix: "12345"

- name: "Get and modify file_names."
set_fact:
file_names: "{{ important_files | map(attribute='file_name') | list | map('regex_replace','(.*)','\\1_{{ file_suffix }}') | list }}"

Let’s break this down and explain why (I think) this works:

  • map(attribute='file_name') selects items in the list whose key matches the attribute given.
  • list casts the generated data structure back into a list (I’ll explain this below)
  • map('regex_replace','$1','$2') replaces every string in the list with the pattern given. This is what actually does what you want.
  • list casts the results back down to a list again.

The thing that’s important to note about this (and the thing that had me hung up on this for a while) is that every call to map (or most other Jinja2 filters) returns the raw Python objects, NOT the objects that they point to!

What this means is that if you did this:

- name: "Set file suffix"
set_fact:
file_suffix: "12345"

- name: "Get and modify file_names."
set_fact:
file_names: "{{ important_files | map(attribute='file_name') | map('regex_replace','(.*)','\\1_{{ file_suffix }}') }}"

You might not get what you were expecting:

ok: [localhost] => {
    "msg": "Test - <generator object do_map at 0x7f9c15982e10>."
}

This is sort-of, kind-of explained in this bug post, but it’s not very well documented.

Conclusion

This is the first of a few blog posts on my experiences of using and failing at Ansible in real life. I hope that these save someone a few hours!

About Me

Carlos Nunez is a site reliability engineer for Namely, a modern take on human capital management, benefits and payroll. He loves bikes, brews and all things Windows DevOps and occasionally helps companies plan and execute their technology strategies.

for vs foreach vs “foreach”

Many developers and sysadmins starting out with Powershell will assume that this:

$arr = 1..10
$arr2 = @()
foreach ($num in $arr) { $arr2 += $num + 1 }
write-output $arr2

is the same as this:

$arr = 1..10
$arr2 = @()
for ($i = 0; $i -lt $arr.length; $i++) { $arr2 += $arr[$i] + $i }
write-output $arr2

or this:

$arr = 1..10
$arr2 = @()
$arr | foreach { $arr2 += $_ + 1 }

Just like those Farmers Insurance commercials demonstrate, they are not the same. It’s not as critical of an error as, say, mixing up Write-Output with Write-Host (which I’ll explain in another post), but knowing the difference between the two might help your scripts perform better and give you more flexibility in how you do certain things within them.

You’ll also get some neat street cred. You can never get enough street cred.

for is a keyword. foreach is an alias…until it’s not.

Developers coming from other languages might assume that foreach is native to the interpreter. Unfortunately, this is not the case if it’s used during the pipeline. In that case, foreach is an alias to the ForEach-Object cmdlet, a cmdlet that iterates over a collection passed into the pipeline while keeping an enumerator internally (much like how foreach works in other languages). Every PSCmdlet incurs a small performance penalty relative to interpreter keywords as does reading from the pipeline, so if script performance is critical, you might be better off with a traditional loop invariant.

To see what I mean, consider the amount of time it takes foreach and for to perform 100k loops (in milliseconds):

PS C:> $st = get-date ; 1..100000 | foreach { } ; $et = get-date ; ($et-$st).TotalMilliseconds
2761.4339

PS C:> $st = get-date ; for ($i = 0 ; $i -lt 100000; $i++) {} ; $et = get-date ; ($et-$st).TotalMilliseconds
**279.2439**
PS C:> $st = get-date ; foreach ($i in (1..100000)) { } ; $et = get-date ; ($et-$st).TotalMilliseconds
**128.1159**

for was almost 10x faster, and the foreach keyword was 2x as fast as for! Words do matter!

foreach (the alias) supports BEGIN, PROCESS, and END

If you look at the help documentation for ForEach-Object, you’ll see that it accepts -Begin, -Process and -End script blocks as anonymous parameters. These parameters give you the ability to run code at the beginning and end of pipeline input, so instead of having to manually check your start condition at the beginning of every iteration, you can run it once and be done with it.

For example, let’s say you wanted to write something to the console at the beginning and end of your loop. With a for statement, you would do it like this:

$maxNumber = 100
for ($i=0; $i -lt $maxNumber; $i++) {
if ($i -eq 0) {
write-host "We're starting!"
}
elseif ($i -eq $maxNumber-1) {
write-host "We're ending!"
}
# do stuff here
}

This will have the interpreter check the value of $i and compare it against $maxNumber twice before doing anything. This isn’t wrong per se but it does make your code a little less readable and is subject to bugs if the value of $i is messed with within the loop somewhere.

Now, compare that to this:

1..100 | foreach `
-Begin { write-host "We're starting now" } `
-Process { # do stuff here } `
-End { write-host "We're ending!" }

Not only is this much cleaner and easier to read (in my opinion), it also removes the risk of the initialization and termination code running prematurely since BEGIN and END always execute at the beginning or end of the pipeline.

Notice how you can’t do this with the foreach keyword:

PS C:\> foreach ($i in 1..10) -Begin {} -Process {echo $_} -End {}
At line:1 char:22
+ foreach ($i in 1..10) -Begin {} -Process {echo $_} -End {}
+ ~
Missing statement body in foreach loop.
+ CategoryInfo : ParserError: (:) [], ParentContainsErrorRecordException
+ FullyQualifiedErrorId : MissingForeachStatement

In this case, foreach has no concept of BEGIN, PROCESS or END; it’s just like the foreach you’re used to using with other languages.

About Me

I’m the founder of caranna.works, an IT engineering firm in Brooklyn that builds smarter and cost-effective IT solutions that help new and growing companies grow fast. Sign up for your free consultation to find out how. http://caranna.works.

Technical Thursdays: Calculate Directory Sizes Stupidly Fast With PowerShell.

Scenario

A file share that a group in your business is dependent on is running out of space. As usual, they have no idea why they’re running out of space, but they need you, the sysadmin, to fix it, and they need it done yesterday.

This has been really easy for Linux admins for a long time now: Do this

du -h / | sort -nr

and delete folders or files from folders at the top that look like they want to be deleted.

Windows admins haven’t been so lucky…at least those that wanted to do it on the command-line (which is becoming increasingly important as Microsoft focuses more on promoting Windows Server Core and PowerShell). `

dir sort-of works, but it only prints sizes on files, not directories. This gets tiring really fast, since many big files are system files, and you don’t want to be that guy that deletes everything in C:\windows\system32\winsxs again.

Doing it in PowerShell is a lot better in this regard (as written by Ed Wilson from The Scripting Guys)

function Get-DirectorySize ($directory) {
Get-ChildItem $directory -Recurse | Measure-Object -Sum Length | Select-Object `
    @{Name="Path"; Expression={$directory.FullName}},
    @{Name="Files"; Expression={$_.Count}},
    @{Name="Size"; Expression={$_.Sum}}
}

This code works really well in getting you a folder report..until you try it on a folder like, say, C:\Windows\System32, where you have lots and lots of little files that PowerShell needs to (a) measure, (b) wait for .NET to marshal the Win32.File system object into an System.IO.FIle object, then (c) wrap into the fancy PSObject we know and love.

This is exacerbated further upon running this against a remote SMB or CIFS file share, which is the more likely scenario these days. In this case, Windows needs to make a SMB call to tell the endpoint on which the file share is hosted to measure the size of the directories you’re looking to report on. With CMD, once WIndows gets this information back, CMD pretty much dumps the result onto the console and goes away. .NET, unfortunately, has to create System.IO.File objects for every single file in that remote directory, and in order to do that, it needs to retrieve extended file information.

By default, it does this for every single file. This isn’t a huge overhead when the share is on the same network or a network with a low-latency/high-bandwidth path. This is a huge problem when this is not the case. (I discovered this early in my career when I needed to calculate folder sizes on shares in Sydney from New York. Australia’s internet is slow and generally awful. I was not a happy man that day.)

Lee Holmes, a founding father of Powershell, wrote about this here. It looks like this is still an issue in Powershell v5 and, based on his blog post, will continue to remain an issue for some time.

This post will show you some optimizations that you can try that might improve the performance of your directory sizing scripts. All of this code will be available on my GitHub repo.

Our First Trick: Use CMD

One common way of sidestepping this issue is by using a hidden cmd window running dir /s /b and doing some light string parsing like this:

function Get-DirectorySizeWithCmd {
    param (
        [Parameter(Mandatory=$true)]
        [string]$folder
    )

    $lines = & cmd /c dir /s $folder /a:-d # Run dir in a hidden cmd.exe prompt and return stdout.

    $key = "" ; # We’ll use this to store our subdirectories.
    $fileCount = 0
    $dict = @{} ; # We’ll use this hashtable to hold our directory to size values.
    $lines | ?{$_} | %{ 
        # These lines have the directory names we’re looking for. When we see them,
        # Remove the “Directory of” part and save the directory name.
        if ( $_ -match " Directory of.*" ) { 
            $key = $_ -replace " Directory of ",”" 
            $dict[$key.Trim()] = 0 
        } 
        # Unless we encounter lines with the size of the folder, which always looks like "0+ Files, 0+ bytes”
        # In this case, take this and set that as the size of the directory we found before, then clear it to avoid
        # overwriting this value later on.
        elseif ( $_ -match "\d{1,} File\(s\).*\d{1,} bytes" ) { 
            $val = $_ -replace ".* ([0-9,]{1,}) bytes.*","`$1” 
            $dict[$key.Trim()] = $val ; 
            $key = “" 
        }
        # Every other line is a file entry, so we’ll add it to our sum.
        else {
            $fileCount++
        }

    }
    $sum = 0
    foreach ( $val in $dict.Values ) {
        $sum += $val
    }
    New-Object -Type PSObject -Property @{
        Path = $folder;
        Files = $fileCount;
        Size = $sum
    }

}

It’s not true Powershell, but it might save you a lot of time over high-latency connections. (It is usually slower on local or nearby storage.

Our Second Trick: Use Robocopy

Most Windows sysadmins know about the usefulness of robocopy during file migrations. What you might not know is how good it is at sizing directories. Unlike dir, robocopy /l /nfl /ndl:

  1. It won’t list every file or directory it finds in its path, and
  2. It provides a little more control over the output, which makes it easier for you to parse when the output makes it way to your Powershell session.

Here’s some sample code that demonstrates this approach:

function Get-DirectorySizeWithRobocopy {
    param (
        [Parameter(Mandatory=$true)]
        [string]$folder
    )

    $fileCount = 0 ; 
    $totalBytes = 0 ; 
    robocopy /l /nfl /ndl $folder \localhostC$nul /e /bytes | ?{ 
        $_ -match "^[ t]+(Files|Bytes) :[ ]+d" 
    } | %{ 
        $line = $_.Trim() -replace '[ ]{2,}',',' -replace ' :',':' ; 
        $value = $line.split(',')[1] ; 
        if ( $line -match "Files:" ) { 
            $fileCount = $value } else { $totalBytes = $value } 
        } ; 
        [pscustomobject]@{Path=',';Files=$fileCount;Bytes=$totalBytes} 
    }
}

The Target

For this post, we’ll be using a local directory with ~10,000 files that were about 1 to 10k in length (the cluster size on the server I used is ~8k, so they’re really about 8-80k in size) and spread out across 200 directories. The code written below will generate this for you:

$maxNumberOfDirectories = 20

$maxNumberOfFiles = 10
$minFileSizeInBytes = 1024
$maxFileSizeInBytes = 1024*10
$maxNumberOfFilesPerDirectory = [Math]::Round($maxNumberOfFiles/$maxNumberOfDirectories)

for ($i=0; $i -lt $maxNumberOfDirectories; $i++) {
    mkdir “./dir-$i” -force

    for ($j=0; $j -lt $maxNumberOfFilesPerDirectory; $j++) {
        $fileSize = Get-Random -Min $minFileSizeInBytes -Max $maxFileSizeInBytes
        $str = ‘a’*$fileSize
        echo $str | out-file “./file-$j” -encoding ascii
        mv “./file-$j” “./dir-$i"

}
}

I used values of 1000 and 10000 for $maxNumberOfFiles while keeping the number of directories at 20.

Here’s how we did:

1k files 10k files
Get-DIrectorySize ~60ms ~2500ms
Get-DirectorySizeWithCmd ~110ms ~3600ms
Get-DIrectorySizeWithRobocopy ~45ms ~85ms

I was actually really surprised to see how performant robocopy was. I believe that cmd would be just as performant if not more so if it didn’t have to do as much printing to the console as it does.

/MT isn’t a panacea

The /MT switch tells robocopy to split off the copy job given amongst several child robocopy instances. One would think that this would speed things up, since the only thing faster than robocopy is more robocopy. It turns out that this was actually NOT the case, as its times ballooned up to around what we saw with cmd. I presume that this has something to do with the way that those jobs are being pooled, or that each process is actually logging to their own stdout buffers.

TL;DR: Don’t use it.

A note about Jobs

PowerShell Jobs seem like a lucrative option. Jobs make it very easy to run several pieces of code concurrently. For long-running scriptblocks, Jobs are actually an awesome approach.

Unfortunately, Jobs will work against you for a problem like this. Every Powershell Job invokes a new Powershell session with their own Powershell processes. Each runspace within that session will use at least 20MB of memory, and that’s without modules! Additionally, you’ll need to invoke every Job serially, which means that the time spent in just starting each job could very well exceed the amount of time it takes robocopy to compute your directory sizes. Finally, if you use cmd or robocopy to compute your directory sizes, every job will invoke their own copies of cmd and robocopy, which will further increase your memory usage for, potentially, very little benefit.

TL;DR: Don’t use Jobs either.

That’s all I’ve got! I hope this helps!

Do you have another solution that works? Has this helped you size directories a lot faster than before? Let’s talk about it in the comments!

About Me

I’m the founder of caranna.works, an IT engineering firm in Brooklyn that builds smarter and cost-effective IT solutions that help new and growing companies grow fast. Sign up for your free consultation to find out how. http://caranna.works.

Technical Tuesdays: Powershell Pipelines vs Socks on Amazon

In Powershell, a typical, run-of-the-mill pipeline looks something like this:

Get-ChlidItem ~ | ?{$_.LastWriteTime -lt $(Get-Date 1/1/2015)} | Format-List -Auto

but really looks like this when written in .NET (C# in this example):

Powershell powershellInstance = new Powershell()
RunspaceConfiguration runspaceConfig = RunspaceConfiguration.Create()
Runspace runspace = RunspaceFactory.CreateRunspace(runspaceConfig)
powershellInstance.Runspace = runspace
try {
    runspace.Open();
    IList errors;

    Command getChildItem = new Command("Get-ChildItem");
    Command whereObjectWithFilter = new Command("Where-Object");

    ScriptBlock whereObjectFilterScript = new ScriptBlock("$_.LastWriteTime -lt $(Get-Date 1/1/2015)");
    whereObjectFilter.Parameters.Add("FilterScript", $whereObjectFilterScript);

    Command formatList = new Command("Format-List");
    formatList.Parameters.Add("Auto", "true");

    Pipeline pipeline = runspace.CreatePipeline();
    pipeline.Commands.Add(getChildItem);
    pipeline.Commands.Add(whereObjectFilter);
    pipeline.Commands.Add(formatList);

    Collection results = pipeline.Invoke(out errors)
    if (results.Count & gt; 0) {
        foreach(result in results) {
            Console.WriteLine(result.Properties["FullName"].toString());
        }
    }
} catch {
    foreach(error in errors) {
        PSObject perror = error;
        if (error != null) {
            ErrorRecord record = error.BaseObject as ErrorRecord;
            Console.WriteLine(record.Exception.Message);
            Console.WriteLine(record.FullyQualifiedErrorId);
        }
    }
}

Was your reaction something like:

WUT

Yeah, mine was too.

Let’s try to break down what’s happening here in a few tweets.

Running commands in Powershell is very much like buying stuff from Amazon. At a really high level, you can think of the life of a command in Powershell like this:

  • You’re in the mood for fancy socks and go to Amazon.com. (This would be equivalent to the runspace in which Powershell commands are run.)

  • You find a few pairs that you like (most of them fuzzy and warm) and order them. (This would be the cmdlet that you type into your Powershell host (command prompt).)

  • Amazon finds those socks in their massive warehouse and begins packaging them. (This is akin to finding the definition of Get-Command in a .NET library loaded into your runspace and, when found, wrapping it into a Command object, with the fuzziness and color of those socks being its Parameter properties.)

  • Amazon then puts that package into a queue in preparation for shipment. (In Powershell, this would be like adding the Command into a Pipeline.)

  • Amazon ships your super fuzzy socks when ready. (Pipeline.Invoke()).

  • You open the box the next day (you DO have Prime, right?!) and enjoy your snazzy feet gloves. (The results of the Pipeline get written to the host attached to its runspace, which in this case would be the Powershell host/command prompt.)

  • If Amazon had issues getting the socks to you, you would have gotten an email of some sort with a refund + free money and an explanation of what happened (In Powershell, this is known as an ErrorRecord.)

And that’s how Microsoft put the power of Amazon on your desktop!

Has the Powershell pipeline ever saved your life? Have you ever had to roll your own runspaces and lived to talk about it? (Did you know you can use runspaces to make multithreaded Powershell scripts? Not saying that *you would…) Let’s talk about it in the comments below!*

About Me

I’m the founder of caranna.works, an IT engineering firm in Brooklyn, NY that employs time-tested and proven solutions that help companies save lots of money on their IT costs. Sign up for your free consultation to find out how. http://caranna.works.