Provisioning VMware Workstation Machines from Artifactory with Vagrant

I wrote a small
Vagrantfile
and helper library for provisioning VMware VMs from boxes hosted on Artifactory. I put this together with the intent of helping us easily provision our Rancher/Cattle/Docker-based platform wholesale on our machines to test changes before pushing them up.

Here it is: https://github.com/carlosonunez/vagrant_vmware_artifactory_example

Tests are to be added soon! I’m thinking Cucumber integration tests with unit tests on the helper methods and Vagrantfile correctness.

I also tried to emphasize small, isolated and easily readable methods with short call chains and zero side effects.

The pipeline would look roughly like this:

  • Clone repo containing our Terraform configurations, cookbooks and this Vagrantfile
  • Make changes
  • Do unit tests (syntax, linting, coverage, etc)
  • Integrate by spinning up a mock Rancher/Cattle/whatever environment with Vagrant
  • Run integration tests (do lb’s work, are services reachable, etc)
  • Vagrant destroy for teardown
  • Terraform apply to push changes to production

We haven’t gotten this far yet, but this Vagrantfile is a good starting point.

Some Terraform gotchas.

So you’ve got a bacon delivery service repository with Terraform configuration files at the ready, and it looks something like this:


$> tree
.
├── main.tf
├── providers.tf
└── variables.tf

0 directories, 3 files

terraform is applying your configurations and saving them in tfstate like you’d expect. Awesome.

Eventually, your infrastructure scales just large enough to necessitate a directory structure. You want to express your Terraform configurations in a way that (a) makes it easy to see what’s in which environment, (b) makes it easy to modify those environments without affecting other environments and (c) prevents your HCL from becoming a total mess not much unlike if you were to do it with Puppet or Chef.

Fortunately, Terraform makes this pretty easy to do…but not without some gotchas.

<

h2>One suggestion: Use modules!

Modules give you the ability to reuse Terraform resources throughout your codebase. This way, instead of having a bunch of aws_instances lying around in your main, you can neatly express them in ways that make more sense:


module "sandbox-web-servers" {
  source = "../modules/aws/sandbox"
  provider = "aws.us-west-1"
  environment = "sandbox"
  tier = "web"
  count = 10
}

When you do this, you need to populate Terraform’s module cache by using terraform get /path/to/module.

<

h2>Gotcha #1: Self variable interpolation isn’t a thing yet.

If you noticed, the example above references “sandbox” quite a lot. This is because, unfortunately, Terraform modules (and resources, I believe) do not yet support self-referencing variables. What I mean is this:


module "sandbox-web-server" {
  environment = "sandbox"
  source = "../modules/${var.self.environment}"
  ...
}

Given that everything in Terraform is a directed graph, the complexity in doing this makes sense. How do you resolve a reference to a variable that hasn’t been defined yet?

This was tracked here, but it looks like a blue-sky feature right now.

Gotcha #2: Module source paths are relative to the module.

Let’s say you had a module definition that looked like this:


module "sandbox-web-servers" {
  source = "modules/aws/sandbox"
}

and a directory structure that looked like this:


$> tree
.
├── infrastructure
│   └── sandbox
│       └── web_servers.tf
└── modules
    └── aws
        └── sandbox
            └── main.tf

5 directories, 2 files

Upon running terraform apply, you’d get an awesome error saying that modules/aws/sandbox couldn’t be located, even if you ran it at the root. You’d wonder why this is given that Terraform is supposed to reference everything from the location from which the application was executed.

It turns out that modules don’t work that way. When modules are loaded with terraform get, their dependencies are sourced from the location of the module. I haven’t looked too deeply into this, but this is likely due to the way in which Terraform populates its graphs.

To fix this, you’ll need to either (a) create symlinks in all of your modules pointing to your module source, or (b) fix your sources to use relative paths relative to the location of the module, like this:


module "sandbox-web-servers" {
  source "../../modules/aws/sandbox"
  ...
}

Gotcha #3: Providers must co-exist with your infrastructure!

This one took me a few hours to reason about. Let’s go back to the directory structure referenced above (which I’ve included again below for your convenience):


$> tree
.
├── infrastructure
│   └── sandbox
│       └── web_servers.tf
└── modules
    └── aws
        └── sandbox
            └── main.tf

5 directories, 2 files

Since you deploy to multiple different sources (nit pick: Nearly every example I’ve seen on Terraform assumes you’re using AWS!), you want to create a providers folder to express this. Additionally, since your infrastructure might be defined differently by environment and you want the thing that’s actually calling terraform to assume as little about your infrastructure as possible, you want to break it down by environment. When I tried this, it looked like this:


.
├── infrastructure
│   └── sandbox
│       └── web_servers.tf
├── modules
│   └── aws
│       └── sandbox
│           └── main.tf
└── providers
    ├── openstack
    ├── colos
    ├── gce
    └── aws
        ├── dev
        │   ├── main.tf
        │   └── variables.tf
        ├── pre-prod
        │   ├── main.tf
        │   └── variables.tf
        ├── prod
        │   ├── main.tf
        │   └── variables.tf
        └── sandbox
            ├── main.tf
            └── variables.tf

14 directories, 10 files

You now want to reference this in your modules:


# infrastructure/sandbox/aws_web_servers.tf
module "sandbox-web-servers" {
  source = "../../modules/aws/sandbox"
  provider = "aws.sandbox.us-west-1" # using a provider alias
  ...
}

and are in for a pleasant surprise when you discover that Terraform fails because it can’t locate the “aws.sandbox.us-west-1” provider.

I initially assumed that when Terraform looked for the nearest provider, it would search the entire directory for a suitable one, in other words, it would follow a search path like this:


- ./infrastructure/sandbox
- ./infrastructure
- .
- ./modules
- ./modules/aws
- ./modules/aws/sandbox
- .
- ./providers
- ./providers/aws
- ./providers/aws/sandbox <-- here

But that’s not what happens. Instead, it looks for its providers in the same location as the module being referenced. This meant that I had to put providers.tf in the same place as aws_web_servers.tf.

I couldn’t even get away with putting it in the directory for its requisite environment above it (i.e. ./infrastructure/aws/sandbox) because Terraform doesn’t currently support object inheritance.

Instead of re-defining my providers in every directory, I created my providers.tf in every infrastructure environment folder I had (which is just sandbox at the moment) and symlinked it in every folder underneath it. In other words:


carlosonunez@DESKTOP-DSKP2VT:/tmp/terraform$ ln -s ../providers.tf infrastructure/sandbox/aws/providers.tf^C
carlosonunez@DESKTOP-DSKP2VT:/tmp/terraform$ ls -lart infrastructure/sandbox/aws/
total 0
-rw-rw-rw- 1 carlosonunez carlosonunez  0 Dec  6 23:52 web_servers.tf
drwxrwxrwx 2 carlosonunez carlosonunez  0 Dec  7 00:14 ..
drwxrwxrwx 2 carlosonunez carlosonunez  0 Dec  7 00:14 .
lrwxrwxrwx 1 carlosonunez carlosonunez 15 Dec  7 00:14 providers.tf -> ../providers.tf
carlosonunez@DESKTOP-DSKP2VT:/tmp/terraform$ tree
.
├── infrastructure
│   └── sandbox
│       ├── aws
│       │   ├── providers.tf -> ../providers.tf
│       │   └── web_servers.tf
│       └── providers.tf
├── modules
│   └── aws
│       └── sandbox
│           └── main.tf
└── providers
    ├── aws
    ├── colos
    ├── gce
    └── openstack
        ├── dev
        │   ├── main.tf
        │   └── variables.tf
        ├── pre-prod
        │   ├── main.tf
        │   └── variables.tf
        ├── prod
        │   ├── main.tf
        │   └── variables.tf
        └── sandbox
            ├── main.tf
            └── variables.tf

15 directories, 12 files

It’s not great, but it’s a lot better than re-defining my providers everywhere.

Gotcha #4: Unset your provider env vars!

So the thing in Gotcha #3 never happened to you. It seemed to deploy just fine. That is until you realized you were deploying to the production account instead of the dev, which you were abruptly informed of by Finance when they were wondering why you spun up $15,000 worth of compute. Oops.

This is because of a thoughtful-yet-conveniently-unfortunate side effect of providers whereby (a) most of them support using environment variables to define their behavior, and (b) Terraform has no way of turning this off (an issue I recently raised).

For now, unset boto, openstack, gcloud or whatever provider CLI tool you might be using before running terraform commands. That, or run them in a clean shell using /bin/sh

That’s it!

I’m really enjoying Terraform. I hope you are too! Do you have any other gotchas? Want to leave some feedback? Throw in a comment below!

About Me

20160408

I’m a DevOps consultant for ThoughtWorks, a software company striving for engineering excellence and a better world for our next generation of thinkers and leaders. I love everything DevOps, Windows, and Powershell, along with a bit of burgers, beer and plenty of travel. I’m on twitter @easiestnameever and LinkedIn at @carlosindfw.

Driving technical change isn’t always technical

Paperful office

Locked rooms full of potential secrets was nothing new for a multinational enterprise that a colleague of mine consulted for a few years ago. A new employee stumbling upon one of these rooms, however, was.

What that employee found in his accidental discovery was a bit unusual: a room full of boxes, all of which were full of neatly-filed printouts of what seemed like meeting minutes. Curious about his new find, he asked his coworkers if they knew anyting about this room.

None did.

It took him weeks to find the one person that had a clue about this mysterious room. According to her, one team was asked to summarize their updates every week, and every week, someone printed them out, shipped it to the papers-to-the-metaphoric-ceiling room and categorized it.

Seems strange? This fresh employee thought so. He sought to find out why.

After a few weeks of semi-serious digging, he excavated the history behind this process. Many, many years ago (I’m talking about bring-your-family-into-security-at-the-airport days), an executive was on his way to a far-away meeting and remembered along the way that he forgot to bring a summary of updates for an important team that was to come up in discussion. Panicked, he asked his executive assistant to print it out and bring it to him post haste. She did.

To prevent this from happening again, she printed and filed this update out every week in the room that eventually became the paper jungle gym. She trained her replacement to do this, her replacement trained her replacement; I think you see where this is headed. The convenience eventually became a “rule,” and because we tend to be conformant in social situations, this rule was never contested.

None of those printed updates in that room were ever used.


This has nothing to do with DevOps.

Keep reading.

I’m not sure of what became of that rule (and neither does my colleague). There is one thing I’m sure of, though: tens of thousands of long-lived companies of all sizes have processes like these. Perhaps your company’s deployments to production depend on an approval from some business unit that’s no longer involved with the frontend. Perhaps your company requires a thorough and tedious approval process for new software regardless of its triviality or use. Perhaps your team’s laptops and workstations are locked down as much as a business analyst who only uses their computers for Excel, Word and PowerPoint. (It’s incredible what they can do. Excel itself is a damn operating system; it even includes its own memory manager.)

Some of the simplest technology changes you can make to help your company go faster to market don’t involve technology at all. If you notice a rule or process that doesn’t make sense, it might be worth your while to do your own digging and question it. More people might agree with you than you think.

About Me

I’m a DevOps consultant for ThoughtWorks, a software company striving for engineering excellence and a better world for our next generation of thinkers and leaders. I love everything DevOps, Windows, and Powershell, along with a bit of burgers, beer and plenty of travel. I’m on twitter @easiestnameever and LinkedIn at @carlosindfw.

Winning at Ansible: How to manipulate items in a list!

The Problem

Ansible is a great configuration management platform with a very, very extensible language for expressing yoru infrastructure as code. It works really well for common workflows (deploying files, adding authorized_keys, creating new EC2 instances, etc), but its limitations become readily apparent as you begin embarking in more custom and complex plays.

Here’s a quick example. Let’s say you have a playbook that uses a variable (or var in Ansible-speak) that contains a list of tables, like this:

important_files:
- file_name: ssh_config
file_path: /usr/shared/ssh_keys
file_purpose: Shared SSH config for all mapped users.
- file_name: bash_profile
file_path: /usr/shared/bash_profile
file_purpose: Shared .bash_profile for all mapped users.

(You probably wouldn’t manage files in Ansible this way, as it already comes with a fleshed-out module for doing things with files; I just wanted to pick something that was easy to work with for this post.)

If you wanted to get a list of file_names from this var, you can do so pretty easily with set_fact and map:

- name: "Get file_names."
set_fact:
file_names: "{{ important_files | map(attribute='file_name') }}"

This should return:

[ u'/usr/shared/ssh_keys', u'/usr/shared/bash_profile' ]

However, what if you wanted to modify every file path to add some sort of identifier, like this:

[ u'/usr/shared/ssh_keys_12345', u'/usr/shared/bash_profile_12345' ]

The answer isn’t as clear. One of the top answers for this approach suggested extending upon the map Jinja2 filter to make this happen, but (a) I’m too lazy for that, and (b) I don’t want to depend on code that might not be on an actual production Ansible management host.

The solution

It turns out that the solution for this is more straightforward than it seems:

- name: "Set file suffix"
set_fact:
file_suffix: "12345"

- name: &quot;Get and modify file_names.&quot;
set_fact:
file_names: "{{ important_files | map(attribute='file_name') | list | map('regex_replace','(.*)','\\1_{{ file_suffix }}') | list }}"

Let’s break this down and explain why (I think) this works:

  • map(attribute='file_name') selects items in the list whose key matches the attribute given.
  • list casts the generated data structure back into a list (I’ll explain this below)
  • map('regex_replace','$1','$2') replaces every string in the list with the pattern given. This is what actually does what you want.
  • list casts the results back down to a list again.

The thing that’s important to note about this (and the thing that had me hung up on this for a while) is that every call to map (or most other Jinja2 filters) returns the raw Python objects, NOT the objects that they point to!

What this means is that if you did this:

- name: "Set file suffix"
set_fact:
file_suffix: "12345"

- name: "Get and modify file_names."
set_fact:
file_names: "{{ important_files | map(attribute='file_name') | map('regex_replace','(.*)','\\1_{{ file_suffix }}') }}"

You might not get what you were expecting:

ok: [localhost] => {
    "msg": "Test - <generator object do_map at 0x7f9c15982e10>."
}

This is sort-of, kind-of explained in this bug post, but it’s not very well documented.

Conclusion

This is the first of a few blog posts on my experiences of using and failing at Ansible in real life. I hope that these save someone a few hours!

About Me

Carlos Nunez is a site reliability engineer for Namely, a modern take on human capital management, benefits and payroll. He loves bikes, brews and all things Windows DevOps and occasionally helps companies plan and execute their technology strategies.

Concurrency is a terrible name.

I was discussing the power of Goroutines a few days ago with a fellow co-worker. Naturally, the topic of “doing things at the same time in fancy ways” came up. In code, this is usually expressed by the async or await keywords depending on your language of choice. I told him that I really liked how Goroutines abstracts much of the grunt work in sharing state across multiple threads. As nicely as he possibly could, he responded with:

You know nothing! Goroutines don’t fork threads!

This sounded ludicrous to me. I (mistakenly) thought that concurrency == parallelism because doing things “concurrently” usually means doing them at the same time simultaneously, i.e. what is typically described as being run in parallel.
Nobody ever says “I made a grilled cheese sandwich in parallel to waiting for x.” So I argued how concurrency is all about multithreading while he argued that concurrency is all about context switching. This small, but friendly, argument invited a few co-workers surrounding us, and much ado about event pumps were made.

After a few minutes of me being proven deeply wrong, one of our nearby coworkers mentioned this tidbit of knowledge:

Concurrency is a terrible name for this.

I couldn’t agree more, and my small post will talk about why.

In computer science, concurrency is the term used to describe the state in which multiple things are done at the same time within the same “thread” of execution. In contrast, parallelism is used to describe the state in which multiple things are done at the same time across multiple “threads” of execution.
The biggest difference between the two is being able to do multiple units of work simultaneously across multiple processors.

“What about multithreading,” you might ask. “I thought that the whole point of doing things across multiple threads was to do multiple things at once!”

Here’s the thing: today’s processors can only do things one instruction at a time. The massive amount of engineering, silicon and transistors that they have are built to execute one instruction at a time really really really quickly and accurately. What gets executed and when is up to the operating system queueing up work for the processor to do. Operating systems deal with this by giving every process (and their threads) a pre-defined amount of time with the processor called a time slice or quantum.

The processor is even processing instructions when the operating system has nothing for it to do; these instructions are called NOOPs in x86 assembly. (Fun fact: whenever you open up Task Manager or Activity Monitor and see the % of CPU being used, what you’re actually looking at is the ratio of instructions being executed to NOOPs.) Process scheduling is quite the loaded topic that I’m almost certain that I’m not doing justice to; if you’re interested in learning more about it, these slides from an operating systems course from UC Davis describe this really well.

Even though operating systems typically schedule work from processes to be done serially on one processor, the programmer
can tell it to divide the work amongst multiple or all processors on the system. So instead of work from this process being done one instruction at a time, it can be done n instructions at a time, where n is the number of processors installed on a system. What’s more is that since most operating systems typically slam the first processor for everything, processes that take advantage of this can typically get more done faster since they are not competing for as time on the main processor. This approach is called symmetric multiprocessing, or SMP, and Windows has supported it since Windows NT and Linux since 2.4. In other words, this is nothing new.

To make matters more confusing, these days, operating systems will often automatically schedule threads across multiple processors automatically if the application uses multiple threads, so for practicality’s sake, concurrent programming == parallel programming.

TL;DR

Concurrency and parallelism aren’t the same, except when they are. Sort of.

About Me

Carlos Nunez is a site reliability engineer for Namely, a human capital management and payroll solution made for humans. He loves bikes, brews and all things Windows DevOps and occasionally helps companies plan and execute their technology strategies.

Technical Thursdays: Calculate Directory Sizes Stupidly Fast With PowerShell.

Scenario

A file share that a group in your business is dependent on is running out of space. As usual, they have no idea why they’re running out of space, but they need you, the sysadmin, to fix it, and they need it done yesterday.

This has been really easy for Linux admins for a long time now: Do this

du -h / | sort -nr

and delete folders or files from folders at the top that look like they want to be deleted.

Windows admins haven’t been so lucky…at least those that wanted to do it on the command-line (which is becoming increasingly important as Microsoft focuses more on promoting Windows Server Core and PowerShell). `

dir sort-of works, but it only prints sizes on files, not directories. This gets tiring really fast, since many big files are system files, and you don’t want to be that guy that deletes everything in C:\windows\system32\winsxs again.

Doing it in PowerShell is a lot better in this regard (as written by Ed Wilson from The Scripting Guys)

function Get-DirectorySize ($directory) {
Get-ChildItem $directory -Recurse | Measure-Object -Sum Length | Select-Object `
    @{Name="Path"; Expression={$directory.FullName}},
    @{Name="Files"; Expression={$_.Count}},
    @{Name="Size"; Expression={$_.Sum}}
}

This code works really well in getting you a folder report..until you try it on a folder like, say, C:\Windows\System32, where you have lots and lots of little files that PowerShell needs to (a) measure, (b) wait for .NET to marshal the Win32.File system object into an System.IO.FIle object, then (c) wrap into the fancy PSObject we know and love.

This is exacerbated further upon running this against a remote SMB or CIFS file share, which is the more likely scenario these days. In this case, Windows needs to make a SMB call to tell the endpoint on which the file share is hosted to measure the size of the directories you’re looking to report on. With CMD, once WIndows gets this information back, CMD pretty much dumps the result onto the console and goes away. .NET, unfortunately, has to create System.IO.File objects for every single file in that remote directory, and in order to do that, it needs to retrieve extended file information.

By default, it does this for every single file. This isn’t a huge overhead when the share is on the same network or a network with a low-latency/high-bandwidth path. This is a huge problem when this is not the case. (I discovered this early in my career when I needed to calculate folder sizes on shares in Sydney from New York. Australia’s internet is slow and generally awful. I was not a happy man that day.)

Lee Holmes, a founding father of Powershell, wrote about this here. It looks like this is still an issue in Powershell v5 and, based on his blog post, will continue to remain an issue for some time.

This post will show you some optimizations that you can try that might improve the performance of your directory sizing scripts. All of this code will be available on my GitHub repo.

Our First Trick: Use CMD

One common way of sidestepping this issue is by using a hidden cmd window running dir /s /b and doing some light string parsing like this:

function Get-DirectorySizeWithCmd {
    param (
        [Parameter(Mandatory=$true)]
        [string]$folder
    )

    $lines = & cmd /c dir /s $folder /a:-d # Run dir in a hidden cmd.exe prompt and return stdout.

    $key = "" ; # We’ll use this to store our subdirectories.
    $fileCount = 0
    $dict = @{} ; # We’ll use this hashtable to hold our directory to size values.
    $lines | ?{$_} | %{ 
        # These lines have the directory names we’re looking for. When we see them,
        # Remove the “Directory of” part and save the directory name.
        if ( $_ -match " Directory of.*" ) { 
            $key = $_ -replace " Directory of ",”" 
            $dict[$key.Trim()] = 0 
        } 
        # Unless we encounter lines with the size of the folder, which always looks like "0+ Files, 0+ bytes”
        # In this case, take this and set that as the size of the directory we found before, then clear it to avoid
        # overwriting this value later on.
        elseif ( $_ -match "\d{1,} File\(s\).*\d{1,} bytes" ) { 
            $val = $_ -replace ".* ([0-9,]{1,}) bytes.*","`$1” 
            $dict[$key.Trim()] = $val ; 
            $key = “" 
        }
        # Every other line is a file entry, so we’ll add it to our sum.
        else {
            $fileCount++
        }

    }
    $sum = 0
    foreach ( $val in $dict.Values ) {
        $sum += $val
    }
    New-Object -Type PSObject -Property @{
        Path = $folder;
        Files = $fileCount;
        Size = $sum
    }

}

It’s not true Powershell, but it might save you a lot of time over high-latency connections. (It is usually slower on local or nearby storage.

Our Second Trick: Use Robocopy

Most Windows sysadmins know about the usefulness of robocopy during file migrations. What you might not know is how good it is at sizing directories. Unlike dir, robocopy /l /nfl /ndl:

  1. It won’t list every file or directory it finds in its path, and
  2. It provides a little more control over the output, which makes it easier for you to parse when the output makes it way to your Powershell session.

Here’s some sample code that demonstrates this approach:

function Get-DirectorySizeWithRobocopy {
    param (
        [Parameter(Mandatory=$true)]
        [string]$folder
    )

    $fileCount = 0 ; 
    $totalBytes = 0 ; 
    robocopy /l /nfl /ndl $folder \localhostC$nul /e /bytes | ?{ 
        $_ -match "^[ t]+(Files|Bytes) :[ ]+d" 
    } | %{ 
        $line = $_.Trim() -replace '[ ]{2,}',',' -replace ' :',':' ; 
        $value = $line.split(',')[1] ; 
        if ( $line -match "Files:" ) { 
            $fileCount = $value } else { $totalBytes = $value } 
        } ; 
        [pscustomobject]@{Path=',';Files=$fileCount;Bytes=$totalBytes} 
    }
}

The Target

For this post, we’ll be using a local directory with ~10,000 files that were about 1 to 10k in length (the cluster size on the server I used is ~8k, so they’re really about 8-80k in size) and spread out across 200 directories. The code written below will generate this for you:

$maxNumberOfDirectories = 20

$maxNumberOfFiles = 10
$minFileSizeInBytes = 1024
$maxFileSizeInBytes = 1024*10
$maxNumberOfFilesPerDirectory = [Math]::Round($maxNumberOfFiles/$maxNumberOfDirectories)

for ($i=0; $i -lt $maxNumberOfDirectories; $i++) {
    mkdir “./dir-$i” -force

    for ($j=0; $j -lt $maxNumberOfFilesPerDirectory; $j++) {
        $fileSize = Get-Random -Min $minFileSizeInBytes -Max $maxFileSizeInBytes
        $str = ‘a’*$fileSize
        echo $str | out-file “./file-$j” -encoding ascii
        mv “./file-$j” “./dir-$i"

}
}

I used values of 1000 and 10000 for $maxNumberOfFiles while keeping the number of directories at 20.

Here’s how we did:

1k files 10k files
Get-DIrectorySize ~60ms ~2500ms
Get-DirectorySizeWithCmd ~110ms ~3600ms
Get-DIrectorySizeWithRobocopy ~45ms ~85ms

I was actually really surprised to see how performant robocopy was. I believe that cmd would be just as performant if not more so if it didn’t have to do as much printing to the console as it does.

/MT isn’t a panacea

The /MT switch tells robocopy to split off the copy job given amongst several child robocopy instances. One would think that this would speed things up, since the only thing faster than robocopy is more robocopy. It turns out that this was actually NOT the case, as its times ballooned up to around what we saw with cmd. I presume that this has something to do with the way that those jobs are being pooled, or that each process is actually logging to their own stdout buffers.

TL;DR: Don’t use it.

A note about Jobs

PowerShell Jobs seem like a lucrative option. Jobs make it very easy to run several pieces of code concurrently. For long-running scriptblocks, Jobs are actually an awesome approach.

Unfortunately, Jobs will work against you for a problem like this. Every Powershell Job invokes a new Powershell session with their own Powershell processes. Each runspace within that session will use at least 20MB of memory, and that’s without modules! Additionally, you’ll need to invoke every Job serially, which means that the time spent in just starting each job could very well exceed the amount of time it takes robocopy to compute your directory sizes. Finally, if you use cmd or robocopy to compute your directory sizes, every job will invoke their own copies of cmd and robocopy, which will further increase your memory usage for, potentially, very little benefit.

TL;DR: Don’t use Jobs either.

That’s all I’ve got! I hope this helps!

Do you have another solution that works? Has this helped you size directories a lot faster than before? Let’s talk about it in the comments!

About Me

I’m the founder of caranna.works, an IT engineering firm in Brooklyn that builds smarter and cost-effective IT solutions that help new and growing companies grow fast. Sign up for your free consultation to find out how. http://caranna.works.