Apologies in advance if this article gets a tad scary but not because of the threat of flesh-eating zombies appearing but because of some of the tools I’ll be using although I’ll show you how easy they actually are to use.
So What is a Zombie Process?
There seem to be a few different views out there as to what constitutes a zombie process on Windows operating systems.
There’s the school of thought here that they are processes which have open handles to non-existent items such as processes which have exited. The article links to a handy tool on GitHub which can identify these processes. This led me to write a PowerShell script which opens handles via the OpenProcess() API to a given process and then waits for user input or a specified amount of time before closing them. Whilst it is waiting, if the process(s) to which it has the handles open are terminated then the PowerShell process running my script is deemed by the tool as being a Zombie.
It’s definitely a useful tool but my interpretation of a Zombie process is one that should be dead and thus gone completely but isn’t so this doesn’t fit the case here in my opinion since my PowerShell process is alive and well, and living in West Yorkshire, but it has handles to what are dead processes, so almost Zombies.
Helge Klein wrote a great post on finding the causes for lack of session re-use on RDS/Citrix servers, so where session ids returned by the quser.exe command are relatively high numbers unless it has just been rebooted, which is a subject close to my heart as I see this frequently in my consultancy work and this is what initially led to my Zombie hunt. I’ve had some success with this to find Zombie sessions, mainly where processes show in task manager with a status of “Suspended”. I haven’t had any joy though in using SysInternals handle.exe utility to close all open handles and thus free up the session (which is also potentially dangerous). The “suspended” processes are closer to zombies though, in my view of the world, as they should be dead but aren’t as they are generally present in sessions which have been logged off so all processes in that session should have been terminated.
I’ve yet to find a way in PowerShell to find processes that task manager shows as suspended as that doesn’t seem to be in any property returned by Get-Process or from a WMI/CIM query on the win32_process class. The latter has an “ExecutionState” property which ought to be what I need but it is always empty so appears to not be implemented. However, I have had some success in looking for processes whose session id is for a session that no longer exists or which have no threads or open handles as that seems to be a sympton of some of these “Suspended” processes. I thought it would be a case of finding that all threads for a given process were suspended but how does that then relate to processes flagged as suspended in task manager but which have no threads at all unless task manager assumes that any process with no threads must be suspended since it has nothing that could be scheduled to run on a CPU?
Microsoft have a debugger extension command “!zombies” but I’ve never got that to show any Zombie processes as yet.
My own definition of a Zombie process is where they are not visible in any user mode tools like task manager or Process Explorer but still have an EPROCESS block in the kernel address space although these are tricky to find as you have to look for symptoms of their existence, such as Zombie sessions causing a lack of session reuse, given their inherent lack of visibility.
Finding Zombie Processes
So how do we see these EPROCESS blocks which reside in the kernel address space? Why, with a debugger of course. However, we would normally look at this in kernel dumps which we usually get after a BSoD (although there are some neat ways of converting hypervisor snapshot memory files and saved state files into dump files for VMware and Hyper-V that I’ve used).
We can point a debugger at a running machine but only if it has been started in debug mode which isn’t generally how they should be booted in BAU situations.
Fortunately, SysInternals comes to the rescue, again, in that they have a tool called livekd that in conjunction with Windows debuggers like windbg (a debugger with a GUI) or kd (a command line debugger), can debug a live system. It’s easy to use; we just need to tell it where windbg or kd are (they are in the same folder as each other when installed) and also decide what we are going to do about symbols which are what are required to convert raw memory addresses into meaningful function names. If you don’t do anything about symbols then livekd will prompt with some defaults but I prefer to first set the _NT_SYMBOL_PATH environment variable to “srv*c:\Symbols*http://msdl.microsoft.com/download/symbols” which uses Microsoft’s online symbol server and stores the downloaded symbols in the c:\Symbols folder.
So to start windbg to debug the local system, we run the following, elevated, obviously changing folders depending on where livekd.exe and windbg.exe live; where I’m using a PowerShell session, not a (legacy) cmd:
& 'C:\Program Files\sysinternals\livekd.exe' -k 'C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\windbg.exe' -w -b
This will then start windbg which should look something like this:
Don’t worry about the error for symbols for livekd.sys as we can safely ignore this.
We’re now ready to view the process entries contained in EPROCESS blocks which we achieve via running the “!process 0 0″ debugger extension command where the first zero instructs it to dump all processes rather than the current process or a specific pid (which we would specify in hex, not decimal) and the second is a flags value which tells it here to just dump basic information. We type this in at the bottom where it shows the “0: kd>” prompt.
This should then give something similar to the following although it may take some time to complete if there are a lot of processes (if you’re on a remote session then minimising the windbg window can make it complete faster since it has to do less screen updating).
To quit the debugger, enter “q”.
So there we have it, all of the processes as far as the kernel sees it. But hang on, that’s not exactly easy to digest is it now and what does all the data mean? The Cid is actually the process id, albeit in hex, and thus the ParentCid is the parent process id for the process.
Rather than manually wading through this data, I wrote a quick PowerShell script that uses regular expressions to parse this data, which can be copied and pasted into a text file or saved directly to file via the “Write Window Text to File” option from the windbg Edit menu, correlates these processes against currently running processes via the Get-Process cmdlet and then outputs the data as below where I’ve filtered the grid view where “Running” is false:
Ignore the fact that it flags the System process as being a Zombie as it isn’t – I’ll filter that out presently.
So here we can see lots of Citrix SelfService.exe Zombie processes since there are none shown in task manager and yet the processed !process output has over 25 instances in the EPROCESS table. We don’t have any Zombie sessions here though, since the Zombie processes all exist in active or disconnected sessions. I have discovered Zombie sessions too using this technique where there are processes shown, and typically the same executables, for sessions that no longer exist and aren’t shown in task manager, etc.
You will potentially get a few false positives for processes that have exited between when the !process was executed and then when the script was run to process its output. However, I’m working on a script to automate the whole process, (weak) pun intended, by using the command line debugger kd.exe, rather than windbg, to run the !process command and then pipe its output straight into a PowerShell script to process it immediately.
The script will be made available in the publicly accessible ControlUp Script Based Actions (SBA) library although all of the available SBAs there can be used standalone without ControlUp software.