Ghost Hunting in XenApp 7.x

The easily scared amongst you needn’t worry as what I am referring to here are disconnected XenApp sessions where the session that Citrix believes is still alive on a specific server have actually already ended, as in been logged off. “Does this cause problems though or is it just cosmetic?” I can hear you ask. Well, if a user tries to launch another application which is available on the same worker then it will cause a problem because XenApp will try and use session sharing, unless disabled, but there is no longer a session to share so the application launch fails. These show as “machine failures” in Director. Trying to log off the actually non-existent session, such as via Director, won’t fix it because there is no session to log off. Restarting the VDA on the effected machine also doesn’t cause the ghost session to be removed.

So how does one reduce the impact of these “ghost” sessions? In researching this, I came across this article from @jgspiers detailing the “hidden” flag which can be set for a session, albeit not via Studio or Director, such that session sharing is disabled for that one session.

I therefore set about writing a script that would query Citrix for disconnected sessions, via Get-BrokerSession, cross reference each of these to the XenApp server they were flagged as running on via running quser.exe and report those which didn’t actually have a session on that server. In addition, the script also tries to get the actual logoff time from the User Profile Service event log on that server and also checks to see if they have any other XenApp sessions, since that is a partial indication that they are not being hampered by the ghost session.

If the -hide flag is passed to the script then the “hidden” flag will be set for ghost sessions found.

The script can email a list of the ghost sessions if desired, by specifying the -recipients and -mailserver options (and -proxymailserver if the SMTP mail server does not allow relaying from where you run the script) and if a history file is specified, via the -historyFile option, then it will only email when there is a new ghost session discovered.

ghosted sessions example

I have noticed that the “UserName” field return by Get-BrokerSession is quite often blank for these ghost sessions and the user name is actually in the “UntrustedUserName” field about which the documentation states “This may be useful where the user is logged in to a non-domain account, however the name cannot be verified and must therefore be considered untrusted” but it doesn’t explain why the UserName field is blank since all logons are domain ones via StoreFront non-anonymous applications.

If running the script via a scheduled task, which I do at a frequency of every thirty minutes, with -hide, also specify the -forceIt flag otherwise the script will hang as it will prompt to confirm that you want to set any new ghost sessions to hidden.

The script is available on GitHub here and you use it at your own risk although I’ve been running it for one of my larger customers for months without issue; in fact we no longer have reports of users failing to launch applications which we previously had tracked down to the farm being haunted with these ghosts although it rarely affects more than 1% of disconnected sessions. This is on XenApp 7.13.

Advertisements

Showing Current & Historical User Sessions

One of my pet hates, other than hamsters, is when people logon to infrastructure servers, which provide a service to users either directly or indirectly, to run a console or command when that item is available on another server which isn’t providing user services. For instance, I find people logon to Citrix XenApp Delivery Controllers to run the Studio console where, in my implementations, there will always be a number of management servers where all of the required consoles and PowerShell cmdlets are installed. They compound the issue by then logging on to other infrastructure servers to run additional consoles which is actually more effort for them than just launching the required console instance(s) on the aforementioned management server(s). To make matters even worse, I find they quite often disconnect these sessions rather than logoff and have the temerity to leave consoles running in these disconnected sessions! How not to be in my good books!

Even if I have to troubleshoot an issue on one of these infrastructure servers, I will typically remotely access their event logs, services, etc. via the Computer Management MMC snap-in connected remotely and if I need to run non-GUI commands then I’ll use PowerShell’s Enter-PSSession cmdlet to remote to it which is much less of an impact than getting a full blown interactive session via mstsc or similar.

To find these offenders, I used to run quser .exe, which is what the command “query user” calls, with the /server argument against various servers to check if people were logged on when they shouldn’t have been but I thought that I really ought to script it to make it easier and quicker to run. I then also added the ability to select one or more of these sessions and log them off.

It also pulls in details of the “offending” user’s profile lest that’s too big and needs trimming or deleting. I have written a separate script for user profile analysis and optional deletion which is also available in my GitHub repository.

For instance, running the following command:

 & '.\Show users.ps1' -name '^cxt2[05]\d\d' -current

will result in a grid view similar to the one below:

show users ordered

 

It works by querying Active Directory via the Get-ADComputer cmdlet, runs quser.exe against all machines named CTX20xx and CTX25yy, where xx and yy are numerical, and display them in a grid view. Sessions selected in this grid view when the “OK” button is pressed will be logged off although PowerShell’s built in confirmation mechanism is used so if “OK” is accidentally pressed, the world probably won’t end because of it.

The script can also be used to show historical logons on a range of servers where the range can be specified in one of three ways:

  1. -last x[smhdwy] where x is a number and s=seconds, m=minutes, h=hours, d=days, w=weeks and y=years. For example, ‘-last 7d’ will show sessions logged on in the preceding 7 days
  2. -sinceboot
  3. -start “hh:mm:ss dd/MM/yyyy” -end “hh:mm:ss dd/MM/yyyy” (if the date is omitted then the current date is used)

For example, running the following:

& '.\Show users.ps1' -ou 'contoso.com/Servers/Citrix XenApp/Production/Infrastructure Servers' -last 7d

gives something not totally unlike the output below where the columns can be sorted by clicking on the headings and filters added by clicking “Add criteria”:

show users aged

Note that the OU is specified in this example as a canonical name, so can be copied and pasted out of the properties tab for an OU in AD Users and Computers rather than you having to write it in distinguished name form, although it will accept that format too. It can take a -group option instead of -ou and will recursively enumerate the given group to find all computers and the -name option can be used with both -ou and -group to further restrict what machines are interrogated.

The results are obtained from the User Profile Service operational event log and can be written to file, rather than being displayed in a grid view, by using the -csv option.

Sessions selected when “OK” is pressed will again be logged off although a warning will be produced instead if a session has already been logged off.

If you are looking for a specific user, then this can be specified via the -user option which takes a regular expression as the argument. For instance adding the following to the command line:

-user '(fredbloggs|johndoe)'

will return only sessions for usernames containing “fredbloggs” or “johndoe”

Although I wrote it for querying non-XenApp/RDS servers, as long as the account you use has sufficient privileges, you can point it at these rather than using tools like Citrix Director or Edgesight.

The script is available on GitHub here and use of it is entirely at your own risk although if you run it with the -noprofile option it will not give the OK and Cancel buttons so logoff cannot be initiated from the script. It requires a minimum of version 3.0 of PowerShell, access to the Active Directory PowerShell module and pulls data from servers from 2008R2 upwards.

If you are querying non-English operating systems, there may be an issue since the way the script parses the output from the quser command is to use the column headers, namely ‘USERNAME’,’SESSIONNAME’,’ID’,’STATE’,’IDLE TIME’,’LOGON TIME’ on an English OS, since the output is fixed width. You may need to either edit the script or specify the column names via the -filedNames option.

Memory Control Script – Capping Leaky Processes

In the third part of the series covering the features of a script I’ve written to control process working sets (aka “memory”), I will show how it can be used to prevent leaky processes from consuming more memory than you believe they should.

First off, what is a memory leak? For me, it’s trying to remember why I’ve gone into a room but in computing terms, it is when a developer has dynamically allocated memory in their programme but then not subsequently informed the operating system that they have finished with that memory. Older programming languages, like C and C++, do not have built in garbage collection so they are not great at automatically releasing memory which is no longer required. Note that just because a process’s memory increases but never decreases doesn’t actually mean that it is leaking – it could be holding on to the memory for reasons that only the developer knows.

So how do we stop a process from leaking? Well short of terminating it, we can’t as such but we can limit the impact by forcing it to relinquish other parts of its allocated memory (working set) in order to fulfil memory allocation requests. What we shouldn’t do is to deny the memory allocations themselves, which we could actually do with hooking methods like Microsoft’s Detours library. This is because the developer, if they even bother checking the return status of a memory allocation request before using it, which would result in the infamous error “the memory referenced at 0x00000000 could not be read/written” (aka a null pointer dereference), probably can’t do a lot if the memory allocation fails other than outputting an error to that effect and exiting.

What we can do, or rather the OS can do, is to apply a hard maximum working set limit to the process. What this means is that the working set cannot increase above the limit so if more memory is required, part of the existing working set must be paged out. The memory paged out is the least recently used so is very likely to be the memory the developer forgot to release so they won’t be using it again and it can sit in the page file until the process exits. Thus increased page file usage but decreased RAM usage which should help performance and scalability and reduce the need for reboots or manual intervention.

Applying a hard working set limit is easy with the script, the tricky part is knowing what value to set as the limit – too low and it might not just be leaked memory that is paged out so performance could be negatively affected due to hard page faults. Too high a limit and the memory savings, if the limit is ever hit, may not be worth the effort.

To set a hard working set limit on a process we run the script thus:

.\trimmer.ps1 -processes leakprocess -hardMax -maxWorkingSet 100MB

or if the process has yet to start we can use the waiting feature of the script along with the -alreadyStarted option in case the process has actually already started:

.\trimmer.ps1 -processes leakprocess -hardMax -maxWorkingSet 100MB -waitFor leakyprocess -alreadyStarted

You will then observe in task manager that its working set never exceeds 100MB.

To check that hard limits are in place, you can use the reporting option of the script since tools like task manager and SysInternals Process Explorer won’t show whether any limits are hard ones. Run the following:

.\trimmer.ps1 -report -above 0

which will give a report similar to this where you can filter where there is a hard working set limit in place:

hard working set limit

There is a video here which demonstrates the script in action and uses task manager to prove that the working set limit is adhered to.

One way to implement this for a user, would be to have a logon script that uses the -waitFor  option as above, together with -loop so that the script keeps running and picks up further new instances of the process to be controlled, to wait for the process to start. To implement for system processes, such as a leaky third party service or agent, use the same approach but in a computer start-up script.

Once implemented, check that hard page fault rates are not impacting performance because the limit you have imposed is too low.

The script is available here and use of it is entirely at your own risk.

Changing/Checking Citrix StoreFront Logging Settings

Enabling, capturing and diagnosing StoreFront logs is not something I have to do often but when I do, I found it was time consuming to enable, and disable, logging across multiple StoreFront servers and also to check on the status of logging since Citrix provide cmdlets to change tracing levels but not to query them as far as I can tell.

After looking at reported poor performance of several StoreFront servers at one of my customers, I found that two of them were set for verbose logging which wouldn’t have been helping. I therefore set about writing a script that would allow the logging (trace) level to be changed across multiple servers and also to report on the current logging levels. I use the plural as there are many discrete modules within StoreFront and each can have its own log level and log file.

So which module needs logging enabled? The quickest way, which is all the script currently supports, is to enable logging for all modules. The Citrix cmdlet that changes trace levels, namely  Set-DSTraceLevel, can be used more granularly it seems but I have found insufficient details in order to be able to implement it in my script.

The script works with clustered StoreFront servers in that you can specify just one of the servers in the cluster via the -servers option together with the -cluster option which will (remotely) read the registry on that server to find where StoreFront is installed so that it can load the required cmdlets to retrieve the list of all servers in the cluster.

To set the trace level on all servers in a StoreFront cluster run the following:

& '.\StoreFront Log Levels.ps1' -servers storefront01 -cluster -traceLevel Verbose

The available trace levels are:

  • Off
  • Error
  • Warning
  • Info
  • Verbose

To show the trace levels, without changing them, on these servers and check that they are consistent on each server and across them, run the following:

& '.\StoreFront Log Levels.ps1' -servers storefront01 -cluster -grid

Which will give a grid view similar to this:

storefront log settings

It will also report the version of StoreFront installed although the -cluster option must be used and all servers in the cluster specified via -servers if you want to display the version for all servers.

The script is available here and you use it entirely at your own risk although I do use it myself on production StoreFront servers. Note that it doesn’t need to run on a StoreFront server as it will remote commands to them via the Invoke-Command cmdlet. It has so far been tested on StoreFront versions 3.0 and 3.5 and requires a minimum of PowerShell version 3.0.

Once you have the log files, there’s a script introduced here that stitches the many log files together and then displays them in a grid view, or csv, for easy filtering to hopefully quickly find anything relevant to the issue being investigated.

For those of an inquisitive nature, the retrieval side of the script works by calling the Get-DSWebSite cmdlet to get the StoreFront web site configuration which includes the applications and for each of these it finds the settings by examining the XML in each web.config file.

Don’t forget to return logging levels to what they were prior to your troubleshooting although I would recommend leaving them set as “Error” as opposed to “Off”.

Citrix PVS device detail viewer with action pane

here

I find myself frequently using the script I wrote, see here, to check the status of PVS devices and then sometimes I need to perform power actions on them, turn maintenance mode on or off or maybe message users on them before performing power actions (I am a nice person after all). Whilst we can perform most of those actions in the PVS console, if you are dealing with devices across multiple collections, sites or PVS instances then that can involve a lot of jumping around in the PVS console. Plus if you want to change maintenance mode settings or message logged on users then you need to do this from Citrix Studio so you’ll need to launch that and go and find the PVS devices in there.

So I decided to put my WPF knowledge to work and built a very simple user interface in Visual Studio and then inserted it into the PVS device detail viewer script. Once you’ve run the script and got a list of the PVS devices, sorted and/or filtered as you desire, select those devices and then click on the “OK” button down in the bottom right hand side of the grid view. Ctrl-A will select all devices which can be useful if you’ve filtered on something like “Booted off latest” so you only have devices displayed which aren’t booting off the latest production vdisk. This will then fire up a user interface that looks like this, unless you’ve run the script with the -noMenu option or hit “Cancel” in the grid view.

pvs device actioner gui

All the devices you selected in the grid view will be selected automatically for you but you can deselect any before clicking on the button for an action. It will ask you to confirm the action before undertaking it.

pvs device viewer confirm

If you select the “Message Users” option then an additional dialog will be shown asking you for the text, caption and level of the message although you can pass these on the command line via -messageText and -MessageCaption options.

pvs device viewer message box

The “Boot” and “Power Off” options use PVS cmdlets rather than Delivery Controller ones since the devices may not be known to the DDC. “Shutdown” and “Restart” use the “Stop-Computer” and “Restart-Computer” cmdlets respectively and I have deliberately not used the -force parameter with them so if users are logged on, the commands will fail. Look in the window you invoked the script from for errors.

You can keep clicking the action buttons until you exit the user interface so, for instance, it can be used to enable maintenance mode, message users asking them to logoff, reboot the devices when they have logged off or you have had enough of waiting for them to do so and then turning off maintenance mode, if you want to put it back in to service after the reboot.

I hope you find it as useful as I do but note that you use the script entirely at your own risk. It is available here and requires version 7.7 or higher of PVS, XenApp 7.x and PowerShell 3.0 or later where those consoles are installed on the machine where you will run the script from (so that their PowerShell cmdlets are available too).

 

Regrecent comes to PowerShell

About 20 years ago, after I found out that registry keys had last modified timestamps, I wrote a tool in C++ called regrecent which showed keys that had been modified in a given time window. If you can still find it, this tool does work today although being 32 bit will only show changes in Wow6432Node on 64 bit systems.

Whilst you might like to use Process Monitor, and before that regmon, or similar, to look for registry changes, that approach needs you to know that you need to monitor the registry so what do you do if you need to look back at what changed in the registry yesterday, when you weren’t running these great tools, because your system or an application has started to misbehave since then? Hence the need for a tool that can show you the timestamps, although you can actually do this from regedit by exporting a key as a .txt file which will include the modification time for each key in the output.

The PowerShell script I wrote to replace the venerable regrecent.exe, available here, can be used in a number of different ways:

The simplest form is to show keys changed in the last n seconds/minutes/hours/days/weeks/years by specifying the number followed by the first letter of the unit. For example, the following shows all keys modified in the last two hours:

.\Regrecent.ps1' -key HKLM:\System\CurrentControlSet -last 2h

We can specify the time range with a start date/time and an optional end date/time where the current time is used if no end is specified.

.\Regrecent.ps1' -key HKLM:\Software -start "25/01/17 04:05:00"

If just a date is specified then midnight is assumed and if no date is given then the current date is used.

We can also exclude (and include) a list of keys based on matching a regular expression to filter out (or in) keys that we are not interested in:

.\regrecent.ps1 -key HKLM:\System\CurrentControlSet -last 36h -exclude '\\Enum','\\Linkage'

Notice that we have to escape the backslashes in the key names above because they have special meaning in regular expressions. You don’t need to specify backslashes but then the search strings would match text anywhere in the key name rather than at the start (which may of course be what you actually want).

To specify a negation on the time range, so show keys not changed in the given time window, use the -notin argument.

If you want to capture the output to a csv file for further manipulation, use the following options:

.\regrecent.ps1 -key HKCU:\Software-last 1d -delimiter ',' -escape | Out-File registry.changes.csv -Encoding ASCIII

I hope you find this as useful as I do for troubleshooting.

When even Process Monitor isn’t enough

I was recently tasked to investigate why an App-V 5.1 application was giving a license error at launch on XenApp 7.8 (on Server 2008R2) when the same application installed locally worked fine. I therefore ran up the trusty Process Monitor (procmon) tool to get traces on the working and non-working systems so I could look for differences. As I knew what the licence file was called, I honed in quickly on this in the traces. In the working trace, you could see it open the licence file, via a CreateFile operation, and then read from the file. However, in the App-V version it wasn’t reading from the file (a ReadFile operation) but no CreateFile operation was failing so I couldn’t understand why it wasn’t even attempting to read from the file when it didn’t appear to be unable to access it. The same happened when running as an administrator so it didn’t look like a file permission issue.

Now whilst procmon is a simply awesome tool, such that life without it would be an unimaginably difficult place, it does unfortunately only tell you about a few of the myriad of Microsoft API calls. In order to understand even more of what a process is doing under the hood, you need to use an API monitor program that has the ability to hook any API call available. To this end I used WinAPIOverride (available here). What I wanted was to find the calls to CreateFile for the licence file and then see what happened after that, again comparing good and bad procmon traces.

WinAPIOverride can launch a process but it needs to be inside the App-V bubble for the app in order for it to be able to function correctly. We therefore run the following PowerShell to get a PowerShell prompt inside the bubble for our application which is called “Medallion”:

$app = Get-AppvClientPackage | ?{ $_.Name -eq 'Medallion' };
Start-AppvVirtualProcess -AppvClientObject $app powershell.exe

We can then launch WinAPIOverride64.exe in this new PowerShell prompt, tell it what executable to run and then run it:

winapioverride-launch

Note that you may not be able to browse to the executable name so you may have to type it in manually.

Once we tell it to run, it will allow us to specify what APIs we want to get details on by clicking on the “Monitoring Files Library” button before we click “Resume”.

api-monitor-hook

You need to know the module (dll) which contains the API that you want to monitor. In this case it is kernel32.dll which we can glean from the MSDN manual page for the CreateFile API call (see here).

api-monitor-kernel32

Whilst you can use the search facility to find the specific APIs that you want to monitor and just tick those, I decided initially to monitor everything in kernel32.dll, knowing that it would generate a lot of data but we can search for what we want if necessary.

So I resumed the process, saw the usual error about the licence file being corrupt, stopped the API monitor trace and set about finding the CreateFile API call for the licence file to see what it revealed. What I actually found was that CreateFile was not being called for the licence file but when I searched for the licence file in the trace, it revealed that it was being opened by a legacy API called OpenFile instead. Looking at the details for this API (here), it says the following:

you cannot use the OpenFile function to open a file with a path length that exceeds 128 characters

Guess how long the full path for our licence file is? 130 characters! So we’re doomed it would seem with this API call which we could see was failing in API monitor anyway:

medallion-open-file

I suspect that we don’t see this in procmon as the OpenFile call fails before it gets converted to a CreateFile call and thence hits the procmon filter driver.

The workaround, as we found that the installation wouldn’t work in any other folder than c:\Medallion so we couldn’t install it to say C:\M, was to shorten the package installation root by running the following as an admin:

Set-AppvClientConfiguration -PackageInstallationRoot '%SystemDrive\A'

This changes the folder where App-V packages are cached from “C:\ProgramData\App-V” to “C:\A” which saves us 18 characters. The C:\A folder needed to be created and given the same permissions and owner (system) as the original folder. I then unloaded and reloaded the App-V package so it got cached to the \A folder whereupon it all worked properly.