-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Encountered in a failed TeamCity build
Failed Tests.ShellCommandFixtureStdInput.ShouldBeCancellable(behaviour: Async) [8 ms]
Error Message:
System.InvalidOperationException : Process must exit before requested information can be determined.
Stack Trace:
at System.Diagnostics.Process.EnsureState(State state)
at System.Diagnostics.Process.get_ExitCode()
at Octopus.Shellfish.ShellCommandExecutorHelpers.SafelyGetExitCode(Process process) in /opt/buildagent/work/c4d67eb417aa367f/source/Shellfish/ShellCommandExecutorHelpers.cs:line 21
at Octopus.Shellfish.ShellCommand.ExecuteAsync(CancellationToken cancellationToken) in /opt/buildagent/work/c4d67eb417aa367f/source/Shellfish/ShellCommand.cs:line 243
at Tests.ShellCommandFixtureStdInput.ShouldBeCancellable(SyncBehaviour behaviour) in /opt/buildagent/work/c4d67eb417aa367f/source/Tests/ShellCommandFixture.StdInput.cs:line 228
Analysis
The relevant code is (edited for brevity) doing this:
try {
await exitedTask.ConfigureAwait(true);
}
catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) {
TryKillProcessAndChildrenRecursively(process);
}
// Exception thrown from SafelyGetExitCode
return new ShellCommandResult(SafelyGetExitCode(process));What is most likely happening.
- On linux,
TryKillProcessAndChildrenRecursivelyis implemented by sending a SIGTERM to the process. - SIGTERM is a message asking the process to exit itself. It may not exit immediately, or even if it is trying to, it may not be fast enough before we try and get the exit code
Discussion
The old ShellExecutor API does not have this bug.
The old code goes like this:
- In background, register cancellation
- Wait for the process to exit (this cannot be cancelled)
- When the cancellation token fires, the background registration kills the process
- The wait does not return until after the process has actually exited.
Whereas the new code goes like this:
- Wait for the process to exit (with cancellation)
- Cancellation token fires, cancelling the wait
- Proceed to trying to kill the process, then SafelyGetExitCode; we have not waited for the "kill" to actually kill.
- (Note added Feb 2025; this may not be the case anymore, when attempting to fix this please figure out what it's doing right now)
Internally in OctopusDeploy we use a forked copy of the old code, called SilentProcessRunner. It also looks like it will have this bug, as the cancellation will cancel the wait operation before the process has actually exited.
Possible Solutions
-
We could simply handle the "not yet exited" code path and return -1 from SafelyGetExitCode
-
We could insert a second wait-for-exit on the cancellation path to ensure the process has actually exited before we proceed to
SafelyGetExitCode -
We could change the linux codepath to use
SIGKILLrather thanSIGTERMso it should be dead immediately. On the windows codepath we callProcess.Kill(which is the Win32TerminateProcessAPI) which is equivalent to SIGKILL -
Or some combination of the above
My ideal solution:
-
Change windows to send a "soft shutdown" message to the process to match the linux
SIGTERM.
We could post a WM_QUIT or GenerateConsoleCtrlEvent or as this person on StackOverflow found, writing the Ctrl+C escape character to the child process stdinput stream -
On both windows/linux, after sending a soft shutdown message, install a short additional WaitForExit with a timer. 5 seconds seems to be a common value and seems fair.
-
If the process does not exit after this time, hard kill it with TerminateProcess / SIGKILL
-
Proceed to SafelyGetExitCode, and install the fallback code so that it returns -1 rather than throwing exceptions if the process is still running for some reason.
That's quite a lot of work though, in practice I'm thinking probably just handling the "not yet exited" code path and return -1 from SafelyGetExitCode