I am a programmer and architect (the kind that writes code) with a focus on testing and open source; I maintain the PHPUnit_Selenium project. I believe programming is one of the hardest and most beautiful jobs in the world. Giorgio is a DZone MVB and is not an employee of DZone and has posted 638 posts at DZone. You can read more from them at their website. View Full User Profile

External processes and PHP

02.13.2013
| 4960 views |
  • submit to reddit

I've come to known a bit about spawning and monitoring new processes from PHP code, while working at Onebip and trying to contribute to Paratest. Here's what you need to know if you think exec() or executing everything in a single .php script is always enough.

Disclaimer

You should have a strict control on how many processes you create in response to certain HTTP requests, and on their life time: Unix processes has a larger overhead with respect to the Apache threads that commonly handle requests in a fixed number of pre-forked processes.

Moreover, there are technologies that offer to create, monitor and terminate these processes for you, such as Gearman. What I describe here is the technology inside PHP that lets you request new processes at the operating system (Linux, probably).

Black boxes

exec() is one of the most popular tools for executing external commands in their own processes, emulating your console effectively for dozens of years. Within the first string argument of exec() you can:

  • set environment variables
  • redirect pipes to files with > 2> or to other commands with |
  • pass arguments

exec() will wait for the process to finish and populate its second argument with the output, while returning the exit code. If you want to execute the process in background, append a '&' to the command line.

Here's a small example:

exec("VARIABLE=value /bin/command --option=value > log.txt 2> errors.txt", $output);

Make sure to use escapeshellarg() if you are building this kind of command lines dynamically.

Control of streams

When communication between processes does not stop after the spawning, but is needed throughout the life of processes, exec() and its cousins do not cut it.

proc_open() is the tool you need in this cases. This function takes a command and a configuration for the three streams (Stdin, stdout and stderr) use in Unix for as the standard interface for communications. It populates an array of 3 pipes which can be considered file descriptors: the parent process will be able to read from or write to them in the same way as it works with files, with the PHP stream extension. This standard interface lets you access files, sockets and other processes via the same polymorphic functions: the syntax is procedural and primitive but the potential is OO-like polymorphism.

$configuration = array();
$child = proc_open('/usr/bin/grep proc', $configuration, $pipes);
fwrite($pipes[0], "this is a test");
fwrite($pipes[0], "of proc_*() functions");
fclose($pipes[0]);
stream_get_contents($pipes[1]);

Complex or large inputs call for the usage of standard input and output over arguments (no escaping and quoting of arguments is required). Moreover, when bidirectional communication is necessary exec() would just return you the output while dealing with processes with proc_open() allows ping-pong exchanges of messages.

Being nice

You can even create and kill a population of children with proc_open(), which will inherit the priority (nice value) of the parent PHP process. However, monitoring the status of children to know when they can take more input is not trivial.

The classic spinning loop:

while (1) {
   foreach ($this->processes as $p) {
     set_stream_blocking($p, 0);
     if ($line = fgets($p)) {
       // deal with it
     }
   }
}

will share equally the CPU between the children and the parent, stealing resources from the former to give it to the latter just to run a loop forever. Every time the foreach() is finished, the parent begins again a new series of checks, and if you have just 2 children it can easily reach 33% CPU (66% on 2 cores, and so on).

Sleeping is a bit better:

while (1) {
   foreach ($this->processes as $p) {
     set_stream_blocking($p, 0);
     if ($line = fgets($p)) {
       // deal with it
     }
   }
   usleep(100000); // microseconds
}

because the parent process actually returns control to the OS for the interval of time in which it sleeps, and does not use CPU. However, it does not scale as the amount of sleeping time necessary is not deterministic and cannot be decided easily from the parent.

We shouldn't reinvent OS mechanisms:

while (1) {
   foreach ($this->processes as $p) {
     set_stream_blocking($p, 1); // is already the default
     if ($line = fgets($p)) {
       // deal with it
     }
   }
}

fgets() and other descriptors-based functions will block until there is data available on the streams.

However, if you have a population of children this solution has you wait on the first, on the second, on the third, and so on: if the first process blocks you do not get to the next ones even if they are already finished.

To deal with multiple streams, the correct solution is stream_select(), which will block until one of a collection of streams is ready.

$read = array($stream1, $stream2);
$write = array();
$except = array();
if (($num_changed_streams = stream_select($read, $write, $except, 0)) > 0) {
   // $read has been modified to contain the streams where reads now won't block
}

For reading streams, this means there is data available on them so that fread() calls will not block.

Data is defined here as even a single byte, but if you write to streams in discrete chunks such as lines, they will only come out as lines at the other end; using fgets() to read the output of line-printing processes in conjunction with stream_select() would effectively never block.

Writing streams will be considered ready when the process at the other end has blocked waiting on input.

Remember, blocking is not a problem, is the mean of synchronization for UNIX processes. Processes that do not block and attempt to poll the input by themselves are usually wasting resources to do what the OS has already implemented for free; non-blocking architectures such as Node.js are not based on multiple processes that need to wait on each other.

Published at DZone with permission of Giorgio Sironi, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)