Martins blog
Friday, 23 January 2009Boost performance with parallel processingReacties
Geeft reacties weer als
(Lineair | Samengevoegd)
For parallel processing of a list of documents (in my case documents in the file system), I've always preferred to write just a simple PHP script that deals with only one document at a time, and then write a bash script that iterates through all documents in the filesystem and starts the PHP script as a background process.
I usually maintain document queues that way, one queue for each available processor core. Within each queue the documents are processed sequentially, but since each one runs on a separate core, the processing load gets evenly spread. The upside of using a bash script as a wrapper is that those are much less prone to memory leaks and/or crashes when they're running for a long time than if I was using a PHP script for the controller part.
"...We used it once to work around a memory issue in the SimpleXML extension..."
Did you by chance use SimpleXML in a foreach loop? If you did, then that is likely where your memory leak came from. For whatever reason simplexml objects leak chronically in foreach loops. You can "fix" it by using a for loop and accessing indexes which seems to be ok. I came up across this during some long running CLI processes that had to sort through 10s of thousands of XML files that all had to be parsed and handled. The script would use all available memory and swap and then die. Replacing the foreach loops with for loops stabilised it almost completely (barring a few other minor leaks that came later from some other updates).
Why not use sockets and get around the problems of pcntl_fork?
Whenever I've come across this sort of requirement, I've mostly just relied on having two or more instances of the script running; just like Markus does.
This makes the code much, much simpler (junior developers can understand it for instance) - and you can control the amount of processes you want to run, anywhere from one to a hundred million trillion. Most of the time, the requirements are pretty simple; and you can use very simple tricks to distribute work. You can do things like: php foo.php 1 5 php foo.php 2 5 php foo.php 3 5 php foo.php 4 5 php foo.php 5 5 and foo.php is just $my_process = // get from command line; $total_processes = // get from command line; $ids = $db->getCol("SELECT id FROM queue"); foreach ($ids as $id) { // Modulus rocks! if ($id % $total_processes != $my_process) { continue; } // Do work! } Simple but effective; easier to write unit tests for.
That is indeed at better selution, that also works Cross platform.
Forking for PHP only Works on POSIX systems. Another problem with POSIX Forking in PHP is that it is sometimes disabled. Like in an my webhosting environment it is disabled for security reasons.
Launching the same PHP script several times in parallel will leak out your server's memory and CPU.
However, the same script with forking capabilities spawning 5 child properly will just work fine. From the system's point-of-view, there is a clear difference between handling 5 parents and handling 1 parent/5 child . Forking allowed me to perform much more operations at the same time, while keeping my server usage relatively low.
@dave, yes it was with a loop. The bug is described here: http://bugs.php.net/bug.php?id=38604.
@markus, @daniel Great idea's thanks for sharing!
i have gave it a try. Read 2500 documents and parse them into a database. result forking takes 4 times longer then a regular script. Wondering when a positive effect comes in???
Its hard to tell why you didn't get the expected results. Maybe you're running out of resources quickly which will have a negative impact on performance. Maybe you can paste some code or give more info on the subject.
I dive into it and let you know. maybe a nice article for the phpgg.nl in the same run.
if(count($children) >= 10)
{ // get the oldest worker child $pid = array_shift($children); // this will wait for the worker child to finish // if the child is already finished the function will return // immediately pcntl_waitpid($pid, $status); } When using this code , you wait the end of the oldest child. In further case you have no waranty that the oldest process will end at first. So I think this code should be replaced by : if(count($children) >= 10) { // this will wait for the worker child to finish // if the child is already finished the function will return // immediately $pid = pcntl_waitpid($status); } We are waiting the end of any child.
hey . but pcntl works only on linux what abount windows operating system .??
Sorry for reviving this after a year, but, in case you haven't found a solution for parallel processing for both Windows and *nix, I suggest you take a look at popen:
http://www.php.net/popen |
Blog











