We just moved to a new server. 2 x Dedicated, 48 GB ram, php-fpm, nginx, memcached, APC. We have an issue where each php-fpm process that spawns keeps getting bigger. A fresh restart of php-fpm shows each process takes 30-100 MB. After a few hours, they are over 250MB. After 8 hours they are at 1.1GB or more for each php-fpm process that spawns. Brining the server to it’s knees. I had to restart php-fpm every hour. To mitigate for the time being, we reduced pm.max_requests to 1,000 from 10,000. It seems to have stopped each process from growing, but we have other issues.
1. Anytime you save a product in admin, we get a 500 server error. The product saves, but it’s quite annoying.
2. our magento import script from stoneedge won’t import orders and gives me 503 Bad Gateway Error. So we can’t import orders. This error is in nginx for the import script
2013/01/31 07:45:30 [error] 15417#0: *435945 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 173.14.230.102, server: www.campsaver.com, request: “POST /magento-import.php HTTP/1.1”, upstream: “fastcgi://127.0.0.1:9000”, host: “www.campsaver.com”
3. this error is all over the place in nginx error logs too. Every few minutes.
2013/01/31 23:53:06 [error] 15430#0: *1176895 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 209.85.238.209, server: www.campsaver.com, request: “GET /mens-clothing/men-s-shirts?brand=254 HTTP/1.1”, upstream: “fastcgi://127.0.0.1:9000”, host: “www.campsaver.com”
4. These errors are all over my php-fpm error logs
Jan 31 23:56:40.551917 [WARNING] [pool www] child 32011 exited on signal 7 SIGBUS after 8332.830655 seconds from start
Jan 31 23:56:40.552514 [NOTICE] [pool www] child 935 started
Jan 31 23:56:51.018778 [WARNING] [pool www] child 675 exited on signal 7 SIGBUS after 1080.377420 seconds from start
Jan 31 23:56:51.019400 [NOTICE] [pool www] child 936 started
Jan 31 23:57:07.588714 [WARNING] [pool www] child 601 exited on signal 7 SIGBUS after 1456.255594 seconds from start
Jan 31 23:57:07.589324 [NOTICE] [pool www] child 940 started
Jan 31 23:57:51.147662 [WARNING] [pool www] child 32037 exited on signal 7 SIGBUS after 8302.292151 seconds from start
Jan 31 23:57:51.148279 [NOTICE] [pool www] child 942 started
Jan 31 23:58:33.067957 [WARNING] [pool www] child 843 exited on signal 7 SIGBUS after 430.257647 seconds from start
Jan 31 23:58:33.068582 [NOTICE] [pool www] child 944 started
Any ideas what is wrong with my server setup here?
1. Anytime you save a product in admin, we get a 500 server error. The product saves, but it’s quite annoying.
read server error log, and magento logs, just tail -f them and see whats going on during that call.
2. our magento import script from stoneedge won’t import orders and gives me 503 Bad Gateway Error. So we can’t import orders. This error is in nginx for the import script
2013/01/31 07:45:30 [error] 15417#0: *435945 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 173.14.230.102, server: www.campsaver.com, request: “POST /magento-import.php HTTP/1.1”, upstream: “fastcgi://127.0.0.1:9000”, host: “www.campsaver.com”
script died, nginx cant read anymore
3. this error is all over the place in nginx error logs too. Every few minutes.
2013/01/31 23:53:06 [error] 15430#0: *1176895 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 209.85.238.209, server: www.campsaver.com, request: “GET /mens-clothing/men-s-shirts?brand=254 HTTP/1.1”, upstream: “fastcgi://127.0.0.1:9000”, host: “www.campsaver.com”
4. These errors are all over my php-fpm error logs
Jan 31 23:56:40.551917 [WARNING] [pool www] child 32011 exited on signal 7 SIGBUS after 8332.830655 seconds from start Jan 31 23:56:40.552514 [NOTICE] [pool www] child 935 started Jan 31 23:56:51.018778 [WARNING] [pool www] child 675 exited on signal 7 SIGBUS after 1080.377420 seconds from start Jan 31 23:56:51.019400 [NOTICE] [pool www] child 936 started Jan 31 23:57:07.588714 [WARNING] [pool www] child 601 exited on signal 7 SIGBUS after 1456.255594 seconds from start Jan 31 23:57:07.589324 [NOTICE] [pool www] child 940 started Jan 31 23:57:51.147662 [WARNING] [pool www] child 32037 exited on signal 7 SIGBUS after 8302.292151 seconds from start Jan 31 23:57:51.148279 [NOTICE] [pool www] child 942 started Jan 31 23:58:33.067957 [WARNING] [pool www] child 843 exited on signal 7 SIGBUS after 430.257647 seconds from start Jan 31 23:58:33.068582 [NOTICE] [pool www] child 944 started
this is a bug plus configuration specific.
easiest solution to start with - upgrade php, overwriting previous configs, and disable apc.
read logs and see that everything is working good. as i said there is no point to debug anything if you have first php-fpm release in 5.3 branch.
upgrade till 5.3.21.
as long as you have rpm’s and updated repositories you can upgrade php,
then configure and restart. no downtime and you do this on none peak times,
you can even check dependencies before you run upgrade.
apc needs to be disabled few hours, then you can plug it in. and read logs again.
there is eaccelerator as well, works sometimes even better.
disable full page cache and apc and run
ab -c 1 -n 1 http://www.campsaver.com/
then restart with apc
see speed, read logs and enable full page cache if good.
We do find this odd, you have highly configured servers with external order management and we would have thought you were hosting where this would not happen. On that note, MySQL and Memcache are best to have their own servers, Magento is burst CPU intensive. The configuration of all parts is more art than science, this combination works perfectly for ourselves. Our main suggestion is to start from the beginning and have the most minimal configuration adding incrementally until things start failing.