Servers crashing, out of memory, kills process

gregom · Post by **gregom** » Thu May 10, 2012 5:23 pm

I already posted this in the Previous Versions forum, but that place is a ghost town, and I don't think this problem is specific to the version of ZM so i'm going to post it here as well.

I have five servers running Ubuntu 8.04 LTS Server x64 and ZM 1.23.3. Each server records 16 cameras 24/7. A few of them have been running over one year (with weekly reboots) with very few problems, and the others probably almost a full year now. I don't know exactly which ones as I didn't keep track of when I built them.

Anyway, two of my servers are crashing fairly often, 3-5 times a week due to out of memory error. The process that gets killed is typically mysqld (or something), but sometimes its a ZM process. When the process gets killed the server stops recording video and stops hosting the website for active monitoring. The weird thing is, that these two servers have both been running for over one year without frequent problems with a scheduled weekly reboot, and it is only recently that they started misbehaving.

When they crash, I just reboot the server. Usually they come back up and start recording fine, but sometimes when I try to go to the ZM webpage, I get "An error has occurred and this operation cannot continue. For full details check your web logs for the code 'B83AB6'". The code is different every time I try, meaning if I refresh the code will be different. To fix this, I run a repair on the frames and events tables using the phpmyadmin web tool and all seems well.

So how do I fix this crashing issue? I've tried a combination of throwing more physical memory at the servers and adjusting the SHMALL and SHMMAX. I'm running the same software versions of MySQL, ZM, and Apache since i've built the system and it never gets software updates. So I don't think anything has changed that could be causing this.

I've started rebooting the servers daily now and that has prevented them from running out of memory (for now...), but that certainly is not a fix. Is it possible that the database just becomes too big of a mess and starts consuming more and more memory over its usage that eventually I will start having problems like this? Do I need to simply reformat and rebuild these servers once a year? I'm not sure what else to try.

I'm not a Linux guru by any means, I barely scrape by with my general tech knowledge... These systems were experimentally built by my boss who knows more about linux that I, but is certainly not an expert either. We're both scratching our heads on this one and don't know what to do. Neither of us are really sure what to set the SHMALL and SHMMAX too, there are several sources claiming a specific way to do it where other sources contradict that. So it seems there is a opinion on how to set it up and that the optimal settings may vary from system to system.

So does anyone have any ideas here?

badger_fruit · Post by **badger_fruit** » Thu May 10, 2012 8:40 pm

hi gregom
I wouldn't go as far as rebuilding the machines once a year, perhaps create a simple cron job that restarts the MYSQL and ZM services daily, say during the day or when the cameras are needed the least?

Perhaps the machines may have developed a hardware memory problem; try booting with the Ultimate Boot Disk (ask google!) and run the memtests included there to check for any hardware errors.

Do you have any monitoring software such as Nagios installed? This can be set to alert if the memory free drops below a user-defined level so you get an alert before the server crashes and can take necessary action.

Regarding updates, yeah, MySQL, ZM and Apache are probably the most stable bits of software I have ever used ... however, the OS itself has had many updates (perhaps I would hold off changing to a newer version just yet,I hear there are many issues with the newest Ubuntu - just see this forum for the evidence) ... besides,I have an old P4 machine with 512 mb ram which has OpenSuSE 10.3 on (it's now on 12.1) and I've stuck with it because of the old saying "if it ain't broke" ...

Perhaps install some monitoring tool like Nagios and see if you can analyse any trends when the server falls over; good luck!

Post by **Flasheart** » Fri May 11, 2012 6:09 am

Hi,

This is a frustrating problem and one that's difficult to trace.

You can pretty much ignore the process that the oom killer culls; what program it uses is chosen according to various specs and isn't always the biggest. Read more at http://lwn.net/Articles/317814/

Nagios is good for alerting you to a problem, but not for profiling what is the problem.

Munin is a very easy to install and visual way to show overall system problems, but again, doesn't help nail down exactly what.

The way I solved this problem on one server was to log the output of "top" every minute to a file, and after a crash, see what program went silly with the ram. I eventually nailed it down to apache2, or specifically, a tiny script apache2 was running. On another server, I spent eight months trying one thing and another and eventually, by doubling the ram on the VM, got it stable - never did find the exact cause, but obviously it stopped being an issue with some more elbow room.

Good luck!

badger_fruit · Post by **badger_fruit** » Fri May 11, 2012 6:17 am

Flasheart, I noticed that the OP did mention he'd already increased the amount of memory and while it is a hard problem to identify, Nagios may provide an insight into something - of course, it might not and I'm not suggesting that is the only Monitoring tool out there ... the reason I suggested it was that if we were having a random issue such as that I'd want some eyes on it 24/7 to see if there was a leak in memory or something which Nagios could monitor and alert me to before the entire machine died. But as you say, it will only show that there's something going wrong, not what.

It was only a suggestion though ... kudos for the "top > file.log" idea, but OUCH for how long it took to identify!

gregom · Post by **gregom** » Mon May 14, 2012 10:13 pm

Thanks for the responses guys...

I think I may have the SHMALL/SHMMAX configured incorrectly. Is there any specific recommendations that works well with ZM? Does anyone have a specific recommendation on these settings that is working well for them on a setup similar to mine? My servers are running two quad core processors and 4 GB RAM, with one 80 GB hard drive for the OS and SWAP, volume sizes automatically configured during setup, and one 80 GB drive for the mySQL database and the ZM website. All my JPG's go to a hardware RAID-5 array. This is the same setup on all 5 of my servers.

If I still have problems and I know my SHMALL/SHMMAX is setup correctly, then i'll try logging top to a file and go back to weekly reboots to see if I can find the problem. I'd like to avoid trying the memtest for a bit, as these are production servers and recording video is critical.

Although maybe I want to log top anyway... What do you think would be the best way to create a log file for outputting the "top" status to? As I said i'm not a Linux guru... I found http://askubuntu.com/questions/22021/ho ... g-cpu-load but I'm worried it is going to make one HUGE file. Or will it make a file for each 24 hour period and just append .1, .2, etc. like I see in /var/log/ ?

I think i'll try setting up Munin, and maybe Nagios to see if I can see other angles on this problem.

This still baffles me that these servers have been running great for over a year, and then all the sudden they start acting up. Nothing has changed that i'm aware of...

gregom · Post by **gregom** » Mon May 21, 2012 8:29 pm

I spent over an hour on the net trying to find info on how to configure my SHMALL and SHMMAX and didn't have any luck. I really have a poor understanding of this so i'm afraid to start changing things without having a reasonable idea of what it will do. Is there any guides out there that someone here has had good success with? Or maybe some of you Linux guys just know what works. I'm not even sure that is the problem, I am also wondering if my issues are with SWAP space.

I let the OS auto configure the swap space during the install, but after going back and looking, there is hardly any... I ran top on all of my systems and they all had nearly all physical memory used and only about 100k of swap being used. There is buffers and cached memory but i'm not exactly sure what this means. Maybe it will be better if I just show you...

DVR 1

Code: Select all

DVR1
top - 13:10:17 up 10:08,  1 user,  load average: 2.18, 2.20, 1.86
Tasks: 224 total,   2 running, 222 sleeping,   0 stopped,   0 zombie
Cpu(s): 11.6%us,  0.5%sy,  0.0%ni, 86.7%id,  1.0%wa,  0.1%hi,  0.1%si,  0.0%st
Mem:   8258640k total,  8219676k used,    38964k free,   345120k buffers
Swap:  3229024k total,      100k used,  3228924k free,  5861244k cached

DVR 2

Code: Select all

top - 13:24:19 up 10:22,  1 user,  load average: 0.86, 1.02, 0.93
Tasks: 241 total,   3 running, 238 sleeping,   0 stopped,   0 zombie
Cpu(s):  8.3%us,  0.4%sy,  0.0%ni, 89.7%id,  1.5%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   4121796k total,  4096208k used,    25588k free,   248748k buffers
Swap:  3229024k total,      108k used,  3228916k free,  2228520k cached

DVR 3

Code: Select all

top - 13:24:39 up 10:02,  1 user,  load average: 1.35, 0.86, 0.72
Tasks: 220 total,   1 running, 219 sleeping,   0 stopped,   0 zombie
Cpu(s):  7.7%us,  0.4%sy,  0.0%ni, 91.0%id,  0.9%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   4121796k total,  4091844k used,    29952k free,   236644k buffers
Swap:  6385796k total,       96k used,  6385700k free,  2408516k cached

DVR 4

Code: Select all

top - 13:42:58 up 10:11,  1 user,  load average: 0.60, 0.92, 1.02
Tasks: 234 total,   2 running, 232 sleeping,   0 stopped,   0 zombie
Cpu(s):  9.5%us,  0.4%sy,  0.0%ni, 88.7%id,  1.3%wa,  0.1%hi,  0.1%si,  0.0%st
Mem:   4121796k total,  4100284k used,    21512k free,   271616k buffers
Swap:  6385796k total,       92k used,  6385704k free,  2319336k cached

DVR 5

Code: Select all

top - 13:43:10 up 10:01,  1 user,  load average: 1.87, 1.63, 1.47
Tasks: 231 total,   1 running, 230 sleeping,   0 stopped,   0 zombie
Cpu(s): 13.9%us,  0.4%sy,  0.0%ni, 84.6%id,  0.9%wa,  0.1%hi,  0.1%si,  0.0%st
Mem:   4121796k total,  4085828k used,    35968k free,   214708k buffers
Swap:  3229024k total,       92k used,  3228932k free,  2421292k cached

It seems servers 3 and 4 have 160 GB OS and SQL database drives instead of 80 GB as I thought, so they got more swap space during the OS install. Anyway, I'm concerned because I hardly have any SWAP space when i'm using nearly all my RAM. I'm not sure what to make of this information, but i'm thinking I may have my systems setup completely wrong.

So i'm wondering if I need to make my SWAP larger, or reconfigure my SHMALL and SHMMAX (which I don't know how to do at this point).

I did try to install Munin, but having almost to Linux skills, I couldn't figure out how to use it. I was able to pipe information from top into a text file easily enough, but I don't know how to make it run constantly and output to a log file so I can review it later.

Anyone have any thoughts on this before I start experimenting?

christophe_y2k · Post by **christophe_y2k** » Wed May 30, 2012 12:29 pm

hello this is my note about this (under gentoo linux):
by default my gentoo have too low shmall/shmmax settings for more than one ip cam...
have 8Go of reel memory

# nano -w /etc/sysctl.conf
ajouter pour nous ici: 16Go(shmall) de mémoire partagée avec une allocation maxi de 8Go(shmmax)

kernel.shmall = 4194304
kernel.shmmax = 8589934592

# sysctl -p

explications:

shmmax = maximum shared memory allocatable at a time unité de mesure: Le Byte
shmall = taille de la mémoire partagée utilisable unité de mesure: Le nombre de pages de 4Ko

http://www.zoneminder.com/wiki/index.ph ... esolutions.

For example: If you want to allocate a maximum memory setting to 16GB
you have to convert it to the number of pages (or segments). with a page size of 4096.

* kernel.shmall=16x1024x1024*1024/4096
-> kernel.shmall=4194304

shmmax is the max amount to allocate in one request - this is is an actual memory size
(as opposed to pages) set to 8GB 8*1024*1024*1024=8 589 934 592

* kernel.shmmax = 8589934592

The /etc/sysctl.conf would have these lines

As above, reload your sysctl.conf with sysctl -p
and check that the settings are correct with ipcs -l.

ZoneMinder Forums

Servers crashing, out of memory, kills process

Servers crashing, out of memory, kills process

Re: Servers crashing, out of memory, kills process

Re: Servers crashing, out of memory, kills process

Re: Servers crashing, out of memory, kills process

Re: Servers crashing, out of memory, kills process

Re: Servers crashing, out of memory, kills process

Re: Servers crashing, out of memory, kills process