VMware "best practices" for 2020

A place for discussion of topics that are not specific to ZoneMinder. This could include Linux, Video4Linux, CCTV cameras or any other topic.
Post Reply
anomaly0617
Posts: 3
Joined: Wed Mar 25, 2020 3:32 pm

VMware "best practices" for 2020

Post by anomaly0617 »

Hi there,

I'm not certain this is the appropriate forum to post this on, so if not, mods please feel free to move it to the appropriate forum.

For a little background, I'm a Systems Engineer with 20+ years in the IT industry. My background includes small and medium sized networks all the way up to large scale data centers with high availability and redundant power, HVAC, and internet connections. My first distro of Linux was the original Red Hat v4 in the 90's, and I've stayed current on Windows, Linux (multiple distros), AIX, and (in the day) Netware. So to summarize, I'm no spring chicken when it comes to Windows, Linux, Firewalling, Switching, Networking, Video Management Systems, H.264 RTSP streams, PBX'es and Voice over IP, etc.

What I'm looking for is a "best practices" guide from someone who is very knowledgeable on ZoneMinder's resource needs when it relates to virtualized environments. In my case, this means VMware ESXi Hypervisors, but this could also apply to any Hypervisor software such as KVM or others.

I've got a few Zoneminder physcial servers set up, and they are generally rock solid. No issues with lock ups.

But - I've got a small Zoneminder virtual server set up on an HP Pavilion ML350 G6 with 96GB of RAM. This was set up for research purposes only in my company's R&D facility. I can't really go into the details of what I'm specifically doing, but ZoneMinder is one VMS I'm working with, and there are 2-3 others I work with regularly.

At that location I started out with a dedicated Ubuntu 18.04.3 LTS virtual server with 4 vCPUs, 12GB of RAM, and a 200GB Thick Provisioned, Lazily Zeroed hard drive. Its running the distro's open-vm-tools and is fully updated. I set up 4 cameras on it using default settings. Within 6 hours, the server hard locks up, no kernel panic, just no responsiveness at all. VMware indicates that the vm-tools have stopped responding. If I hard reset the virtual server, it will run for another 4-6 hours and then hard lock up again.

Thinking this was a fluke instance, I powered this server down. Then I then took a known stable Ubuntu 18.04.4 LTS server running some of my company's analysis software that has been running on it for months. After making a snapshot backup, I bumped the virtual server resources up to 4 vCPUs, 24 GB of RAM, 600 GB Thick Provisioned, Lazily Zeroed Hard Drive space. I locked all the RAM in place so that nothing else can try to share the RAM. This means that when the VM is powered on, it reserves off 24 GB of RAM regardless of what it is actually using, and no other VM on the server can touch that RAM. After installing ZoneMinder from the distro and per the ZoneMinder website, I put one single camera on it, a Bosch MIC IP Starlight 7010.

It locked up within 2 hours.

After a few of these lockups/hard resets, I worked to optimize the virtual machine. I ran mysqltuner optimization tool and followed its recommendations. My frame analysis is down to 5 FPS, per recommendations. I'm monitoring only, not recording at this point. I am running the camera at 1920x1080 resolution, but that should be expected as this is a $10k camera and people expect to get $10k worth of performance out of it. I've either met or exceeded the recommendations for memory allocation for mysql and PHP. The system still locks up.

I've run top looking for a smoking gun, and zms is the only thing I see that could be a likely culprit. CPU and memory resources are well within tolerance - in fact I'd go so far as to say they are ridiculously under-utilized.

If I disable ZoneMinder (systemctl disable zoneminder) and reboot the server, it stays up and running for days (so far, I'm at 5+ days). I have no doubt it will run for weeks or months without any human intervention based on past performance of the VM without ZoneMinder loaded on it.

If I enable ZoneMinder, even just to start the service (systemctl start zoneminder), the system locks up within 4 hours.

So, I'm now at the point where I think I need to ask the ZoneMinder gurus and people that are running ZM in virtualized environments -- what am I doing wrong, and what should I be doing to get it right?

Thanks, in advance, for all of your advice and help.
-Anomaly0617 (Paul)
ABigHead
Posts: 1
Joined: Fri Mar 27, 2020 1:27 pm

Re: VMware "best practices" for 2020

Post by ABigHead »

I hope that my reply is more than anecdotal for you, but YMMV.

I'm currently running 1.34.7 on Ubuntu 18.04 LTS on ESXi 6.7U3. It's been running on really old hardware in a VM that has 16GB of ram assigned to it, with no issues. I think and i am assuming that you're taking a snapshot, firing up the VM and running it for X amount of time, then it freezes up on its own, correct? If that is the case, it is likely caused by the snapshot. As far as I understand it snapshots are intended to be a temporary use for a VM so that you can take your snapshot, update packages, try new configurations, etc... then if you blow something up roll it back. If your changes don't break anything, you then go back into your ESXi interface and use 'delete all' to remove and compile all of your snapshots. This sounds counter-intuitive, but what deleting these will do (the following is my personal understanding:) is some form of compiling of the current state of the VM back to the original snapshot state. THIS CAN TAKE A WHILE, as it is keeping all the newest changes from your latest snapshot and merging it down to the original capture.

When I have forgotten to do what I just described above, my system would lock up after a few hours to a day of running, and I'd have to kill it from the ESXi console web interface, as the VM would become almost if not completely unresponsive. My understanding is that when you create a snapshot, it creates a 'delta disk' which is a smaller virtualized disk for all the changes your making. When you fail to delete all and recompile your snapshots at the end of your changes, this very small new virtual delta disk fills up and freezes your OS/VM.

Here is the link to VMWare docs explaining how the delete process works, read the whole thing to get a better explanation of above:
https://docs.vmware.com/en/VMware-vSphe ... 1AE25.html
anomaly0617
Posts: 3
Joined: Wed Mar 25, 2020 3:32 pm

Re: VMware "best practices" for 2020

Post by anomaly0617 »

First, thanks for the response!!

I don't think it's exactly what you mentioned above, but I do believe you are on to something viable here.

I'm doing some testing now. I'll let you know one way or another how it pans out.
anomaly0617
Posts: 3
Joined: Wed Mar 25, 2020 3:32 pm

Re: VMware "best practices" for 2020

Post by anomaly0617 »

OK, so to follow up, the answer is "this happens when one and/or two things are occurring":

1. You have a VMware ESX Server with some bad memory in it. A DIMM module is bad, and it's likely not in the first bank, but in the 2nd-99th bank.

2. You are using a backup software that quiesce's the VM to create a snapshot that it then backs up. If the quiesce takes too long, it takes too long to create a snapshot, and NVR's do not really like that.

As it turns out, my problem resulted from both. The HP ProLiant was not reliable, and the backup software in question is Veeam Backup and Replication. We moved over to a Dell R720 and suddenly Veeam could snapshot the VM in seconds instead of minutes or hours, and the VM no longer locks up. Been running this way for about a week now with no lockups.

Hope this helps someone in the future!
winstontj
Posts: 28
Joined: Tue Aug 06, 2019 7:56 pm

Re: VMware "best practices" for 2020

Post by winstontj »

Can I ask what physical hardware you are using? I thought I'd be fine running ZM on esxi but my experience has been less than ideal. That said, my hardware is not new and I probably haven't put in enough time to learn about how zm works to tune/tweak to my setup.

I'm running esxi ent plus 5.5u3 as a home/test/sandbox lab. I have 6.5u3 in production and 5.5u3 was paid at one time. My half-rack at home is Dell & supermicro with Xeon x5600 cpus and plenty of memory. Operating systems are on hw raid striped arrays on ssd and data storage is hw raid on hdd.

I'm wondering what I'm doing wrong. I have a 6 vCore VM with 64gb vHDD and 1.5tb vHDD, 8gb vRam. Videos are all buggy and glitchy and ZM breaks when I try and add a third camera. This is more Ubuntu/ZM related than ESXI: If I had it my way I'd love to keep daily or weekly files on a local ssd and then run a cron job to move video files and images over to a slower hdd storage space. Cameras are Chinese 5mp, wired, PoE running on a Cisco SG300-10-MPP switch.

All of my network should be fine (L2 & L3 switching, 1GbE, LACP or LAGG in different places).

I'm at a bit of a loss. I know my cameras aren't dialed in perfectly but I'm a loss as to how to get three cameras to work on one virtual machine. I see people running systems on bare metal that are a lot older than the hardware I'm using and they are fine with lots more cameras. Wondering if it's finally time to spend some money on hardware.
User avatar
bkjaya1952
Posts: 282
Joined: Sat Aug 25, 2018 3:24 pm
Location: Sri Lanka

Re: VMware "best practices" for 2020

Post by bkjaya1952 »

You can install zoneminder docker on qemu and view remotely on balenaCloud . For the details please refer the following link .
https://bkjaya.wordpress.com/2020/06/17 ... lenacloud/
User avatar
jlw52761
Posts: 12
Joined: Thu Dec 03, 2020 9:29 pm

Re: VMware "best practices" for 2020

Post by jlw52761 »

So running a snapshot will easily triple the IO overhead to your storage. Get rid of snapshots, they should never be used to backup purposes, just quick actions to rollback an update or quiesce the filesystem to allow a backup software to ingest the disk.
Also, look at ESXTOP and see if you are getting IOWait or COSTOP. IOWait will point to network/storage issues, and COSTOP will pretty much tell you that you have too many vCPU's assigned to the VM. Less is more in a virtual world. With ESXi 6.7 U3, vNUMA is less of an issue, so no need to be concerned about the cores per socket, but having CPU or Memory HotAdd will tear a VM and the host up. Make sure those are disabled, the slight convenience is just not worth the performance hit and issues it causes the vmkernel scheduler.
I would go through the VMware Performance Best Practices guide and look at all your VMs to make sure they are not oversized and that you don't have a crazy vCPU to pCPU ratio, that will cause issues.
Also, ZM is IO intensive, what storage do you have? If it's network based (iSCSI or NFS), then jumbo frames is a must. Also, NFSv3 is single session, it cannot use multiple vmkernel interfaces, so you are limited to the bandwidth of the one vmkernel. Also, it's TCP based and file based, so lot's of detractors. iSCSI will give better performance for IO intensive, can use multiple uplinks, and doesn't have all the TCP overhead.
User avatar
iconnor
Posts: 2879
Joined: Fri Oct 29, 2010 1:43 am
Location: Toronto
Contact:

Re: VMware "best practices" for 2020

Post by iconnor »

ZM dev here.

Not an expert in VM land. However, having taken ZM to over 800 cameras over 50 servers and faced just about every bottleneck I thought I would hit...

I'm pretty shocked at these lockups.

You are allocating a ton of resources to not a lot of cameras.

Yes, Video surveillance is IO/CPU/RAM/NET intensive. it hits all of them. Yet at the same time it doesn't hit them THAT hard.

I know of two things that will lock up a machine whether vm or not. OOM and hung disk IO. OOM doesn't tend to cause logging until it recovers and sometimes it never recovers. A simple time based ram usage monitoring should at least show that that is the case.

Disk IO problems should generate logs.

So I don't know what is going on here. At the end of the day ZM is a user space process and SHOULD not be able to lock up a system. Yet occasionally we get a case where it seems to.

I also have virtualbox vms that lockup and crash during kernel boot in about 1 our of 10 boots. So virtualisation is good, but maybe not perfect.
smokinjoe
Posts: 38
Joined: Wed Feb 03, 2021 12:45 am

Re: VMware "best practices" for 2020

Post by smokinjoe »

Hyperthreading IMHO is bad if your load for the box is above 50%

Dual CPU NUMA issue?

VEEAM is bad, no really! LOL

Does your VM have the Open-VM-Tools installed and up to date?

if your using the proprietary VMware tools what version?

For our Windows servers in the past we had to tewak the disk timeout setting as iSCSSI could have timeouts that would make windows bluescreen, if something similar is happening to your DB or application then it is an IO issue. I think when we run ZM on enterprise gear we need to tweak some as it is not running on a stand alone / bare metal box.

VEEAM has a few different backup methods, so you have to tell us what method yoru using and what version of VEEAM. I have had VEEAM cause MS SQL databases to hang enough that applicions talking to the server are barking about spooling transactions. Veam 10 does some back handed stuff to get a backup, version 11 it gets more invlved as each vmware host will have a guest running to help coordinate the stuff it needs to do. IMHO VEEAM should just make a VEEAM file system for VMware and be responsible for what they do, in v11 they also cram a filter driver in-line with the disk subsystem to "help" the backup process. Evey time I hear filter drivers my head spins as it usually has to relate with MS Windows and AV software.

VMware vmware native snapshots are ugly, nuckle dragger ugly. I have ZoneMinder running as a jail on TrueNAS Core 12.0-U2.1 and snapshots do not cause any issues at all, compressing the IOCAAGE location and data location for the ZM data does cause issues for the web interface as it is not as fluid as an install with no compression. I am gong to re-isntall to fix this mistake I made.
mywebsite
Posts: 1
Joined: Wed Apr 13, 2022 11:00 am

Re: VMware "best practices" for 2020

Post by mywebsite »

VMware has announced its the new ware protection in 2022, and there are a few key changes businesses should pay attention to. First and foremost, VMware emphasizes the importance of cloud migration and has many features on it. I am thinking of best VMware available in the serp, any thoughts of Nakivo ransomware protection? discover this info here
Post Reply