My name is Kfir and programming is an hobby of mine. Seeing how ZM can be CPU heavy, i thought it will be challenging to see how far i can go with reducing ZM's CPU usage.
I started working on this at around February 2011 and been working on this slowly (because of limited free time). I first wanted to get an overall view of the zm components. I did this by compiling zoneminder statically with profiling enabled to be able to see how many times each function is called and how much time it takes it to execute. I then began tackling the most used, most heavy functions.I discovered that for some operations, it is possible to take advantage of advanced CPU extensions such as SSE2(Pentium 4 and newer) or SSSE3(Intel Atom and newer) which are CPU extensions by intel that add SIMD instructions.
SIMD means "Single Instruction, Multiple data". The idea is to be able to work on multiple data using just 1 instruction. So for example, if we want to divide 4 integers, instead of loading the first one, dividing it, storing the result, loading the second one, dividing it, storing the result (and so on), we are able to load all these 4 integers into a large register (XMM register) using a single instruction, divide them all at once using a single instruction and store them all back to memory, also using a single instruction. So theoretically, we are now working on 4 integers/time instead of 1 integers/time. This should result in 400% speedup but is very often not the case, because sometimes we have to arrange the data to however we want it on the XMM registers, and because SSE2\SSSE3 instruction set is very limited and sometimes we have to write code to emulate an instruction that isn't available in SSE2\SSSE3, So the effective speedup should be around 200%, which is still very significant.
However, SIMD can be very restricting. Except the very limited instruction set, here are few setbacks: 1) It requires the data to be vectorized. 2) Data has to be aligned on a 16 byte boundary, otherwise there is a big performance penalty.
When it comes to ZM, I had to change all image buffer memory allocations to be on a 16 byte boundary, but thats not all. The color format that ZM uses (RGB24) is not suitable because it doesn't exactly fit into an XMM register. Each 24bit pixel is 3 bytes, so we can't fit an exact number of pixels into the XMM register.
This required me to add 32bit colour format into ZoneMinder. On the way, i also added BGR24 support to prevent the need for the conversion from BGR24 to RGB24. The 32bit RGB formats i added are: RGBA, BGRA, ARGB, ABGR. In all of these the alpha byte is not used, but is there for the padding and alignment. four RGB32 pixels fit (4 bytes each) fit exactly into a XMM register (16 bytes). Grayscale is even better, its possible to fit 16 grayscale pixels into an XMM register and work on all these 16 pixels at any given moment, which is quite amazing!
A problem i faced is with when dealing with JPEGs. libjpeg can only work with grayscale or RGB24 (RGB order). So if we want to store from or read from a BGR24 or 32bit RGB buffer, we would have to convert the image into RGB24 first, not good! Fortunately, there is libjpeg-turbo which has colorspace extensions which add support for BGR24, RGBA, BGRA, ARGB, ABGR (the last four are 32bit RGB formats), so we are good! Also, because i removed the conversion from BGR24 into RGB24 (no need for this performance penalty anymore), this requires libjpeg-turbo for storing BGR24. This made the ZM_LOCAL_BGR_INVERT option now useless and does nothing.
OK, but what if the analog device you are capturing from doesn't support 32bit capturing (BGR32 or RGB32)? No worry, you have two options:
1) Use the "Target colorspace" option to tell ZM to convert your capture palette into 32bit RGB. This also works for converting any capture palette into 8bit grayscale or RGB24. Take YUYV for example. What if your camera only supports the YUYV capture palette and you want grayscale? Previously, there was nothing you could do. Now you can just select 8bit target colorspace and it will automaticly convert your YUYV capture palette into 8bit grayscale. Let me tell you a secret, YUYV can be VERY fast converted into grayscale. To convert YUYV into grayscale, all needed to do is to extract the Y channel. If SSSE3 is available, this will be even faster!
2) Stick to your current capture palette, e.g. RGB24. I also made improvements to the RGB24 format by using the same algorithms i wrote for the 32bit RGB formats, using some loop unrolling to reduce loop overhead and some other techniques.
The "Target colourspace" option is available for all source types. Previously this option existed under a different name for remote, FFMPEG and file monitor types, but it wasn't always functional when choosing grayscale. I removed this option and added instead "Target colorspace" which should be equivalent but this time its available for all monitor types and is actually completely functional.
2. The major performance changes:
Although there are many changes to almost all components, i will only explain about the major ones at the moment:
The analyze daemon (zma):
After many testing and many profiling with different resolutions, settings, time, etc. I discovered that for zma in motion detection mode, the Delta and the Blend functions are responsible together for around 70-80% of zma's total cpu usage.
Code: Select all
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 43.25 86.56 86.56 43079 2.01 2.01 Image::Delta(Image const&) const 39.97 166.55 79.99 43105 1.86 1.86 Image::Blend(Image const&, int) const 14.66 195.90 29.35 43079 0.68 0.75 Zone::CheckAlarms(Image const*) 1.28 198.47 2.57 96002511 0.00 0.00 Image::Buffer(unsigned int, unsigned int) const
I coded new fast algorithms for Delta and Blend, which should result in almost exactly identical output to the original ones. I wrote all algorithms externally, tested them using custom test programs i wrote for this task and implanted them into ZM.
The blending process: Its possible to do really fast image blending at given percentages without using multiplication or division (which are expensive on the CPU). Those percentages are 50%, 25%, 12.5%, 6.25%, 3.125% and 1.5%. If you now turn on ZM_FAST_IMAGE_BLENDS, based on your monitor's blend percent, it will select the closest possible fast blending percentage and use it. And even better, if SSE2 is available on the system and ZM_CPU_EXTENSIONS is enabled (along with ZM_FAST_IMAGE_BLENDS), an SSE2 version of this fast blending will be used. if ZM_FAST_IMAGE_BLENDS is disabled, standard blending will be used which is a lot slower and doesn't have an SSE2 version for it.
Delta image generation: The algorithm used is pretty much identical to the old ones with some small differences, such as the removal of the lookup and abs tables, used some loop unrolling and switched to the use BT.709 coefficients instead of BT.601 coefficients. With BT.709 its possible to use optimization to get the Y channel and receive almost the identical value as if without optimization. The formula is Y = (R + R + B + G + G + G + G + G) / 8
For grayscale i added a SSE2 version, and for 32bit RGB i added SSSE3 versions. These SSE algorithms will be used as long as ZM_CPU_EXTENSIONS is enabled and the required SSE version for the algorithm is available on the system.
zma with the patch is expected to be around 20-40% faster.
The capture daemon (zmc):
For zmc there was not as much to be done as for zma. The biggest performance change i did for zmc was to reduce the amount of memory copying done, which accounted for around 10-30% of zmc's cpu time.
Capture: All source types now capture directly into the shared memory, and if a conversion has to be performed, it will be performed from the source with the output being stored into the shared memory.
zmc with the patch is expected to be around 10-30% faster.
3. Full list of changes
- Added new internal formats: BGR24 and 4 new 32bit RGB formats: RGBA, BGRA, ARGB, ABGR.
Support for these has been added to almost all existing functions, however these require libjpeg-turbo (read above).
- Changed all image buffer memory allocations to be on a 16 byte boundary.
- Rewrote the Blend and Delta functions entirely with SSE2\SSSE3 versions included.
- Replaced the ZM_Y_IMAGE_DELTAS option with ZM_CPU_EXTENSIONS to enable support for extended processor features.
- Removed the ZM_LOCAL_BGR_INVERT option as its not needed anymore because BGR24 is now natively supported.
- Added "Target colorspace" option available to all source types. This is like the previous option, but also exists for local cameras.
- Direct memory capture (Less memory copying).
- Added SSE2 memory copy for image buffer copying.
- Added code to detect SSE presence and version.
- Event streaming uses sendfile() if available
- Changed code to make compiling with function inlining enabled possible
- Added fast YUYV->Grayscale conversion, including an SSSE3 version of it
- Major changes to zm_local_camera.cpp, including intelligent format selection based on palette, target colorspace, processor endianness, swscale presence and so on.
- Allow ZM to compile cleanly without swscale.
- Minor performance improvements to the AlarmedPixels motion detection.
- Added support for the JPEG and MJPEG capture palettes.
- Added automatic capture palette option.
- Changed shared memory layout to be identical on 32bit and 64bit platforms to improve compatibility and to make the size 16 byte aligned.
- Added grayscale support to the signal checking.
Uses the lowest byte only (right most byte). #000023 works nicely for me using the bttv driver.
- Fixed multiple monitors capturing from the same device and channel.
Current code allows for multiple monitors sharing the same device, each on a different channel
Or, multiple monitors sharing the same device, all on the same channel.
In both cases, capture method, width, height and palette must be identical on all monitors.
However, target colorspace can be different because each monitor handles the format conversion separately.
- Changed monitor probe's prefered capture settings.
- All capture palettes that require a format conversion are now marked with an asterisk infront.
- Modified default monitor options to simplify new monitor creation.
- Improved crop detection error handling.
- Allow the blend percent to be zero to disable blending.
- Added swscale clean up code for local cameras.
- Added image size requirements to ensure proper alignment.
- Modified the database Monitors table to add a field for the target colorspace.
- Changed ZM static colours to RGB order (in memory) instead of HTML(BGR) order.
- Relocated ZM's own format conversion algorithms to zm_image.cpp.
- Fixed bug: mmap unexpected memory size when changing capture options.
- Fixed bug: Undefined constant ZM_V4L2 in monitor probe.
- Fixed bug: Error in offset X in monitor probe.
- Fixed bug: Wrong events path for absolute paths with deep storage enabled.
- Fixed bug: Disabled linking a monitor to itself, which can cause an alarm lasting forever once triggered.
- Fixed bug: zma crashes when the monitor is linked to a disabled monitor.
- Fixed bug: zma crashes when the signal for an analog camera is unstable.
- Fixed bug: Mocord not entering an alarm state after certain conditions such as signal returned.
- Fixed bug: ZM crashing sometimes if JPEG quality is set to 100.
- Small other changes and fixes.
Grayscale is extremely fast, efficient and uses minimal RAM. Grayscale should be even faster with this patch, especially if SSE2 is available. Grayscale benefits greatly from SSE2. Earlier i wrote that the size of an SSE register (XMM register) is 16 bytes. On those 16 bytes you can fit 4 32bit pixels or 16 grayscale pixels. Its really clear who is the winner. Also keep in mind that the grayscale delta algorithm is a lot shorter (and thus faster) than the 32bit ones.
So if you don't need colour, or have limited CPU or RAM, consider using grayscale.
So its now at 24bit vs 32bit. The major difference between these two is that SSE2\SSSE3 delta is only used with 32bit. If you are not using motion detection or your processor does not have SSSE3, you might want to stick to 24bit because it uses less memory.
My suggestion is to try both and see if you notice any performance increase with 32bit. If not, you can resort back to 24bit because it uses less memory.
I made a comparison table to explain the differences better:
These instructions below assume that if you have ZoneMinder installed, its already stopped.
1) Download the patched source (ZM SVN version with my changes applied ontop of it already) from here:
http://github.com/mastertheknife/ZoneMi ... /perfpatch
2) Navigate to the folder you just downloaded it to.
3) Extract the file to a folder by running:
Code: Select all
tar xvzf TheFile.tar.gz
5) Rebuild the configure script and the makefiles to be fully compatible with your machine:
Code: Select all
Note 1: For GCC versions older than 4.4, the compiler might complain about unknown registers. In that case, also add -msse2 to the CPPFLAGS. e.g. CPPFLAGS="-D__STDC_CONSTANT_MACROS -msse2"
Note 2: If your processor does not have SSE2, the code might fail to compile (unknown registers error). In that case, add -DZM_STRIP_SSE to the CPPFLAGS. This will compile ZM without any SSE code inside.
Optional: I recommend compiling with crashtrace and debugging enabled, see the following example for how to enable these.
Example ./configure line:
Code: Select all
./configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib --with-libarch=lib --with-mysql=/usr --with-ffmpeg=/usr --with-webdir=/var/www/zoneminder/htdocs --with-cgidir=/var/www/zoneminder/cgi-bin --with-webuser=apache --with-webgroup=apache --enable-debug=yes --enable-crashtrace=yes
Code: Select all
Code: Select all
* If this is a new installation, then load the db/zm_create.sql file to create a new database.
* If you have ZoneMinder installed, but the version is older than 1.25.0, then run this to upgrade the database:
Code: Select all
Run this SQL statement. This is a harmless change because it doesn't affect stock ZM, but is required for using the patch.
You can use phpmyadmin or the mysql program to execute the SQL statement below.
Example using the mysql program:
A) mysql -uUSER -pPASSWORD
B) Now type "use zm" to change to the correct database.
C) Execute the SQL statements below by typing it into the window:
Code: Select all
ALTER TABLE `Monitors` ADD `Colours` TINYINT UNSIGNED NOT NULL DEFAULT '1' AFTER `Height`; ALTER TABLE `Monitors` ADD `Deinterlacing` INT UNSIGNED NOT NULL DEFAULT '0' AFTER `Orientation`
Note: I recommend looking at the comparison table image above to see the differences between all these. For local cameras, make sure to match the target colorspace and the capture palette. This way no format conversion is involved which should result in best performance. Again, look at the table above to see the direct V4L formats.
13) Start ZoneMinder and look for any errors in the system log.
Enjoy! Report any bugs and anything you discover so i can fix
I recommend reading the performance tips below in order to take full advantage of the patch.
6. Performance Tips
Here are some performance tips on how to get the most performance out of ZM:
- Make sure ZM_CPU_EXTENSIONS is enabled to take advantage of SSE2\SSSE3 processor extensions.
- Make sure ZM_FAST_IMAGE_BLENDS is enabled. This limits the possible blend percents to 50%, 25%, 12.5%, 6.25%, 3.125% and 1.5%. Any other blend percent will be rounded to the nearest one. This type of blending is extremely fast and involves no multiplication or division, which can impact performance.
- Disable COLOUR_JPEG_FILES unless you really need it. This option converts grayscale images to colour before storing them as jpegs. This impacts performance and uses more space on your hard drive, so you should keep it disabled unless you need it.
- For motion detection, The less zones and smaller the zones, the faster.
- Disable CREATE_ANALYSIS_IMAGES if you use Blob motion detection, but don't need the analysis images.
- Local cameras: Enable ZM_V4L_MULTI_BUFFER if you can.
- Although obvious, make sure EXTRA_DEBUG and RECORD_DIAG_IMAGES are disabled!
- For local cameras: You should try matching the capture palette to the target colorspace, to prevent the need for a format conversion.
- How many frames per second do you really need? Think about it, there's no reason to capture at 20 fps if you don't need it. 5 fps should be enough for most users.
- Use libjpeg-turbo if you are not!
- Experiment! Try different capture palettes \ Target colorspace options and see what works best for you.
7. Known Issues
None at the moment.
Some common questions that might arise:
Q: I'm receiving "Bogus input color" or similar errors in my system log.
A: You are attempting to use a format that requires libjpeg-turbo, but standard libjpeg is being used instead, which doesn't have the colorspace extensions that libjpeg-turbo has.
Q: I'm receiving "libjpeg-turbo is required for reading a JPEG directly into a RGB32 buffer, reading into a RGB24 buffer instead." or similiar error messages in the system log.
A: You are attempting to use a format that requires libjpeg-turbo, but ZoneMinder was not compiled against libjpeg-turbo. Make sure that libjpeg-turbo's header files are used, and not standard libjpeg's header files.
9. Future Plans
- MPEG\Streaming performance tweaks (mostly remove some buffer copying) and some clean up in zm_mpeg.cpp
- Built-in deinterlacing (Pretty much complete, already included)
EDIT 26th October:
I removed the attachments. Instead, grab an already patched source from here:
http://github.com/mastertheknife/ZoneMi ... /perfpatch
The instructions were updated accordingly.
It is also possible to view the changes on github: