Page 1 of 12

ZM 1.25.0 Performance Patch - Completed

Posted: Fri May 20, 2011 4:37 pm
by mastertheknife
Hello everyone,

1. General:
My name is Kfir and programming is an hobby of mine. Seeing how ZM can be CPU heavy, i thought it will be challenging to see how far i can go with reducing ZM's CPU usage.
I started working on this at around February 2011 and been working on this slowly (because of limited free time). I first wanted to get an overall view of the zm components. I did this by compiling zoneminder statically with profiling enabled to be able to see how many times each function is called and how much time it takes it to execute. I then began tackling the most used, most heavy functions.I discovered that for some operations, it is possible to take advantage of advanced CPU extensions such as SSE2(Pentium 4 and newer) or SSSE3(Intel Atom and newer) which are CPU extensions by intel that add SIMD instructions.

SIMD means "Single Instruction, Multiple data". The idea is to be able to work on multiple data using just 1 instruction. So for example, if we want to divide 4 integers, instead of loading the first one, dividing it, storing the result, loading the second one, dividing it, storing the result (and so on), we are able to load all these 4 integers into a large register (XMM register) using a single instruction, divide them all at once using a single instruction and store them all back to memory, also using a single instruction. So theoretically, we are now working on 4 integers/time instead of 1 integers/time. This should result in 400% speedup but is very often not the case, because sometimes we have to arrange the data to however we want it on the XMM registers, and because SSE2\SSSE3 instruction set is very limited and sometimes we have to write code to emulate an instruction that isn't available in SSE2\SSSE3, So the effective speedup should be around 200%, which is still very significant.
However, SIMD can be very restricting. Except the very limited instruction set, here are few setbacks: 1) It requires the data to be vectorized. 2) Data has to be aligned on a 16 byte boundary, otherwise there is a big performance penalty.

When it comes to ZM, I had to change all image buffer memory allocations to be on a 16 byte boundary, but thats not all. The color format that ZM uses (RGB24) is not suitable because it doesn't exactly fit into an XMM register. Each 24bit pixel is 3 bytes, so we can't fit an exact number of pixels into the XMM register.
This required me to add 32bit colour format into ZoneMinder. On the way, i also added BGR24 support to prevent the need for the conversion from BGR24 to RGB24. The 32bit RGB formats i added are: RGBA, BGRA, ARGB, ABGR. In all of these the alpha byte is not used, but is there for the padding and alignment. four RGB32 pixels fit (4 bytes each) fit exactly into a XMM register (16 bytes). Grayscale is even better, its possible to fit 16 grayscale pixels into an XMM register and work on all these 16 pixels at any given moment, which is quite amazing!

A problem i faced is with when dealing with JPEGs. libjpeg can only work with grayscale or RGB24 (RGB order). So if we want to store from or read from a BGR24 or 32bit RGB buffer, we would have to convert the image into RGB24 first, not good! Fortunately, there is libjpeg-turbo which has colorspace extensions which add support for BGR24, RGBA, BGRA, ARGB, ABGR (the last four are 32bit RGB formats), so we are good! Also, because i removed the conversion from BGR24 into RGB24 (no need for this performance penalty anymore), this requires libjpeg-turbo for storing BGR24. This made the ZM_LOCAL_BGR_INVERT option now useless and does nothing.
OK, but what if the analog device you are capturing from doesn't support 32bit capturing (BGR32 or RGB32)? No worry, you have two options:
1) Use the "Target colorspace" option to tell ZM to convert your capture palette into 32bit RGB. This also works for converting any capture palette into 8bit grayscale or RGB24. Take YUYV for example. What if your camera only supports the YUYV capture palette and you want grayscale? Previously, there was nothing you could do. Now you can just select 8bit target colorspace and it will automaticly convert your YUYV capture palette into 8bit grayscale. Let me tell you a secret, YUYV can be VERY fast converted into grayscale. To convert YUYV into grayscale, all needed to do is to extract the Y channel. If SSSE3 is available, this will be even faster!
2) Stick to your current capture palette, e.g. RGB24. I also made improvements to the RGB24 format by using the same algorithms i wrote for the 32bit RGB formats, using some loop unrolling to reduce loop overhead and some other techniques.
The "Target colourspace" option is available for all source types. Previously this option existed under a different name for remote, FFMPEG and file monitor types, but it wasn't always functional when choosing grayscale. I removed this option and added instead "Target colorspace" which should be equivalent but this time its available for all monitor types and is actually completely functional.

2. The major performance changes:
Although there are many changes to almost all components, i will only explain about the major ones at the moment:

The analyze daemon (zma):
After many testing and many profiling with different resolutions, settings, time, etc. I discovered that for zma in motion detection mode, the Delta and the Blend functions are responsible together for around 70-80% of zma's total cpu usage.

Code: Select all

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 43.25     86.56    86.56    43079     2.01     2.01  Image::Delta(Image const&) const
 39.97    166.55    79.99    43105     1.86     1.86  Image::Blend(Image const&, int) const
 14.66    195.90    29.35    43079     0.68     0.75  Zone::CheckAlarms(Image const*)
  1.28    198.47     2.57 96002511     0.00     0.00  Image::Buffer(unsigned int, unsigned int) const
(Look here for a complete profiling report for stock ZM: http://pastebin.com/jU1Crv44)

I coded new fast algorithms for Delta and Blend, which should result in almost exactly identical output to the original ones. I wrote all algorithms externally, tested them using custom test programs i wrote for this task and implanted them into ZM.

The blending process: Its possible to do really fast image blending at given percentages without using multiplication or division (which are expensive on the CPU). Those percentages are 50%, 25%, 12.5%, 6.25%, 3.125% and 1.5%. If you now turn on ZM_FAST_IMAGE_BLENDS, based on your monitor's blend percent, it will select the closest possible fast blending percentage and use it. And even better, if SSE2 is available on the system and ZM_CPU_EXTENSIONS is enabled (along with ZM_FAST_IMAGE_BLENDS), an SSE2 version of this fast blending will be used. if ZM_FAST_IMAGE_BLENDS is disabled, standard blending will be used which is a lot slower and doesn't have an SSE2 version for it.

Delta image generation: The algorithm used is pretty much identical to the old ones with some small differences, such as the removal of the lookup and abs tables, used some loop unrolling and switched to the use BT.709 coefficients instead of BT.601 coefficients. With BT.709 its possible to use optimization to get the Y channel and receive almost the identical value as if without optimization. The formula is Y = (R + R + B + G + G + G + G + G) / 8
For grayscale i added a SSE2 version, and for 32bit RGB i added SSSE3 versions. These SSE algorithms will be used as long as ZM_CPU_EXTENSIONS is enabled and the required SSE version for the algorithm is available on the system.

zma with the patch is expected to be around 20-40% faster.

The capture daemon (zmc):

For zmc there was not as much to be done as for zma. The biggest performance change i did for zmc was to reduce the amount of memory copying done, which accounted for around 10-30% of zmc's cpu time.

Capture: All source types now capture directly into the shared memory, and if a conversion has to be performed, it will be performed from the source with the output being stored into the shared memory.

zmc with the patch is expected to be around 10-30% faster.

3. Full list of changes
  • Added new internal formats: BGR24 and 4 new 32bit RGB formats: RGBA, BGRA, ARGB, ABGR.
    Support for these has been added to almost all existing functions, however these require libjpeg-turbo (read above).
  • Changed all image buffer memory allocations to be on a 16 byte boundary.
  • Rewrote the Blend and Delta functions entirely with SSE2\SSSE3 versions included.
  • Replaced the ZM_Y_IMAGE_DELTAS option with ZM_CPU_EXTENSIONS to enable support for extended processor features.
  • Removed the ZM_LOCAL_BGR_INVERT option as its not needed anymore because BGR24 is now natively supported.
  • Added "Target colorspace" option available to all source types. This is like the previous option, but also exists for local cameras.
  • Direct memory capture (Less memory copying).
  • Added SSE2 memory copy for image buffer copying.
  • Added code to detect SSE presence and version.
  • Event streaming uses sendfile() if available
  • Changed code to make compiling with function inlining enabled possible
  • Added fast YUYV->Grayscale conversion, including an SSSE3 version of it
  • Major changes to zm_local_camera.cpp, including intelligent format selection based on palette, target colorspace, processor endianness, swscale presence and so on.
  • Allow ZM to compile cleanly without swscale.
  • Minor performance improvements to the AlarmedPixels motion detection.
  • Added support for the JPEG and MJPEG capture palettes.
  • Added automatic capture palette option.
  • Changed shared memory layout to be identical on 32bit and 64bit platforms to improve compatibility and to make the size 16 byte aligned.
  • Added grayscale support to the signal checking.
    Uses the lowest byte only (right most byte). #000023 works nicely for me using the bttv driver.
  • Fixed multiple monitors capturing from the same device and channel.
    Current code allows for multiple monitors sharing the same device, each on a different channel
    Or, multiple monitors sharing the same device, all on the same channel.
    In both cases, capture method, width, height and palette must be identical on all monitors.
    However, target colorspace can be different because each monitor handles the format conversion separately.
  • Changed monitor probe's prefered capture settings.
  • All capture palettes that require a format conversion are now marked with an asterisk infront.
  • Modified default monitor options to simplify new monitor creation.
  • Improved crop detection error handling.
  • Allow the blend percent to be zero to disable blending.
  • Added swscale clean up code for local cameras.
  • Added image size requirements to ensure proper alignment.
  • Modified the database Monitors table to add a field for the target colorspace.
  • Changed ZM static colours to RGB order (in memory) instead of HTML(BGR) order.
  • Relocated ZM's own format conversion algorithms to zm_image.cpp.
  • Fixed bug: mmap unexpected memory size when changing capture options.
  • Fixed bug: Undefined constant ZM_V4L2 in monitor probe.
  • Fixed bug: Error in offset X in monitor probe.
  • Fixed bug: Wrong events path for absolute paths with deep storage enabled.
  • Fixed bug: Disabled linking a monitor to itself, which can cause an alarm lasting forever once triggered.
  • Fixed bug: zma crashes when the monitor is linked to a disabled monitor.
  • Fixed bug: zma crashes when the signal for an analog camera is unstable.
  • Fixed bug: Mocord not entering an alarm state after certain conditions such as signal returned.
  • Fixed bug: enableDisableAlarms javascript error.
  • Fixed bug: ZM crashing sometimes if JPEG quality is set to 100.
  • Small other changes and fixes.
4. 8bit , 24bit or 32bit?
Grayscale is extremely fast, efficient and uses minimal RAM. Grayscale should be even faster with this patch, especially if SSE2 is available. Grayscale benefits greatly from SSE2. Earlier i wrote that the size of an SSE register (XMM register) is 16 bytes. On those 16 bytes you can fit 4 32bit pixels or 16 grayscale pixels. Its really clear who is the winner. Also keep in mind that the grayscale delta algorithm is a lot shorter (and thus faster) than the 32bit ones.
So if you don't need colour, or have limited CPU or RAM, consider using grayscale.
So its now at 24bit vs 32bit. The major difference between these two is that SSE2\SSSE3 delta is only used with 32bit. If you are not using motion detection or your processor does not have SSSE3, you might want to stick to 24bit because it uses less memory.
My suggestion is to try both and see if you notice any performance increase with 32bit. If not, you can resort back to 24bit because it uses less memory.

I made a comparison table to explain the differences better:
Image

5. Installation
These instructions below assume that if you have ZoneMinder installed, its already stopped.

1) Download the patched source (ZM SVN version with my changes applied ontop of it already) from here:
http://github.com/mastertheknife/ZoneMi ... /perfpatch
2) Navigate to the folder you just downloaded it to.
3) Extract the file to a folder by running:

Code: Select all

tar xvzf TheFile.tar.gz
4) Navigate to the folder you just extracted the gzipped tarball to.
5) Rebuild the configure script and the makefiles to be fully compatible with your machine:

Code: Select all

autoreconf
6) Run the ./configure script
Note 1: For GCC versions older than 4.4, the compiler might complain about unknown registers. In that case, also add -msse2 to the CPPFLAGS. e.g. CPPFLAGS="-D__STDC_CONSTANT_MACROS -msse2"
Note 2: If your processor does not have SSE2, the code might fail to compile (unknown registers error). In that case, add -DZM_STRIP_SSE to the CPPFLAGS. This will compile ZM without any SSE code inside.
Optional: I recommend compiling with crashtrace and debugging enabled, see the following example for how to enable these.
Example ./configure line:

Code: Select all

./configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib --with-libarch=lib --with-mysql=/usr --with-ffmpeg=/usr --with-webdir=/var/www/zoneminder/htdocs --with-cgidir=/var/www/zoneminder/cgi-bin --with-webuser=apache --with-webgroup=apache --enable-debug=yes --enable-crashtrace=yes
7) Run this to compile everything:

Code: Select all

make
8 ) As superuser, run this to install ZoneMinder into your system:

Code: Select all

make install
9) Edit the zm.conf file (typically in /etc/zm.conf or /etc/local/zm.conf) to set the correct database user and password.
10)
* If this is a new installation, then load the db/zm_create.sql file to create a new database.
* If you have ZoneMinder installed, but the version is older than 1.25.0, then run this to upgrade the database:

Code: Select all

zmupdate.pl
11) If this is not a new installation:
Run this SQL statement. This is a harmless change because it doesn't affect stock ZM, but is required for using the patch.
You can use phpmyadmin or the mysql program to execute the SQL statement below.
Example using the mysql program:
A) mysql -uUSER -pPASSWORD
B) Now type "use zm" to change to the correct database.
C) Execute the SQL statements below by typing it into the window:

Code: Select all

ALTER TABLE `Monitors` ADD `Colours` TINYINT UNSIGNED NOT NULL DEFAULT '1' AFTER `Height`;
ALTER TABLE `Monitors` ADD `Deinterlacing` INT UNSIGNED NOT NULL DEFAULT '0' AFTER `Orientation`
12) Navigate to ZoneMinder's web interface and for each monitor, re-set the capture palette and set the "Target colorspace" option as you desire.
Note: I recommend looking at the comparison table image above to see the differences between all these. For local cameras, make sure to match the target colorspace and the capture palette. This way no format conversion is involved which should result in best performance. Again, look at the table above to see the direct V4L formats.

13) Start ZoneMinder and look for any errors in the system log.

Enjoy! Report any bugs and anything you discover so i can fix :D
I recommend reading the performance tips below in order to take full advantage of the patch.

6. Performance Tips
Here are some performance tips on how to get the most performance out of ZM:

- Make sure ZM_CPU_EXTENSIONS is enabled to take advantage of SSE2\SSSE3 processor extensions.
- Make sure ZM_FAST_IMAGE_BLENDS is enabled. This limits the possible blend percents to 50%, 25%, 12.5%, 6.25%, 3.125% and 1.5%. Any other blend percent will be rounded to the nearest one. This type of blending is extremely fast and involves no multiplication or division, which can impact performance.
- Disable COLOUR_JPEG_FILES unless you really need it. This option converts grayscale images to colour before storing them as jpegs. This impacts performance and uses more space on your hard drive, so you should keep it disabled unless you need it.
- For motion detection, The less zones and smaller the zones, the faster.
- Disable CREATE_ANALYSIS_IMAGES if you use Blob motion detection, but don't need the analysis images.
- Local cameras: Enable ZM_V4L_MULTI_BUFFER if you can.
- Although obvious, make sure EXTRA_DEBUG and RECORD_DIAG_IMAGES are disabled!
- For local cameras: You should try matching the capture palette to the target colorspace, to prevent the need for a format conversion.
- How many frames per second do you really need? Think about it, there's no reason to capture at 20 fps if you don't need it. 5 fps should be enough for most users.
- Use libjpeg-turbo if you are not!
- Experiment! Try different capture palettes \ Target colorspace options and see what works best for you.

7. Known Issues
None at the moment.

8. FAQ
Some common questions that might arise:

Q: I'm receiving "Bogus input color" or similar errors in my system log.
A: You are attempting to use a format that requires libjpeg-turbo, but standard libjpeg is being used instead, which doesn't have the colorspace extensions that libjpeg-turbo has.

Q: I'm receiving "libjpeg-turbo is required for reading a JPEG directly into a RGB32 buffer, reading into a RGB24 buffer instead." or similiar error messages in the system log.
A: You are attempting to use a format that requires libjpeg-turbo, but ZoneMinder was not compiled against libjpeg-turbo. Make sure that libjpeg-turbo's header files are used, and not standard libjpeg's header files.

9. Future Plans
  • MPEG\Streaming performance tweaks (mostly remove some buffer copying) and some clean up in zm_mpeg.cpp
  • Built-in deinterlacing (Pretty much complete, already included)
(Things unlisted are either completed or cancelled.)


EDIT 26th October:
I removed the attachments. Instead, grab an already patched source from here:
http://github.com/mastertheknife/ZoneMi ... /perfpatch
The instructions were updated accordingly.

It is also possible to view the changes on github:
http://github.com/mastertheknife/ZoneMinder-kfir

mastertheknife.

Re: ZM 1.24.3 Peformance patch - V1

Posted: Sat May 21, 2011 12:26 am
by petarggh
That sounds really impressive! Let me know when you've got something I can try out. I'd really like to see how this does.

Re: ZM 1.24.3 Peformance patch - V1

Posted: Sat May 21, 2011 4:39 pm
by mastertheknife
petarggh wrote:That sounds really impressive! Let me know when you've got something I can try out. I'd really like to see how this does.
Hi there, happy to see some interest :D

The patch is pretty much operational, i'm using it for about a month now.
Some RTSP issues are holding me back, but other than that, its pretty much all good.
Stay tuned :D

http://i.imgur.com/cEX2V.jpg

mastertheknife

Re: ZM 1.24.3 Peformance patch - V1

Posted: Mon May 23, 2011 9:00 am
by mastertheknife
Hello everyone,

I ran some tests between stock ZM 1.24.3 and my almost-complete patch.
I ran each test for ~15 minutes with all settings being identical in both versions (nothing was changed between version changes, except capture palette and target colourspace).
I also attached profiling logs generated with gprof to be able to see the differences.

V4L monitor (Colour)
384x288 @ 10 fps
Modect
Single zone (using AlarmedPixels only)

Stock
Capture palette: BGR24

Code: Select all

zmc: 2.0%
zma: 4.6%
zmc profiling: http://pastebin.com/3ZMfhmmA
zma profiling: http://pastebin.com/jU1Crv44

PATCH:
(zma should be even better with SSSE3 capable CPU, My ZM box doesn't have SSSE3)
Capture palette: BGR32 -> 32bit RGB

Code: Select all

zmc: 0.7%
zma: 2.7%
zmc profiling: http://pastebin.com/nwn30sMn
zma profiling: http://pastebin.com/4SaVFNrL


V4L monitor (Greyscale)
384x288 @ 10 fps
Modect
Single zone (using AlarmedPixels only)

Stock:
Capture palette: Grey

Code: Select all

zmc: 1.7%
zma: 4.6%
zmc profiling: http://pastebin.com/7F65SyRE
zma profiling: http://pastebin.com/stQx9KZp

PATCH:
Capture palette: Grey -> Greyscale

Code: Select all

zmc: 0.3%
zma: 1.0%
zmc profiling: http://pastebin.com/CTEJaVTP
zma profiling: http://pastebin.com/KGP2PDnj


File monitor (Colour)
1280x800 @ 5 fps
Modect
Single zone (using AlarmedPixels only)

Stock:
24bit RGB

Code: Select all

zmc: 32%
zma: 23%
zmc profiling: http://pastebin.com/sUfvz74s
zma profiling: http://pastebin.com/HJBxt9RY

PATCH:
(zma should be even better with SSSE3 capable CPU, My ZM box doesn't have SSSE3)
32bit RGB

Code: Select all

zmc: 29%
zma: 13%
zmc profiling: http://pastebin.com/piRUdD92
zma profiling: http://pastebin.com/pt7iYJc3


mastertheknife :D

Re: ZM 1.24.3 Peformance patch - V1

Posted: Tue May 24, 2011 5:34 am
by Normando
Hi Mastertheknife

Really superb! I was reading all the post, and it is brilliant. I hope you can fix RTSP issues.

I don't know if this can help, but a few years ago I was posted a framework maded by nvidia to do image processing.

Here the post: http://www.zoneminder.com/forums/viewto ... =8&t=14503

I stay tunned

Re: ZM 1.24.3 Peformance patch - V1

Posted: Wed May 25, 2011 4:51 pm
by whatboy
Can I haz cake???

Very impressive the numbers... hope it can boost the sheevaplug...

Re: ZM 1.24.3 Peformance patch - V1

Posted: Wed May 25, 2011 5:41 pm
by mastertheknife
whatboy wrote:Can I haz cake???

Very impressive the numbers... hope it can boost the sheevaplug...
Probably not by much, because the sheevaplug and the guru do not have SSE at all.

However, Intel Atom processors can definitely benefit from this greatly because they have SSSE3.
So anyone looking for a low power system for ZM, the Atom processor might be a good candidate. Check out the Intel D510MO and D525MW boards.

Also, keep in mind that the patch was aimed to improve motion detection performance. So if your monitors are in "Monitor" or "Record" mode, you probably won't notice much difference.

mastertheknife.

Re: ZM 1.24.3 Peformance patch - V1

Posted: Thu May 26, 2011 1:50 am
by graphicw
This is some superb work indeed. I have been hoping for an efficient way of doing bgr24 as the bt878a cards do not support rgb24. The need to not convert anymore will be wonderful once your patch is complete. I also like the work on the motion detection algorithm. Should be useful on my dual Xeon Gallatin 3.2 server once complete.

Re: ZM 1.24.3 Peformance Patch V1

Posted: Tue May 31, 2011 12:30 pm
by mastertheknife
Just a small update..

I did some other changes and improvements. So far all left to resolve is compability with the V4L2 field option because the rescaling process was removed.
The problem is that i don't have much time lately because i have to study for some tests
And now with the ZM 1.24.4 release, i will have to merge the changes from 1.24.4 into the patch and solve any conflicts.
So hopefully next week i will put it here.

mastertheknife

Re: ZM 1.24.3 Peformance Patch V1

Posted: Sat Jun 04, 2011 6:19 pm
by bb99
Reads great! Good luck; hearing rumours of ZM V2.

Re: ZM 1.24.3 Peformance Patch V1

Posted: Sat Jun 11, 2011 1:26 am
by graphicw
How is this patch coming along?

Re: ZM 1.24.3 Peformance Patch V1

Posted: Sat Jun 11, 2011 2:56 pm
by mastertheknife
graphicw wrote:How is this patch coming along?
Not so well. In the last 3 weeks i worked on it very very little because i didn't have much time (tests..) and interest in fixing the last things i wanted to fix, such as the V4L2 field option that broke when i removed the rescaling process.
Today i updated my 2 ZM installs (home CCTV, test setup) to 1.24.4 and rebased the patch against 1.24.4, along with removing two fixes that are now fixed by default in 1.24.4.
I won't fix the V4L2 field option for now because i haven't decided yet how i should do it(its tricky, because ZM doesn't let you specify capture size and monitor size separately) and because of lack of time.
Few things left to re-code and test and i think it should be pretty ready to be uploaded here and have other people test\look for bugs, and after that, hopefully Phil will integrate the code modifications to ZM 1.25.0.

mastertheknife :D

Re: ZM 1.24.4 Peformance Patch - Complete

Posted: Sat Jun 11, 2011 5:59 pm
by mastertheknife
Okay, the download link and the instructions are in the first post.

The diffstat if anyone is curious: http://pastebin.com/0BRZyCGx

Report your results and if anyone needs help i am also in #zoneminder on freenode

Enjoy :D

Re: ZM 1.24.4 Peformance Patch - Complete

Posted: Sun Jun 12, 2011 6:52 pm
by Flasheart
(Edited Master's post to include the patch file as an attachment - it's only 50k and avoids the adverts and wait on the hosted download site)

Re: ZM 1.24.4 Peformance Patch - Complete

Posted: Mon Jun 13, 2011 7:48 am
by mastertheknife
Updated the instructions so they can be also used for a new ZM installation or for installed ZM version older than 1.24.4.

mastertheknife