Some Comments on PolyMon Performance and Database Size

Several people have asked, and frankly felt concerned, about PolyMon’s SQL performance and database size requirements over time.

In terms of database growth, PolyMon includes a data retention policy mechanism that allows users to decide how much historical data they wish to retain for each monitor. This can be set for each individual monitor and is fairly granular allowing historical (aggregated) views of long historical time periods without requiring and extensive amount ot storage space.

Every time a Monitor runs in PolyMon, the event status result and any associated counters are stored in the database. For example, every time a Ping monitor runs, an OK/Warn/Fail status is stored in the database as well as counters for RTT and % loss are stored. If you run such a monitor every 5 minutes, you will generate and store 120 events per day, 43,800 events per year. Each record is very small in size (in the case of a Ping the total data size per event is about 150 bytes, including status and counter information). However, in addition to storing this event level data, PolyMon automatically creates roll-ups of this data into daily, weekly and monthly averages/totals. The retention policy can determine for each of these levels (raw, daily, weekly and monthly) how long data will be retained. Typically you might retain event level data for 1 month, daily data for 6 months, weekly data for 1 year and monthly data for 3 years. What this means is that storing historical data can become quite efficient depending on how long you want to retain event level data.

I have been using PolyMon in a production environment (with many monitors running every minute, some just 3 or 4 times a day, depending on the needs) for  18 months now with approximately 150 monitors. Although our retention policies have not discarded any event level data yet (we’ve got enough disk space!), our database is currently just over 4GB. By setting retention periods to 1 month at the event level, we could probably reduce this size to less than 300MB (since our aggregate data barely totals over 5MB, the majority of the space being used up by event level data).

In terms of reporting performance we have not seen any problems either. Our current Event table (which holds event level status data for our monitors) contains over 1.2 million rows and our Counters table (which holds event level counter data) is a little over 1.5 million rows. Reports, database updates, aggregations, etc have been performing fine. I constantly try and tweak indexes, stored procedures or pre-built statistical info to further enhance performance.

Where there is a potential performance bottleneck however, is in the monitoring itself. Currently the monitoring is agent-less and is performed by a single windows service (PolyMon Executive) that runs every monitor based on its frequency interval in sequence. Basically, the service has a primary timer that fires of every n minute (this is user configurable in minute increments). Each monitor is individually configured to repeat every nth timer cycle and the service evaluates whether it needs to run a specific monitor or not. If it does, it runs the monitor and moves on to the next one, otherwise it skips over it.

So far, at 150 monitors we have not experienced any problems with monitoring. However, this is definitely a bottleneck. The service is single threaded and therefore can only run one monitor at a time.

I recognized this was a scalability issue when I started coding PolyMon but decided to live with that drawback for a while. I intend to address this problem in two ways.

Firstly, I intend to make the windows service that runs the monitors multi-threaded (user configurable number of threads) to help alleviate “blocking” problems (where one monitor that takes a long time to run essentially blocks any other monitors from running). However this does not alleviate cases where the Windows server running the service is no longer able to keep up running all the monitors (even with a multi-threaded service). In other words, adding threading allows PolyMon to scale up, but not out.

Secondly, to address the scale out issue, I intend to allow multiple windows services (PolyMon Executives) to be running on multiple servers. Each monitor definition will then not only contain information on what resource to monitor, but will also allow the user to select which service instance the monitor should be run from. Actually the intent would be to allow users to either hardwire a monitor to a specific service or allow PolyMon itself to determine, dynamically, which service instance should be used to run the monitor.

Now this will take a little while to implement but I have already started researching this and have already started laying the groundwork to easily be able achieve this in a future release.

For now, the bottom line is that database size and performance has not been an issue, nor do I foresee it becoming an issue in the future. In a production environment monitoring over 150 resources (ping, wmi, perfmon, sql jobs, etc) we have not experienced any difficulties. However I am aware of the current bottlenecks and will address those, as outlined above, in the first quarter of 2009.

As always, I very much welcome any feedback, positive or negative, you may have regarding PolyMon. I hope you find it useful and find that it can perform, in certain circumstances, as well as some of the commercial systems out there that charge an arm and a leg.

%d bloggers like this: