Monday, August 30, 2010

DTrace to the rescue!

We recently migrated data to some older servers and ran into a performance problem during peakhours.

Analysis

On our busiest days, we've seen the load go up to over 300. Luckily this is a SPARC system, and it was still somewhat responsive. Not bad for an old 2 Socket system...

Looking at the CPU utilization using vmstat, showed a high percentage of time spent in SYS.


You can see this in the two performance graphs (the graphs were made on a Sunday, with a lower load).


















After looking at the processes with prstat -Lm (microstate accounting) and seeing the high SYS-percentage, we decided to look at what the kernel is doing.

Solaris has a nifty tool called lockstat. This is what it revealed:

# lockstat -i 133 -I sleep 5

Profiling interrupt: 1394 events in 5.241 seconds (266 events/sec)

Count indv cuml rcnt     nsec CPU+PIL                Caller                 
-------------------------------------------------------------------------------
  214  15%  15% 0.00     1982 cpu0                   (usermode)             
  174  12%  28% 0.00     3809 cpu[0]                 (usermode)             
  105   8%  35% 0.00     6559 cpu[0]                 dnlc_purge_vp+0xa4     
   53   4%  39% 0.00     3583 cpu0                   dnlc_purge_vp+0xa4     
   35   3%  42% 0.00     2684 cpu0                   copyout_more+0x190     
   31   2%  44% 0.00     7042 cpu[0]                 dnlc_purge_vp+0xa8     
   24   2%  46% 0.00     6714 cpu[0]                 dnlc_purge_vp+0x148    
   19   1%  47% 0.00     3929 cpu0                   dnlc_purge_vp+0xa8     
   15   1%  48% 0.00     5041 cpu[0]                 mutex_enter+0x8        
   13   1%  49% 0.00     3799 cpu0                   mutex_enter+0x8        
   12   1%  50% 0.00     3449 cpu0                   dnlc_purge_vp+0x148    
   10   1%  51% 0.00     2712 cpu[0]                 i_ddi_splx+0x1c        
    9   1%  51% 0.00     2320 cpu0                   hwblkclr+0x94          
    8   1%  52% 0.00     7265 cpu[0]                 dnlc_purge_vp+0xa0     
    8   1%  52% 0.00     3183 cpu0                   hwblkclr+0x9c          
    8   1%  53% 0.00     3397 cpu0                   hwblkclr+0x98          
    7   1%  53% 0.00     2592 cpu0                   copyout_more+0x194     
    5   0%  54% 0.00     2005 cpu0                   segmap_getmapflt+0x124 
    4   0%  54% 0.00     3784 cpu[0]+11              disp_getwork+0x88      
    4   0%  54% 0.00     1441 cpu[0]+11              spl6+0x14              
    4   0%  55% 0.00     3253 cpu[0]                 spl6                   
    4   0%  55% 0.00     5103 cpu[0]                 _resume_from_idle+0x1f8
    4   0%  55% 0.00     2466 cpu0+0x6               pci_pbm_dma_sync+0x8c  
    4   0%  56% 0.00     2735 cpu0                   hwblkclr+0xb0          
    4   0%  56% 0.00     2555 cpu0                   dnlc_purge_vp+0xa0     
    4   0%  56% 0.00     2998 cpu0                   page_lookup_create+0x90
    3   0%  56% 0.00     2449 cpu[0]+11              disp_getwork+0x15c     
    3   0%  57% 0.00     3502 cpu[0]+11              disp_getwork+0xa4      
    3   0%  57% 0.00     3233 cpu[0]+11              i_ddi_splx             
    3   0%  57% 0.00     6187 cpu[0]                 r4find+0x14c           
    3   0%  57% 0.00     5585 cpu[0]                 copyout+0x25c   


There's a strange function stealing almost 20% of the interrupts:  dnlc_purge_vp

The dnlc_purge_vp function belongs to the directory name lookup cache. It purges cache entries from time to time. You can read more about it here.

Next step was to find out how much time is spend in this function. A small dtrace script helped us:

purge.d:
fbt::dnlc_purge_vp:entry
{
        self->t = timestamp;
}

fbt::dnlc_purge_vp:return
/self->t/
{
        @["latency"] = quantize((timestamp-self->t)/1000);
        self->t = 0;
}

Running it, gives us a nice latency distribution table:

#  dtrace -s purge.d                   
dtrace: script 'purge.d' matched 2 probes
^C

  latency                                          
           value  ------------- Distribution ------------- count   
             128 |                                         0       
             256 |                                         1       
             512 |                                         0       
            1024 |                                         2       
            2048 |@                                        6       
            4096 |@                                        10      
            8192 |@@@@@@                                   65      
           16384 |@@@@@@@@@@@@@@@@@@@@@@                   247     
           32768 |@@@@@@@@@@                               112     
           65536 |                                         3       
          131072 |                                         5       
          262144 |                                         3

          524288 |                                         0       

Looking at the result, we can see that most function calls take about 8-32ms. Really slow!

Solution

Knowing that the CPU is too slow for purging a large cache, we decided to decrease the cache from its default size to 25% of the default size. The ncsize kernel parameter is unfortunately not dynamic. We had to enter it in /etc/system and reboot the OS.

This was done on an indentical system with approx. the same load, and this is what lockstat reported after the reboot:

# lockstat -i 133 -I sleep 5  

Profiling interrupt: 1344 events in 5.054 seconds (266 events/sec)

Count indv cuml rcnt     nsec CPU+PIL                Caller                 
-------------------------------------------------------------------------------
   55   4%   4% 0.00     2301 cpu0                   (usermode)             
   47   3%   8% 0.00     2510 cpu[0]                 (usermode)             
   42   3%  11% 0.00     1717 cpu[0]                 splx+0x1c              
   41   3%  14% 0.00     1667 cpu0                   splx+0x1c              
   29   2%  16% 0.00     1718 cpu0                   idle+0x12c             
   26   2%  18% 0.00     1758 cpu[0]+11              disp_getwork+0x98      
   22   2%  19% 0.00     1511 cpu[0]                 disp_getwork+0x174     
   20   1%  21% 0.00     1249 cpu[0]+11              disp_getwork+0x78      
   20   1%  22% 0.00     1406 cpu0+0xb               disp_getwork+0x98      
   20   1%  24% 0.00     1627 cpu0+0xb               disp_getwork+0x88      
   20   1%  25% 0.00     1498 cpu0+0xb               i_ddi_splhigh+0x14     
   18   1%  27% 0.00     1324 cpu[0]+11              splx                   
   18   1%  28% 0.00     1663 cpu0                   disp_getwork+0x174     
   17   1%  29% 0.00     1592 cpu0+0xb               disp_getwork+0x78      
   16   1%  31% 0.00     1284 cpu[0]+11              i_ddi_splhigh+0x14     
   16   1%  32% 0.00     1500 cpu[0]                 disp_getwork+0x1b4     
   16   1%  33% 0.00     1602 cpu[0]                 idle+0x12c             
   16   1%  34% 0.00     1438 cpu[0]                 i_ddi_splhigh+0xc      
   16   1%  35% 0.00     1696 cpu0+0xb               splx                   
   14   1%  36% 0.00     1571 cpu0                   idle+0xf4              
   14   1%  37% 0.00     1411 cpu0                   i_ddi_splhigh          
   13   1%  38% 0.00     1664 cpu[0]                 i_ddi_splhigh          
   12   1%  39% 0.00     1697 cpu[0]                 idle+0xec          
...

There's no dnlc_purge_vp function in the first 40% of the interrupts!

Running the dtrace script also confirms, that purge times have dropped linearly to about 25%:

# dtrace -s purge.d
dtrace: script 'purge.d' matched 2 probes
^C

  latency                                          
           value  ------------- Distribution ------------- count   
             256 |                                         0       
             512 |@                                        1       
            1024 |@                                        2       
            2048 |@@@@@@                                   10      
            4096 |@@@@@@@@@@@@@@@@@@@@                     32      
            8192 |@@@@@@@@@@@@                             19      
           16384 |                                         0  
     

Looking at the system over a day, the load and CPU utilization was much lower:










Conclusion

Solaris' dtrace tool is not just a marketing gag, but is real value for all production enviroments. We were able to find the bottleneck in no time (about 30 minutes of investigation + in this case reboot during off-peak hours).

Most importantly, we didn't have to risk production as dtrace is production safe.

Thursday, August 26, 2010

Package Deployment with OpsCenter

There has been some discussion on Constantins blog about working around IPS' scripting limitations.

I don't want to talk about the reasons, why IPS does not have classical pre/postinstall-scripts, but I want to demonstrate how this functionality could be achieved with Oracle's Enterprise Manager OpsCenter.

If you don't know OpsCenter, it's _the_ management tool from Oracle for Solaris, Unbreakable Linux and OracleVM.


OS Updates

In OpsCenter terminology Package deployments are called "OS Updates". OS Updates consist of Update Components (Packages, Patches and Configurations), Update Policies and Update Profiles. If an OS is to be updated, so called "Update Profile"-jobs will be sent to hosts run.

Update Profiles

There are two types of Update Profiles, System-defined and User-defined. What we're looking for are user defined profiles. A user-defined profile contains components. These components can be Packages, Pre/Postinstall-Scripts and Configuration files.

They can also contain macros to be run on the target system to return information about the system. This information can then be used to dynamically create a configuration file. 

It is also possible to run probes, to check if pre-requisites are met.

Update Profiles can also contain other update profiles. (e.g. to create a Mail-Server profile or Web-Server profile)

In the picture we can see two Splunk Profiles. Below you can see some components, like the SMF Manifest or a postinstall script.




















Update Components

Update components are organised into folders:










Configuration Files carry a version number.

Uploading configuration file (-templates) or scripts is simple:
















Running OS Update Job

It is easy to run update jobs on one or several hosts. There are different options that can be set how to run the jobs...























While the job is running and after it's finished the job details can be monitored. Of course it makes sense to have good error handling in the pre/postscript.



















Summary

The examples shown are with SysV packages, as OpsCenter for obvious reasons lacks IPS support at the moment. But it shows how scripts can be run before and after package installation. As an alternative to OpsCenter, the same can be achieved using e.g. Puppet, bcfg2 or cfenginge.

The nice thing about using a management tool is, to compare different systems, find differences and keep them standardized.