DTrace to the rescue!

We recently migrated data to some older servers and ran into a performance problem during peakhours.

Analysis

On our busiest days, we've seen the load go up to over 300. Luckily this is a SPARC system, and it was still somewhat responsive. Not bad for an old 2 Socket system...

Looking at the CPU utilization using vmstat, showed a high percentage of time spent in SYS.


You can see this in the two performance graphs (the graphs were made on a Sunday, with a lower load).


















After looking at the processes with prstat -Lm (microstate accounting) and seeing the high SYS-percentage, we decided to look at what the kernel is doing.

Solaris has a nifty tool called lockstat. This is what it revealed:

# lockstat -i 133 -I sleep 5

Profiling interrupt: 1394 events in 5.241 seconds (266 events/sec)

Count indv cuml rcnt     nsec CPU+PIL                Caller                 
-------------------------------------------------------------------------------
  214  15%  15% 0.00     1982 cpu0                   (usermode)             
  174  12%  28% 0.00     3809 cpu[0]                 (usermode)             
  105   8%  35% 0.00     6559 cpu[0]                 dnlc_purge_vp+0xa4     
   53   4%  39% 0.00     3583 cpu0                   dnlc_purge_vp+0xa4     
   35   3%  42% 0.00     2684 cpu0                   copyout_more+0x190     
   31   2%  44% 0.00     7042 cpu[0]                 dnlc_purge_vp+0xa8     
   24   2%  46% 0.00     6714 cpu[0]                 dnlc_purge_vp+0x148    
   19   1%  47% 0.00     3929 cpu0                   dnlc_purge_vp+0xa8     
   15   1%  48% 0.00     5041 cpu[0]                 mutex_enter+0x8        
   13   1%  49% 0.00     3799 cpu0                   mutex_enter+0x8        
   12   1%  50% 0.00     3449 cpu0                   dnlc_purge_vp+0x148    
   10   1%  51% 0.00     2712 cpu[0]                 i_ddi_splx+0x1c        
    9   1%  51% 0.00     2320 cpu0                   hwblkclr+0x94          
    8   1%  52% 0.00     7265 cpu[0]                 dnlc_purge_vp+0xa0     
    8   1%  52% 0.00     3183 cpu0                   hwblkclr+0x9c          
    8   1%  53% 0.00     3397 cpu0                   hwblkclr+0x98          
    7   1%  53% 0.00     2592 cpu0                   copyout_more+0x194     
    5   0%  54% 0.00     2005 cpu0                   segmap_getmapflt+0x124 
    4   0%  54% 0.00     3784 cpu[0]+11              disp_getwork+0x88      
    4   0%  54% 0.00     1441 cpu[0]+11              spl6+0x14              
    4   0%  55% 0.00     3253 cpu[0]                 spl6                   
    4   0%  55% 0.00     5103 cpu[0]                 _resume_from_idle+0x1f8
    4   0%  55% 0.00     2466 cpu0+0x6               pci_pbm_dma_sync+0x8c  
    4   0%  56% 0.00     2735 cpu0                   hwblkclr+0xb0          
    4   0%  56% 0.00     2555 cpu0                   dnlc_purge_vp+0xa0     
    4   0%  56% 0.00     2998 cpu0                   page_lookup_create+0x90
    3   0%  56% 0.00     2449 cpu[0]+11              disp_getwork+0x15c     
    3   0%  57% 0.00     3502 cpu[0]+11              disp_getwork+0xa4      
    3   0%  57% 0.00     3233 cpu[0]+11              i_ddi_splx             
    3   0%  57% 0.00     6187 cpu[0]                 r4find+0x14c           
    3   0%  57% 0.00     5585 cpu[0]                 copyout+0x25c   


There's a strange function stealing almost 20% of the interrupts:  dnlc_purge_vp

The dnlc_purge_vp function belongs to the directory name lookup cache. It purges cache entries from time to time. You can read more about it here.

Next step was to find out how much time is spend in this function. A small dtrace script helped us:

purge.d:
fbt::dnlc_purge_vp:entry
{
        self->t = timestamp;
}

fbt::dnlc_purge_vp:return
/self->t/
{
        @["latency"] = quantize((timestamp-self->t)/1000);
        self->t = 0;
}

Running it, gives us a nice latency distribution table:

#  dtrace -s purge.d                   
dtrace: script 'purge.d' matched 2 probes
^C

  latency                                          
           value  ------------- Distribution ------------- count   
             128 |                                         0       
             256 |                                         1       
             512 |                                         0       
            1024 |                                         2       
            2048 |@                                        6       
            4096 |@                                        10      
            8192 |@@@@@@                                   65      
           16384 |@@@@@@@@@@@@@@@@@@@@@@                   247     
           32768 |@@@@@@@@@@                               112     
           65536 |                                         3       
          131072 |                                         5       
          262144 |                                         3

          524288 |                                         0       

Looking at the result, we can see that most function calls take about 8-32ms. Really slow!

Solution

Knowing that the CPU is too slow for purging a large cache, we decided to decrease the cache from its default size to 25% of the default size. The ncsize kernel parameter is unfortunately not dynamic. We had to enter it in /etc/system and reboot the OS.

This was done on an indentical system with approx. the same load, and this is what lockstat reported after the reboot:

# lockstat -i 133 -I sleep 5  

Profiling interrupt: 1344 events in 5.054 seconds (266 events/sec)

Count indv cuml rcnt     nsec CPU+PIL                Caller                 
-------------------------------------------------------------------------------
   55   4%   4% 0.00     2301 cpu0                   (usermode)             
   47   3%   8% 0.00     2510 cpu[0]                 (usermode)             
   42   3%  11% 0.00     1717 cpu[0]                 splx+0x1c              
   41   3%  14% 0.00     1667 cpu0                   splx+0x1c              
   29   2%  16% 0.00     1718 cpu0                   idle+0x12c             
   26   2%  18% 0.00     1758 cpu[0]+11              disp_getwork+0x98      
   22   2%  19% 0.00     1511 cpu[0]                 disp_getwork+0x174     
   20   1%  21% 0.00     1249 cpu[0]+11              disp_getwork+0x78      
   20   1%  22% 0.00     1406 cpu0+0xb               disp_getwork+0x98      
   20   1%  24% 0.00     1627 cpu0+0xb               disp_getwork+0x88      
   20   1%  25% 0.00     1498 cpu0+0xb               i_ddi_splhigh+0x14     
   18   1%  27% 0.00     1324 cpu[0]+11              splx                   
   18   1%  28% 0.00     1663 cpu0                   disp_getwork+0x174     
   17   1%  29% 0.00     1592 cpu0+0xb               disp_getwork+0x78      
   16   1%  31% 0.00     1284 cpu[0]+11              i_ddi_splhigh+0x14     
   16   1%  32% 0.00     1500 cpu[0]                 disp_getwork+0x1b4     
   16   1%  33% 0.00     1602 cpu[0]                 idle+0x12c             
   16   1%  34% 0.00     1438 cpu[0]                 i_ddi_splhigh+0xc      
   16   1%  35% 0.00     1696 cpu0+0xb               splx                   
   14   1%  36% 0.00     1571 cpu0                   idle+0xf4              
   14   1%  37% 0.00     1411 cpu0                   i_ddi_splhigh          
   13   1%  38% 0.00     1664 cpu[0]                 i_ddi_splhigh          
   12   1%  39% 0.00     1697 cpu[0]                 idle+0xec          
...

There's no dnlc_purge_vp function in the first 40% of the interrupts!

Running the dtrace script also confirms, that purge times have dropped linearly to about 25%:

# dtrace -s purge.d
dtrace: script 'purge.d' matched 2 probes
^C

  latency                                          
           value  ------------- Distribution ------------- count   
             256 |                                         0       
             512 |@                                        1       
            1024 |@                                        2       
            2048 |@@@@@@                                   10      
            4096 |@@@@@@@@@@@@@@@@@@@@                     32      
            8192 |@@@@@@@@@@@@                             19      
           16384 |                                         0  
     

Looking at the system over a day, the load and CPU utilization was much lower:










Conclusion

Solaris' dtrace tool is not just a marketing gag, but is real value for all production enviroments. We were able to find the bottleneck in no time (about 30 minutes of investigation + in this case reboot during off-peak hours).

Most importantly, we didn't have to risk production as dtrace is production safe.

Comments

Chris said…
Cooler Beitrag

Popular posts from this blog

Heating up the Data Pipeline (Part 1)

Heating up the Data Pipeline (Part 3)

SLOG Latency