DTrace to the rescue!

We recently migrated data to some older servers and ran into a performance problem during peakhours.

Analysis

On our busiest days, we've seen the load go up to over 300. Luckily this is a SPARC system, and it was still somewhat responsive. Not bad for an old 2 Socket system...

Looking at the CPU utilization using vmstat, showed a high percentage of time spent in SYS.

You can see this in the two performance graphs (the graphs were made on a Sunday, with a lower load).

After looking at the processes with prstat -Lm (microstate accounting) and seeing the high SYS-percentage, we decided to look at what the kernel is doing.

Solaris has a nifty tool called lockstat. This is what it revealed:

# lockstat -i 133 -I sleep 5

Profiling interrupt: 1394 events in 5.241 seconds (266 events/sec)

Count indv cuml rcnt     nsec CPU+PIL                Caller
-------------------------------------------------------------------------------
214 15% 15% 0.00     1982 cpu0                   (usermode)
174 12% 28% 0.00     3809 cpu[0]                 (usermode)
105   8% 35% 0.00     6559 cpu[0]                 dnlc_purge_vp+0xa4
   53   4% 39% 0.00     3583 cpu0                   dnlc_purge_vp+0xa4
   35   3% 42% 0.00     2684 cpu0                   copyout_more+0x190
   31   2% 44% 0.00     7042 cpu[0]                 dnlc_purge_vp+0xa8
   24   2% 46% 0.00     6714 cpu[0]                 dnlc_purge_vp+0x148
   19   1% 47% 0.00     3929 cpu0                   dnlc_purge_vp+0xa8
   15   1% 48% 0.00     5041 cpu[0]                 mutex_enter+0x8
   13   1% 49% 0.00     3799 cpu0                   mutex_enter+0x8
   12   1% 50% 0.00     3449 cpu0                   dnlc_purge_vp+0x148
   10   1% 51% 0.00     2712 cpu[0]                 i_ddi_splx+0x1c
    9   1% 51% 0.00     2320 cpu0                   hwblkclr+0x94
    8   1% 52% 0.00     7265 cpu[0]                 dnlc_purge_vp+0xa0
    8   1% 52% 0.00     3183 cpu0                   hwblkclr+0x9c
    8   1% 53% 0.00     3397 cpu0                   hwblkclr+0x98
    7   1% 53% 0.00     2592 cpu0                   copyout_more+0x194
    5   0% 54% 0.00     2005 cpu0                   segmap_getmapflt+0x124
    4   0% 54% 0.00     3784 cpu[0]+11              disp_getwork+0x88
    4   0% 54% 0.00     1441 cpu[0]+11              spl6+0x14
    4   0% 55% 0.00     3253 cpu[0]                 spl6
    4   0% 55% 0.00     5103 cpu[0]                 _resume_from_idle+0x1f8
    4   0% 55% 0.00     2466 cpu0+0x6               pci_pbm_dma_sync+0x8c
    4   0% 56% 0.00     2735 cpu0                   hwblkclr+0xb0
    4   0% 56% 0.00     2555 cpu0                   dnlc_purge_vp+0xa0
    4   0% 56% 0.00     2998 cpu0                   page_lookup_create+0x90
    3   0% 56% 0.00     2449 cpu[0]+11              disp_getwork+0x15c
    3   0% 57% 0.00     3502 cpu[0]+11              disp_getwork+0xa4
    3   0% 57% 0.00     3233 cpu[0]+11              i_ddi_splx
    3   0% 57% 0.00     6187 cpu[0]                 r4find+0x14c
    3   0% 57% 0.00     5585 cpu[0]                 copyout+0x25c

There's a strange function stealing almost 20% of the interrupts: dnlc_purge_vp

The dnlc_purge_vp function belongs to the directory name lookup cache. It purges cache entries from time to time. You can read more about it here.

Next step was to find out how much time is spend in this function. A small dtrace script helped us:

purge.d: 

fbt::dnlc_purge_vp:entry
{
        self->t = timestamp;
}

fbt::dnlc_purge_vp:return
/self->t/
{
        @["latency"] = quantize((timestamp-self->t)/1000);
        self->t = 0;
}

Running it, gives us a nice latency distribution table:

#  dtrace -s purge.d                    

dtrace: script 'purge.d' matched 2 probes

^C

  latency                                           

           value  ------------- Distribution ------------- count    

             128 |                                         0        

             256 |                                         1        

             512 |                                         0        

            1024 |                                         2        

            2048 |@                                        6        

            4096 |@                                        10       

            8192 |@@@@@@                                   65       

           16384 |@@@@@@@@@@@@@@@@@@@@@@                   247      

           32768 |@@@@@@@@@@                               112      

           65536 |                                         3        

          131072 |                                         5        

          262144 |                                         3

          524288 |                                         0

Looking at the result, we can see that most function calls take about 8-32ms. Really slow!

Solution

Knowing that the CPU is too slow for purging a large cache, we decided to decrease the cache from its default size to 25% of the default size. The ncsize kernel parameter is unfortunately not dynamic. We had to enter it in /etc/system and reboot the OS.

This was done on an indentical system with approx. the same load, and this is what lockstat reported after the reboot:

# lockstat -i 133 -I sleep 5   

Profiling interrupt: 1344 events in 5.054 seconds (266 events/sec)

Count indv cuml rcnt     nsec CPU+PIL                Caller                  

-------------------------------------------------------------------------------

   55   4%   4% 0.00     2301 cpu0                   (usermode)              

   47   3%   8% 0.00     2510 cpu[0]                 (usermode)              

   42   3%  11% 0.00     1717 cpu[0]                 splx+0x1c               

   41   3%  14% 0.00     1667 cpu0                   splx+0x1c               

   29   2%  16% 0.00     1718 cpu0                   idle+0x12c              

   26   2%  18% 0.00     1758 cpu[0]+11              disp_getwork+0x98       

   22   2%  19% 0.00     1511 cpu[0]                 disp_getwork+0x174      

   20   1%  21% 0.00     1249 cpu[0]+11              disp_getwork+0x78       

   20   1%  22% 0.00     1406 cpu0+0xb               disp_getwork+0x98       

   20   1%  24% 0.00     1627 cpu0+0xb               disp_getwork+0x88       

   20   1%  25% 0.00     1498 cpu0+0xb               i_ddi_splhigh+0x14      

   18   1%  27% 0.00     1324 cpu[0]+11              splx                    

   18   1%  28% 0.00     1663 cpu0                   disp_getwork+0x174      

   17   1%  29% 0.00     1592 cpu0+0xb               disp_getwork+0x78       

   16   1%  31% 0.00     1284 cpu[0]+11              i_ddi_splhigh+0x14      

   16   1%  32% 0.00     1500 cpu[0]                 disp_getwork+0x1b4      

   16   1%  33% 0.00     1602 cpu[0]                 idle+0x12c              

   16   1%  34% 0.00     1438 cpu[0]                 i_ddi_splhigh+0xc       

   16   1%  35% 0.00     1696 cpu0+0xb               splx                    

   14   1%  36% 0.00     1571 cpu0                   idle+0xf4               

   14   1%  37% 0.00     1411 cpu0                   i_ddi_splhigh           

   13   1%  38% 0.00     1664 cpu[0]                 i_ddi_splhigh           

   12   1%  39% 0.00     1697 cpu[0]                 idle+0xec

...

There's no dnlc_purge_vp function in the first 40% of the interrupts!

Running the dtrace script also confirms, that purge times have dropped linearly to about 25%:

# dtrace -s purge.d
dtrace: script 'purge.d' matched 2 probes
^C

latency
           value ------------- Distribution ------------- count
             256 |                                         0
             512 |@                                        1
            1024 |@                                        2
            2048 |@@@@@@                                   10
            4096 |@@@@@@@@@@@@@@@@@@@@                     32
            8192 |@@@@@@@@@@@@                             19
           16384 |                                         0

Looking at the system over a day, the load and CPU utilization was much lower:

Conclusion

Solaris' dtrace tool is not just a marketing gag, but is real value for all production enviroments. We were able to find the bottleneck in no time (about 30 minutes of investigation + in this case reboot during off-peak hours).

Most importantly, we didn't have to risk production as dtrace is production safe.

Search This Blog

My 2nd Head

DTrace to the rescue!

Comments

Popular posts from this blog

SLOG Latency

Heating up the Data Pipeline (Part 1)

Heating up the Data Pipeline (Part 2)