Tuesday, January 22, 2013

More Insights

In my previous post, I was talking about the past and especially what I did the last year.

While I am not the guy who celebrates new years eve too much, I do like the beginning of a new year, as January has a much slower pace as December, giving me time to think about what I want to achieve in the new year.

My personal goals are to become a better data scientist.  I like this picture about Data Science found on the Wikipedia Article. I could improve myself in any of the fields.
Attribution: http://commons.wikimedia.org/wiki/User:Calvin.Andrus
Wikipedia: Data Science

In some of the fields I will naturally improve myself doing projects with customers (e.g. domain expertise), others may require self-study (should not have slept in math and statistics classes...).

While I'm practicing data science every day, I think it is good to sometimes take a step backwards and learn more about the theory behind it, especially when there is a lot going on in areas like machine learning and predictive analytics. Applying these algorithms requires to understand the theory behind them.

Machine learning and predictive analytics will be a hot topic for me this year. More and more customers are looking to get more value out of their unstructured data. So far customers have been more in reactive mode,  like doing manual queries on historic data. Some of the more advanced customer have started to write their own algorithms. Giving them the tools to automatically detect anomalies and predict the future is the logical next step. This is one area where we are currently investigating and talking to domain experts, to get more insights and come up with a good solution.

Another area where we are expanding into is Application Performance Monitoring. Systems are getting more and more complex, and it is really hard to find root causes. Last week, we signed a partnership agreement that I am very happy with.

Readers of my blog know that beside Splunk there is another Application I'm a huge fan of. Yes, I'm talking about AppDynamics. I'm very happy that we now have a tool in our portfolio, that perfectly complements Splunk. There are almost no overlaps in functionality. AppDynamics provides the insights into an application, while Splunk has the view from the outside.

Together with the new AppDynamics Splunk App, we now have the perfect tag-team... Stay tuned!

Attribution: http://www.flickr.com/photos/usagj/6301467928/sizes/m/in/photostream/
Splunk-AppDynamics Tag-Team


Sunday, January 13, 2013

An Unexpected Journey...

2012 was a busy year that passed by too fast and without any blog entries. It was also a year, where I've switched sides from being a customer into being a consultant.

I was running the largest e-mail platform in Switzerland for over three years, together with an outstanding team dedicated to the task, but somewhere deep in the guts, I had this feeling to set sails...

It was the Splunk .conf in 2011, that was the impulse that sent me to an unexpected journey. I did a presentation there, and got into a lot of discussions with other Splunk users. Something was going on, Splunk has moved from niche into something people were enthusiastic about. I felt like in 2007 when Splunk 3.0 came out, that something was brewing...

Meno, the second Splunk customer in Switzerland, had already taken the path of working as a consultant at a Splunk Partner. I knew that they had an open position, but I declined the job a couple of months before, as I was satisfied with my current position.

I mean, where else in Switzerland can you run the full HW/Storage/Software-Stack in a large environment? I learned so much in this environment, and I was afraid, that I would not find something similarly interesting and challenging.

Then the year 2012 came. I started in January, and after only 3 weeks I was already booked out by customers who wanted to start Splunk projects.

What fascinates me, is the diversity of the projects. They go through all industries, can be highly technical and the next day very business oriented. I have to talk to all kinds of people, security, auditors, server and storage admins, network engineers, developers... You name it. One thing Splunk projects have in common: you never know what to expect at the customer, but in the end you have a happy customer.

Splunk .conf2012 was even more exciting than the previous year. Again, I was doing a presentation, and this time the room was overbooked! I was chatting with many Splunk employees, customers, other partners. Everyone was enthusiastic about the products and things to come.
 


If we look at the earlier years, many customers did not know anything about unstructured / machine data. In 2012, this clearly changed. People learned about Splunk from various Splunk Live! events, roadshows, and mouth-to-mouth propaganda. Customers really wanted to have Splunk. The only problem they had was how to start with such a powerful tool. Often projects started as proof-of-concepts, to give customers more ideas about what to do with the data. After implementing a few use-cases customers grasped the idea behind Splunk and more and more ideas popped up.

2013 is already here and I guess, it will continue the same way for new customers, learning what value Splunk can bring. Customers who have used Splunk for a longer time, will try to get even more value out of their machine data/unstructured data. I see some growing interest in data visualizations,  machine learning and predictive analytics. Also, Splunk is moving up the stack more and more, from giving technology answers, into giving business answers.

I expect this year to be even more exiting than last year, not only for me, but also the company I work for. There will be quite some announcements this year... stay tuned!

Wednesday, October 12, 2011

Splunking Oracle's ZFS Appliance Part II

In my first part I wrote about storing long term analytics data in Splunk.

Wouldn't it be nice to also have storage capacity tracked with Splunk?

This is how it's done:

1. Get pool properties


#!/bin/ksh


# list capacity all Pools in a System to ${outputdir}/${poolname}.pools.log


# Example: listPools.ksh /tmp 10.16.5.14


typeset outputdir=$1
typeset ipname=$2
typeset debug=$3
typeset user=monitor


if [ -z "$1" -o -z "$2" ]; then
  printf "\nUsage: $0 <output Dir≷ <ZFSSA ipname> [ debug ]\n\n"
  exit 1
fi


mkdir -p ${outputdir}
dat=$(date +'%y-%m-%d %H:%M:%S')


ssh -T ${user}@${ipname} << --EOF-- > ${outputdir}/${ipname}.pools.log
script
run('status');
run('storage');
var poollist=list();


printf("Time,pool,avail,compression,used,space_percentage\\n");


for(var k=0; k&lt:poollist.length; k++) {
  run('select ' + poollist[k]);


  var space_used=get('used')/1024/1024/1024;
  var space_avail=get('avail')/1024/1024/1024;
  var compression=get('compression');
  var space_percentage=space_used/(space_used + space_avail)*100;


  printf("$dat,%s,%0.0f,%0.2f,%0.0f,%0.0f\\n",poollist[k],space_avail,compression,space_used,space_percentage);


  run('done');
}
run('done');


--EOF--


exit $?

2. Write Splunk props.conf

As an exercise...


3. Enjoy Splunk Dashboards:

















4. Repeat for projects and shares.































PS: If I find the time, I will eventually package this into a Splunk App.

Friday, October 7, 2011

Splunking Oracle's ZFS Appliance

We have a bunch of Oracle ZFS Appliances. What I really like is their integrated dtrace based analytics feature.

However, some things are missing or causing problems:

-Storing long-term analytics data on the appliances produces a lot of data on the internal disks. This can fill up your appliance and in the worst case slow down the appliance software

-Scaling the timeline out too much, makes peaks invisible. This is probably a problem of the rendering software used on the appliance (JavaScript)

-Comparing all our appliances is not possible. There is no central analytics console.

As we are a heavy Splunk user, I sat together with our friendly storage consultant from Oracle and we brought these two great products closer together:

This is how we did it:

1. Setting up analytics worksheets

First we had to create the analytics worksheets. This is best done using the CLI interface, as the order of drilldowns should be always the same. Otherwise fields in the generated csv file might be in a different order on every appliance. Doing this in the BUI is possible, but hard...

I would also recommend to store the worksheet under a separate Appliance User.

Sample CLI commands:

analytics
worksheets
create Monitor
select worksheet-???


dataset
set name=io.ops[op]
set drilldowns=read,write
set seconds=3600
commit


dataset
set name=nfs4.ops
set seconds=3600
commit
...


2. Fetch Analytics Data

Script Excerpt:

ssh -T ${user}@${ipname} << --EOF--
${outputdir}/${wsname}.${ipname}.out
script
run('analytics');
run('worksheets');
var ws=list();
printf("Worksheets:%d\\n",ws.length);
printf("%s\\n",ws);
for(var i=0; i<ws.length; i++)="" p="" {<=""></ws.length;>
  run('select ' + ws[i]);
  var wsname=get('name');
  printf("Worksheet Name:%s\\n",wsname);
  if ( wsname == "$wsname" ) {
    var ds=list();
    for(var j=0; j<ds.length; j++)="" p="" {<=""></ds.length;>
      run('select ' + ds[j]);
      var dsname=get('name');
      printf("zfssa_%s\\n",dsname);
      dump(run('csv'));
      run('done');
    }
  }
  run('done');
}
run('done');


--EOF--

3. Configure Splunk Inputs

4. Create Splunk Dashboards

5. Enjoy Analytics Data Under Splunk













Happy Spelunking...

Sunday, May 22, 2011

Splunk: Unscaling units

I'm working on a Splunk Application for Solaris.

One of the commands that is of interest to me is the fsstat(1m) command output.  Here's the output for two filesystem types (zfs, nfs4):

solaris# fsstat zfs nfs4 1 1
 new  name   name  attr  attr lookup rddir  read read  write write
 file remov  chng   get   set    ops   ops   ops bytes   ops bytes
2.21K   881   521  585K 1.22K  1.71M 9.34K 1.66M 21.3G  765K 10.7G zfs
    0     0     0     0     0      0     0     0     0     0     0 nfs4
    0     0     0    20     0      0     0   279  997K   142  997K zfs
    0     0     0     0     0      0     0     0     0     0     0 nfs4

While Splunk is very flexible in parsing whatever output, for command outputs it is better to do a little pre-formatting:

-Make headers single line
-Drop the summary line (activity since fs loaded/mounted)
-Find a solution to be able to do stats on the autoscale values (K,M,G,T)

First, I wrote a script to adjust the output. The output looks like this now:

./fsstat.pl zfs nfs4
new_file name_remov name_chng attr_get attr_set lookup_ops rddir_ops read_ops read_bytes write_ops write_bytes fstype
    1     0     1     9     1     27     0   260 1.14M   145 1.18M zfs
    0     0     0     0     0      0     0     0     0     0     0 nfs4


This makes it much easier to parse the data.

A splunk search with multikv will split this into several fields:

sourcetype="solaris_fsstat" |multikv

We will now have single line events with the fields new_file, name_remov, name_chng etc...

The trouble is, that the fsstat command scales values automatically into human readable format. This can not be disabled.

But we are lucky, Splunk is able to solve the problem. To unscale e.g. read_ops, we add a bit Splunk magic to the search:

| rex field=read_ops "(?[\d\.]+)(?\w+)?" | eval read_ops_unscaled=case(
read_ops_unit=="",read_ops_amount,
read_ops_unit=="K",read_ops_amount*1024, read_ops_unit=="M",read_ops_amount*1024*1024, read_ops_unit=="G",read_ops_amount*1024*1024*1024, read_ops_unit=="T",read_ops_amount*1024*1024*1024*1024)

Now we have created a new field called read_ops_unscaled.

Wasn't this cool?

As this is quite hard to type I have created macros for every field that has to be scaled.

After this, I have created a master macro called `unscale_fsstat` which calls all other macros. Now it is trivial to run a search and do some stats on the results.

Happy Splunking!

Sunday, March 13, 2011

Adjusting ZFS resilvering speed

There are two kernel parameters that can be adjusted if ZFS resilvering speed is too slow/fast:

zfs_resilver_delay /* number of ticks to delay resilver */

and

zfs_resilver_min_time_ms /* min millisecs to resilver per txg */


In some cases the values can be too low or two high (e.g. when using Mirroring vs. RAIDZ).


A boost could be:


# echo zfs_resilver_delay/W0|mdb -kw
# echo zfs_resilver_min_time_ms/W0t3000|mdb -kw


whereas a handbrake is e.g.:


# echo zfs_resilver_delay/W2|mdb -kw
# echo zfs_resilver_min_time_ms/W0t300|mdb -kw

Disclaimer: Use at your own risk. Do not try on production systems without contacting support first.


Tuesday, March 1, 2011

Useful Dtrace One-Liners....

Finding write operations for a process. Especially when writing to a NFS share...

# dtrace -n 'fsinfo:::write /execname == "execname"/ \
  { printf("%s", args[0]->fi_pathname) }'


Finding the top userland stacks for a process

# dtrace -n 'syscall:::entry /execname == "execname"/ \
  { @[ustack()] = count();}'

Finding the same for a certain system call

# dtrace -n 'syscall::mmap:entry /execname == "execname"/ \
  { @[ustack()] = count();}'