Heating up the Data Pipeline (Part 1)

Pre-Processing Data

I often hear the question from our customers, how data can be transformed prior to indexing in Splunk.
Damien from Baboonbones has done a tremendous job in creating add-ons  providing custom inputs for Splunk. Most of his custom inputs provide the means to pre-process data by allowing custom event handlers to be written.

Sometimes you still want to pre-process data that gets collected from Splunk's standard input types, like file monitors, Windows EventLogs, scripted inputs etc. Also, not everyone is capable of writing custom event handlers.

A requirement these customers have, is that they have rolled out a large number of Splunk Universal Forwarders and they do not want to install another agent.

To summarize, the solution capable of pre-processing data, should be easy to use, be easily integrated and be build on top of their existing architecture.

How to plumb Splunk Pipelines

Splunk has its own fittings to connect a Universal Forwarder to a Heavy Forwarder or to an Indexer, so called SplunkTCP. This proprietary protocol transports raw data together with metadata such as information about sourcetype, source, host, index etc. also know as "Cooked Data".

The cooked format can be divided into two types. A Universal Forwarder typically sends ingested data out as as stream to an Splunk Indexer or Heavy Forwarder, where event parsing is applied (parsing queue). The parsing queue transforms data streams into single events. Further queues are the aggregation queue which e.g. extracts timestamps and the typing queue which allows transformations of the event and it's metadata.

More information about Splunk's event processing queues can be found here.

Tee Fitting and Event Heating

To not duplicate functionality that Splunk handles very well, the best place to extract data from the Splunk pipeline is after the typing queue and prior to the indexing queue.

Unfortunately, the cooked SplunkTCP protocol is proprietary and can not be used. Fortunately, Splunk provides a supported way to forward data to 3rd party systems.The trick is to disable Splunk Cooked Mode in outputs.conf:

sendCookedData = [true|false]
* Set to false if you are sending to a third-party system.

By using this setting, the recipient gets an event stream over tcp protocol. The downside of this option is, that all metadata is forever lost.

Wouldn't it be great if the 3rd party system would receive all metadata, the same way as an indexer would receive in cooked mode?

Remember, events are going through the typing queue, and we can change events as we like. This is how it's done:

Turning Up the Heat

With a clever combination of props.conf and transforms.conf, we can "lower cook" our events, in other words, we add our own headers to our events. This is the header respectively the full event we would like to have: 

###time=<epoch> \
###meta=<meta> \
###host=<host> \
###sourcetype=<sourcetype> \
###index=<index> \
###source=<source> \
###Start-of-Event \
###<_raw> \

See below how the transforms.conf will look like. Note that we will go through all important SOURCE_KEYs and prepend the gathered information to the _raw event with a clever combination of $1 (matching data) and $0 (the existing _raw data):

SOURCE_KEY = _time
REGEX = (.*)
FORMAT = ###time=$1|$0
DEST_KEY = _raw

SOURCE_KEY = MetaData:Host
REGEX = ^host::(.*)$
FORMAT = host=$1|$0
DEST_KEY = _raw

SOURCE_KEY = MetaData:Sourcetype
REGEX = ^sourcetype::(.*)$
FORMAT = sourcetype=$1|$0
DEST_KEY = _raw

SOURCE_KEY = _meta
REGEX = (.*)
FORMAT = meta=$1|$0
DEST_KEY = _raw

SOURCE_KEY = _MetaData:Index
REGEX = (.*)
FORMAT = index=$1|$0
DEST_KEY = _raw

SOURCE_KEY = MetaData:Source
REGEX = ^source::(.*)$
FORMAT = source=$1###Start-of-Event###$0###End-of-Event###
DEST_KEY = _raw

Now we apply the transforms to the sourcetype of choice:

TRANSFORMS-metadata = metadata_source, \
                      metadata_index, \
                      metadata_sourcetype, \
                      metadata_host, \
                      metadata_meta, \

You can also use the unsupported catch all sourcetype rule:


Selective Forwarding

Probably not all data needs to be pre-processed. For selective routing, add this to tranforms.conf:

SOURCE_KEY = MetaData:Host
FORMAT = third_party

Create a stanza in your outputs.conf:

server =
sendCookedData = false

Don't forget to append the routing to your props.conf

TRANSFORMS-metadata = metadata_source, \
                      metadata_index, \
                      metadata_sourcetype, \
                      metadata_host, \
                      metadata_meta, \
                      metadata_time, \


Some 3rd Party Systems, can't receive multi-line events easily. You can replace line feeds and carriage returns with a simple SEDMCD in your props.conf:

SEDCMD-LF = s/(?ims)\n/###LF###/g
SEDCMD-CR = s/(?ims)\r/###CR###/g

Sample Low Cooked Event

This is how a sample event will look like:

###time=1498849507|meta=datetime::"06-30-2017 21:05:07.214 +0200" log_level::INFO component::PerProcess data.pid::17656 data.ppid::564 data.t_count::68 data.mem_used::66.676 data.pct_memory::0.83 data.page_faults::767240 data.pct_cpu::0.00 data.normalized_pct_cpu::0.00 data.elapsed::84006.0001 data.process::splunkd data.args::service data.process_type::splunkd_server _subsecond::.214 date_second::7 date_hour::21 date_minute::5 date_year::2017 date_month::june date_mday::30 date_wday::friday date_zone::120|host=LT-PF0R53KD|sourcetype=splunk_resource_usage|index=_introspection|source=C:\Program Files\Splunk\var\log\introspection\resource_usage.log###Start-of-Event###{"datetime":"06-30-2017 21:05:07.214 +0200","log_level":"INFO","component":"PerProcess","data":{"pid":"17656","ppid":"564","t_count":"68","mem_used":"66.676","pct_memory":"0.83","page_faults":"767240","pct_cpu":"0.00","normalized_pct_cpu":"0.00","elapsed":"84006.0001","process":"splunkd","args":"service","process_type":"splunkd_server"}}###End-of-Event###

Niagara Pipeline

Now that we can output events in our "low cooked" event format, we need a solution that is capable of transform these events.

My tool of choice is Apache NiFi (short for Niagara Files). NiFi is an Enterprise grade dataflow tool that can collect, route, enrich and process data in a scalable manner.

In Part II, we will look at what Apache NiFi is capable to do and how it is configured. But let's start looking at the architecture where to put Apache NiFi.

Pipeline Architecture

This picture shows where NiFi can be placed in a Splunk architecture, where Universal Forwarders send data directly to Splunk Indexers.NiFi will pre-process the data and e.g. send the data further to the Indexer.

The next picture shows a Splunk architecture, where Universal Forwarders send their data first to an intermediate heavy forwarder. Selected events will be pre-processed by Splunk and then be sent further to the Indexer.


With the above methods, it's possible to send data to a 3rd party system, without losing crucial metadata. The 3rd party system can then pre-process the data/metadata and send these events to an Indexer.

In Part II we will look at how NiFi can pre-process data, and show you some examples.


Popular posts from this blog

Opensolaris, Huawei E220, Swisscom and Sunrise

Heating up the Data Pipeline (Part 2)