One of the most powerful features of Splunk, the market leader in log aggregation and operational data intelligence, is the ability to extract fields while searching for data. Unfortunately, it can be a daunting task to get this working correctly. In this article, I’ll explain how you can extract fields using Splunk SPL’s rex command. I’ll provide plenty of examples with actual SPL queries. In my experience, rex is one of the most useful commands in the long list of SPL commands. I’ll also reveal one secret command that can make this process super easy. By fully reading this article you will gain a deeper understanding of fields, and learn how to use rex command to extract fields from your data.
What is a field?
A field is a name-value pair that is searchable. Virtually all searches in Splunk uses fields. A field can contain multiple values. Also, a given field need not appear in all of your events. Let’s consider the following SPL.
The fields in the above SPL are “index”, “sourcetype” and “action”. The values are “main”, “access_combined_wcookie” and “purchase” respectively.
Apr 08, 2021 Splunk does not necessarily interpret the transaction defined by multiple fields as conjunction (field1 AND field2 AND field3) or a disjunction (field1 OR field2 OR field3) of those fields. If there is a transitive relationship between the fields in the, the transaction command uses it. Aug 12, 2019 Regular Expression Cheat-Sheet (c) karunsubramanian.com A short-cut. Regex, while powerful, can be hard to grasp in the beginning. Fortunately, Splunk includes a command called erex which will generate the regex for you. All you have to do is provide samples of data and Splunk will figure out a possible regular expression.
Fields turbo charge your searches by enabling you to customize and tailor your searches. For example, consider the following SPL
The above SPL searches the index web which happens have web access logs, with sourcetype equal to access_combined, status grater than or equal to 500 (indicating a server side error) and response_time grater than 6 seconds (or 6000 milli seconds). This kind of flexibility in exploring data will never be possible with simple text searching.
How are fields created?
There is some good news here. Splunk automatically creates many fields for you. The process of creating fields from the raw data is called extraction. By default Splunk extracts many fields during index time. The most notable ones are:
index
host
sourcetype
source
_time
_indextime
splunk_server
You can configure Splunk to extract additional fields during index time based on your data and the constraints you specify. This process is also known as adding custom fields during index time. This is achieved through configuring props.conf, transforms.conf and fields.conf. Note that if you are using Splunk in a distributed environment, props.conf and transforms.conf reside on the Indexers (also called Search Peers) while fields.conf reside on the Search Heads. And if you are using a Heavy Forwarder, props.conf and transforms.conf reside there instead of Indexers.
While index-time extraction seems appealing, you should try to avoid it for the following reasons.
- Indexed extractions use more disk space.
- Indexed extractions are not flexible. i.e. if you change the configuration of any of the indexed extractions, the entire index needs to be rebuilt.
- There is a performance impact as Indexers do more work during index time.
Instead, you should use search-time extractions. Schema-on-Read, in fact, is the superior strength of Splunk that you won’t find in any other log aggregation platforms. Schema-on-Write, which requires you to define the fields ahead of Indexing, is what you will find in most log aggregation platforms (including Elastic Search). With Schema-on-Read that Splunk uses, you slice and dice the data during search time with no persistent modifications done to the indexes. This also provides the most flexibility as you define how the fields should be extracted.
Many ways of extracting fields in Splunk during search-time
There are several ways of extracting fields during search-time. These include the following.
- Using the Field Extractor utility in Splunk Web
- Using the Fields menu in Settings in Splunk Web
- Using the configuration files
- Using SPL commands
- rex
- extract
- multikv
- spath
- xmlkv/xpath
- kvform
For Splunk neophytes, using the Field Extractor utility is a great start. However as you gain more experience with field extractions, you will start to realize that the Field extractor does not always come up with the most efficient regular expressions. Eventually, you will start to leverage the power of rex command and regular expressions, which is what we are going to look in detail now.
What is rex?
rex is a SPL (Search Processing Language) command that extracts fields from the raw data based on the pattern you specify using regular expressions.
The command takes search results as input (i.e the command is written after a pipe in SPL). It matches a regular expression pattern in each event, and saves the value in a field that you specify. Let’s see a working example to understand the syntax.
Consider the following raw event.
Thu Jan 16 2018 00:15:06 mailsv1 sshd[5801]: Failed password for invalid user desktop from 194.8.74.23 port 2285 ssh2
The above event is from Splunk tutorial data. Let’s say you want to extract the port number as a field. Using the rex command, you would use the following SPL:
index=main sourcetype=secure
| rex 'ports(?d+)s'
Once you have port extracted as a field, you can use it just like any other field. For example, the following SPL retrieves events with port numbers between 1000 and 2000.
index=main sourcetype=secure
| rex 'ports(?d+)s'
| where portNumber >= 1000 AND portNumber < 2000
Note: rex has two modes of operation. The sed mode, denoted by option mode=sed lets you replace characters in an existing field. We will not discuss sed more in this blog.
Note: Do not confuse the SPL command regex with rex. regex filters search results using a regular expression (i.e removes events that do not match the regular expression provided with regex command).
Syntax of rex
Let’s unpack the syntax of rex.
rex field=<field> <PCRE named capture group>
The PCRE named capture group works the following way:
(?<name>regex)
The above expression captures the text matched by regex into the group name.
Note: You may also see (?P<name>regex) used in named capture groups (notice the character P). In Splunk, you can use either approach.
If you don’t specify the field name, rex applies to _raw (which is the entire event). Specifying a field greatly improves performance (especially if your events are large. Typically I would consider any event over 10-15 lines as large).
There is also an option named max_match which is set to 1 by default i.e, rex retains only the first match. If you set this option to 0, there is no limit to the number of matches in an event and rex creates a multi valued field in case of multiple matches.
As you can sense by now, mastering rex means getting a good handle of Regular Expressions. In fact, it is all out regular expressions when it comes to rex. It is best learned through examples. Let’s dive right in.
Learn rex through examples
Extract a value followed by a string
Raw Event:
Extract a field named username that is followed by the string user in the events.
index=main sourcetype=secure
| rex 'users(?w+)s'
Isn’t that beautiful?
Now, let’s dig deep in to the command and break it down.
Extract a value based a pattern of the string
This can be super handy. Extract java exceptions as a field.
Raw Event:
08:24:42 ERROR : Unexpected error while launching program.
java.lang.NullPointerException
at com.xilinx.sdk.debug.core.XilinxAppLaunchConfiguration
Delegate.isFpgaConfigured(XilinxAppLaunchConfigurati
onDelegate.java:293)
Extract any java Exception as a field. Note that java exceptions have the form java.<package hierarchy>.<Exception>. For example:
java.lang.NullPointerException
java.net.connectexception
javax.net.ssl.SSLHandshakeException
So, the following regex matching will do the trick.
Splunk Query Cheat Sheet 2019
java..*Exception
Explanation:
java: A literal string java
. : Backslash followed by period. In regex, backslash escapes the following character, meaning it will interpret the following character as it is. Period (.) stands for any character in regex. In this case we want to literally match a period. So, we escape it.
.* : Period followed by Star (*). In regex, * indicates zero or more of the preceding character. Simply .* means anything.
Exception: A literal string Exception.
Our full blown SPL looks like this:
index=main sourcetype=java-logs
| rex '(?<javaException>java..*Exception)'
Let’s add some complexity to it. Let’s say you have exceptions that look like the following:
javax.net.ssl.SSLHandshakeException
Notice the “x” in javax ? How can we account for x ? Ideally what we want is to have rex extract the java exception regardless of javax or java. Thanks to the character class and “?” quantifier.
java[x]?..*Exception
Let us consider new raw events.
Our new SPL looks like this:
index=main sourcetype=java-logs1
| rex '(?<javaException>java[x]?..*Exception)'
That’s much better. Our extracted field javaException captured the exception from both the events.
Wait a minute. Is something wrong with this extraction,
Apparently, the extraction captured two exceptions. The raw event looks like this:
08:24:43 ERROR : javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
Apparently the regex java[x]?..*Exception is matching all the way up to the second instance of the string “Exception”.
This is called greedy matching in regex. By default the quantifiers such as “*” and “+” will try to match as many characters as possible. In order to force a ‘lazy’ behaviour, we use the quantifier “?”. Our new SPL looks like this:
index=main sourcetype=java-logs1
| rex '(?<javaException>java[x]?..*?Exception)'
That’s much much better!
Extract credit card numbers as a field
Let’s say you have credit card numbers in your log file (very bad idea). Let’s say they all the format XXXX-XXXX-XXXX-XXXX, where X is any digit. You can easily extract the field using the following SPL
index='main' sourcetype='custom-card'
| rex '(?<cardNumber>d{4}-d{4}-d{4}-d{4})'
The {} helps with applying a multiplier. For example, d{4} means 4 digits. d{1,4} means between 1 and 4 digits. Note that you can group characters and apply multipliers on them too. For example, the above SPL can be written as following:
index='main' sourcetype='custom-card'
| rex '(?<cardNumber>(d{4}-){3}d{4})'
Extract multiple fields
You can extract multiple fields in the same rex command.
Consider the following raw event
Thu Jan 16 2018 00:15:06 mailsv1 sshd[5276]: Failed password for invalid user appserver from 194.8.74.23 port 3351 ssh2
The above event is from Splunk tutorial data.
You can extract the user name, ip address and port number in one rex command as follows:
index='main' sourcetype=secure
| rex 'invalid user (?<userName>w+) from (?<ipAddress>(d{1,3}.){3}d{1,3}) port (?<port>d+) '
Also note that you can pipe the results of the rex command to further reporting commands. For example, from the above example, if you want to find the top user with login errors, you will use the following SPL
index='main' sourcetype=secure
| rex 'invalid user (?<userName>w+) from (?<ipAddress>(d{1,3}.){3}d{1,3}) port (?<port>d+) '
| top limit=15 userName
Regular Expression Cheat-Sheet
Splunk For Dummies Pdf
A short-cut
Regex, while powerful, can be hard to grasp in the beginning. Fortunately, Splunk includes a command called erex which will generate the regex for you. All you have to do is provide samples of data and Splunk will figure out a possible regular expression. While I don’t recommend relying fully on erex, it can be a great way to learn regex.
For example use the following SPL to extract IP Address from the data we used in our previous example:
index='main' sourcetype=secure
| erex ipAddress examples='194.8.74.23,109.169.32.135'
Not bad at all. Without writing any regex, we are able to use Splunk to figure out the field extraction for us. Here is the best part: When you click on “Job” (just above the Timeline), you can see the actual regular expression that Splunk has come up with.
Successfully learned regex. Consider using: | rex '(?i) from (?P[^ ]+)'
I’ll let you analyze the regex that Splunk had come up with for this example :-). One hint: The (?i) in the above regex stands for “case insensitive”
That brings us to the end of this blog. I hope you have become a bit more comfortable using rex to extract fields in Splunk. Like I mentioned, it is one of the most powerful commands in SPL. Feel free to use as often you need. Before you know, you will be helping your peers with regex.
Happy Splunking!
This is part six of the 'Hunting with Splunk: The Basics' series.
If you have spent any time searching in Splunk, you have likely done at least one search using the stats command. I won’t belabor the point, but it's such a crucial capability in the context of threat hunting that it would be crime to not talk about it in this series.
When focusing on data sets of interest, it's very easy to use the stats command to perform calculations on any of the returned field values to derive additional information. When I say stats, I am not just referring to the stats command; there are two additional commands that are worth mentioning—eventstats and streamstats. Like many Splunk commands, all three are transformational commands, meaning they take a result set and perform functions on the data.
Let’s dive into stats.
Stats
The stats command is a fundamental Splunk command. It will perform any number of statistical functions on a field, which could be as simple as a count or average, or something more advanced like a percentile or standard deviation. Using the keyword by within the stats command can group the statistical calculation based on the field or fields listed.
Here is a good basic example of how to apply the stats command during hunting. I might hypothesize that the source destination pairs with the largest amount of connections starting in a specific netblock are of interest to dig deeper into.
The search is looking at the firewall data originating from the 192.168.225.0/24 netblock and going to destinations that are not internal or DNS. The stats command is generating a count, grouped by source and destination address. Once the count is generated, that output can be manipulated to get rid of single events and then sorted from largest to smallest.
Another use for stats is to sum values together. A hypothesis might be to look at firewall traffic to understand who my top talkers to external hosts are, not from a connection perspective, but from a byte perspective. Using the stats command, multiple fields can be calculated, renamed and grouped.
In this example, the same data sets are used but this time, the stats command is used to sum the bytes_in and bytes_out fields. By changing the sort, I can easily pivot to look at the top inbound byte volumes or even the low talkers based on lowest byte count (which might be its own hypothesis). As a side note, if I saw the result set above I might ask why I am seeing many hosts from the same subnet all communicating to the same destination IP, with identical byte counts, both in and out. The point is there are numerous ways to leverage stats.
Eventstats
With these fundamentals in place, let’s apply these concepts to eventstats. I like to think of eventstats as a method to calculate “grand totals” within a result set that can then be used to further manipulate these totals to introspect the data set further.
Another hypothesis I might want to pursue is identifying and investigating the systems with the largest byte counts leaving the network; but to effectively hunt, I want to know all of the external hosts that my system is connecting to and how much data is going to each host.
Using the same basic search criteria as the earlier search, we slightly augmented it to make sure any bytes_out are not zero to keep the result set cleaner. Eventstats is calculating the sum of the bytes_out and renaming it total_bytes_out grouped by source IP address. That output can then be treated as a field value that can be outputted with additional Splunk commands.
The bands highlighted in red show the source IP address with the bytes_out summed to equal the total_bytes_out.
Splunk Cheat Sheet Reference Guide Pdf
Another hypothesis that I could pursue using eventstats would be to look for systems that have more than 60 percent of their traffic going to a single destination. If a system is talking nearly exclusively to a single external host, that might be cause for concern or at least an opportunity to investigate further.
Going back to the earlier example that looked for large volumes of bytes_out by source and destination IP addresses, we could evolve this and use eventstats to look at the bytes_out by source as a percentage of the total byte volume going to a specific destination.
Building on the previous search criteria, I calculate the eventstats by summing the bytes_out grouped by source IP address to get that “grand total.” Now I can start transforming that data using stats like I did earlier and grouping by source and destination IP. If I stopped there, I would have the sum of the bytes_in, bytes_out, the total_bytes_out and the source and destination IP. That’s great, but I need to filter down on the outliers that I'm hypothesizing about.
Using the eval command, the bytes_out and total_bytes_out can be used to calculate a percentage of the overall traffic. At that point, I'm formatting the data using the table command and then filtering down on the percentages that are greater than 60 and sorting the output.
I now have a set of source IP addresses that I can continue to interrogate with the knowledge that a high percentage of the data is going to a single destination. In fact, when I look at my output, I find an interesting outcome which is that my top 14 source addresses are all communicating to the same external IP address. That alone might be something interesting to dig further on, or it might be a destination that should be whitelisted using a lookup. This approach though allows me to further refine my search and reinforce or disprove my hypothesis.
Streamstats
On to streamstats. Streamstats builds upon the basics of the stats command but it provides a way for statistics to be generated as each event is seen. This can be very useful for things like running totals or looking for averages as data is coming into the result set.
If I were to take the results from our earlier hunt, I could further hypothesize that communications outbound from my host occur in bursts. I could then use streamstats to visualize and confirm that hypothesis.
Building off the previous example, the source IP address 192.168.225.80 generated 77% of its traffic to a specific destination. We could investigate further and look at the data volume over time originating from that address.
The search I start with is the same basic search as the other examples with one exception—the source is no longer a range but a specific address. Because I would like the information to aggregate on a daily basis, I'm sorting by date. Streamstats is then used to get the sum of the bytes_out, renamed as total_bytes_out and grouped by source IP address. Finally, we table the output, specifically date, bytes_out and the total_bytes_out.
The output can be viewed in a tabular format or visualized, preferably as a line chart or area chart. As you can see from the output, the daily bytes_out added to the previous day’s total_bytes_out will equal today’s total_bytes_out.
Stats, eventstats and streamstats are all very powerful tools to further refine the result set to identify outliers within the environment. While this blog focused on network traffic and used sums and counts, there is no reason not to use it for host-based analysis as well as leveraging statistics like standard deviations, medians and percentiles.
Happy hunting!