This post will be a part of a multiple posts to cover data parsing in Microsoft Sentinel.
Kusto is a powerfull query language and easy to adopt.
Even if Kusto is very powerfull, working with custom log sources is, sometimes, a mess. Some parsers requires more effort and some are very simple.
In general, when it’s possible to use operators like “parse” (link) function or “parse-kv” (link) it’s very welcome. However, the reality has a different challenge for us.
In this post we want to share a quick pro tip to solve the mystic of hidden whitespaces
The challenge of whitespaces
Whitespaces ” ” exists everywhere, the challenge is how it’s presented in log analytics.
Log analytics does a lot for the user in terms of nicely present data. It actually removes duplicate whitespaces, as well as leading and trailing whitespace.
This could result in problems like failing parsers, regex and string operators like “==”, “startswith”, “endswith” etc will fail. Especially if it’s not consistent.
Marking the string in the output view does not show the extra whitespaces
Copying the text and paste into a text-editor will not show it either like in below example where we copied the output into VS Code (we can only see one dot to show one whitespace between foo and bar)
However, the double whitespaces are interpreted during execution, and it’s only in the presentation view the extra is removed. As in below example, we used split on ” ” to show the existence of the double whitespace.
When working with multiple log sources you don’t want to search and see if they exist (which may change during the log source life cycle), you rather want a way to always make the log to look good in your parser.
To properly address this (if there aren’t any good ways to change the audit settings of the system sending the logs)
To handle the duplicate white spaces we use the replace_regex function (link here) and use the whitespace “\s” with the quantifier “+” which means one or multiple times and replace it with a space ” “.
This will search for spaces (one or more) and replace it with a single, because we don’t want to remove single spaces. And by using the same column name “SyslogMessage” we will actually reuse the same column for our clean output.
Please note that this will not change the message in the database, only during execution.
Doing this gives us the following output.
The next step is that we want to remove the leading and trailing whitespaces. If we for instance expect the first character to be a value, the leading whitespace could make our parser to fail or an analytic rule.
We have seen occasions where this happens from time to time and not all messages in a log source.
To fix the leading and trailing whitespaces we use another regex to look for start of string and end of string. But this time we want to replace with “nothing/null” which is why we can’t use this regex in the first cleaning.
In the second run we use the same column name again to cleanup the SyslogMessage. There is a best practice to always keep the original message, however, this is to solve an error from the log source and not to alter the SyslogMessage.
The regex starts with an anchor “^” to define the start of the string and followed by a whitespace “\s” since we cleaned all double whitespaces we don’t need to use the quantifier. To handle the trailing whitespace we use the OR operand “|” and check for a whitespace “\s” followed by the anchor “$” to determine the end of the string. If we get any hits it will be replaced with null and we have a clean string.
By adding these 2 lines of code to the parser, we will avoid running into strange issues which could take some time to troubleshoot.
//Sample CustomLogSource_CL | extend SyslogMessage = replace_regex(SyslogMessage,@"\s+",@" ") //Remove duplicate whitespaces | extend SyslogMessage = replace_regex(SyslogMessage,@"^\s|\s$",@"") //Remove leading and trailing whitespaces