I’ve begun to use the logstash/elasticsearch/kibana stack for processing many different types of logs, and an issue that has started to crop up is that of MAC addresses. They can appear in so many different formats. While this is likely solvable where the devices are all under your own control, in a situation such as running and eduroam wireless network, the RADIUS servers see MAC addresses from all sorts of sources, including our own wireless controllers, and many other site’s NASes as well.

So whereas we can standardize on one format for our own systems, generally lower-case hyphen-delimited octets (e.g. aa-11-bb-cc-22-33) I will also see a mix of log entries in colon-delimited (00:11:22:44:dd:ff) and Cisco formats (12ab.567c.15c0), in both upper- and lower-case, occasionally with leading zeros missing, too.

This is a problem when feeding these logs into elasticsearch, as I would like the ‘client_mac’ field to be the same format for all entries otherwise searching becomes pretty hard. The whole point of elasticsearch and kibana is that searching logs becomes easy, so this needed fixing.

I came up with the following logstash mess, which I am most certainly not proud of. I share only as an example of what not to do!

# Sanitise MAC address. Yuck.
if [client_mac] {
  # Preserve the original MAC address
  mutate {
    add_field => { "client_mac.original" => "%{client_mac}" }
  }

  # Translate the MAC address to lowercase
  mutate {
    lowercase => [ "client_mac" ]
  }

  # Try and match several MAC address formats, capturing the octets
  grok {
    match => [
      "client_mac", "^(?(?:0?|[1-9a-f])[0-9a-f])[:-](?(?:0?|[1-9a-f])[0-9a-f])[:-](?(?:0?|[1-9a-f])[0-9a-f])[:-](?(?:0?|[1-9a-f])[0-9a-f])[:-](?(?:0?|[1-9a-f])[0-9a-f])[:-](?(?:0?|[1-9a-f])[0-9a-f])$",
      "client_mac", "^(?(?:0?|[1-9a-f])[0-9a-f])(?(?:0?|[1-9a-f])[0-9a-f])\.(?(?:0?|[1-9a-f])[0-9a-f])(?(?:0?|[1-9a-f])[0-9a-f])\.(?(?:0?|[1-9a-f])[0-9a-f])(?(?:0?|[1-9a-f])[0-9a-f])$",
      "client_mac", "^(?(?:0?|[1-9a-f])[0-9a-f])(?(?:0?|[1-9a-f])[0-9a-f])(?(?:0?|[1-9a-f])[0-9a-f])(?(?:0?|[1-9a-f])[0-9a-f])(?(?:0?|[1-9a-f])[0-9a-f])(?(?:0?|[1-9a-f])[0-9a-f])$"
    ]
    add_tag => [ "_found_mac" ]
  }

  if "_found_mac" in [tags] {
    # Fill in any missing leading zeros
    if [_mac_a] =~ /^.$/ { mutate { replace => [ "_mac_a", "0%{_mac_a}" ] } }
    if [_mac_b] =~ /^.$/ { mutate { replace => [ "_mac_b", "0%{_mac_b}" ] } }
    if [_mac_c] =~ /^.$/ { mutate { replace => [ "_mac_c", "0%{_mac_c}" ] } }
    if [_mac_d] =~ /^.$/ { mutate { replace => [ "_mac_d", "0%{_mac_d}" ] } }
    if [_mac_e] =~ /^.$/ { mutate { replace => [ "_mac_e", "0%{_mac_e}" ] } }
    if [_mac_f] =~ /^.$/ { mutate { replace => [ "_mac_f", "0%{_mac_f}" ] } }

    # Copy the new MAC into the client_mac field, with colon delimiters
    mutate { replace => { "client_mac" => "%{_mac_a}:%{_mac_b}:%{_mac_c}:%{_mac_d}:%{_mac_e}:%{_mac_f}" } }
  } else {
    # Couldn't match a MAC address, so copy the original (non-lowercased) data back over
    mutate { replace => { "client_mac" => "%{client_mac.original}" } }
  }

  # Tidy up the temporary fields
  mutate {
    remove_field => [ "_mac_a", "_mac_b", "_mac_c", "_mac_d", "_mac_e", "_mac_f" ]
    remove_tag => [ "_found_mac" ]
  }
}

I’ve been wanting to do this in a better way, and that way seemed to be a new logstash filter. Having worked out that logstash is actually written in Ruby (though it runs in Java – JRuby; a new one on me) I had to brush up on my very rusty Ruby skills. I like Ruby as a language a lot, but generally end up using Perl for everything (which I also like), and occasionally Python, so haven’t really programmed anything in it for a number of years.

So, I present a filter, “sanitize_mac”. Which does exactly as it says in the name. Usage to do the above is something like the following:

sanitize_mac {
  match => { "client_mac" => "client_mac_sanitized" }
  fixcase => "lower"
  separator => ":"
}

Which is much easier on the eyes. It can clean up multiple MAC addresses, and put them into the same or a different field. It will set them to upper- or lower-case if required, or leave them alone, and can set the delimiter to colon, hyphen, full-stop or none.

There are definitely some improvements to be made (it doesn’t currently cope with missing leading zeros, for one), but for now the source is available here on GitHub. Just drop it into /opt/logstash/lib/logstash/filters/ to make it available for use.

5 Thoughts on “Cleaning up MAC addresses in logstash”

  • Google search for ‘logstash mac address’ led me hear and it was exactly what I needed. The filter works well for my usecase. Thanks!

  • Great, glad it’s useful and thanks for the feedback!

    Make sure you got the latest version from a couple of days ago (7f025fcd) as it now also fixes MAC addresses that are missing leading zeros, and has a bit more input sanity checking.

  • Hey Mathew,

    I really appreciate the code. I am having some troubles implementing it however.

    I’m dropping it into the folder you specified but I keep getting the same error.

    The given configuration is invalid. Reason: Couldn’t find any filter plugin named ‘sanitize_mac’. Are you sure this is correct? Trying to load the sanitize_mac filter plugin resulted in this error: no such file to load — logstash/filters/sanitize_mac {:level=>:fatal}

    Any thoughts?

Leave a Reply to mcnewton Cancel reply

Your email address will not be published. Required fields are marked *