msticpy.transform.iocextract module

Module for IoCExtract class.

Uses a set of builtin regular expressions to look for Indicator of Compromise (IoC) patterns. Input can be a single string or a pandas dataframe with one or more columns specified as input.

The following types are built-in:

  • IPv4 and IPv6

  • URL

  • DNS domain

  • Hashes (MD5, SHA1, SHA256)

  • Windows file paths

  • Linux file paths (this is kind of noisy because a legal linux file path can have almost any character) You can modify or add to the regular expressions used at runtime.

class msticpy.transform.iocextract.IoCExtract(defanged: bool = True)

Bases: object

IoC Extractor - looks for common IoC patterns in input strings.

The extract() method takes either a string or a pandas DataFrame as input. When using the string option as an input extract will return a dictionary of results. When using a DataFrame the results will be returned as a new DataFrame with the following columns: IoCType: the mnemonic used to distinguish different IoC Types Observable: the actual value of the observable SourceIndex: the index of the row in the input DataFrame from which the source for the IoC observable was extracted.

The class has a number of built-in IoC regex definitions. These can be retrieved using the ioc_types attribute.

Addition IoC definitions can be added using the add_ioc_type method.

Note: due to some ambiguity in the regular expression patterns for different types and observable may be returned assigned to multiple observable types. E.g. 192.168.0.1 is a also a legal file name in both Linux and Windows. Linux file names have a particularly large scope in terms of legal characters so it will be quite common to see other IoC observables (or parts of them) returned as a possible linux path.

Initialize new instance of IoCExtract.

DNS_DF_REGEX = '((?=[a-z0-9-]{1,63}\\[?\\.\\]?)[a-z0-9]+(-[a-z0-9]+)*\\[?\\.\\]?){1,126}[a-z]{2,63}'
DNS_REGEX = '((?=[a-z0-9-]{1,63}\\.)[a-z0-9]+(-[a-z0-9]+)*\\.){1,126}[a-z]{2,63}'
EMAIL_DF_REGEX = "(?P<user>[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+)(@|AT)(?P<domain>((?=[a-z0-9-]{1,63}\\[?\\.\\]?)[a-z0-9]+(-[a-z0-9]+)*\\[?\\.\\]?){1,126}[a-z]{2,63})"
EMAIL_REGEX = "(?P<user>[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+)(@|AT)(?P<domain>((?=[a-z0-9-]{1,63}\\.)[a-z0-9]+(-[a-z0-9]+)*\\.){1,126}[a-z]{2,63})"
EMAIL_USER_REGEX = "(?P<user>[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+)(@|AT)"
IPV4_DF_REGEX = '(?P<ipaddress>(?:[0-9]{1,3}\\[?\\.\\]?){3}[0-9]{1,3})'
IPV4_REGEX = '(?P<ipaddress>(?:[0-9]{1,3}\\.){3}[0-9]{1,3})'
IPV6_REGEX = '(?<![:.\\w])(?:[A-F0-9]{0,4}:){2,7}[A-F0-9]{0,4}(?![:.\\w])'
LXPATH_REGEX = '(?P<root>/+||[.]+)\n            (?P<folder>/(?:[^\\\\/:*?<>|\\r\\n]+/)*)\n            (?P<file>[^/\\0<>|\\r\\n ]+)'
LXSTDPATH_REGEX = '\n            (?P<root>/|/bin|/boot|/dev|/home|/lib|/lost\\\\+found|/misc|/mnt|/net|/opt|/proc|/root|/sbin|/tmp|/usr|/var)\n            (?P<folder>/(?:[^\\\\/:*?<>|\\r\\n]+/)*)\n            (?P<file>[^/\\0<>|\\r\\n ]+)\n    '
MD5_REGEX = '(?:^|[^A-Fa-f0-9])(?P<hash>[A-Fa-f0-9]{32})(?:$|[^A-Fa-f0-9])'
SHA1_REGEX = '(?:^|[^A-Fa-f0-9])(?P<hash>[A-Fa-f0-9]{40})(?:$|[^A-Fa-f0-9])'
SHA256_REGEX = '(?:^|[^A-Fa-f0-9])(?P<hash>[A-Fa-f0-9]{64})(?:$|[^A-Fa-f0-9])'
URL_DF_REGEX = '\n            (?P<protocol>(https?|hXXps?|s?ftps?|s?fXps?|telnet|ldap|file)://)\n            (?P<userinfo>([a-z0-9-._~!$&\\\'()*+,;=:]|%[0-9A-F]{2})*@)?\n            (?P<host>([a-z0-9-._~!$&\\\'()*+,;=\\[\\]]|%[0-9A-F]{2})*)\n            (:(?P<port>\\d*))?\n            (/(?P<path>([^?\\#"<>\\s]|%[0-9A-F]{2})*/?))?\n            (\\?(?P<query>([a-z0-9-._~!$&\'()*+,;=:/?@]|%[0-9A-F]{2})*))?\n            (\\#(?P<fragment>([a-z0-9-._~!$&\'()*+,;=:/?@]|%[0-9A-F]{2})*))?'
URL_REGEX = '\n            (?P<protocol>(https?|s?ftps?|telnet|ldap|file)://)\n            (?P<userinfo>([a-z0-9-._~!$&\\\'()*+,;=:]|%[0-9A-F]{2})*@)?\n            (?P<host>([a-z0-9-._~!$&\\\'()*+,;=]|%[0-9A-F]{2})*)\n            (:(?P<port>\\d*))?\n            (/(?P<path>([^?\\#"<>\\s]|%[0-9A-F]{2})*/?))?\n            (\\?(?P<query>([a-z0-9-._~!$&\'()*+,;=:/?@]|%[0-9A-F]{2})*))?\n            (\\#(?P<fragment>([a-z0-9-._~!$&\'()*+,;=:/?@]|%[0-9A-F]{2})*))?'
WINPATH_REGEX = '\n            (?P<root>[a-z]:|\\\\\\\\[a-z0-9_.$-]+||[.]+)\n            (?P<folder>\\\\(?:[^\\/:*?"\\\'<>|\\r\\n]+\\\\)*)\n            (?P<file>[^\\\\/*?""<>|\\r\\n ]+)'
add_ioc_type(ioc_type: str, ioc_regex: str, priority: int = 0, group: Optional[str] = None)

Add an IoC type and regular expression to use to the built-in set.

Parameters
  • ioc_type (str) – A unique name for the IoC type

  • ioc_regex (str) – A regular expression used to search for the type

  • priority (int, optional) – Priority of the regex match vs. other ioc_patterns. 0 is the highest priority (the default is 0).

  • group (str, optional) – The regex group to match (the default is None, which will match on the whole expression)

Notes

Pattern priorities.

If two IocType patterns match on the same substring, the matched substring is assigned to the pattern/IocType with the highest priority. E.g. foo.bar.com will match types: dns, windows_path and linux_path but since dns has a higher priority, the expression is assigned to the dns matches.

extract(src: Optional[str] = None, data: Optional[DataFrame] = None, columns: Optional[List[str]] = None, **kwargs) Union[Dict[str, Set[str]], DataFrame]

Extract IoCs from either a string or pandas DataFrame.

Parameters
  • src (str, optional) – source string in which to look for IoC patterns (the default is None)

  • data (pd.DataFrame, optional) – input DataFrame from which to read source strings (the default is None)

  • columns (list, optional) – The list of columns to use as source strings, if the data parameter is used. (the default is None)

  • ioc_types (list, optional) – Restrict matching to just specified types. (default is all types)

  • include_paths (bool, optional) – Whether to include path matches (which can be noisy) (the default is false - excludes ‘windows_path’ and ‘linux_path’). If ioc_types is specified this parameter is ignored.

  • ignore_tlds (bool, optional) – If True, ignore the official Top Level Domains list when determining whether a domain name is a legal domain.

  • defanged (bool, optional) – If False will remove any [] from email, dns and ip entities.

Returns

dict of found observables (if input is a string) or DataFrame of observables

Return type

Any

Notes

Extract takes either a string or a pandas DataFrame as input. When using the string option as an input extract will return a dictionary of results. When using a DataFrame the results will be returned as a new DataFrame with the following columns: - IoCType: the mnemonic used to distinguish different IoC Types - Observable: the actual value of the observable - SourceIndex: the index of the row in the input DataFrame from which the source for the IoC observable was extracted.

IoCType Pattern selection The default list is: [‘ipv4’, ‘ipv6’, ‘dns’, ‘url’, ‘md5_hash’, ‘sha1_hash’, ‘sha256_hash’] plus any user-defined types. ‘windows_path’, ‘linux_path’ are excluded unless include_paths is True or explicitly included in ioc_types.

extract_df(data: DataFrame, columns: Union[str, List[str]], **kwargs) DataFrame

Extract IoCs from either a pandas DataFrame.

Parameters
  • data (pd.DataFrame) – input DataFrame from which to read source strings

  • columns (Union[str, list]) – A single column name as a string or a a list of columns to use as source strings,

  • ioc_types (list, optional) – Restrict matching to just specified types. (default is all types)

  • include_paths (bool, optional) – Whether to include path matches (which can be noisy) (the default is false - excludes ‘windows_path’ and ‘linux_path’). If ioc_types is specified this parameter is ignored.

  • ignore_tlds (bool, optional) – If True, ignore the official Top Level Domains list when determining whether a domain name is a legal domain.

Returns

DataFrame of observables

Return type

pd.DataFrame

Notes

Extract takes a pandas DataFrame as input. The results will be returned as a new DataFrame with the following columns: - IoCType: the mnemonic used to distinguish different IoC Types - Observable: the actual value of the observable - SourceIndex: the index of the row in the input DataFrame from which the source for the IoC observable was extracted.

IoCType Pattern selection The default list is: [‘ipv4’, ‘ipv6’, ‘dns’, ‘url’, ‘md5_hash’, ‘sha1_hash’, ‘sha256_hash’] plus any user-defined types. ‘windows_path’, ‘linux_path’ are excluded unless include_paths is True or explicitly included in ioc_types.

static file_hash_type(file_hash: str) IoCType

Return specific IoCType based on hash length.

Parameters

file_hash (str) – File hash string

Returns

Specific hash type or unknown.

Return type

IoCType

get_ioc_type(observable: str) str

Return first matching type.

Parameters

observable (str) – The IoC Observable to check

Returns

The IoC type enumeration (unknown, if no match)

Return type

str

property ioc_types: dict

Return the current set of IoC types and regular expressions.

Returns

dict of IoC Type names and regular expressions

Return type

dict

validate(input_str: str, ioc_type: str, ignore_tlds: bool = False) bool

Check that input_str matches the regex for the specified ioc_type.

Parameters
  • input_str (str) – the string to test

  • ioc_type (str) – the regex pattern to use

  • ignore_tlds (bool, optional) – If True, ignore the official Top Level Domains list when determining whether a domain name is a legal domain.

Returns

True if match.

Return type

bool

class msticpy.transform.iocextract.IoCExtractAccessor(pandas_obj)

Bases: object

Pandas api extension for IoC Extractor.

Instantiate pandas extension class.

extract(columns, **kwargs)

Extract IoCs from either a pandas DataFrame.

Parameters
  • columns (list) – The list of columns to use as source strings,

  • ioc_types (list, optional) – Restrict matching to just specified types. (default is all types)

  • include_paths (bool, optional) – Whether to include path matches (which can be noisy) (the default is false - excludes ‘windows_path’ and ‘linux_path’). If ioc_types is specified this parameter is ignored.

Returns

DataFrame of observables

Return type

pd.DataFrame

Notes

Extract takes a pandas DataFrame as input. The results will be returned as a new DataFrame with the following columns: - IoCType: the mnemonic used to distinguish different IoC Types - Observable: the actual value of the observable - SourceIndex: the index of the row in the input DataFrame from which the source for the IoC observable was extracted.

IoCType Pattern selection The default list is: [‘ipv4’, ‘ipv6’, ‘dns’, ‘url’, ‘md5_hash’, ‘sha1_hash’, ‘sha256_hash’] plus any user-defined types. ‘windows_path’, ‘linux_path’ are excluded unless include_paths is True or explicitly included in ioc_paths.

class msticpy.transform.iocextract.IoCPattern(ioc_type, comp_regex, priority, group)

Bases: tuple

Create new instance of IoCPattern(ioc_type, comp_regex, priority, group)

comp_regex

Alias for field number 1

count(value, /)

Return number of occurrences of value.

group

Alias for field number 3

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

ioc_type

Alias for field number 0

priority

Alias for field number 2

class msticpy.transform.iocextract.IoCType(value)

Bases: Enum

Enumeration of IoC Types.

dns = 'dns'
email = 'email'
file_hash = 'file_hash'
hostname = 'hostname'
ipv4 = 'ipv4'
ipv6 = 'ipv6'
linux_path = 'linux_path'
md5_hash = 'md5_hash'
classmethod parse(value: str) IoCType

Return parsed IoCType of string.

Parameters

value (str) – Enumeration name

Returns

IoCType matching name or unknown if no match

Return type

IoCType

sha1_hash = 'sha1_hash'
sha256_hash = 'sha256_hash'
unknown = 'unknown'
url = 'url'
windows_path = 'windows_path'