msticpy.transform.iocextract module

Module for IoCExtract class.

Uses a set of builtin regular expressions to look for Indicator of Compromise (IoC) patterns. Input can be a single string or a pandas dataframe with one or more columns specified as input.

The following types are built-in:

  • IPv4 and IPv6

  • URL

  • DNS domain

  • Hashes (MD5, SHA1, SHA256)

  • Windows file paths

  • Linux file paths (this is kind of noisy because a legal linux file path can have almost any character) You can modify or add to the regular expressions used at runtime.

class msticpy.transform.iocextract.IoCExtract(defanged: bool = True)

Bases: object

IoC Extractor - looks for common IoC patterns in input strings.

The extract() method takes either a string or a pandas DataFrame as input. When using the string option as an input extract will return a dictionary of results. When using a DataFrame the results will be returned as a new DataFrame with the following columns: IoCType: the mnemonic used to distinguish different IoC Types Observable: the actual value of the observable SourceIndex: the index of the row in the input DataFrame from which the source for the IoC observable was extracted.

The class has a number of built-in IoC regex definitions. These can be retrieved using the ioc_types attribute.

Addition IoC definitions can be added using the add_ioc_type method.

Note: due to some ambiguity in the regular expression patterns for different types and observable may be returned assigned to multiple observable types. E.g. 192.168.0.1 is a also a legal file name in both Linux and Windows. Linux file names have a particularly large scope in terms of legal characters so it will be quite common to see other IoC observables (or parts of them) returned as a possible linux path.

Initialize new instance of IoCExtract.

Parameters:

defanged (bool) – If True, the regex will be used to match defanged IoC patterns

DF_AT = '(@|\\[at\\])'
DNS_DF_REGEX = '((?=[a-z0-9-]{1,63}\\[?\\.\\]?)[a-z0-9]+(-[a-z0-9]+)*\\[?\\.\\]?){1,126}[a-z]{2,63}'
DNS_REGEX = '((?=[a-z0-9-]{1,63}\\.)[a-z0-9]+(-[a-z0-9]+)*\\.){1,126}[a-z]{2,63}'
EMAIL_DF_REGEX = "(?P<user>[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+)(@|\\[at\\])(?P<domain>((?=[a-z0-9-]{1,63}\\[?\\.\\]?)[a-z0-9]+(-[a-z0-9]+)*\\[?\\.\\]?){1,126}[a-z]{2,63})"
EMAIL_REGEX = "(?P<user>[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+)@(?P<domain>((?=[a-z0-9-]{1,63}\\.)[a-z0-9]+(-[a-z0-9]+)*\\.){1,126}[a-z]{2,63})"
EMAIL_USER_REGEX = "(?P<user>[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+)"
IPV4_DF_REGEX = '(?P<ipaddress>(?:[0-9]{1,3}\\[?\\.\\]?){3}[0-9]{1,3})'
IPV4_REGEX = '(?P<ipaddress>(?:[0-9]{1,3}\\.){3}[0-9]{1,3})'
IPV6_REGEX = '(?<![:.\\w])(?:[A-F0-9]{0,4}:){2,7}[A-F0-9]{0,4}(?![:.\\w])'
LXPATH_REGEX = '(?P<root>/+||[.]+)\n            (?P<folder>/(?:[^\\\\/:*?<>|\\r\\n]+/)*)\n            (?P<file>[^/\\0<>|\\r\\n ]+)'
LXSTDPATH_REGEX = '\n            (?P<root>/|/bin|/boot|/dev|/home|/lib|/lost\\\\+found|/misc|/mnt|/net|/opt|/proc|/root|/sbin|/tmp|/usr|/var)\n            (?P<folder>/(?:[^\\\\/:*?<>|\\r\\n]+/)*)\n            (?P<file>[^/\\0<>|\\r\\n ]+)\n    '
MD5_REGEX = '(?:^|[^A-Fa-f0-9])(?P<hash>[A-Fa-f0-9]{32})(?:$|[^A-Fa-f0-9])'
SHA1_REGEX = '(?:^|[^A-Fa-f0-9])(?P<hash>[A-Fa-f0-9]{40})(?:$|[^A-Fa-f0-9])'
SHA256_REGEX = '(?:^|[^A-Fa-f0-9])(?P<hash>[A-Fa-f0-9]{64})(?:$|[^A-Fa-f0-9])'
URL_DF_REGEX = '\n            (?P<protocol>(https?|hXXps?|s?ftps?|s?fXps?|telnet|ldap|file)://)\n            (?P<userinfo>([a-z0-9-._~!$&\\\'()*+,;=:]|%[0-9A-F]{2})*@)?\n            (?P<host>([a-z0-9-._~!$&\\\'()*+,;=\\[\\]]|%[0-9A-F]{2})*)\n            (:(?P<port>\\d*))?\n            (/(?P<path>([^?\\#"<>\\s]|%[0-9A-F]{2})*/?))?\n            (\\?(?P<query>([a-z0-9-._~!$&\'()*+,;=:/?@]|%[0-9A-F]{2})*))?\n            (\\#(?P<fragment>([a-z0-9-._~!$&\'()*+,;=:/?@]|%[0-9A-F]{2})*))?'
URL_REGEX = '\n            (?P<protocol>(https?|s?ftps?|telnet|ldap|file)://)\n            (?P<userinfo>([a-z0-9-._~!$&\\\'()*+,;=:]|%[0-9A-F]{2})*@)?\n            (?P<host>([a-z0-9-._~!$&\\\'()*+,;=]|%[0-9A-F]{2})*)\n            (:(?P<port>\\d*))?\n            (/(?P<path>([^?\\#"<>\\s]|%[0-9A-F]{2})*/?))?\n            (\\?(?P<query>([a-z0-9-._~!$&\'()*+,;=:/?@]|%[0-9A-F]{2})*))?\n            (\\#(?P<fragment>([a-z0-9-._~!$&\'()*+,;=:/?@]|%[0-9A-F]{2})*))?'
WINPATH_REGEX = '\n            (?P<root>[a-z]:|\\\\\\\\[a-z0-9_.$-]+||[.]+)\n            (?P<folder>\\\\(?:[^\\/:*?"\\\'<>|\\r\\n]+\\\\)*)\n            (?P<file>[^\\\\/*?""<>|\\r\\n ]+)'
add_ioc_type(ioc_type: str, ioc_regex: str, priority: int = 0, group: str | None = None, defang_pattern: bool | None = None) None

Add an IoC type and regular expression to use to the built-in set.

Parameters:
  • ioc_type (str) – A unique name for the IoC type

  • ioc_regex (str) – A regular expression used to search for the type

  • priority (int, optional) – Priority of the regex match vs. other ioc_patterns. 0 is the highest priority (the default is 0).

  • group (str, optional) – The regex group to match (the default is None, which will match on the whole expression)

  • defang_pattern (bool, optional) – If True, the regex will be used to match defanged patterns If False, the regex will be used to match non-defanged patterns If None, the regex will be used to match both defanged and non-defanged patterns

Notes

Pattern priorities.

If two IocType patterns match on the same substring, the matched substring is assigned to the pattern/IocType with the highest priority. E.g. foo.bar.com will match types: dns, windows_path and linux_path but since dns has a higher priority, the expression is assigned to the dns matches.

extract(src: str | None = None, data: pd.DataFrame | None = None, columns: list[str] | None = None, **kwargs) dict[str, set[str]] | pd.DataFrame

Extract IoCs from either a string or pandas DataFrame.

Parameters:
  • src (str, optional) – source string in which to look for IoC patterns (the default is None)

  • data (pd.DataFrame, optional) – input DataFrame from which to read source strings (the default is None)

  • columns (list, optional) – The list of columns to use as source strings, if the data parameter is used. (the default is None)

  • ioc_types (list, optional) – Restrict matching to just specified types. (default is all types)

  • include_paths (bool, optional) – Whether to include path matches (which can be noisy) (the default is false - excludes ‘windows_path’ and ‘linux_path’). If ioc_types is specified this parameter is ignored.

  • ignore_tlds (bool, optional) – If True, ignore the official Top Level Domains list when determining whether a domain name is a legal domain.

  • defanged (bool, optional) – If True will match defanged versions of from email, dns, url and ip entities.

Returns:

dict of found observables (if input is a string) or DataFrame of observables

Return type:

Any

Notes

Extract takes either a string or a pandas DataFrame as input. When using the string option as an input extract will return a dictionary of results. When using a DataFrame the results will be returned as a new DataFrame with the following columns: - IoCType: the mnemonic used to distinguish different IoC Types - Observable: the actual value of the observable - SourceIndex: the index of the row in the input DataFrame from which the source for the IoC observable was extracted.

IoCType Pattern selection The default list is: [‘ipv4’, ‘ipv6’, ‘dns’, ‘url’, ‘md5_hash’, ‘sha1_hash’, ‘sha256_hash’] plus any user-defined types. ‘windows_path’, ‘linux_path’ are excluded unless include_paths is True or explicitly included in ioc_types.

extract_df(data: pd.DataFrame, columns: str | list[str], **kwargs) pd.DataFrame

Extract IoCs from either a pandas DataFrame.

Parameters:
  • data (pd.DataFrame) – input DataFrame from which to read source strings

  • columns (Union[str, list]) – A single column name as a string or a a list of columns to use as source strings,

  • ioc_types (list, optional) – Restrict matching to just specified types. (default is all types)

  • include_paths (bool, optional) – Whether to include path matches (which can be noisy) (the default is false - excludes ‘windows_path’ and ‘linux_path’). If ioc_types is specified this parameter is ignored.

  • ignore_tlds (bool, optional) – If True, ignore the official Top Level Domains list when determining whether a domain name is a legal domain.

  • defanged (bool, optional) – If True will match defanged versions of from email, dns, url and ip entities.

Returns:

DataFrame of observables

Return type:

pd.DataFrame

Notes

Extract takes a pandas DataFrame as input. The results will be returned as a new DataFrame with the following columns: - IoCType: the mnemonic used to distinguish different IoC Types - Observable: the actual value of the observable - SourceIndex: the index of the row in the input DataFrame from which the source for the IoC observable was extracted.

IoCType Pattern selection The default list is: [‘ipv4’, ‘ipv6’, ‘dns’, ‘url’, ‘md5_hash’, ‘sha1_hash’, ‘sha256_hash’] plus any user-defined types. ‘windows_path’, ‘linux_path’ are excluded unless include_paths is True or explicitly included in ioc_types.

static file_hash_type(file_hash: str) IoCType

Return specific IoCType based on hash length.

Parameters:

file_hash (str) – File hash string

Returns:

Specific hash type or unknown.

Return type:

IoCType

get_ioc_type(observable: str) str

Return first matching type.

Parameters:

observable (str) – The IoC Observable to check

Returns:

The IoC type enumeration (unknown, if no match)

Return type:

str

property ioc_df_types: dict

Return current set of IoC types and regular expressions for defanged IoCs.

Returns:

dict of IoC Type names and regular expressions

Return type:

dict

property ioc_types: dict

Return the current set of IoC types and regular expressions.

Returns:

dict of IoC Type names and regular expressions

Return type:

dict

validate(input_str: str, ioc_type: str, ignore_tlds: bool = False, defanged: bool | None = None) bool

Check that input_str matches the regex for the specified ioc_type.

Parameters:
  • input_str (str) – the string to test

  • ioc_type (str) – the regex pattern to use

  • ignore_tlds (bool, optional) – If True, ignore the official Top Level Domains list when determining whether a domain name is a legal domain.

  • defanged (bool, optional) – If True, the input string will also match defanged versions of the IoC, default is False.

Returns:

True if match.

Return type:

bool

class msticpy.transform.iocextract.IoCExtractAccessor(pandas_obj)

Bases: object

Pandas api extension for IoC Extractor.

Instantiate pandas extension class.

extract(columns, **kwargs)

Extract IoCs from either a pandas DataFrame.

Parameters:
  • columns (list) – The list of columns to use as source strings,

  • ioc_types (list, optional) – Restrict matching to just specified types. (default is all types)

  • include_paths (bool, optional) – Whether to include path matches (which can be noisy) (the default is false - excludes ‘windows_path’ and ‘linux_path’). If ioc_types is specified this parameter is ignored.

Returns:

DataFrame of observables

Return type:

pd.DataFrame

Notes

Extract takes a pandas DataFrame as input. The results will be returned as a new DataFrame with the following columns: - IoCType: the mnemonic used to distinguish different IoC Types - Observable: the actual value of the observable - SourceIndex: the index of the row in the input DataFrame from which the source for the IoC observable was extracted.

IoCType Pattern selection The default list is: [‘ipv4’, ‘ipv6’, ‘dns’, ‘url’, ‘md5_hash’, ‘sha1_hash’, ‘sha256_hash’] plus any user-defined types. ‘windows_path’, ‘linux_path’ are excluded unless include_paths is True or explicitly included in ioc_paths.

class msticpy.transform.iocextract.IoCPattern(ioc_type: str, comp_regex: re.Pattern[str], priority: int, group: str | None)

Bases: object

Define patterns for IOC.

Method generated by attrs for class IoCPattern.

comp_regex: re.Pattern[str]
group: str | None
ioc_type: str
priority: int
class msticpy.transform.iocextract.IoCType(value)

Bases: Enum

Enumeration of IoC Types.

dns = 'dns'
email = 'email'
file_hash = 'file_hash'
hostname = 'hostname'
ipv4 = 'ipv4'
ipv6 = 'ipv6'
linux_path = 'linux_path'
md5_hash = 'md5_hash'
classmethod parse(value: str) IoCType

Return parsed IoCType of string.

Parameters:

value (str) – Enumeration name

Returns:

IoCType matching name or unknown if no match

Return type:

IoCType

sha1_hash = 'sha1_hash'
sha256_hash = 'sha256_hash'
unknown = 'unknown'
url = 'url'
windows_path = 'windows_path'