msticpy.transform.iocextract module
Module for IoCExtract class.
Uses a set of builtin regular expressions to look for Indicator of Compromise (IoC) patterns. Input can be a single string or a pandas dataframe with one or more columns specified as input.
The following types are built-in:
IPv4 and IPv6
URL
DNS domain
Hashes (MD5, SHA1, SHA256)
Windows file paths
Linux file paths (this is kind of noisy because a legal linux file path can have almost any character) You can modify or add to the regular expressions used at runtime.
- class msticpy.transform.iocextract.IoCExtract(defanged: bool = True)
Bases:
object
IoC Extractor - looks for common IoC patterns in input strings.
The extract() method takes either a string or a pandas DataFrame as input. When using the string option as an input extract will return a dictionary of results. When using a DataFrame the results will be returned as a new DataFrame with the following columns: IoCType: the mnemonic used to distinguish different IoC Types Observable: the actual value of the observable SourceIndex: the index of the row in the input DataFrame from which the source for the IoC observable was extracted.
The class has a number of built-in IoC regex definitions. These can be retrieved using the ioc_types attribute.
Addition IoC definitions can be added using the add_ioc_type method.
Note: due to some ambiguity in the regular expression patterns for different types and observable may be returned assigned to multiple observable types. E.g. 192.168.0.1 is a also a legal file name in both Linux and Windows. Linux file names have a particularly large scope in terms of legal characters so it will be quite common to see other IoC observables (or parts of them) returned as a possible linux path.
Initialize new instance of IoCExtract.
- Parameters:
defanged (bool) – If True, the regex will be used to match defanged IoC patterns
- DF_AT = '(@|\\[at\\])'
- DNS_DF_REGEX = '((?=[a-z0-9-]{1,63}\\[?\\.\\]?)[a-z0-9]+(-[a-z0-9]+)*\\[?\\.\\]?){1,126}[a-z]{2,63}'
- DNS_REGEX = '((?=[a-z0-9-]{1,63}\\.)[a-z0-9]+(-[a-z0-9]+)*\\.){1,126}[a-z]{2,63}'
- EMAIL_DF_REGEX = "(?P<user>[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+)(@|\\[at\\])(?P<domain>((?=[a-z0-9-]{1,63}\\[?\\.\\]?)[a-z0-9]+(-[a-z0-9]+)*\\[?\\.\\]?){1,126}[a-z]{2,63})"
- EMAIL_REGEX = "(?P<user>[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+)@(?P<domain>((?=[a-z0-9-]{1,63}\\.)[a-z0-9]+(-[a-z0-9]+)*\\.){1,126}[a-z]{2,63})"
- EMAIL_USER_REGEX = "(?P<user>[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+)"
- IPV4_DF_REGEX = '(?P<ipaddress>(?:[0-9]{1,3}\\[?\\.\\]?){3}[0-9]{1,3})'
- IPV4_REGEX = '(?P<ipaddress>(?:[0-9]{1,3}\\.){3}[0-9]{1,3})'
- IPV6_REGEX = '(?<![:.\\w])(?:[A-F0-9]{0,4}:){2,7}[A-F0-9]{0,4}(?![:.\\w])'
- LXPATH_REGEX = '(?P<root>/+||[.]+)\n (?P<folder>/(?:[^\\\\/:*?<>|\\r\\n]+/)*)\n (?P<file>[^/\\0<>|\\r\\n ]+)'
- LXSTDPATH_REGEX = '\n (?P<root>/|/bin|/boot|/dev|/home|/lib|/lost\\\\+found|/misc|/mnt|/net|/opt|/proc|/root|/sbin|/tmp|/usr|/var)\n (?P<folder>/(?:[^\\\\/:*?<>|\\r\\n]+/)*)\n (?P<file>[^/\\0<>|\\r\\n ]+)\n '
- MD5_REGEX = '(?:^|[^A-Fa-f0-9])(?P<hash>[A-Fa-f0-9]{32})(?:$|[^A-Fa-f0-9])'
- SHA1_REGEX = '(?:^|[^A-Fa-f0-9])(?P<hash>[A-Fa-f0-9]{40})(?:$|[^A-Fa-f0-9])'
- SHA256_REGEX = '(?:^|[^A-Fa-f0-9])(?P<hash>[A-Fa-f0-9]{64})(?:$|[^A-Fa-f0-9])'
- URL_DF_REGEX = '\n (?P<protocol>(https?|hXXps?|s?ftps?|s?fXps?|telnet|ldap|file)://)\n (?P<userinfo>([a-z0-9-._~!$&\\\'()*+,;=:]|%[0-9A-F]{2})*@)?\n (?P<host>([a-z0-9-._~!$&\\\'()*+,;=\\[\\]]|%[0-9A-F]{2})*)\n (:(?P<port>\\d*))?\n (/(?P<path>([^?\\#"<>\\s]|%[0-9A-F]{2})*/?))?\n (\\?(?P<query>([a-z0-9-._~!$&\'()*+,;=:/?@]|%[0-9A-F]{2})*))?\n (\\#(?P<fragment>([a-z0-9-._~!$&\'()*+,;=:/?@]|%[0-9A-F]{2})*))?'
- URL_REGEX = '\n (?P<protocol>(https?|s?ftps?|telnet|ldap|file)://)\n (?P<userinfo>([a-z0-9-._~!$&\\\'()*+,;=:]|%[0-9A-F]{2})*@)?\n (?P<host>([a-z0-9-._~!$&\\\'()*+,;=]|%[0-9A-F]{2})*)\n (:(?P<port>\\d*))?\n (/(?P<path>([^?\\#"<>\\s]|%[0-9A-F]{2})*/?))?\n (\\?(?P<query>([a-z0-9-._~!$&\'()*+,;=:/?@]|%[0-9A-F]{2})*))?\n (\\#(?P<fragment>([a-z0-9-._~!$&\'()*+,;=:/?@]|%[0-9A-F]{2})*))?'
- WINPATH_REGEX = '\n (?P<root>[a-z]:|\\\\\\\\[a-z0-9_.$-]+||[.]+)\n (?P<folder>\\\\(?:[^\\/:*?"\\\'<>|\\r\\n]+\\\\)*)\n (?P<file>[^\\\\/*?""<>|\\r\\n ]+)'
- add_ioc_type(ioc_type: str, ioc_regex: str, priority: int = 0, group: str | None = None, defang_pattern: bool | None = None) None
Add an IoC type and regular expression to use to the built-in set.
- Parameters:
ioc_type (str) – A unique name for the IoC type
ioc_regex (str) – A regular expression used to search for the type
priority (int, optional) – Priority of the regex match vs. other ioc_patterns. 0 is the highest priority (the default is 0).
group (str, optional) – The regex group to match (the default is None, which will match on the whole expression)
defang_pattern (bool, optional) – If True, the regex will be used to match defanged patterns If False, the regex will be used to match non-defanged patterns If None, the regex will be used to match both defanged and non-defanged patterns
Notes
- Pattern priorities.
If two IocType patterns match on the same substring, the matched substring is assigned to the pattern/IocType with the highest priority. E.g. foo.bar.com will match types: dns, windows_path and linux_path but since dns has a higher priority, the expression is assigned to the dns matches.
- extract(src: str | None = None, data: pd.DataFrame | None = None, columns: list[str] | None = None, **kwargs) dict[str, set[str]] | pd.DataFrame
Extract IoCs from either a string or pandas DataFrame.
- Parameters:
src (str, optional) – source string in which to look for IoC patterns (the default is None)
data (pd.DataFrame, optional) – input DataFrame from which to read source strings (the default is None)
columns (list, optional) – The list of columns to use as source strings, if the data parameter is used. (the default is None)
ioc_types (list, optional) – Restrict matching to just specified types. (default is all types)
include_paths (bool, optional) – Whether to include path matches (which can be noisy) (the default is false - excludes ‘windows_path’ and ‘linux_path’). If ioc_types is specified this parameter is ignored.
ignore_tlds (bool, optional) – If True, ignore the official Top Level Domains list when determining whether a domain name is a legal domain.
defanged (bool, optional) – If True will match defanged versions of from email, dns, url and ip entities.
- Returns:
dict of found observables (if input is a string) or DataFrame of observables
- Return type:
Any
Notes
Extract takes either a string or a pandas DataFrame as input. When using the string option as an input extract will return a dictionary of results. When using a DataFrame the results will be returned as a new DataFrame with the following columns: - IoCType: the mnemonic used to distinguish different IoC Types - Observable: the actual value of the observable - SourceIndex: the index of the row in the input DataFrame from which the source for the IoC observable was extracted.
IoCType Pattern selection The default list is: [‘ipv4’, ‘ipv6’, ‘dns’, ‘url’, ‘md5_hash’, ‘sha1_hash’, ‘sha256_hash’] plus any user-defined types. ‘windows_path’, ‘linux_path’ are excluded unless include_paths is True or explicitly included in ioc_types.
- extract_df(data: pd.DataFrame, columns: str | list[str], **kwargs) pd.DataFrame
Extract IoCs from either a pandas DataFrame.
- Parameters:
data (pd.DataFrame) – input DataFrame from which to read source strings
columns (Union[str, list]) – A single column name as a string or a a list of columns to use as source strings,
ioc_types (list, optional) – Restrict matching to just specified types. (default is all types)
include_paths (bool, optional) – Whether to include path matches (which can be noisy) (the default is false - excludes ‘windows_path’ and ‘linux_path’). If ioc_types is specified this parameter is ignored.
ignore_tlds (bool, optional) – If True, ignore the official Top Level Domains list when determining whether a domain name is a legal domain.
defanged (bool, optional) – If True will match defanged versions of from email, dns, url and ip entities.
- Returns:
DataFrame of observables
- Return type:
pd.DataFrame
Notes
Extract takes a pandas DataFrame as input. The results will be returned as a new DataFrame with the following columns: - IoCType: the mnemonic used to distinguish different IoC Types - Observable: the actual value of the observable - SourceIndex: the index of the row in the input DataFrame from which the source for the IoC observable was extracted.
IoCType Pattern selection The default list is: [‘ipv4’, ‘ipv6’, ‘dns’, ‘url’, ‘md5_hash’, ‘sha1_hash’, ‘sha256_hash’] plus any user-defined types. ‘windows_path’, ‘linux_path’ are excluded unless include_paths is True or explicitly included in ioc_types.
- static file_hash_type(file_hash: str) IoCType
Return specific IoCType based on hash length.
- Parameters:
file_hash (str) – File hash string
- Returns:
Specific hash type or unknown.
- Return type:
- get_ioc_type(observable: str) str
Return first matching type.
- Parameters:
observable (str) – The IoC Observable to check
- Returns:
The IoC type enumeration (unknown, if no match)
- Return type:
str
- property ioc_df_types: dict
Return current set of IoC types and regular expressions for defanged IoCs.
- Returns:
dict of IoC Type names and regular expressions
- Return type:
dict
- property ioc_types: dict
Return the current set of IoC types and regular expressions.
- Returns:
dict of IoC Type names and regular expressions
- Return type:
dict
- validate(input_str: str, ioc_type: str, ignore_tlds: bool = False, defanged: bool | None = None) bool
Check that input_str matches the regex for the specified ioc_type.
- Parameters:
input_str (str) – the string to test
ioc_type (str) – the regex pattern to use
ignore_tlds (bool, optional) – If True, ignore the official Top Level Domains list when determining whether a domain name is a legal domain.
defanged (bool, optional) – If True, the input string will also match defanged versions of the IoC, default is False.
- Returns:
True if match.
- Return type:
bool
- class msticpy.transform.iocextract.IoCExtractAccessor(pandas_obj)
Bases:
object
Pandas api extension for IoC Extractor.
Instantiate pandas extension class.
- extract(columns, **kwargs)
Extract IoCs from either a pandas DataFrame.
- Parameters:
columns (list) – The list of columns to use as source strings,
ioc_types (list, optional) – Restrict matching to just specified types. (default is all types)
include_paths (bool, optional) – Whether to include path matches (which can be noisy) (the default is false - excludes ‘windows_path’ and ‘linux_path’). If ioc_types is specified this parameter is ignored.
- Returns:
DataFrame of observables
- Return type:
pd.DataFrame
Notes
Extract takes a pandas DataFrame as input. The results will be returned as a new DataFrame with the following columns: - IoCType: the mnemonic used to distinguish different IoC Types - Observable: the actual value of the observable - SourceIndex: the index of the row in the input DataFrame from which the source for the IoC observable was extracted.
IoCType Pattern selection The default list is: [‘ipv4’, ‘ipv6’, ‘dns’, ‘url’, ‘md5_hash’, ‘sha1_hash’, ‘sha256_hash’] plus any user-defined types. ‘windows_path’, ‘linux_path’ are excluded unless include_paths is True or explicitly included in ioc_paths.
- class msticpy.transform.iocextract.IoCPattern(ioc_type: str, comp_regex: re.Pattern[str], priority: int, group: str | None)
Bases:
object
Define patterns for IOC.
Method generated by attrs for class IoCPattern.
- comp_regex: re.Pattern[str]
- group: str | None
- ioc_type: str
- priority: int
- class msticpy.transform.iocextract.IoCType(value)
Bases:
Enum
Enumeration of IoC Types.
- dns = 'dns'
- email = 'email'
- file_hash = 'file_hash'
- hostname = 'hostname'
- ipv4 = 'ipv4'
- ipv6 = 'ipv6'
- linux_path = 'linux_path'
- md5_hash = 'md5_hash'
- classmethod parse(value: str) IoCType
Return parsed IoCType of string.
- Parameters:
value (str) – Enumeration name
- Returns:
IoCType matching name or unknown if no match
- Return type:
- sha1_hash = 'sha1_hash'
- sha256_hash = 'sha256_hash'
- unknown = 'unknown'
- url = 'url'
- windows_path = 'windows_path'