IoC Extraction

This class allows you to extract IoC patterns from a string or a DataFrame. Several patterns are built in to the class and you can override these or supply new ones.

# Imports
import sys
MIN_REQ_PYTHON = (3,6)
if sys.version_info < MIN_REQ_PYTHON:
    print('Check the Kernel->Change Kernel menu and ensure that Python 3.6')
    print('or later is selected as the active kernel.')
    sys.exit("Python %s.%s or later is required.\n" % MIN_REQ_PYTHON)

from IPython.display import display, HTML
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_colwidth', 100)
# Load test data
process_tree = pd.read_csv('data/process_tree.csv')
process_tree[['CommandLine']].head()
CommandLine
0 .\ftp -s:C:\RECYCLER\xxppyy.exe
1 .\reg not /domain:everything that /sid:shines is /krbtgt:golden !
2 cmd /c "systeminfo && systeminfo"
3 .\rundll32 /C 42424.exe
4 .\rundll32 /C c:\users\MSTICAdmin\42424.exe

Looking for IoC in a String

Just pass the string as a parameter to the extract() method.

Get a commandline from our data set.

# get a commandline from our data set
cmdline = process_tree['CommandLine'].loc[78]
cmdline
'netsh  start capture=yes IPv4.Address=1.2.3.4 tracefile=C:\\Users\\user\\AppData\\Local\\Temp\\bzzzzzz.txt'

Instantiate an IoCExtract instance and pass the string to the extract() method.

# Instantiate an IoCExtract object
from msticpy.transform import IoCExtract
ioc_extractor = IoCExtract()

# any IoCs in the string?
iocs_found = ioc_extractor.extract(cmdline)

if iocs_found:
    print('\nPotential IoCs found in alert process:')
    display(iocs_found)
Potential IoCs found in alert process:
defaultdict(set,
            {'ipv4': {'1.2.3.4'},
             'windows_path': {'C:\\Users\\user\\AppData\\Local\\Temp\\bzzzzzz.txt'}})

The following IoC patterns are searched for:

  • ipv4

  • ipv6

  • dns

  • url

  • windows_path

  • linux_path

  • md5_hash

  • sha1_hash

  • sha256_hash

Using a DataFrame as Input

You can use the data= parameter to IoCExtract.extract() to pass a DataFrame. Use the columns parameter to specify which column or columns that you want to search.

Note

When searching a DataFrame the following types are not included in the search by default windows_path and linux_path because of the likely high volume of results and number of false positive matches. You can include them by specifing include_paths=True as a parameter to extract().

You can also use the ioc_types parameter to explicitly list the ioc_types that you want to search for. This should be a list of strings of valid types. See ioc_types

ioc_extractor = IoCExtract()
ioc_df = ioc_extractor.extract(data=process_tree, columns=['CommandLine'])
if len(ioc_df):
    display(HTML("<h3>IoC patterns found in process tree.</h3>"))
    display(ioc_df)

IoC patterns found in process tree.

IoCType Observable SourceIndex
48 windows_path .\powershell 36
49 url http://somedomain/best-kitten-names-1.jpg' 37
53 windows_path .\pOWErS^H^ElL^.eX^e^ 37
58 md5_hash 81ed03caf6901e444c72ac67d192fb9c 44
59 url http://badguyserver/pwnme" 46
68 windows_path .\reg query add mscfile\\\\open 59
72 windows_path \system\CurrentControlSet\Control\Terminal 63
92 ipv4 1.2.3.4 78
108 ipv4 127.0.0.1 102
109 url http://127.0.0.1/ 102
110 windows_path \SOFTWARE\Microsoft\Windows NT\CurrentVersion\Svchost\MyNastySvcHostConfig 103

IoCExtractor API

See IoCExtract and See IoCExtract

Predefined Regex Patterns

from html import escape
extractor = IoCExtract()

for ioc_type, pattern in extractor.ioc_types.items():
    esc_pattern = escape(pattern.comp_regex.pattern)
    display(HTML(f'<b>{ioc_type}</b>'))
    display(HTML(f'<div style="margin-left:20px"><pre>{esc_pattern}</pre></div>)'))
IoCType Regex
ipv4
(?P<ipaddress>(?:[0-9]{1,3}\\.){3}[0-9]{1,3})
ipv6
(?<![:.\\w])(?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}(?![:.\\w])
dns
((?=[a-z0-9-]{1,63}\\.)[a-z0-9]+(-[a-z0-9]+)*\\.){2,}[a-z]{2,63}
url
(?P<protocol>(https?|ftp|telnet|ldap|file)://)
(?P<userinfo>([a-z0-9-._~!$&\\'()*+,;=:]|%[0-9A-F]{2})*@)?
(?P<host>([a-z0-9-._~!$&\\'()*+,;=]|%[0-9A-F]{2})*)
windows_path

(?P<root>[a-z]:|\\\\\\\\[a-z0-9_.$-]+||[.]+)
(?P<folder>\\\\(?:[^\\/:*?"\\\'<>|\\r\\n]+\\\\)*)
>
(?P<file>[^\\\\/*?""<>|\\r\\n ]+)
linux_path
(?P<root>/+||[.]+)
(?P<folder>/(?:[^\\\\/:*?<>|\\r\\n]+/)*)
(?P<file>[^/\\0<>|\\r\\n ]+)
md5_hash
(?:^|[^A-Fa-f0-9])(?P<hash>[A-Fa-f0-9]{32})(?:$|[^A-Fa-f0-9])
sha1_hash
(?:^|[^A-Fa-f0-9])(?P<hash>[A-Fa-f0-9]{40})(?:$|[^A-Fa-f0-9])
ipv6
(?:^|[^A-Fa-f0-9])(?P<hash>[A-Fa-f0-9]{64})(?:$|[^A-Fa-f0-9])

Adding your own pattern(s)

See add_ioc_type

Add an IoC type and regular expression to use to the built-in set.

Warning

Adding an ioc_type that exists in the internal set will overwrite that item

Regular expressions are compiled with re.I | re.X | re.M (Ignore case, Verbose and MultiLine)

add_ioc_type parameters:

  • ioc_type{str} - a unique name for the IoC type

  • ioc_regex{str} - a regular expression used to search for the type

import re
rcomp = re.compile(r'(?P<pipe>\\\\\.\\pipe\\[^\s\\]+)')
extractor.add_ioc_type(ioc_type='win_named_pipe', ioc_regex=r'(?P<pipe>\\\\\.\\pipe\\[^\s\\]+)')

# Check that it added ok
print(extractor.ioc_types['win_named_pipe'])

# Use it in our data set
ioc_extractor.extract(data=process_tree, columns=['CommandLine']).query('IoCType == \'win_named_pipe\'')
IoCPattern(ioc_type='win_named_pipe', comp_regex=re.compile('(?P<pipe>\\\\\.\\pipe\\[^\s\\]+)', re.IGNORECASE|re.MULTILINE|re.VERBOSE), priority=0)
IoCType Observable SourceIndex
116 win_named_pipe \\.\pipe\blahtest" 107

extract_df()

extract_df functions identically to extract with a data parameter. It may be more convenient to use this when you know that your input is a DataFrame

ioc_extractor.extract_df(process_tree, columns=['NewProcessName', 'CommandLine']).head(10)

Merging output with source data

The SourceIndex column allows you to merge the results with the input DataFrame Where an input row has multiple IoC matches the output of this merge will result in duplicate rows from the input (one per IoC match). The previous index is preserved in the second column (and in the SourceIndex column).

Note: you will need to set the type of the SourceIndex column. In the example below case we are matching with the default numeric index so we force the type to be numeric. In cases where you are using an index of a different dtype you will need to convert the SourceIndex (dtype=object) to match the type of your index column.

input_df = data=process_tree.head(20)
output_df = ioc_extractor.extract(data=input_df, columns=['NewProcessName', 'CommandLine'])
# set the type of the SourceIndex column. In this case we are matching with the default numeric index.
output_df['SourceIndex'] = pd.to_numeric(output_df['SourceIndex'])
merged_df = pd.merge(left=input_df, right=output_df, how='outer', left_index=True, right_on='SourceIndex')
merged_df.head()

TenantId

Account

EventID

TimeGenerated

Computer

SubjectUserSid

SubjectUserName

SubjectDomainName

SubjectLogonId

NewProcessId

NewProcessName

TokenElevationType

ProcessId

CommandLine

ParentProcessName

TargetLogonId

SourceComputerId

TimeCreatedUtc

NodeRole

Level

ProcessId1

NewProcessId1

IoCType

Observable

SourceIndex

0

802d39e1-9d70-404d-832c-2de5e2478eda

MSTICAlertsWin1MSTICAdmin

4688

2019-01-15 05:15:15.677

MSTICAlertsWin1

S-1-5-21-996632719-2361334927-4038480536-500

MSTICAdmin

MSTICAlertsWin1

0xfaac27

0x1580

C:DiagnosticsUserTmpftp.exe

%%1936

0xbc8

.ftp -s:C:RECYCLERxxppyy.exe

C:WindowsSystem32cmd.exe

0x0

46fe7078-61bb-4bed-9430-7ac01d91c273

2019-01-15 05:15:15.677

source

0

nan

nan

nan

nan

0

1

802d39e1-9d70-404d-832c-2de5e2478eda

MSTICAlertsWin1MSTICAdmin

4688

2019-01-15 05:15:16.167

MSTICAlertsWin1

S-1-5-21-996632719-2361334927-4038480536-500

MSTICAdmin

MSTICAlertsWin1

0xfaac27

0x16fc

C:DiagnosticsUserTmpreg.exe

%%1936

0xbc8

.reg not /domain:everything that /sid:shines is /krbtgt:golden !

C:WindowsSystem32cmd.exe

0x0

46fe7078-61bb-4bed-9430-7ac01d91c273

2019-01-15 05:15:16.167

sibling

1

nan

nan

nan

nan

1

2

802d39e1-9d70-404d-832c-2de5e2478eda

MSTICAlertsWin1MSTICAdmin

4688

2019-01-15 05:15:16.277

MSTICAlertsWin1

S-1-5-21-996632719-2361334927-4038480536-500

MSTICAdmin

MSTICAlertsWin1

0xfaac27

0x1700

C:DiagnosticsUserTmpcmd.exe

%%1936

0xbc8

cmd /c “systeminfo && systeminfo”

C:WindowsSystem32cmd.exe

0x0

46fe7078-61bb-4bed-9430-7ac01d91c273

2019-01-15 05:15:16.277

sibling

1

nan

nan

nan

nan

2

3

802d39e1-9d70-404d-832c-2de5e2478eda

MSTICAlertsWin1MSTICAdmin

4688

2019-01-15 05:15:16.340

MSTICAlertsWin1

S-1-5-21-996632719-2361334927-4038480536-500

MSTICAdmin

MSTICAlertsWin1

0xfaac27

0x1728

C:DiagnosticsUserTmprundll32.exe

%%1936

0xbc8

.rundll32 /C 42424.exe

C:WindowsSystem32cmd.exe

0x0

46fe7078-61bb-4bed-9430-7ac01d91c273

2019-01-15 05:15:16.340

sibling

1

nan

nan

nan

nan

3

4

802d39e1-9d70-404d-832c-2de5e2478eda

MSTICAlertsWin1MSTICAdmin

4688

2019-01-15 05:15:16.400

MSTICAlertsWin1

S-1-5-21-996632719-2361334927-4038480536-500

MSTICAdmin

MSTICAlertsWin1

0xfaac27

0x175c

C:DiagnosticsUserTmprundll32.exe

%%1936

0xbc8

.rundll32 /C c:usersMSTICAdmin42424.exe

C:WindowsSystem32cmd.exe

0x0

46fe7078-61bb-4bed-9430-7ac01d91c273

2019-01-15 05:15:16.400

sibling

1

nan

nan

nan

nan

4

IPython magic

You can use the line magic %ioc or cell magic %%ioc to extract IoCs from text pasted directly into a cell

The ioc magic supports the following options:

--out OUT, -o OUT
    The variable to return the results in the variable `OUT`
    Note: the output variable is a dictionary iocs grouped by IoC Type
--ioc_types IOC_TYPES, -i IOC_TYPES
    The types of IoC to search for (comma-separated string)
%%ioc --out ioc_capture
netsh  start capture=yes IPv4.Address=1.2.3.4 tracefile=C:\Users\user\AppData\Local\Temp\bzzzzzz.txt
hostname    customers-service.ddns.net              Feb 5, 2020, 2:20:35 PM         7
URL \https://two-step-checkup.site/securemail/secureLogin/challenge/url?ucode=d50a3eb1-9a6b-45a8-8389-d5203bbddaa1&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;service=mailservice&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;type=password                Feb 5, 2020, 2:20:35 PM         1
hostname    mobile.phonechallenges-submit.site              Feb 5, 2020, 2:20:35 PM         8
hostname    youtube.service-activity-checkup.site           Feb 5, 2020, 2:20:35 PM         8
hostname    www.drive-accounts.com          Feb 5, 2020, 2:20:35 PM         7
hostname    google.drive-accounts.com               Feb 5, 2020, 2:20:35 PM         7
domain      niaconucil.org          Feb 5, 2020, 2:20:35 PM         11
domain      isis-online.net         Feb 5, 2020, 2:20:35 PM         11
domain      bahaius.info            Feb 5, 2020, 2:20:35 PM         11
domain      w3-schools.org          Feb 5, 2020, 2:20:35 PM         12
domain      system-services.site            Feb 5, 2020, 2:20:35 PM         11
domain      accounts-drive.com              Feb 5, 2020, 2:20:35 PM         8
domain      drive-accounts.com              Feb 5, 2020, 2:20:35 PM         10
domain      service-issues.site             Feb 5, 2020, 2:20:35 PM         8
domain      two-step-checkup.site           Feb 5, 2020, 2:20:35 PM         8
domain      customers-activities.site               Feb 5, 2020, 2:20:35 PM         11
domain      seisolarpros.org                Feb 5, 2020, 2:20:35 PM         11
domain      yah00.site              Feb 5, 2020, 2:20:35 PM         4
domain      skynevvs.com            Feb 5, 2020, 2:20:35 PM         11
domain      recovery-options.site           Feb 5, 2020, 2:20:35 PM         4
domain      malcolmrifkind.site             Feb 5, 2020, 2:20:35 PM         8
domain      instagram-com.site              Feb 5, 2020, 2:20:35 PM         8
domain      leslettrespersanes.net          Feb 5, 2020, 2:20:35 PM         11
domain      software-updating-managers.site         Feb 5, 2020, 2:20:35 PM         8
domain      cpanel-services.site            Feb 5, 2020, 2:20:35 PM         8
domain      service-activity-checkup.site           Feb 5, 2020, 2:20:35 PM         7
domain      inztaqram.ga            Feb 5, 2020, 2:20:35 PM         8
domain      unirsd.com              Feb 5, 2020, 2:20:35 PM         8
domain      phonechallenges-submit.site             Feb 5, 2020, 2:20:35 PM         7
domain      acconut-verify.com              Feb 5, 2020, 2:20:35 PM         11
domain      finance-usbnc.info              Feb 5, 2020, 2:20:35 PM         8
FileHash-MD5        542128ab98bda5ea139b169200a50bce                Feb 5, 2020, 2:20:35 PM         3
FileHash-MD5        3d67ce57aab4f7f917cf87c724ed7dab                Feb 5, 2020, 2:20:35 PM         3
hostname    x09live-ix3b.account-profile-users.info         Feb 6, 2020, 2:56:07 PM         0
hostname    www.phonechallenges-submit.site         Feb 6, 2020, 2:56:07 PM
[('ipv4', ['1.2.3.4']),
 ('dns',
  ['malcolmrifkind.site',
   'w3-schools.org',
   'niaconucil.org',
   'software-updating-managers.site',
   'isis-online.net',
   'accounts-drive.com',
   'cpanel-services.site',
   'service-activity-checkup.site',
   'service-issues.site',
   'recovery-options.site',
   'instagram-com.site',
   'mobile.phonechallenges-submit.site',
   'youtube.service-activity-checkup.site',
   'google.drive-accounts.com',
   'phonechallenges-submit.site',
   'drive-accounts.com',
   'www.phonechallenges-submit.site',
   'yah00.site',
   'seisolarpros.org',
   'customers-activities.site',
   'bahaius.info',
   'system-services.site',
   'two-step-checkup.site',
   'x09live-ix3b.account-profile-users.info',
   'customers-service.ddns.net',
   'leslettrespersanes.net',
   'www.drive-accounts.com',
   'acconut-verify.com',
   'finance-usbnc.info',
   'unirsd.com',
   'skynevvs.com',
   'inztaqram.ga']),
 ('url',
  ['https://two-step-checkup.site/securemail/secureLogin/challenge/url?ucode=d50a3eb1-9a6b-45a8-8389-d5203bbddaa1&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;service=mailservice&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;type=password']),
 ('windows_path', ['C:\Users\user\AppData\Local\Temp\bzzzzzz.txt']),
 ('linux_path',
  ['//two-step-checkup.site/securemail/secureLogin/challenge/url?ucode=d50a3eb1-9a6b-45a8-8389-d5203bbddaa1&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;service=mailservice&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;type=passwordttFeb']),
 ('md5_hash',
  ['3d67ce57aab4f7f917cf87c724ed7dab', '542128ab98bda5ea139b169200a50bce'])]
%%ioc --ioc_types "ipv4, ipv6, linux_path, md5_hash"
netsh  start capture=yes IPv4.Address=1.2.3.4 tracefile=C:\Users\user\AppData\Local\Temp\bzzzzzz.txt
tracefile2=/usr/localbzzzzzz.sh
hostname    customers-service.ddns.net              Feb 5, 2020, 2:20:35 PM         7
URL \https://two-step-checkup.site/securemail/secureLogin/challenge/url?ucode=d50a3eb1-9a6b-45a8-8389-d5203bbddaa1&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;service=mailservice&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;type=password                Feb 5, 2020, 2:20:35 PM         1
hostname    mobile.phonechallenges-submit.site              Feb 5, 2020, 2:20:35 PM         8
hostname    youtube.service-activity-checkup.site           Feb 5, 2020, 2:20:35 PM         8
hostname    www.drive-accounts.com          Feb 5, 2020, 2:20:35 PM         7
hostname    google.drive-accounts.com               Feb 5, 2020, 2:20:35 PM         7
domain      niaconucil.org          Feb 5, 2020, 2:20:35 PM         11
domain      isis-online.net         Feb 5, 2020, 2:20:35 PM         11
domain      bahaius.info            Feb 5, 2020, 2:20:35 PM         11
domain      w3-schools.org          Feb 5, 2020, 2:20:35 PM         12
domain      system-services.site            Feb 5, 2020, 2:20:35 PM         11
domain      accounts-drive.com              Feb 5, 2020, 2:20:35 PM         8
domain      drive-accounts.com              Feb 5, 2020, 2:20:35 PM         10
domain      service-issues.site             Feb 5, 2020, 2:20:35 PM         8
domain      two-step-checkup.site           Feb 5, 2020, 2:20:35 PM         8
domain      customers-activities.site               Feb 5, 2020, 2:20:35 PM         11
domain      seisolarpros.org                Feb 5, 2020, 2:20:35 PM         11
domain      yah00.site              Feb 5, 2020, 2:20:35 PM         4
domain      skynevvs.com            Feb 5, 2020, 2:20:35 PM         11
domain      recovery-options.site           Feb 5, 2020, 2:20:35 PM         4
domain      malcolmrifkind.site             Feb 5, 2020, 2:20:35 PM         8
domain      instagram-com.site              Feb 5, 2020, 2:20:35 PM         8
domain      leslettrespersanes.net          Feb 5, 2020, 2:20:35 PM         11
domain      software-updating-managers.site         Feb 5, 2020, 2:20:35 PM         8
domain      cpanel-services.site            Feb 5, 2020, 2:20:35 PM         8
domain      service-activity-checkup.site           Feb 5, 2020, 2:20:35 PM         7
domain      inztaqram.ga            Feb 5, 2020, 2:20:35 PM         8
domain      unirsd.com              Feb 5, 2020, 2:20:35 PM         8
domain      phonechallenges-submit.site             Feb 5, 2020, 2:20:35 PM         7
domain      acconut-verify.com              Feb 5, 2020, 2:20:35 PM         11
domain      finance-usbnc.info              Feb 5, 2020, 2:20:35 PM         8
FileHash-MD5        542128ab98bda5ea139b169200a50bce                Feb 5, 2020, 2:20:35 PM         3
FileHash-MD5        3d67ce57aab4f7f917cf87c724ed7dab                Feb 5, 2020, 2:20:35 PM         3
hostname    x09live-ix3b.account-profile-users.info         Feb 6, 2020, 2:56:07 PM         0
hostname    www.phonechallenges-submit.site         Feb 6, 2020, 2:56:07 PM
[('ipv4', ['1.2.3.4']),
 ('linux_path',
  ['//two-step-checkup.site/securemail/secureLogin/challenge/url?ucode=d50a3eb1-9a6b-45a8-8389-d5203bbddaa1&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;service=mailservice&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;type=passwordttFeb',
   '/usr/localbzzzzzz.sh']),
 ('md5_hash',
  ['3d67ce57aab4f7f917cf87c724ed7dab', '542128ab98bda5ea139b169200a50bce'])]

Pandas Extension

The decoding functionality is also available in a pandas extension mp_ioc. This supports a single method extract().

This supports the same syntax as extract (described earlier).

process_tree.mp_ioc.extract(columns=['CommandLine'])

IoCType

Observable

SourceIndex

0

dns

microsoft.com

24

1

url

http://server/file.sct

31

2

dns

server

31

3

dns

evil.ps

35

4

url

http://somedomain/best-kitten-names-1.jpg’

37

5

dns

somedomain

37

6

dns

blah.ps

40

7

md5_hash

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

40

8

dns

blah.ps

41

9

md5_hash

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

41

10

md5_hash

81ed03caf6901e444c72ac67d192fb9c

44

11

url

http://badguyserver/pwnme

46

12

dns

badguyserver

46

13

url

http://badguyserver/pwnme

47

14

dns

badguyserver

47

15

dns

Invoke-Shellcode.ps

48

16

dns

Invoke-ReverseDnsLookup.ps

49

17

dns

Wscript.Shell

67

18

url

http://system.management.automation.amsiutils’).getfield(‘amsiinitfailed’,’nonpublic,static’).s…

77

19

dns

system.management.automation.amsiutils’).getfield(‘amsiinitfailed’,’nonpublic,static’).setvalue(…

77

20

ipv4

1.2.3.4

78

21

dns

wscript.shell

81

22

dns

abc.com

90

23

ipv4

127.0.0.1

102

24

url

http://127.0.0.1/

102

25

win_named_pipe

\.pipeblahtest”

107

Note

the URLs in the previous table have been altered to prevent inadvertent navigation to them.