Data Masking Functions

Sharing data, creating documents and doing public demonstrations often require that data containing PII or other sensitive material be masked.

MSTICPy contains a simple library to obfuscate data using hashing and random mapping of values. You can use these functions on a single data items or entire DataFrames.

Warning

These functions are only intended to mask data. No real attempt is made to preserve the syntax and meaning of the output. We recommend not trying to use an obfuscated data set as the input to any analysis. Instead, perform your analysis and mask the results.

Import the module

from msticpy.data import data_obfus

See data_obfus for API details.

Individual Masking Functions

In the examples below we’re importing individual functions from the data_obfus module but you can access them with the single import statement show above as attributes of that module.

data_obfus.hash_string(...)

hash_string

hash_string does a simple hash of the input. If the input is a numeric string it will output a numeric.

Hash a simple string.

Parameters
----------
input_str : str
    The input string

Returns
-------
str
    The masked output string

Examples

> hash_string('sensitive data')
jdiqcnrqmlidkd

> hash_string('42424')
59944

hash_item

hash_item allows specification of delimiters. This is useful for preserving the look of domains, emails, etc.

Hash a simple string.

Parameters
----------
input_item : str
    The input string
delim: str, optional
    A string of delimiters to use to split the input string
    prior to hashing.

Returns
-------
str
    The masked output string

Examples

> hash_item('sensitive data', delim=' ')
kdneqoiia laoe

> hash_item('most-sensitive-data/here', delim=' /-')
kmea-kdneqoiia-laoe/fcec

hash_ip

hash_ip will output random mappings of input IP V4 and V6 addresses. For IPV4 addresses this works by creating a random mapping of each byte of the address. So multiple occurrences of the the same IP address will be converted to the same randomized output address. The mapping remains for the Python session.

Some special IP addresses (localhost, 0.0.0.0) and the prefixes of reserved private addresses are preserved.

Warning

No checking is done for collisions with public IPs that get randomly mapped to a 10.x.x.x or other private address spaces.

Note

IPV6 addresses have their individual components hashed to a hex string and do not use this mapping. This should still result in a given input IP address being mapped to the same masked address. The output IPV6 address will usually not be a valid IP address though.

Hash IP address or list of IP addresses.

Parameters
----------
input_item : Union[List[str], str]
    List of IP addresses or single IP address.

Returns
-------
Union[List[str], str]
    List of hashed addresses or single address.
    (depending on input)

Examples

> hash_ip('192.168.3.1')
160.21.239.194

> hash_ip('2001:0db8:85a3:0000:0000:8a2e:0370:7334')
85d6:7819:9cce:9af1:9af1:24ad:d338:7d03

> hash_ip('['192.168.3.1', '192.168.5.2', '192.168.10.2']')
['160.21.239.194', '160.21.103.84', '160.21.149.84']

> hash_ip("127.0.0.1")
'127.0.0.1'

# private network prefixes preserved
> hash_ip("10.1.23.456")
'10.19.74.1'

> hash_ip("192.168.23.456")
'192.168.80.1'

hash_sid

hash_sid will randomize the domain-specific parts of a Windows SID. It preserves built-in SIDs and well known RIDs (e.g. Admins ‘-500’ RID will be preserved in the masked output). Built-in SIDs (such as LocalSystem and NetworkService are preserved as-is.

Hash a SID preserving well-known SIDs and the RID.

Parameters
----------
sid : str
    SID string

Returns
-------
str
    Hashed SID

Examples

> hash_sid('S-1-5-21-1180699209-877415012-3182924384-1004')
S-1-5-21-3321821741-636458740-4143214142-1004

> hash_sid('S-1-5-18')
S-1-5-18

hash_account

hash_sid will randomize an account name while preserving the structure and the one-to-one mapping between masked and actual account names. It preserves built-in accounts such as “root”, “SYSTEM”, etc.

Hash an Account to something recognizable.

Parameters
----------
account : str
    Account name (UPN, NT or simple name)

Returns
-------
str
    Hashed Account

Examples

> hash_account("ian@mydomain.com")
'account-#21786@blbbrfbk.pjb'

> hash_account("NT AUTHORITY/SYSTEM")
'NT AUTHORITY/SYSTEM'

> hash_account("sams_linux_user")
'account-#26953'

> hash_account("local service")
'local service'

hash_account("root")
'root'

hash_list

hash_list will randomize a list of items preserving the list structure but treating each element as a simple string to hash.

Hash list of strings.

Parameters
----------
item_list : List[str]
    Input list

Returns
-------
List[str]
    Hashed list

Examples

>> hash_list('['S-1-5-21-1180699209-877415012-3182924384-1004', 'S-1-5-18']')
['elkbjiboklpknokdeflikamojqjflqmicqiorqfbqboqe', 'nrllmpbd']

hash_dict

hash_dict will randomize a dict of items preserving the structure and the name of the dictionary keys. Only the values of the keys are hashed.

Hash dictionary values.

Parameters
----------
item_dict : Dict[str, Union[Dict[str, Any], List[Any], str]]
    Input item can be a Dict of strings, lists or other
    dictionaries.

Returns
-------
Dict[str, Any]
    Dictionary with hashed values.

Examples

> hash_dict('{'SID1': 'S-1-5-21-1180699209-877415012-3182924384-1004', 'SID2': 'S-1-5-18'}')
{'SID1': 'elkbjiboklpknokdeflikamojqjflqmicqiorqfbqboqe', 'SID2': 'nrllmpbd'}

replace_guid

replace_guid will output a random UUID mapped to the input. The same input UUUD will be mapped to the same newly-generated output UUID for the current Python session.

In the example below you can see that UUID #4 is the same as #1 and mapped to the same output UUID.

Replace GUID/UUID with mapped random UUID.

Parameters
----------
guid : str
    Input UUID.

Returns
-------
str
    Mapped UUID

Examples

> replace_guid('cf1b0b29-08ae-4528-839a-5f66eca2cce9')
9ef6c321-14f3-4681-8c3b-b596de52d8b0

> replace_guid('ed63d29e-6288-4d66-b10d-8847096fc586')
219a5b0c-3985-49cc-9016-7b23a98c3d53

> replace_guid('ac561203-99b2-4067-a525-60d45ea0d7ff')
8e8ec1e1-6df6-4b41-bbff-b73b1614430b

> replace_guid('cf1b0b29-08ae-4528-839a-5f66eca2cce9')
9ef6c321-14f3-4681-8c3b-b596de52d8b0

Masking DataFrames

We can use the msticpy pandas extension to mask the data in an entire DataFrame.

See mp_obf.obfuscate

The masking library contains a mapping for a number of common field names. You can view this list by displaying the attribute:

data_obfus.OBFUS_COL_MAP

In the first example, the TenantId, ResourceGroup, VMName have been masked.

display(netflow_df.head(3))
netflow_df.head(3).mp_mask.mask()

Warning

The pandas extension and method were renamed from msticpy 0.9.0 from mp_obfus.obfuscate() to mp_mask.mask()

Input DataFrame

TenantId

TimeGenerated

FlowStartTime

ResourceGroup

VMName

VMIPAddress

PublicIPs

SrcIP

DestIP

L4Protocol

AllExtIPs

52b1ab41-869e-4138-9e40-2a4457f09bf0

2019-02-12 14:22:40.697

2019-02-12 13:00:07.000

asihuntomsworkspacerg

msticalertswin1

10.0.3.5

[‘65.55.44.109’]

nan

nan

T

65.55.44.109

52b1ab41-869e-4138-9e40-2a4457f09bf0

2019-02-12 14:22:40.681

2019-02-12 13:00:48.000

asihuntomsworkspacerg

msticalertswin1

10.0.3.5

[‘13.71.172.130’, ‘13.71.172.128’]

nan

nan

T

13.71.172.128

52b1ab41-869e-4138-9e40-2a4457f09bf0

2019-02-12 14:22:40.681

2019-02-12 13:00:48.000

asihuntomsworkspacerg

msticalertswin1

10.0.3.5

[‘13.71.172.130’, ‘13.71.172.128’]

nan

nan

T

13.71.172.130

Output DataFrame

TenantId

TimeGenerated

FlowStartTime

ResourceGroup

VMName

VMIPAddress

PublicIPs

SrcIP

DestIP

L4Protocol

AllExtIPs

68a5a31d-7516-4c54-ad27-3b1360ce0b56

2019-02-12 14:22:40.697

2019-02-12 13:00:07.000

ibmkajbmepnmiaeilfofa

msticalertswin1

10.0.3.5

[‘65.55.44.109’]

nan

nan

T

65.55.44.109

68a5a31d-7516-4c54-ad27-3b1360ce0b56

2019-02-12 14:22:40.681

2019-02-12 13:00:48.000

ibmkajbmepnmiaeilfofa

msticalertswin1

10.0.3.5

[‘13.71.172.130’, ‘13.71.172.128’]

nan

nan

T

13.71.172.128

68a5a31d-7516-4c54-ad27-3b1360ce0b56

2019-02-12 14:22:40.681

2019-02-12 13:00:48.000

ibmkajbmepnmiaeilfofa

msticalertswin1

10.0.3.5

[‘13.71.172.130’, ‘13.71.172.128’]

nan

nan

T

13.71.172.130

TenantId and ResourceGroup have been masked but VMName and the IPAddress fields have not.

Adding custom column mappings

In the previous example you probably spotted that the VMIPAddress, PublicIPs and AllExtIPs columns were all unchanged. This is because there is no default mapping for these column names in the builtin mapping table.

We can add these columns to a custom mapping dictionary and re-run the obfuscation. See the later section on Creating custom mappings.

col_map = {
    "VMName": ".",
    "VMIPAddress": "ip",
    "PublicIPs": "ip",
    "AllExtIPs": "ip"
}

netflow_df.head(3).mp_mask.mask(column_map=col_map)

Output DataFrame after applying custom column mappings

TenantId

TimeGenerated

FlowStartTime

ResourceGroup

VMName

VMIPAddress

PublicIPs

SrcIP

DestIP

L4Protocol

AllExtIPs

68a5a31d-7516-4c54-ad27-3b1360ce0b56

2019-02-12 14:22:40.697

2019-02-12 13:00:07.000

ibmkajbmepnmiaeilfofa

fmlmbnlpdcbnbnn

149.172.239.103

[‘62.100.208.57’]

nan

nan

T

62.100.208.57

68a5a31d-7516-4c54-ad27-3b1360ce0b56

2019-02-12 14:22:40.681

2019-02-12 13:00:48.000

ibmkajbmepnmiaeilfofa

fmlmbnlpdcbnbnn

149.172.239.103

[‘156.64.40.139’, ‘156.64.40.236’]

nan

nan

T

156.64.40.236

68a5a31d-7516-4c54-ad27-3b1360ce0b56

2019-02-12 14:22:40.681

2019-02-12 13:00:48.000

ibmkajbmepnmiaeilfofa

fmlmbnlpdcbnbnn

149.172.239.103

[‘156.64.40.139’, ‘156.64.40.236’]

nan

nan

T

156.64.40.139

mask_df

You can also call the standard function obfuscate_df to perform the same operation on the DataFrame passed as the data parameter.

Warning

This function was renamed from obfuscate_df to mask_df in msticpy 0.9.0. The previous function name still exists as an alias of mask_df

data_obfus.obfuscate_df(data=netflow_df.head(3), column_map=col_map)

TenantId

TimeGenerated

FlowStartTime

ResourceGroup

VMName

VMIPAddress

PublicIPs

SrcIP

DestIP

L4Protocol

AllExtIPs

68a5a31d-7516-4c54-ad27-3b1360ce0b56

2019-02-12 14:22:40.697

2019-02-12 13:00:07.000

ibmkajbmepnmiaeilfofa

fmlmbnlpdcbnbnn

149.172.239.103

[‘62.100.208.57’]

nan

nan

T

62.100.208.57

68a5a31d-7516-4c54-ad27-3b1360ce0b56

2019-02-12 14:22:40.681

2019-02-12 13:00:48.000

ibmkajbmepnmiaeilfofa

fmlmbnlpdcbnbnn

149.172.239.103

[‘156.64.40.139’, ‘156.64.40.236’]

nan

nan

T

156.64.40.236

68a5a31d-7516-4c54-ad27-3b1360ce0b56

2019-02-12 14:22:40.681

2019-02-12 13:00:48.000

ibmkajbmepnmiaeilfofa

fmlmbnlpdcbnbnn

149.172.239.103

[‘156.64.40.139’, ‘156.64.40.236’]

nan

nan

T

156.64.40.139

Creating custom mappings

A custom mapping dictionary has entries in the following form:

"ColumnName": "operation"

The operation defines the type of masking method used for that column. Both the column and the operation code must be quoted.

operation code

masking function

“uuid”

replace_guid

“ip”

hash_ip

“str”

hash_string

“dict”

hash_dict

“list”

hash_list

“sid”

hash_sid

“null”

“null”*

None

hash_str*

delims_str

hash_item*

*The last three items require some explanation:

  • null - the null operation code means set the value to empty - i.e. delete the value in the output frame.

  • None (i.e. the dictionary value is None) default to hash_string.

  • delims_str - any string other than those named above is assumed to be a string of delimiters.

See next section for a discussion of use of delimiters.

Note

If you want to only use custom mappings and ignore the builtin mapping table, specify use_default=False as a parameter to either mp_mask.mask() or mask_df.

Using hash_item to preserve the structure/look of the hashed input

Using hash_item with a delimiters string lets you create output that reflects the structure of the input. The delimiters string is specified as a simple string of delimiter characters, e.g. “@,-”

The input string is broken into substrings using each of the delimiters in the delims_str. The substrings are individually hashed and the resulting substrings joined together using the original delimiters. The string is split in the order of the characters in the delims string.

This allows you to create hashed values that bear some resemblance to the original structure of the string. This might be useful for email address, qualified domain names and other structure text.

For example : “ian@mydomain.com

Using the simple hash_string function the output bears no resemblance to an email address

hash_string("ian@mydomain.com")
'prqocjmdpbodrafn'

Using hash_item and specifying the expected delimiters we get something like an email address in the output.

hash_item("ian@mydomain.com", "@.")
'bnm@blbbrfbk.pjb'

You use hash_item in your Custom Mapping dictionary by specifying a delimiters string as the operation.

Checking Your Masking Results

Use the check_masking function to ensure that you have masked all of the data columns that you need.

Use silent=False to print out the results. If you use silent=True (the default) it will return 2 lists of unchanged and obfuscated columns.

Note

by default this will check only the first row of the data. You can check other rows using the index parameter.

Warning

The two DataFrames should have a matching index and ordering because the check works by comparing the values in each column, judging that column values that do not match have been masked.

We create partially and fully masked DataFrames to test and run the check against the first of these. We can see that several important columns are listed as unchanged.

partly_obfus_df = netflow_df.head(3).mp_mask.mask()
fully_obfus_df = netflow_df.head(3).mp_mask.mask(column_map=col_map)

data_obfus.check_obfuscation(partly_obfus_df, netflow_df.head(3), silent=False)
===== Start Check ====
Unchanged columns:
------------------
AllExtIPs: 65.55.44.109
FlowStartTime: 2019-02-12 13:00:07.000
L4Protocol: T
PublicIPs: ['65.55.44.109']
TimeGenerated: 2019-02-12 14:22:40.697
VMIPAddress: 10.0.3.5
VMName: msticalertswin1

Obfuscated columns:
--------------------
DestIP:   nan ----> nan
ResourceGroup:   asihuntomsworkspacerg ----> ibmkajbmepnmiaeilfofa
SrcIP:   nan ----> nan
TenantId:   52b1ab41-869e-4138-9e40-2a4457f09bf0 ----> 56260b2e-9d3f-4ad9-8e65-e4a9230fd5aa
====== End Check =====

Test the fully masked data, we can see that all desired columns have been transformed.

data_obfus.check_masking(fully_obfus_df, netflow_df.head(3), silent=False)
===== Start Check ====
Unchanged columns:
------------------
FlowStartTime: 2019-02-12 13:00:07.000
L4Protocol: T
TimeGenerated: 2019-02-12 14:22:40.697

Obfuscated columns:
--------------------
AllExtIPs:   65.55.44.109 ----> 239.3.143.131
DestIP:   nan ----> nan
PublicIPs:   ['65.55.44.109'] ----> ['239.3.143.131']
ResourceGroup:   asihuntomsworkspacerg ----> ibmkajbmepnmiaeilfofa
SrcIP:   nan ----> nan
TenantId:   52b1ab41-869e-4138-9e40-2a4457f09bf0 ----> 56260b2e-9d3f-4ad9-8e65-e4a9230fd5aa
VMIPAddress:   10.0.3.5 ----> 224.21.98.125
VMName:   msticalertswin1 ----> fmlmbnlpdcbnbnn
====== End Check =====