msticpy.transform.base64unpack module

base64_unpack.

The main function of this module is to decode and unpack strings that are obfuscated using base64 and/or certain compression algorithms such as gzip and zip.

It has the following functions: unpack_items - this is the main entry point and takes either a string or a pandas dataframe (with specified column) as input. It returns a string with obfuscated parts replaced by decoded equivalents (unless the decoding results in an undecodable binary, in which case a placeholder is used).

Other helper functions may also be useful standalone get_items_from_gzip(binary): Return decompressed gzip content of byte string get_items_from_zip(binary): Return dictionary of zip contents from byte string get_items_from_tar(binary): Return dictionary of tar file contents get_hashes(binary): Return md5, sha1 and sha256 hashes of input byte string

class msticpy.transform.base64unpack.B64ExtractAccessor(pandas_obj)

Bases: object

Base64 Unpack pandas extension.

Initialize the extension.

extract(column, **kwargs) DataFrame

Base64 decode strings taken from a pandas dataframe.

Parameters:
  • data (pd.DataFrame) – dataframe containing column to decode

  • column (str) – Name of dataframe text column

  • trace (bool, optional) – Show additional status (the default is None)

  • utf16 (bool, optional) – Attempt to decode UTF16 byte strings

Returns:

Decoded string and additional metadata in dataframe

Return type:

pd.DataFrame

Notes

Items that decode to utf-8 or utf-16 strings will be returned as decoded strings replaced in the original string. If the encoded string is a known binary type it will identify the file type and return the hashes of the file. If any binary types are known archives (zip, tar, gzip) it will unpack the contents of the archive. For any binary it will return the decoded file as a byte array, and as a printable list of byte values.

The columns of the output DataFrame are:

  • decoded string: this is the input string with any decoded sections replaced by the results of the decoding

  • reference : this is an index that matches an index number in the decoded string (e.g. <<encoded binary type=pdf index=1.2’).

  • original_string : the string prior to decoding - file_type : the type of file if this could be determined

  • file_hashes : a dictionary of hashes (the md5, sha1 and sha256 hashes are broken out into separate columns)

  • input_bytes : the binary image as a byte array

  • decoded_string : printable form of the decoded string (either string or list of hex byte values)

  • encoding_type : utf-8, utf-16 or binary

  • md5, sha1, sha256 : the respective hashes of the binary file_type, file_hashes, input_bytes, md5, sha1, sha256 will be null if this item is decoded to a string

  • src_index - the index of the source row in the input frame.

class msticpy.transform.base64unpack.BinaryRecord(reference, original_string, file_name, file_type, input_bytes, decoded_string, encoding_type, file_hashes, md5, sha1, sha256, printable_bytes)

Bases: tuple

Create new instance of BinaryRecord(reference, original_string, file_name, file_type, input_bytes, decoded_string, encoding_type, file_hashes, md5, sha1, sha256, printable_bytes)

count(value, /)

Return number of occurrences of value.

decoded_string

Alias for field number 5

encoding_type

Alias for field number 6

file_hashes

Alias for field number 7

file_name

Alias for field number 2

file_type

Alias for field number 3

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

input_bytes

Alias for field number 4

md5

Alias for field number 8

original_string

Alias for field number 1

printable_bytes

Alias for field number 11

reference

Alias for field number 0

sha1

Alias for field number 9

sha256

Alias for field number 10

msticpy.transform.base64unpack.get_hashes(binary: bytes) Dict[str, str]

Return md5, sha1 and sha256 hashes of input byte string.

Parameters:

binary (bytes) – byte string of item to be hashed

Returns:

dictionary of hash algorithm + hash value

Return type:

Dict[str, str]

msticpy.transform.base64unpack.get_items_from_gzip(binary: bytes) Tuple[str, Dict[str, bytes]]

Return decompressed gzip contents.

Parameters:

binary (bytes) – byte array of gz file

Returns:

File type + decompressed file

Return type:

Tuple[str, bytes]

msticpy.transform.base64unpack.get_items_from_tar(binary: bytes) Tuple[str, Dict[str, bytes]]

Return dictionary of tar file contents.

Parameters:

binary (bytes) – byte array of zip file

Returns:

Filetype + dictionary of file name + file content

Return type:

Tuple[str, Dict[str, bytes]]

msticpy.transform.base64unpack.get_items_from_zip(binary: bytes) Tuple[str, Dict[str, bytes]]

Return dictionary of zip contents.

Parameters:

binary (bytes) – byte array of zip file

Returns:

Filetype + dictionary of file name + file content

Return type:

Tuple[str, Dict[str, bytes]]

msticpy.transform.base64unpack.unpack(input_string: str, trace: bool = False, utf16: bool = False) Tuple[str, DataFrame]

Base64 decode an input string.

Parameters:
  • input_string (str, optional) – single string to decode (the default is None)

  • trace (bool, optional) – Show additional status (the default is None)

  • utf16 (bool, optional) – Attempt to decode UTF16 byte strings

Returns:

Decoded string and additional metadata

Return type:

Tuple[str, pd.DataFrame]

Notes

Items that decode to utf-8 or utf-16 strings will be returned as decoded strings replaced in the original string. If the encoded string is a known binary type it will identify the file type and return the hashes of the file. If any binary types are known archives (zip, tar, gzip) it will unpack the contents of the archive. For any binary it will return the decoded file as a byte array, and as a printable list of byte values. If the input is a string the function returns:

  • decoded string: this is the input string with any decoded sections replaced by the results of the decoding

msticpy.transform.base64unpack.unpack_df(data: DataFrame, column: str, trace: bool = False, utf16: bool = False) DataFrame

Base64 decode strings taken from a pandas dataframe.

Parameters:
  • data (pd.DataFrame) – dataframe containing column to decode

  • column (str) – Name of dataframe text column

  • trace (bool, optional) – Show additional status (the default is None)

  • utf16 (bool, optional) – Attempt to decode UTF16 byte strings

Returns:

Decoded string and additional metadata in dataframe

Return type:

pd.DataFrame

Notes

Items that decode to utf-8 or utf-16 strings will be returned as decoded strings replaced in the original string. If the encoded string is a known binary type it will identify the file type and return the hashes of the file. If any binary types are known archives (zip, tar, gzip) it will unpack the contents of the archive. For any binary it will return the decoded file as a byte array, and as a printable list of byte values.

The columns of the output DataFrame are:

  • decoded string: this is the input string with any decoded sections replaced by the results of the decoding

  • reference : this is an index that matches an index number in the decoded string (e.g. <<encoded binary type=pdf index=1.2’).

  • original_string : the string prior to decoding

  • file_type : the type of file if this could be determined

  • file_hashes : a dictionary of hashes (the md5, sha1 and sha256 hashes are broken out into separate columns)

  • input_bytes : the binary image as a byte array

  • decoded_string : printable form of the decoded string (either string or list of hex byte values)

  • encoding_type : utf-8, utf-16 or binary

  • md5, sha1, sha256 : the respective hashes of the binary file_type, file_hashes, input_bytes, md5, sha1, sha256 will be null if this item is decoded to a string

  • src_index - the index of the source row in the input frame.

msticpy.transform.base64unpack.unpack_items(input_string: str | None = None, data: DataFrame | None = None, column: str | None = None, trace: bool = False, utf16: bool = False) Any

Base64 decode an input string or strings taken from a pandas dataframe.

Parameters:
  • input_string (str, optional) – single string to decode (the default is None)

  • data (pd.DataFrame, optional) – dataframe containing column to decode (the default is None)

  • column (str, optional) – Name of dataframe text column (the default is None)

  • trace (bool, optional) – Show additional status (the default is None)

  • utf16 (bool, optional) – Attempt to decode UTF16 byte strings

Returns:

  • Tuple[str, pd.DataFrame] (if input_string) – Decoded string and additional metadata

  • pd.DataFrame – Decoded stringa and additional metadata in dataframe

Notes

If the input is a dataframe you must supply the name of the column to use.

Items that decode to utf-8 or utf-16 strings will be returned as decoded strings replaced in the original string. If the encoded string is a known binary type it will identify the file type and return the hashes of the file. If any binary types are known archives (zip, tar, gzip) it will unpack the contents of the archive. For any binary it will return the decoded file as a byte array, and as a printable list of byte values. If the input is a string the function returns:

  • decoded string: this is the input string with any decoded sections replaced by the results of the decoding

It also returns the data as a Pandas DataFrame with the following columns:

  • reference : this is an index that matches an index number in the returned string (e.g. <<encoded binary type=pdf index=1.2’).

  • original_string : the string prior to decoding - file_type : the type of file if this could be determined

  • file_hashes : a dictionary of hashes (the md5, sha1 and sha256 hashes are broken out into separate columns)

  • input_bytes : the binary image as a byte array

  • decoded_string : printable form of the decoded string (either string or list of hex byte values)

  • encoding_type : utf-8, utf-16 or binary

  • md5, sha1, sha256 : the respective hashes of the binary file_type, file_hashes, input_bytes, md5, sha1, sha256 will be null if this item is decoded to a string

If the input is a dataframe the output dataframe will also include the following column: - src_index - the index of the source row in the input frame. This allows you to re-join the output data to the input data.