msticpy.analysis.eventcluster module

eventcluster module.

This module is intended to be used to summarize large numbers of events into clusters of different patterns. High volume repeating events can often make it difficult to see unique and interesting items.

The module contains functions to generate clusterable features from string data. For example, an administration command that does some maintenance on thousands of servers with a commandline such as::

install-update -hostname {host.fqdn} -tmp:/tmp/{GUID}/rollback

can be collapsed into a single cluster pattern by ignoring the character values in the string and using delimiters or tokens to group the values.

This is an unsupervised learning module implemented using SciKit Learn DBScan.

Contains: dbcluster_events: generic clustering method using DBSCAN designed to summarize process events and other similar data by grouping on common features.

add_process_features: derives numerical features from text features such as commandline and process path.

msticpy.analysis.eventcluster.add_process_features(input_frame: DataFrame, path_separator: Optional[str] = None, force: bool = False) DataFrame

Add numerical features based on patterns of command line and process name.

Parameters
  • input_frame (pd.DataFrame) – The input dataframe

  • path_separator (str, optional) – Path separator. If not supplied, try to determine from ‘NewProcessName’ column of first 10 rows (the default is None)

  • force (bool, optional) – Forces re-calculation of feature columns even if they already exist (the default is False)

Returns

Copy of the dataframe with the additional numeric features

Return type

pd.DataFrame

Notes

Features added:

  • processNameLen: length of process file name (inc path)

  • processNameTokens: the number of elements in the path

  • processName: the process file name (minus path)

  • commandlineTokens: number of space-separated tokens in the command line

  • commandlineLen: length of the command line

  • commandlineLogLen: log10 length of commandline

  • isSystemSession: 1 if session Id is 0x3e7 for Windows or -1 for Linux

  • commandlineTokensFull: counts number of token separators in commandline [\s-\/.,”'|&:;%$()]

  • pathScore: sum of ord() value of characters in path

  • pathLogScore: log10 of pathScore

  • commandlineScore: sum of ord() value of characters in commandline

  • commandlineLogScore: log10 of commandlineScore

msticpy.analysis.eventcluster.char_ord_score(value: str, scale: int = 1) int

Return sum of ord values of characters in string.

Parameters
  • value (str) – Data to process

  • scale (int, optional) – reduce the scale of the feature (reducing the influence of variations this feature on the clustering algorithm (the default is 1)

Returns

[description]

Return type

int

Notes

This function sums the ordinal value of each character in the input string. Two strings with minor differences will result in a similar score. However, for strings with highly variable content (e.g. command lines or http requests containing GUIDs) this may result in too much variance to be useful when you are trying to detect similar patterns. You can use the scale parameter to reduce the influence of features using this function on clustering and anomaly algorithms.

msticpy.analysis.eventcluster.char_ord_score_df(data: DataFrame, column: str, scale: int = 1) Series

Return sum of ord values of characters in string.

Parameters
  • data (pd.DataFrame) – The DataFrame to process

  • column (str) – Column name to process

  • scale (int, optional) – reduce the scale of the feature (reducing the influence of variations this feature on the clustering algorithm (the default is 1)

Returns

The sum of the ordinal values of the characters in column.

Return type

pd.Series

Notes

This function sums the ordinal value of each character in the input string. Two strings with minor differences will result in a similar score. However, for strings with highly variable content (e.g. command lines or http requests containing GUIDs) this may result in too much variance to be useful when you are trying to detect similar patterns. You can use the scale parameter to reduce the influence of features using this function on clustering and anomaly algorithms.

msticpy.analysis.eventcluster.crc32_hash(value: str) int

Return the CRC32 hash of the input column.

Parameters

value (str) – Data to process

Returns

CRC32 hash

Return type

int

msticpy.analysis.eventcluster.crc32_hash_df(data: DataFrame, column: str) Series

Return the CRC32 hash of the input column.

Parameters
  • data (pd.DataFrame) – The DataFrame to process

  • column (str) – Column name to process

Returns

CRC32 hash of input column

Return type

pd.Series

msticpy.analysis.eventcluster.dbcluster_events(data: Any, cluster_columns: Optional[List[Any]] = None, verbose: bool = False, normalize: bool = True, time_column: str = 'TimeCreatedUtc', max_cluster_distance: float = 0.01, min_cluster_samples: int = 2, **kwargs) Tuple[DataFrame, sklearn.cluster.DBSCAN, ndarray]

Cluster data set according to cluster_columns features.

Parameters
  • data (Any) – Input data as a pandas DataFrame or numpy array

  • cluster_columns (List[Any], optional) – List of columns to use for features - for DataFrame this is a list of column names - for numpy array this is a list of column indexes

  • verbose (bool, optional) – Print additional information about clustering results (the default is False)

  • normalize (bool, optional) – Normalize the input data (should probably always be True)

  • time_column (str, optional) – If there is a time column the output data will be ordered by this (the default is ‘TimeCreatedUtc’)

  • max_cluster_distance (float, optional) – DBSCAN eps (max cluster member distance) (the default is 0.01)

  • min_cluster_samples (int, optional) – DBSCAN min_samples (the minimum cluster size) (the default is 2)

  • kwargs (Other arguments are passed to DBSCAN constructor) –

Returns

Output dataframe with clustered rows DBSCAN model Normalized data set

Return type

Tuple[pd.DataFrame, DBSCAN, np.ndarray]

msticpy.analysis.eventcluster.delim_count(value: str, delim_list: str = '[\\s\\-\\\\/\\.,"\\\'|&:;%$()]') int

Count the delimiters in input column.

Parameters
  • value (str) – Data to process

  • delim_list (str, optional) –

    delimiters to use. The default is:

    [\s\-\\/\.,"\'|&:;%$()]
    

Returns

Count of delimiters in the string.

Return type

int

msticpy.analysis.eventcluster.delim_count_df(data: DataFrame, column: str, delim_list: str = '[\\s\\-\\\\/\\.,"\\\'|&:;%$()]') Series

Count the delimiters in input column.

Parameters
  • data (pd.DataFrame) – The DataFrame to process

  • column (str) – The name of the column to process

  • delim_list (str, optional) –

    delimiters to use. The default is:

    [\s\-\\/\.,"\'|&:;%$()]
    

Returns

Count of delimiters in the string in column.

Return type

pd.Series

msticpy.analysis.eventcluster.delim_hash(value: str, delim_list: str = '[\\s\\-\\\\/\\.,"\\\'|&:;%$()]') int

Return a hash (CRC32) of the delimiters from input column.

Parameters
  • value (str) – Data to process

  • delim_list (str, optional) –

    delimiters to use. The default is:

    [\s\-\\/\.,"\'|&:;%$()]
    

Returns

Hash of delimiter set in the string.

Return type

int

msticpy.analysis.eventcluster.plot_cluster(db_cluster: sklearn.cluster.DBSCAN, data: DataFrame, x_predict: ndarray, plot_label: Optional[str] = None, plot_features: Tuple[int, int] = (0, 1), verbose: bool = False, cut_off: int = 3, xlabel: Optional[str] = None, ylabel: Optional[str] = None)

Plot clustered data as scatter chart.

Parameters
  • db_cluster (DBSCAN) – DBScan Cluster (from SkLearn DBSCAN).

  • data (pd.DataFrame) – Dataframe containing original data.

  • x_predict (np.ndarray) – The DBSCAN predict numpy array

  • plot_label (str, optional) – If set the column to use to label data points (the default is None)

  • plot_features (Tuple[int, int], optional) – Which two features in x_predict to plot (the default is (0, 1))

  • verbose (bool, optional) – Verbose execution with some extra info (the default is False)

  • cut_off (int, optional) – The cluster size below which items are considered outliers (the default is 3)

  • xlabel (str, optional) – x-axis label (the default is None)

  • ylabel (str, optional) – y-axis label (the default is None)

msticpy.analysis.eventcluster.token_count(value: str, delimiter: str = ' ') int

Return count of delimiter-separated tokens pd.Series column.

Parameters
  • value (str) – Data to process

  • delimiter (str, optional) – Delimiter used to split the column string. (the default is ‘ ‘)

Returns

count of tokens

Return type

int

msticpy.analysis.eventcluster.token_count_df(data: DataFrame, column: str, delimiter: str = ' ') Series

Return count of delimiter-separated tokens pd.Series column.

Parameters
  • data (pd.DataFrame) – The DataFrame to process

  • column (str) – Column name to process

  • delimiter (str, optional) – Delimiter used to split the column string. (the default is ‘ ‘)

Returns

count of tokens in strings in column

Return type

pd.Series