msticpy.analysis package
Subpackages
- msticpy.analysis.anomalous_sequence package
- Subpackages
- msticpy.analysis.anomalous_sequence.utils package
- Submodules
- msticpy.analysis.anomalous_sequence.utils.cmds_only module
- msticpy.analysis.anomalous_sequence.utils.cmds_params_only module
- msticpy.analysis.anomalous_sequence.utils.cmds_params_values module
- msticpy.analysis.anomalous_sequence.utils.data_structures module
- msticpy.analysis.anomalous_sequence.utils.laplace_smooth module
- msticpy.analysis.anomalous_sequence.utils.probabilities module
- Module contents
- msticpy.analysis.anomalous_sequence.utils package
- Submodules
- msticpy.analysis.anomalous_sequence.anomalous module
- msticpy.analysis.anomalous_sequence.model module
- msticpy.analysis.anomalous_sequence.sessionize module
- Module contents
- Subpackages
Submodules
msticpy.analysis.cluster_auditd module
Auditd cluster function.
- msticpy.analysis.cluster_auditd.cluster_auditd_processes(audit_data: pandas.core.frame.DataFrame, app: Optional[str] = None) pandas.core.frame.DataFrame
Clusters process data into specific processes.
- Parameters
audit_data (pd.DataFrame) – The Audit data containing process creation events
app (str, optional) – The name of a specific app you wish to cluster
- Returns
Details of the clustered process
- Return type
pd.DataFrame
msticpy.analysis.eventcluster module
eventcluster module.
This module is intended to be used to summarize large numbers of events into clusters of different patterns. High volume repeating events can often make it difficult to see unique and interesting items.
The module contains functions to generate clusterable features from string data. For example, an administration command that does some maintenance on thousands of servers with a commandline such as::
install-update -hostname {host.fqdn} -tmp:/tmp/{GUID}/rollback
can be collapsed into a single cluster pattern by ignoring the character values in the string and using delimiters or tokens to group the values.
This is an unsupervised learning module implemented using SciKit Learn DBScan.
Contains: dbcluster_events: generic clustering method using DBSCAN designed to summarize process events and other similar data by grouping on common features.
add_process_features: derives numerical features from text features such as commandline and process path.
- msticpy.analysis.eventcluster.add_process_features(input_frame: pandas.core.frame.DataFrame, path_separator: Optional[str] = None, force: bool = False) pandas.core.frame.DataFrame
Add numerical features based on patterns of command line and process name.
- Parameters
input_frame (pd.DataFrame) – The input dataframe
path_separator (str, optional) – Path separator. If not supplied, try to determine from ‘NewProcessName’ column of first 10 rows (the default is None)
force (bool, optional) – Forces re-calculation of feature columns even if they already exist (the default is False)
- Returns
Copy of the dataframe with the additional numeric features
- Return type
pd.DataFrame
Notes
Features added:
processNameLen: length of process file name (inc path)
processNameTokens: the number of elements in the path
processName: the process file name (minus path)
commandlineTokens: number of space-separated tokens in the command line
commandlineLen: length of the command line
commandlineLogLen: log10 length of commandline
isSystemSession: 1 if session Id is 0x3e7 for Windows or -1 for Linux
commandlineTokensFull: counts number of token separators in commandline [\s-\/.,”'|&:;%$()]
pathScore: sum of ord() value of characters in path
pathLogScore: log10 of pathScore
commandlineScore: sum of ord() value of characters in commandline
commandlineLogScore: log10 of commandlineScore
- msticpy.analysis.eventcluster.char_ord_score(value: str, scale: int = 1) int
Return sum of ord values of characters in string.
- Parameters
value (str) – Data to process
scale (int, optional) – reduce the scale of the feature (reducing the influence of variations this feature on the clustering algorithm (the default is 1)
- Returns
[description]
- Return type
int
Notes
This function sums the ordinal value of each character in the input string. Two strings with minor differences will result in a similar score. However, for strings with highly variable content (e.g. command lines or http requests containing GUIDs) this may result in too much variance to be useful when you are trying to detect similar patterns. You can use the scale parameter to reduce the influence of features using this function on clustering and anomaly algorithms.
- msticpy.analysis.eventcluster.char_ord_score_df(data: pandas.core.frame.DataFrame, column: str, scale: int = 1) pandas.core.series.Series
Return sum of ord values of characters in string.
- Parameters
data (pd.DataFrame) – The DataFrame to process
column (str) – Column name to process
scale (int, optional) – reduce the scale of the feature (reducing the influence of variations this feature on the clustering algorithm (the default is 1)
- Returns
The sum of the ordinal values of the characters in column.
- Return type
pd.Series
Notes
This function sums the ordinal value of each character in the input string. Two strings with minor differences will result in a similar score. However, for strings with highly variable content (e.g. command lines or http requests containing GUIDs) this may result in too much variance to be useful when you are trying to detect similar patterns. You can use the scale parameter to reduce the influence of features using this function on clustering and anomaly algorithms.
- msticpy.analysis.eventcluster.crc32_hash(value: str) int
Return the CRC32 hash of the input column.
- Parameters
value (str) – Data to process
- Returns
CRC32 hash
- Return type
int
- msticpy.analysis.eventcluster.crc32_hash_df(data: pandas.core.frame.DataFrame, column: str) pandas.core.series.Series
Return the CRC32 hash of the input column.
- Parameters
data (pd.DataFrame) – The DataFrame to process
column (str) – Column name to process
- Returns
CRC32 hash of input column
- Return type
pd.Series
- msticpy.analysis.eventcluster.dbcluster_events(data: Any, cluster_columns: Optional[List[Any]] = None, verbose: bool = False, normalize: bool = True, time_column: str = 'TimeCreatedUtc', max_cluster_distance: float = 0.01, min_cluster_samples: int = 2, **kwargs) Tuple[pandas.core.frame.DataFrame, sklearn.cluster.DBSCAN, numpy.ndarray]
Cluster data set according to cluster_columns features.
- Parameters
data (Any) – Input data as a pandas DataFrame or numpy array
cluster_columns (List[Any], optional) – List of columns to use for features - for DataFrame this is a list of column names - for numpy array this is a list of column indexes
verbose (bool, optional) – Print additional information about clustering results (the default is False)
normalize (bool, optional) – Normalize the input data (should probably always be True)
time_column (str, optional) – If there is a time column the output data will be ordered by this (the default is ‘TimeCreatedUtc’)
max_cluster_distance (float, optional) – DBSCAN eps (max cluster member distance) (the default is 0.01)
min_cluster_samples (int, optional) – DBSCAN min_samples (the minimum cluster size) (the default is 2)
kwargs (Other arguments are passed to DBSCAN constructor) –
- Returns
Output dataframe with clustered rows DBSCAN model Normalized data set
- Return type
Tuple[pd.DataFrame, DBSCAN, np.ndarray]
- msticpy.analysis.eventcluster.delim_count(value: str, delim_list: str = '[\\s\\-\\\\/\\.,"\\\'|&:;%$()]') int
Count the delimiters in input column.
- Parameters
value (str) – Data to process
delim_list (str, optional) –
delimiters to use. The default is:
[\s\-\\/\.,"\'|&:;%$()]
- Returns
Count of delimiters in the string.
- Return type
int
- msticpy.analysis.eventcluster.delim_count_df(data: pandas.core.frame.DataFrame, column: str, delim_list: str = '[\\s\\-\\\\/\\.,"\\\'|&:;%$()]') pandas.core.series.Series
Count the delimiters in input column.
- Parameters
data (pd.DataFrame) – The DataFrame to process
column (str) – The name of the column to process
delim_list (str, optional) –
delimiters to use. The default is:
[\s\-\\/\.,"\'|&:;%$()]
- Returns
Count of delimiters in the string in column.
- Return type
pd.Series
- msticpy.analysis.eventcluster.delim_hash(value: str, delim_list: str = '[\\s\\-\\\\/\\.,"\\\'|&:;%$()]') int
Return a hash (CRC32) of the delimiters from input column.
- Parameters
value (str) – Data to process
delim_list (str, optional) –
delimiters to use. The default is:
[\s\-\\/\.,"\'|&:;%$()]
- Returns
Hash of delimiter set in the string.
- Return type
int
- msticpy.analysis.eventcluster.plot_cluster(db_cluster: sklearn.cluster.DBSCAN, data: pandas.core.frame.DataFrame, x_predict: numpy.ndarray, plot_label: Optional[str] = None, plot_features: Tuple[int, int] = (0, 1), verbose: bool = False, cut_off: int = 3, xlabel: Optional[str] = None, ylabel: Optional[str] = None)
Plot clustered data as scatter chart.
- Parameters
db_cluster (DBSCAN) – DBScan Cluster (from SkLearn DBSCAN).
data (pd.DataFrame) – Dataframe containing original data.
x_predict (np.ndarray) – The DBSCAN predict numpy array
plot_label (str, optional) – If set the column to use to label data points (the default is None)
plot_features (Tuple[int, int], optional) – Which two features in x_predict to plot (the default is (0, 1))
verbose (bool, optional) – Verbose execution with some extra info (the default is False)
cut_off (int, optional) – The cluster size below which items are considered outliers (the default is 3)
xlabel (str, optional) – x-axis label (the default is None)
ylabel (str, optional) – y-axis label (the default is None)
- msticpy.analysis.eventcluster.token_count(value: str, delimiter: str = ' ') int
Return count of delimiter-separated tokens pd.Series column.
- Parameters
value (str) – Data to process
delimiter (str, optional) – Delimiter used to split the column string. (the default is ‘ ‘)
- Returns
count of tokens
- Return type
int
- msticpy.analysis.eventcluster.token_count_df(data: pandas.core.frame.DataFrame, column: str, delimiter: str = ' ') pandas.core.series.Series
Return count of delimiter-separated tokens pd.Series column.
- Parameters
data (pd.DataFrame) – The DataFrame to process
column (str) – Column name to process
delimiter (str, optional) – Delimiter used to split the column string. (the default is ‘ ‘)
- Returns
count of tokens in strings in column
- Return type
pd.Series
msticpy.analysis.outliers module
Outlier detection class. TODO Preliminary.
Similar to the eventcluster module but a little bit more experimental (read ‘less tested’). It uses SkLearn Isolation Forest to identify outlier events in a single data set or using one data set as training data and another on which to predict outliers.
- msticpy.analysis.outliers.identify_outliers(x: numpy.ndarray, x_predict: numpy.ndarray, contamination: float = 0.05) Tuple[sklearn.ensemble.IsolationForest, numpy.ndarray, numpy.ndarray]
Identify outlier items using SkLearn IsolationForest.
- Parameters
x (np.ndarray) – Input data
x_predict (np.ndarray) – Model
contamination (float) – Percentage contamination (default: {0.05})
- Returns
IsolationForest model, X_Outliers, y_pred_outliers
- Return type
Tuple[IsolationForest, np.ndarray, np.ndarray]
- msticpy.analysis.outliers.plot_outlier_results(clf: sklearn.ensemble.IsolationForest, x: numpy.ndarray, x_predict: numpy.ndarray, x_outliers: numpy.ndarray, feature_columns: List[int], plt_title: str)
Plot Isolation Forest results.
- Parameters
clf (IsolationForest) – Isolation Forest model
x (np.ndarray) – Input data
x_predict (np.ndarray) – Prediction
x_outliers (np.ndarray) – Set of outliers
feature_columns (List[int]) – list of feature columns to display
plt_title (str) – Plot title
- msticpy.analysis.outliers.remove_common_items(data: pandas.core.frame.DataFrame, columns: List[str]) pandas.core.frame.DataFrame
Remove rows from input DataFrame.
- Parameters
data (pd.DataFrame) – Input dataframe
columns (List[str]) – Column list to filter
- Returns
Filtered DataFrame
- Return type
pd.DataFrame
msticpy.analysis.timeseries module
Module for timeseries analysis functions.
- msticpy.analysis.timeseries.create_time_period_kqlfilter(periods: Dict[datetime.datetime, datetime.datetime]) str
Create KQL time filter expression from time periods dict.
- Parameters
periods (Dict[datetime, datetime]) – Dict of start, end periods
- Returns
KQL filter clause
- Return type
str
- msticpy.analysis.timeseries.extract_anomaly_periods(data: pandas.core.frame.DataFrame, time_column: str = 'TimeGenerated', period: str = '1H', pos_only: bool = True) Dict[datetime.datetime, datetime.datetime]
Merge adjacent anomaly periods.
- Parameters
data (pd.DataFrame) – The data to process
time_column (str, optional) – The name of the time column
period (str, optional) – pandas-compatible time period designator, by default “1H”
pos_only (bool, optional) – If True only extract positive anomaly periods, else extract both positive and negative. By default, True
- Returns
start_period, end_period
- Return type
Dict[datetime, datetime]
- msticpy.analysis.timeseries.find_anomaly_periods(data: pandas.core.frame.DataFrame, time_column: str = 'TimeGenerated', period: str = '1H', pos_only: bool = True) List[msticpy.common.timespan.TimeSpan]
Merge adjacent anomaly periods.
- Parameters
data (pd.DataFrame) – The data to process
time_column (str, optional) – The name of the time column
period (str, optional) – pandas-compatible time period designator, by default “1H”
pos_only (bool, optional) – If True only extract positive anomaly periods, else extract both positive and negative. By default, True
- Returns
TimeSpan(start, end)
- Return type
List[TimeSpan]
- msticpy.analysis.timeseries.set_new_anomaly_threshold(data: pandas.core.frame.DataFrame, threshold: int, threshold_low: Optional[int] = None) pandas.core.frame.DataFrame
Return DataFrame with anomalies calculated based on new threshold.
- Parameters
data (pd.DataFrame) – Input DataFrame
threshold (int) – Threshold above (beyond) which values will be marked as anomalies. Used as positive and negative threshold unless threshold_low is specified.
threshold_low (Optional[int], optional) – The threshhold below which values will be reported as anomalies, by default None.
- Returns
Output DataFrame with recalculated anomalies.
- Return type
pd.DataFrame
- msticpy.analysis.timeseries.timeseries_anomalies_stl(data: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame
Return anomalies in Timeseries using STL.
- Parameters
data (pd.DataFrame) – DataFrame as a time series data set retrived from data connector or external data source. Dataframe must have 2 columns with time column set as index and other numeric value.
time_column (str, optional) – If the input data is not indexed on the time column, use this column as the time index
data_column (str, optional) – Use named column if the input data has more than one column.
seasonal (int, optional) – Seasonality period of the input data required for STL. Must be an odd integer, and should normally be >= 7 (default).
period (int, optional) – Periodicity of the the input data. by default 24 (Hourly).
score_threshold (float, optional) – standard deviation threshold value calculated using Z-score used to flag anomalies, by default 3
- Returns
Returns a dataframe with additional columns by decomposing time series data into residual, trend, seasonal, weights, baseline, score and anomalies. The anomalies column will have 0, 1,-1 values based on score_threshold set.
- Return type
pd.DataFrame
Notes
The decomposition method is STL - Seasonal-Trend Decomposition using LOESS
- msticpy.analysis.timeseries.ts_anomalies_stl(data: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame
Return anomalies in Timeseries using STL.
- Parameters
data (pd.DataFrame) – DataFrame as a time series data set retrived from data connector or external data source. Dataframe must have 2 columns with time column set as index and other numeric value.
time_column (str, optional) – If the input data is not indexed on the time column, use this column as the time index
data_column (str, optional) – Use named column if the input data has more than one column.
seasonal (int, optional) – Seasonality period of the input data required for STL. Must be an odd integer, and should normally be >= 7 (default).
period (int, optional) – Periodicity of the the input data. by default 24 (Hourly).
score_threshold (float, optional) – standard deviation threshold value calculated using Z-score used to flag anomalies, by default 3
- Returns
Returns a dataframe with additional columns by decomposing time series data into residual, trend, seasonal, weights, baseline, score and anomalies. The anomalies column will have 0, 1,-1 values based on score_threshold set.
- Return type
pd.DataFrame
Notes
The decomposition method is STL - Seasonal-Trend Decomposition using LOESS
Module contents
MSTIC Analysis Tools.