msticpy.analysis package
msticpy.analysis.anomalous_sequence subpackage
Wrapper module for Model class for modelling sessions.
In particular, this module is for both modelling and visualising your session data.
- msticpy.analysis.anomalous_sequence.anomalous.score_and_visualise_sessions(data: pandas.core.frame.DataFrame, session_column: str, window_length: int, time_column: str, likelihood_upper_bound: Optional[float] = None, source_columns: Optional[list] = None)
Model sessions and then produces an interactive timeline visualisation plot.
In particular, the sessions are modelled using a sliding window approach within a markov model. The visualisation plot has time on the x-axis and the modelled session likelihood metric on the y-axis.
- Parameters
data (pd.DataFrame) – Dataframe which contains at least columns for time and sessions
session_column (str) –
name of the column which contains the sessions The values in the session column should take one of the following formats:
[‘Set-User’, ‘Set-Mailbox’]
[Cmd(name=’Set-User’, params={‘Identity’, ‘Force’}), Cmd(name=’Set-Mailbox’, params={‘Identity’, ‘AuditEnabled’})]
[Cmd(name=’Set-User’, params={‘Identity’: ‘blahblah’, ‘Force’: ‘true’}), Cmd(name=’Set-Mailbox’, params={‘Identity’: ‘blahblah’, ‘AuditEnabled’: ‘false’})]
The Cmd datatype can be accessed from seqeunce.utils.data_structures.Cmd
window_length (int) –
length of the sliding window to use when computing the likelihood metrics for each session.
This should be set to an integer >= 2. Note that sessions which have fewer commands than the chosen window_length + 1 will not appear in the visualisation. (The + 1 is because we append a dummy end_token to each session before starting the sliding window, so a session of length 2, would be treated as length 3)
time_column (str) – name of the column which contains a timestamp
likelihood_upper_bound (float, optional) – an optional upper bound on the likelihood metrics for the visualisation plot. This can help to zoom in on the more anomalous sessions
source_columns (list, optional) – An optional list of source columns to include in the tooltips in the visualisation. Note, the content of each of these columns should be json serializable in order to be compatible with the figure
- Returns
- Return type
figure
- msticpy.analysis.anomalous_sequence.anomalous.score_sessions(data: pandas.core.frame.DataFrame, session_column: str, window_length: int) pandas.core.frame.DataFrame
Model sessions using a sliding window approach within a markov model.
- Parameters
data (pd.DataFrame) – Dataframe which contains at least a column for sessions
session_column (str) –
name of the column which contains the sessions The values in the session column should take one of the following formats:
[‘Set-User’, ‘Set-Mailbox’]
[Cmd(name=’Set-User’, params={‘Identity’, ‘Force’}), Cmd(name=’Set-Mailbox’, params={‘Identity’, ‘AuditEnabled’})]
[Cmd(name=’Set-User’, params={‘Identity’: ‘blahblah’, ‘Force’: ‘true’}), Cmd(name=’Set-Mailbox’, params={‘Identity’: ‘blahblah’, ‘AuditEnabled’: ‘false’})]
The Cmd datatype can be accessed from anomalous_sequence.utils.data_structures.Cmd
window_length (int) – length of the sliding window to use when computing the likelihood metrics for each session. This should be set to an integer >= 2. Note that sessions which have fewer commands than the chosen window_length + 1 will end up with a np.nan score. (The + 1 is because we append a dummy end_token to each session before starting the sliding window, so a session of length 2, would be treated as length 3)
- Returns
- Return type
input dataframe with two additional columns appended.
- msticpy.analysis.anomalous_sequence.anomalous.visualise_scored_sessions(data_with_scores: pandas.core.frame.DataFrame, time_column: str, score_column: str, window_column: str, score_upper_bound: Optional[float] = None, source_columns: Optional[list] = None)
Visualise the scored sessions on an interactive timeline.
- Parameters
data_with_scores (pd.DataFrame) – Dataframe which contains at least columns for time, session score, window representing the session
time_column (str) – name of the column which contains a timestamp
score_column (str) – name of the column which contains a numerical score for each of the sessions
window_column (str) – name of the column which contains a representation of each of the sessions. This representation will appear in the tooltips in the figure. For example, it could be the rarest window of the session, or the full session etc.
score_upper_bound (float, optional) – an optional upper bound on the score for the visualisation figure. This can help to zoom in on the more anomalous sessions
source_columns (list, optional) – an optional list of source columns to include in the tooltips in the visualisation. Note, the content of each of these columns should be json serializable in order to be compatible with the figure
- Returns
- Return type
figure
Module for Model class for modelling sessions data.
- class msticpy.analysis.anomalous_sequence.model.Model(sessions: List[List[Union[str, msticpy.analysis.anomalous_sequence.utils.data_structures.Cmd]]], modellable_params: Optional[set] = None)
Bases:
object
Class for modelling sessions data.
Instantiate the Model class.
This Model class can be used to model sessions, where each session is a sequence of commands. We use a sliding window approach to calculate the rarest part of each session. We can view the sessions in ascending order of this metric to see if the top sessions are anomalous/malicious.
- Parameters
sessions (List[List[Union[str, Cmd]]]) –
list of sessions, where each session is a list of either strings or a list of the Cmd datatype.
The Cmd datatype should have “name” and “params” as attributes where “name” is the name of the command (string) and “params” is either a set of accompanying params or a dict of accompanying params and values.
examples formats of a session: 1) [‘Set-User’, ‘Set-Mailbox’] 2) [Cmd(name=’Set-User’, params={‘Identity’, ‘Force’}), Cmd(name=’Set-Mailbox’, params={‘Identity’, ‘AuditEnabled’})] 3) [Cmd(name=’Set-User’, params={‘Identity’: ‘blahblah’, ‘Force’: ‘true’}), Cmd(name=’Set-Mailbox’, params={‘Identity’: ‘blahblah’, ‘AuditEnabled’: ‘false’})]
modellable_params (set, optional) – set of params which you deem to have categorical values which are suitable for modelling. Note this argument will only have an effect if your sessions include commands, params and values. If your sessions include commands, params and values and this argument is not set, then some rough heuristics will be used to determine which params have values which are suitable for modelling.
- compute_geomean_lik_of_sessions()
Compute the geometric mean of the likelihood for each of the sessions.
This is done by raising the likelihood of the session to the power of (1 / k) where k is the length of the session.
Note: If the lengths (number of commands) of the sessions vary a lot, then you may not be able to fairly compare the likelihoods between a long session and a short session. This is because longer sessions involve multiplying more numbers together which are between 0 and 1. Therefore the length of the session will be negatively correlated with the likelihoods. If you take the geometric mean of the likelihood, then you can compare the likelihoods more fairly across different session lengths.
- compute_likelihoods_of_sessions(use_start_end_tokens: bool = True)
Compute the likelihoods for each of the sessions.
Note: If the lengths (number of commands) of the sessions vary a lot, then you may not be able to fairly compare the likelihoods between a long session and a short session. This is because longer sessions involve multiplying more numbers together which are between 0 and 1. Therefore the length of the session will be negatively correlated with the likelihoods. If you take the geometric mean of the likelihood, then you can compare the likelihoods more fairly across different session lengths
- Parameters
use_start_end_tokens (bool) – if True, then start_token and end_token will be prepended and appended to the session respectively before the calculations are done
- compute_rarest_windows(window_len: int, use_start_end_tokens: bool = True, use_geo_mean: bool = False)
Find the rarest window and corresponding likelihood for each session.
In particular, uses a sliding window approach to find the rarest window and corresponding likelihood for that window for each session.
If we have a long session filled with benign activity except for a small window of suspicious behaviour, then this approach should be able to identity the session as anomalous. This approach should be more effective than simply taking the geometric mean of the full session likelihood. This is because the small window of suspicious behaviour might get averaged out by the majority benign behaviour in the session when using the geometric mean approach.
Note that if we have a session of length k, and we use a sliding window of length k+1, then we will end up with np.nan for the rarest window likelihood metric for that session. However, if use_start_end_tokens is set to True, then because we will be appending self.end_token to the session, the session will be treated as a session of length k+1, therefore, we will end up with a non np.nan value.
- Parameters
window_len (int) – length of sliding window for likelihood calculations
use_start_end_tokens (bool) – if True, then start_token and end_token will be prepended and appended to each session respectively before the calculations are done
use_geo_mean (bool) – if True, then each of the likelihoods of the sliding windows will be raised to the power of (1/window_len)
- compute_scores(use_start_end_tokens: bool)
Compute some likelihood based scores/metrics for each of the sessions.
In particular, computes the likelihoods and geometric mean of the likelihoods for each of the sessions. Also, uses the sliding window approach to compute the rarest window likelihoods for each of the sessions. It does this for windows of length 2 and 3.
Note that if we have a session of length k, and we use a sliding window of length k+1, then we will end up with np.nan for the rarest window likelihood metric for that session. However, if use_start_end_tokens is set to True, then because we will be appending self.end_token to the session, the session will be treated as a session of length k+1, therefore, we will end up with a non np.nan value for that session.
- Parameters
use_start_end_tokens (bool) – if True, then self.start_token and self.end_token will be prepended and appended to each of the sessions respectively before the calculations are done.
- compute_setof_params_cond_cmd(use_geo_mean: bool)
Compute likelihood of combinations of params conditional on the cmd.
In particular, go through each command from each session and compute the probability of that set of params (and values if provided) appearing conditional on the command.
This can help us to identify unlikely combinations of params (and values if provided) for each distinct command.
Note, this method is only available if each session is a list of the Cmd datatype. It will result in an Exception if you try and use it when each session is a list of strings.
- Parameters
use_geo_mean (bool) –
if True, then the probabilities will be raised to the power of (1/K)
- case1: we have only params:
Then K is the number of distinct params which appeared for the given cmd across all the sessions.
- case2: we have params and values:
Then K is the number of distinct params which appeared for the given cmd across all the sessions + the number of values which we included in the modelling for this cmd.
- train()
Train the model by computing counts and probabilities.
In particular, computes the counts and probabilities of the commands (and possibly the params if provided, and possibly the values if provided)
- class msticpy.analysis.anomalous_sequence.model.SessionType
Bases:
object
Class for storing the types of accepted sessions.
- cmds_only = 'cmds_only'
- cmds_params_only = 'cmds_params_only'
- cmds_params_values = 'cmds_params_values'
Module for creating sessions out of raw data.
- msticpy.analysis.anomalous_sequence.sessionize.create_session_col(data: pandas.core.frame.DataFrame, user_identifier_cols: List[str], time_col: str, max_session_time_mins: int, max_event_separation_mins: int) pandas.core.frame.DataFrame
Create a “session_ind” column in the dataframe.
In particular, the session_ind column will be incremented each time a new session starts.
- Parameters
data (pd.DataFrame) – This dataframe should contain at least the following columns: - time stamp column - columns related to user name and/or computer name and/or ip address etc
user_identifier_cols (List[str]) – Name of the columns which contain username and/or computer name and/or ip address etc. Each time the value of one of these columns changes, a new session will be started.
time_col (str) – Name of the column which contains a time stamp. If this column is not already in datetime64[ns, UTC] format, it will be casted to it.
max_session_time_mins (int) – The maximum length of a session in minutes. If a sequence of events for the same user_identifier_cols values exceeds this length, then a new session will be started.
max_event_separation_mins (int) – The maximum length in minutes between two events in a session. If we have 2 events for the same user_identifier_cols values, and if those two events are more than max_event_separation_mins apart, then a new session will be started.
- Returns
- Return type
pd.DataFrame with an additional “session_ind” column
- msticpy.analysis.anomalous_sequence.sessionize.sessionize_data(data: pandas.core.frame.DataFrame, user_identifier_cols: List[str], time_col: str, max_session_time_mins: int, max_event_separation_mins: int, event_col: str) pandas.core.frame.DataFrame
Sessionize the input data.
In particular, the resulting dataframe will have 1 row per session. It will contain the following columns: the user_identifier_cols, <time_col>_min, <time_col>_max, <event_col>_list, duration (<time_col>_max - <time_col>_min), number_events (length of the <event_col>_list value)
- Parameters
data (pd.DataFrame) – This dataframe should contain at least the following columns: - time stamp column - columns related to user name and/or computer name and/or ip address etc - column containing an event
user_identifier_cols (List[str]) – Name of the columns which contain username and/or computer name and/or ip address etc. Each time the value of one of these columns changes, a new session will be started.
time_col (str) – Name of the column which contains a time stamp. If this column is not already in datetime64[ns, UTC] format, it will be casted to it.
max_session_time_mins (int) – The maximum length of a session in minutes. If a sequence of events for the same user_identifier_cols values exceeds this length, then a new session will be started.
max_event_separation_mins (int) – The maximum length in minutes between two events in a session. If we have 2 events for the same user_identifier_cols values, and if those two events are more than max_event_separation_mins apart, then a new session will be started.
event_col (str) – Name of the column which contains the event of interest. For example, if we are interested in sessionizing exchange admin commands, the “event_col” could contain values like: “Set-Mailbox” or “Set-User” etc.
- Returns
- Return type
pd.DataFrame containing the sessionized data. 1 row per session.
msticpy.analysis.timeseries
Module for timeseries analysis functions.
- msticpy.analysis.timeseries.create_time_period_kqlfilter(periods: Dict[datetime.datetime, datetime.datetime]) str
Create KQL time filter expression from time periods dict.
- Parameters
periods (Dict[datetime, datetime]) – Dict of start, end periods
- Returns
KQL filter clause
- Return type
str
- msticpy.analysis.timeseries.extract_anomaly_periods(data: pandas.core.frame.DataFrame, time_column: str = 'TimeGenerated', period: str = '1H', pos_only: bool = True) Dict[datetime.datetime, datetime.datetime]
Merge adjacent anomaly periods.
- Parameters
data (pd.DataFrame) – The data to process
time_column (str, optional) – The name of the time column
period (str, optional) – pandas-compatible time period designator, by default “1H”
pos_only (bool, optional) – If True only extract positive anomaly periods, else extract both positive and negative. By default, True
- Returns
start_period, end_period
- Return type
Dict[datetime, datetime]
- msticpy.analysis.timeseries.find_anomaly_periods(data: pandas.core.frame.DataFrame, time_column: str = 'TimeGenerated', period: str = '1H', pos_only: bool = True) List[msticpy.common.timespan.TimeSpan]
Merge adjacent anomaly periods.
- Parameters
data (pd.DataFrame) – The data to process
time_column (str, optional) – The name of the time column
period (str, optional) – pandas-compatible time period designator, by default “1H”
pos_only (bool, optional) – If True only extract positive anomaly periods, else extract both positive and negative. By default, True
- Returns
TimeSpan(start, end)
- Return type
List[TimeSpan]
- msticpy.analysis.timeseries.set_new_anomaly_threshold(data: pandas.core.frame.DataFrame, threshold: int, threshold_low: Optional[int] = None) pandas.core.frame.DataFrame
Return DataFrame with anomalies calculated based on new threshold.
- Parameters
data (pd.DataFrame) – Input DataFrame
threshold (int) – Threshold above (beyond) which values will be marked as anomalies. Used as positive and negative threshold unless threshold_low is specified.
threshold_low (Optional[int], optional) – The threshhold below which values will be reported as anomalies, by default None.
- Returns
Output DataFrame with recalculated anomalies.
- Return type
pd.DataFrame
- msticpy.analysis.timeseries.timeseries_anomalies_stl(data: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame
Return anomalies in Timeseries using STL.
- Parameters
data (pd.DataFrame) – DataFrame as a time series data set retrived from data connector or external data source. Dataframe must have 2 columns with time column set as index and other numeric value.
time_column (str, optional) – If the input data is not indexed on the time column, use this column as the time index
data_column (str, optional) – Use named column if the input data has more than one column.
seasonal (int, optional) – Seasonality period of the input data required for STL. Must be an odd integer, and should normally be >= 7 (default).
period (int, optional) – Periodicity of the the input data. by default 24 (Hourly).
score_threshold (float, optional) – standard deviation threshold value calculated using Z-score used to flag anomalies, by default 3
- Returns
Returns a dataframe with additional columns by decomposing time series data into residual, trend, seasonal, weights, baseline, score and anomalies. The anomalies column will have 0, 1,-1 values based on score_threshold set.
- Return type
pd.DataFrame
Notes
The decomposition method is STL - Seasonal-Trend Decomposition using LOESS
- msticpy.analysis.timeseries.ts_anomalies_stl(data: pandas.core.frame.DataFrame, **kwargs) pandas.core.frame.DataFrame
Return anomalies in Timeseries using STL.
- Parameters
data (pd.DataFrame) – DataFrame as a time series data set retrived from data connector or external data source. Dataframe must have 2 columns with time column set as index and other numeric value.
time_column (str, optional) – If the input data is not indexed on the time column, use this column as the time index
data_column (str, optional) – Use named column if the input data has more than one column.
seasonal (int, optional) – Seasonality period of the input data required for STL. Must be an odd integer, and should normally be >= 7 (default).
period (int, optional) – Periodicity of the the input data. by default 24 (Hourly).
score_threshold (float, optional) – standard deviation threshold value calculated using Z-score used to flag anomalies, by default 3
- Returns
Returns a dataframe with additional columns by decomposing time series data into residual, trend, seasonal, weights, baseline, score and anomalies. The anomalies column will have 0, 1,-1 values based on score_threshold set.
- Return type
pd.DataFrame
Notes
The decomposition method is STL - Seasonal-Trend Decomposition using LOESS
msticpy.analysis.eventcluster module
Deprecated placeholder for eventcluster.py.
msticpy.analysis.outliers module
Outlier detection class. TODO Preliminary.
Similar to the eventcluster module but a little bit more experimental (read ‘less tested’). It uses SkLearn Isolation Forest to identify outlier events in a single data set or using one data set as training data and another on which to predict outliers.
- msticpy.analysis.outliers.identify_outliers(x: numpy.ndarray, x_predict: numpy.ndarray, contamination: float = 0.05) Tuple[sklearn.ensemble.IsolationForest, numpy.ndarray, numpy.ndarray]
Identify outlier items using SkLearn IsolationForest.
- Parameters
x (np.ndarray) – Input data
x_predict (np.ndarray) – Model
contamination (float) – Percentage contamination (default: {0.05})
- Returns
IsolationForest model, X_Outliers, y_pred_outliers
- Return type
Tuple[IsolationForest, np.ndarray, np.ndarray]
- msticpy.analysis.outliers.plot_outlier_results(clf: sklearn.ensemble.IsolationForest, x: numpy.ndarray, x_predict: numpy.ndarray, x_outliers: numpy.ndarray, feature_columns: List[int], plt_title: str)
Plot Isolation Forest results.
- Parameters
clf (IsolationForest) – Isolation Forest model
x (np.ndarray) – Input data
x_predict (np.ndarray) – Prediction
x_outliers (np.ndarray) – Set of outliers
feature_columns (List[int]) – list of feature columns to display
plt_title (str) – Plot title
- msticpy.analysis.outliers.remove_common_items(data: pandas.core.frame.DataFrame, columns: List[str]) pandas.core.frame.DataFrame
Remove rows from input DataFrame.
- Parameters
data (pd.DataFrame) – Input dataframe
columns (List[str]) – Column list to filter
- Returns
Filtered DataFrame
- Return type
pd.DataFrame
msticpy.analysis.cluster_auditd module
Auditd cluster function.
- msticpy.analysis.cluster_auditd.cluster_auditd_processes(audit_data: pandas.core.frame.DataFrame, app: Optional[str] = None) pandas.core.frame.DataFrame
Clusters process data into specific processes.
- Parameters
audit_data (pd.DataFrame) – The Audit data containing process creation events
app (str, optional) – The name of a specific app you wish to cluster
- Returns
Details of the clustered process
- Return type
pd.DataFrame