msticpy.analysis package¶
msticpy.analysis.anomalous_sequence subpackage¶
Wrapper module for Model class for modelling sessions.
In particular, this module is for both modelling and visualising your session data.

msticpy.analysis.anomalous_sequence.anomalous.
score_and_visualise_sessions
(data: pandas.core.frame.DataFrame, session_column: str, window_length: int, time_column: str, likelihood_upper_bound: float = None, source_columns: list = None)¶ Model sessions and then produces an interactive timeline visualisation plot.
In particular, the sessions are modelled using a sliding window approach within a markov model. The visualisation plot has time on the xaxis and the modelled session likelihood metric on the yaxis.
Parameters:  data (pd.DataFrame) – Dataframe which contains at least columns for time and sessions
 session_column (str) –
name of the column which contains the sessions The values in the session column should take one of the following formats:
 [‘SetUser’, ‘SetMailbox’]
 [Cmd(name=’SetUser’, params={‘Identity’, ‘Force’}), Cmd(name=’SetMailbox’, params={‘Identity’, ‘AuditEnabled’})]
 [Cmd(name=’SetUser’, params={‘Identity’: ‘blahblah’, ‘Force’: ‘true’}), Cmd(name=’SetMailbox’, params={‘Identity’: ‘blahblah’, ‘AuditEnabled’: ‘false’})]
The Cmd datatype can be accessed from seqeunce.utils.data_structures.Cmd
 window_length (int) –
length of the sliding window to use when computing the likelihood metrics for each session.
This should be set to an integer >= 2. Note that sessions which have fewer commands than the chosen window_length + 1 will not appear in the visualisation. (The + 1 is because we append a dummy end_token to each session before starting the sliding window, so a session of length 2, would be treated as length 3)
 time_column (str) – name of the column which contains a timestamp
 likelihood_upper_bound (float, optional) – an optional upper bound on the likelihood metrics for the visualisation plot. This can help to zoom in on the more anomalous sessions
 source_columns (list, optional) – An optional list of source columns to include in the tooltips in the visualisation. Note, the content of each of these columns should be json serializable in order to be compatible with the figure
Returns: Return type: figure

msticpy.analysis.anomalous_sequence.anomalous.
score_sessions
(data: pandas.core.frame.DataFrame, session_column: str, window_length: int) → pandas.core.frame.DataFrame¶ Model sessions using a sliding window approach within a markov model.
Parameters:  data (pd.DataFrame) – Dataframe which contains at least a column for sessions
 session_column (str) –
name of the column which contains the sessions The values in the session column should take one of the following formats:
 [‘SetUser’, ‘SetMailbox’]
 [Cmd(name=’SetUser’, params={‘Identity’, ‘Force’}), Cmd(name=’SetMailbox’, params={‘Identity’, ‘AuditEnabled’})]
 [Cmd(name=’SetUser’, params={‘Identity’: ‘blahblah’, ‘Force’: ‘true’}), Cmd(name=’SetMailbox’, params={‘Identity’: ‘blahblah’, ‘AuditEnabled’: ‘false’})]
The Cmd datatype can be accessed from anomalous_sequence.utils.data_structures.Cmd
 window_length (int) – length of the sliding window to use when computing the likelihood metrics for each session. This should be set to an integer >= 2. Note that sessions which have fewer commands than the chosen window_length + 1 will end up with a np.nan score. (The + 1 is because we append a dummy end_token to each session before starting the sliding window, so a session of length 2, would be treated as length 3)
Returns: Return type: input dataframe with two additional columns appended.

msticpy.analysis.anomalous_sequence.anomalous.
visualise_scored_sessions
(data_with_scores: pandas.core.frame.DataFrame, time_column: str, score_column: str, window_column: str, score_upper_bound: float = None, source_columns: list = None)¶ Visualise the scored sessions on an interactive timeline.
Parameters:  data_with_scores (pd.DataFrame) – Dataframe which contains at least columns for time, session score, window representing the session
 time_column (str) – name of the column which contains a timestamp
 score_column (str) – name of the column which contains a numerical score for each of the sessions
 window_column (str) – name of the column which contains a representation of each of the sessions. This representation will appear in the tooltips in the figure. For example, it could be the rarest window of the session, or the full session etc.
 score_upper_bound (float, optional) – an optional upper bound on the score for the visualisation figure. This can help to zoom in on the more anomalous sessions
 source_columns (list, optional) – an optional list of source columns to include in the tooltips in the visualisation. Note, the content of each of these columns should be json serializable in order to be compatible with the figure
Returns: Return type: figure
Module for Model class for modelling sessions data.

class
msticpy.analysis.anomalous_sequence.model.
Model
(sessions: List[List[Union[str, msticpy.analysis.anomalous_sequence.utils.data_structures.Cmd]]], modellable_params: set = None)¶ Bases:
object
Class for modelling sessions data.
Instantiate the Model class.
This Model class can be used to model sessions, where each session is a sequence of commands. We use a sliding window approach to calculate the rarest part of each session. We can view the sessions in ascending order of this metric to see if the top sessions are anomalous/malicious.
Parameters:  sessions (List[List[Union[str, Cmd]]]) –
list of sessions, where each session is a list of either strings or a list of the Cmd datatype.
The Cmd datatype should have “name” and “params” as attributes where “name” is the name of the command (string) and “params” is either a set of accompanying params or a dict of accompanying params and values.
examples formats of a session: 1) [‘SetUser’, ‘SetMailbox’] 2) [Cmd(name=’SetUser’, params={‘Identity’, ‘Force’}), Cmd(name=’SetMailbox’, params={‘Identity’, ‘AuditEnabled’})] 3) [Cmd(name=’SetUser’, params={‘Identity’: ‘blahblah’, ‘Force’: ‘true’}), Cmd(name=’SetMailbox’, params={‘Identity’: ‘blahblah’, ‘AuditEnabled’: ‘false’})]
 modellable_params (set, optional) – set of params which you deem to have categorical values which are suitable for modelling. Note this argument will only have an effect if your sessions include commands, params and values. If your sessions include commands, params and values and this argument is not set, then some rough heuristics will be used to determine which params have values which are suitable for modelling.

compute_geomean_lik_of_sessions
()¶ Compute the geometric mean of the likelihood for each of the sessions.
This is done by raising the likelihood of the session to the power of (1 / k) where k is the length of the session.
Note: If the lengths (number of commands) of the sessions vary a lot, then you may not be able to fairly compare the likelihoods between a long session and a short session. This is because longer sessions involve multiplying more numbers together which are between 0 and 1. Therefore the length of the session will be negatively correlated with the likelihoods. If you take the geometric mean of the likelihood, then you can compare the likelihoods more fairly across different session lengths.

compute_likelihoods_of_sessions
(use_start_end_tokens: bool = True)¶ Compute the likelihoods for each of the sessions.
Note: If the lengths (number of commands) of the sessions vary a lot, then you may not be able to fairly compare the likelihoods between a long session and a short session. This is because longer sessions involve multiplying more numbers together which are between 0 and 1. Therefore the length of the session will be negatively correlated with the likelihoods. If you take the geometric mean of the likelihood, then you can compare the likelihoods more fairly across different session lengths
Parameters: use_start_end_tokens (bool) – if True, then start_token and end_token will be prepended and appended to the session respectively before the calculations are done

compute_rarest_windows
(window_len: int, use_start_end_tokens: bool = True, use_geo_mean: bool = False)¶ Find the rarest window and corresponding likelihood for each session.
In particular, uses a sliding window approach to find the rarest window and corresponding likelihood for that window for each session.
If we have a long session filled with benign activity except for a small window of suspicious behaviour, then this approach should be able to identity the session as anomalous. This approach should be more effective than simply taking the geometric mean of the full session likelihood. This is because the small window of suspicious behaviour might get averaged out by the majority benign behaviour in the session when using the geometric mean approach.
Note that if we have a session of length k, and we use a sliding window of length k+1, then we will end up with np.nan for the rarest window likelihood metric for that session. However, if use_start_end_tokens is set to True, then because we will be appending self.end_token to the session, the session will be treated as a session of length k+1, therefore, we will end up with a non np.nan value.
Parameters:  window_len (int) – length of sliding window for likelihood calculations
 use_start_end_tokens (bool) – if True, then start_token and end_token will be prepended and appended to each session respectively before the calculations are done
 use_geo_mean (bool) – if True, then each of the likelihoods of the sliding windows will be raised to the power of (1/window_len)

compute_scores
(use_start_end_tokens: bool)¶ Compute some likelihood based scores/metrics for each of the sessions.
In particular, computes the likelihoods and geometric mean of the likelihoods for each of the sessions. Also, uses the sliding window approach to compute the rarest window likelihoods for each of the sessions. It does this for windows of length 2 and 3.
Note that if we have a session of length k, and we use a sliding window of length k+1, then we will end up with np.nan for the rarest window likelihood metric for that session. However, if use_start_end_tokens is set to True, then because we will be appending self.end_token to the session, the session will be treated as a session of length k+1, therefore, we will end up with a non np.nan value for that session.
Parameters: use_start_end_tokens (bool) – if True, then self.start_token and self.end_token will be prepended and appended to each of the sessions respectively before the calculations are done.

compute_setof_params_cond_cmd
(use_geo_mean: bool)¶ Compute likelihood of combinations of params conditional on the cmd.
In particular, go through each command from each session and compute the probability of that set of params (and values if provided) appearing conditional on the command.
This can help us to identify unlikely combinations of params (and values if provided) for each distinct command.
Note, this method is only available if each session is a list of the Cmd datatype. It will result in an Exception if you try and use it when each session is a list of strings.
Parameters: use_geo_mean (bool) – if True, then the probabilities will be raised to the power of (1/K)
 case1: we have only params:
 Then K is the number of distinct params which appeared for the given cmd across all the sessions.
 case2: we have params and values:
 Then K is the number of distinct params which appeared for the given cmd across all the sessions + the number of values which we included in the modelling for this cmd.

train
()¶ Train the model by computing counts and probabilities.
In particular, computes the counts and probabilities of the commands (and possibly the params if provided, and possibly the values if provided)
 sessions (List[List[Union[str, Cmd]]]) –

class
msticpy.analysis.anomalous_sequence.model.
SessionType
¶ Bases:
object
Class for storing the types of accepted sessions.

cmds_only
= 'cmds_only'¶

cmds_params_only
= 'cmds_params_only'¶

cmds_params_values
= 'cmds_params_values'¶

Module for creating sessions out of raw data.

msticpy.analysis.anomalous_sequence.sessionize.
create_session_col
(data: pandas.core.frame.DataFrame, user_identifier_cols: List[str], time_col: str, max_session_time_mins: int, max_event_separation_mins: int) → pandas.core.frame.DataFrame¶ Create a “session_ind” column in the dataframe.
In particular, the session_ind column will be incremented each time a new session starts.
Parameters:  data (pd.DataFrame) – This dataframe should contain at least the following columns:  time stamp column  columns related to user name and/or computer name and/or ip address etc
 user_identifier_cols (List[str]) – Name of the columns which contain username and/or computer name and/or ip address etc. Each time the value of one of these columns changes, a new session will be started.
 time_col (str) – Name of the column which contains a time stamp. If this column is not already in datetime64[ns, UTC] format, it will be casted to it.
 max_session_time_mins (int) – The maximum length of a session in minutes. If a sequence of events for the same user_identifier_cols values exceeds this length, then a new session will be started.
 max_event_separation_mins (int) – The maximum length in minutes between two events in a session. If we have 2 events for the same user_identifier_cols values, and if those two events are more than max_event_separation_mins apart, then a new session will be started.
Returns: Return type: pd.DataFrame with an additional “session_ind” column

msticpy.analysis.anomalous_sequence.sessionize.
sessionize_data
(data: pandas.core.frame.DataFrame, user_identifier_cols: List[str], time_col: str, max_session_time_mins: int, max_event_separation_mins: int, event_col: str) → pandas.core.frame.DataFrame¶ Sessionize the input data.
In particular, the resulting dataframe will have 1 row per session. It will contain the following columns: the user_identifier_cols, <time_col>_min, <time_col>_max, <event_col>_list, duration (<time_col>_max  <time_col>_min), number_events (length of the <event_col>_list value)
Parameters:  data (pd.DataFrame) – This dataframe should contain at least the following columns:  time stamp column  columns related to user name and/or computer name and/or ip address etc  column containing an event
 user_identifier_cols (List[str]) – Name of the columns which contain username and/or computer name and/or ip address etc. Each time the value of one of these columns changes, a new session will be started.
 time_col (str) – Name of the column which contains a time stamp. If this column is not already in datetime64[ns, UTC] format, it will be casted to it.
 max_session_time_mins (int) – The maximum length of a session in minutes. If a sequence of events for the same user_identifier_cols values exceeds this length, then a new session will be started.
 max_event_separation_mins (int) – The maximum length in minutes between two events in a session. If we have 2 events for the same user_identifier_cols values, and if those two events are more than max_event_separation_mins apart, then a new session will be started.
 event_col (str) – Name of the column which contains the event of interest. For example, if we are interested in sessionizing exchange admin commands, the “event_col” could contain values like: “SetMailbox” or “SetUser” etc.
Returns: Return type: pd.DataFrame containing the sessionized data. 1 row per session.
msticpy.analysis.timeseries¶
Module for timeseries analysis functions.

msticpy.analysis.timeseries.
create_time_period_kqlfilter
(periods: Dict[datetime.datetime, datetime.datetime]) → str¶ Create KQL time filter expression from time periods dict.
Parameters: periods (Dict[datetime, datetime]) – Dict of start, end periods Returns: KQL filter clause Return type: str

msticpy.analysis.timeseries.
extract_anomaly_periods
(data: pandas.core.frame.DataFrame, time_column: str = 'TimeGenerated', period: str = '1H', pos_only: bool = True) → Dict[datetime.datetime, datetime.datetime]¶ Merge adjacent anomaly periods.
Parameters:  data (pd.DataFrame) – The data to process
 time_column (str, optional) – The name of the time column
 period (str, optional) – pandascompatible time period designator, by default “1H”
 pos_only (bool, optional) – If True only extract positive anomaly periods, else extract both positive and negative. By default, True
Returns: start_period, end_period
Return type: Dict[datetime, datetime]

msticpy.analysis.timeseries.
set_new_anomaly_threshold
(data: pandas.core.frame.DataFrame, threshold: int, threshold_low: Optional[int] = None) → pandas.core.frame.DataFrame¶ Return DataFrame with anomalies calculated based on new threshold.
Parameters:  data (pd.DataFrame) – Input DataFrame
 threshold (int) – Threshold above (beyond) which values will be marked as anomalies. Used as positive and negative threshold unless threshold_low is specified.
 threshold_low (Optional[int], optional) – The threshhold below which values will be reported as anomalies, by default None.
Returns: Output DataFrame with recalculated anomalies.
Return type: pd.DataFrame

msticpy.analysis.timeseries.
timeseries_anomalies_stl
(data: pandas.core.frame.DataFrame, **kwargs) → pandas.core.frame.DataFrame¶ Return anomalies in Timeseries using STL.
Parameters: data (pd.DataFrame) – DataFrame as a time series data set retrived from data connector or external data source. Dataframe must have 2 columns with time column set as index and other numeric value.
Other Parameters:  seasonal (int, optional) – Seasonality period of the input data required for STL. Must be an odd integer, and should normally be >= 7 (default).
 period (int, optional) – Periodicity of the the input data. by default 24 (Hourly).
 score_threshold (float, optional) – standard deviation threshold value calculated using Zscore used to flag anomalies, by default 3
Returns: Returns a dataframe with additional columns by decomposing time series data into residual, trend, seasonal, weights, baseline, score and anomalies. The anomalies column will have 0, 1,1 values based on score_threshold set.
Return type: pd.DataFrame
Notes
The decomposition method is STL  SeasonalTrend Decomposition using LOESS
msticpy.analysis.eventcluster module¶
Deprecated placeholder for eventcluster.py.
msticpy.analysis.outliers module¶
Outlier detection class. TODO Preliminary.
Similar to the eventcluster module but a little bit more experimental (read ‘less tested’). It uses SkLearn Isolation Forest to identify outlier events in a single data set or using one data set as training data and another on which to predict outliers.

msticpy.analysis.outliers.
identify_outliers
(x: numpy.ndarray, x_predict: numpy.ndarray, contamination: float = 0.05) → Tuple[<sphinx.ext.autodoc.importer._MockObject object at 0x7f1384d6ee10>, numpy.ndarray, numpy.ndarray]¶ Identify outlier items using SkLearn IsolationForest.
Parameters:  x (np.ndarray) – Input data
 x_predict (np.ndarray) – Model
 contamination (float) – Percentage contamination (default: {0.05})
Returns: IsolationForest model, X_Outliers, y_pred_outliers
Return type: Tuple[IsolationForest, np.ndarray, np.ndarray]

msticpy.analysis.outliers.
plot_outlier_results
(clf: <sphinx.ext.autodoc.importer._MockObject object at 0x7f1384d6ee10>, x: numpy.ndarray, x_predict: numpy.ndarray, x_outliers: numpy.ndarray, feature_columns: List[int], plt_title: str)¶ Plot Isolation Forest results.
Parameters:  clf (IsolationForest) – Isolation Forest model
 x (np.ndarray) – Input data
 x_predict (np.ndarray) – Prediction
 x_outliers (np.ndarray) – Set of outliers
 feature_columns (List[int]) – list of feature columns to display
 plt_title (str) – Plot title

msticpy.analysis.outliers.
remove_common_items
(data: pandas.core.frame.DataFrame, columns: List[str]) → pandas.core.frame.DataFrame¶ Remove rows from input DataFrame.
Parameters:  data (pd.DataFrame) – Input dataframe
 columns (List[str]) – Column list to filter
Returns: Filtered DataFrame
Return type: pd.DataFrame
msticpy.analysis.cluster_auditd module¶
Auditd cluster function.

msticpy.analysis.cluster_auditd.
cluster_auditd_processes
(audit_data: pandas.core.frame.DataFrame, app: str = None) → pandas.core.frame.DataFrame¶ Clusters process data into specific processes.
Parameters:  audit_data (pd.DataFrame) – The Audit data containing process creation events
 app (str, optional) – The name of a specific app you wish to cluster
Returns: Details of the clustered process
Return type: pd.DataFrame