msticpy.analysis.anomalous_sequence package
Subpackages
- msticpy.analysis.anomalous_sequence.utils package
- Submodules
- msticpy.analysis.anomalous_sequence.utils.cmds_only module
- msticpy.analysis.anomalous_sequence.utils.cmds_params_only module
- msticpy.analysis.anomalous_sequence.utils.cmds_params_values module
- msticpy.analysis.anomalous_sequence.utils.data_structures module
- msticpy.analysis.anomalous_sequence.utils.laplace_smooth module
- msticpy.analysis.anomalous_sequence.utils.probabilities module
- Module contents
Submodules
msticpy.analysis.anomalous_sequence.anomalous module
Wrapper module for Model class for modelling sessions.
In particular, this module is for both modelling and visualising your session data.
- msticpy.analysis.anomalous_sequence.anomalous.score_and_visualise_sessions(data: pandas.core.frame.DataFrame, session_column: str, window_length: int, time_column: str, likelihood_upper_bound: Optional[float] = None, source_columns: Optional[list] = None)
Model sessions and then produces an interactive timeline visualisation plot.
In particular, the sessions are modelled using a sliding window approach within a markov model. The visualisation plot has time on the x-axis and the modelled session likelihood metric on the y-axis.
- Parameters
data (pd.DataFrame) – Dataframe which contains at least columns for time and sessions
session_column (str) –
name of the column which contains the sessions The values in the session column should take one of the following formats:
[‘Set-User’, ‘Set-Mailbox’]
[Cmd(name=’Set-User’, params={‘Identity’, ‘Force’}), Cmd(name=’Set-Mailbox’, params={‘Identity’, ‘AuditEnabled’})]
[Cmd(name=’Set-User’, params={‘Identity’: ‘blahblah’, ‘Force’: ‘true’}), Cmd(name=’Set-Mailbox’, params={‘Identity’: ‘blahblah’, ‘AuditEnabled’: ‘false’})]
The Cmd datatype can be accessed from seqeunce.utils.data_structures.Cmd
window_length (int) –
length of the sliding window to use when computing the likelihood metrics for each session.
This should be set to an integer >= 2. Note that sessions which have fewer commands than the chosen window_length + 1 will not appear in the visualisation. (The + 1 is because we append a dummy end_token to each session before starting the sliding window, so a session of length 2, would be treated as length 3)
time_column (str) – name of the column which contains a timestamp
likelihood_upper_bound (float, optional) – an optional upper bound on the likelihood metrics for the visualisation plot. This can help to zoom in on the more anomalous sessions
source_columns (list, optional) – An optional list of source columns to include in the tooltips in the visualisation. Note, the content of each of these columns should be json serializable in order to be compatible with the figure
- Return type
figure
- msticpy.analysis.anomalous_sequence.anomalous.score_sessions(data: pandas.core.frame.DataFrame, session_column: str, window_length: int) pandas.core.frame.DataFrame
Model sessions using a sliding window approach within a markov model.
- Parameters
data (pd.DataFrame) – Dataframe which contains at least a column for sessions
session_column (str) –
name of the column which contains the sessions The values in the session column should take one of the following formats:
[‘Set-User’, ‘Set-Mailbox’]
[Cmd(name=’Set-User’, params={‘Identity’, ‘Force’}), Cmd(name=’Set-Mailbox’, params={‘Identity’, ‘AuditEnabled’})]
[Cmd(name=’Set-User’, params={‘Identity’: ‘blahblah’, ‘Force’: ‘true’}), Cmd(name=’Set-Mailbox’, params={‘Identity’: ‘blahblah’, ‘AuditEnabled’: ‘false’})]
The Cmd datatype can be accessed from anomalous_sequence.utils.data_structures.Cmd
window_length (int) – length of the sliding window to use when computing the likelihood metrics for each session. This should be set to an integer >= 2. Note that sessions which have fewer commands than the chosen window_length + 1 will end up with a np.nan score. (The + 1 is because we append a dummy end_token to each session before starting the sliding window, so a session of length 2, would be treated as length 3)
- Return type
input dataframe with two additional columns appended.
- msticpy.analysis.anomalous_sequence.anomalous.visualise_scored_sessions(data_with_scores: pandas.core.frame.DataFrame, time_column: str, score_column: str, window_column: str, score_upper_bound: Optional[float] = None, source_columns: Optional[list] = None)
Visualise the scored sessions on an interactive timeline.
- Parameters
data_with_scores (pd.DataFrame) – Dataframe which contains at least columns for time, session score, window representing the session
time_column (str) – name of the column which contains a timestamp
score_column (str) – name of the column which contains a numerical score for each of the sessions
window_column (str) – name of the column which contains a representation of each of the sessions. This representation will appear in the tooltips in the figure. For example, it could be the rarest window of the session, or the full session etc.
score_upper_bound (float, optional) – an optional upper bound on the score for the visualisation figure. This can help to zoom in on the more anomalous sessions
source_columns (list, optional) – an optional list of source columns to include in the tooltips in the visualisation. Note, the content of each of these columns should be json serializable in order to be compatible with the figure
- Return type
figure
msticpy.analysis.anomalous_sequence.model module
Module for Model class for modelling sessions data.
- class msticpy.analysis.anomalous_sequence.model.Model(sessions: List[List[Union[str, msticpy.analysis.anomalous_sequence.utils.data_structures.Cmd]]], modellable_params: Optional[set] = None)
Bases:
object
Class for modelling sessions data.
Instantiate the Model class.
This Model class can be used to model sessions, where each session is a sequence of commands. We use a sliding window approach to calculate the rarest part of each session. We can view the sessions in ascending order of this metric to see if the top sessions are anomalous/malicious.
- Parameters
sessions (List[List[Union[str, Cmd]]]) –
list of sessions, where each session is a list of either strings or a list of the Cmd datatype.
The Cmd datatype should have “name” and “params” as attributes where “name” is the name of the command (string) and “params” is either a set of accompanying params or a dict of accompanying params and values.
examples formats of a session: 1) [‘Set-User’, ‘Set-Mailbox’] 2) [Cmd(name=’Set-User’, params={‘Identity’, ‘Force’}), Cmd(name=’Set-Mailbox’, params={‘Identity’, ‘AuditEnabled’})] 3) [Cmd(name=’Set-User’, params={‘Identity’: ‘blahblah’, ‘Force’: ‘true’}), Cmd(name=’Set-Mailbox’, params={‘Identity’: ‘blahblah’, ‘AuditEnabled’: ‘false’})]
modellable_params (set, optional) – set of params which you deem to have categorical values which are suitable for modelling. Note this argument will only have an effect if your sessions include commands, params and values. If your sessions include commands, params and values and this argument is not set, then some rough heuristics will be used to determine which params have values which are suitable for modelling.
- compute_geomean_lik_of_sessions()
Compute the geometric mean of the likelihood for each of the sessions.
This is done by raising the likelihood of the session to the power of (1 / k) where k is the length of the session.
Note: If the lengths (number of commands) of the sessions vary a lot, then you may not be able to fairly compare the likelihoods between a long session and a short session. This is because longer sessions involve multiplying more numbers together which are between 0 and 1. Therefore the length of the session will be negatively correlated with the likelihoods. If you take the geometric mean of the likelihood, then you can compare the likelihoods more fairly across different session lengths.
- compute_likelihoods_of_sessions(use_start_end_tokens: bool = True)
Compute the likelihoods for each of the sessions.
Note: If the lengths (number of commands) of the sessions vary a lot, then you may not be able to fairly compare the likelihoods between a long session and a short session. This is because longer sessions involve multiplying more numbers together which are between 0 and 1. Therefore the length of the session will be negatively correlated with the likelihoods. If you take the geometric mean of the likelihood, then you can compare the likelihoods more fairly across different session lengths
- Parameters
use_start_end_tokens (bool) – if True, then start_token and end_token will be prepended and appended to the session respectively before the calculations are done
- compute_rarest_windows(window_len: int, use_start_end_tokens: bool = True, use_geo_mean: bool = False)
Find the rarest window and corresponding likelihood for each session.
In particular, uses a sliding window approach to find the rarest window and corresponding likelihood for that window for each session.
If we have a long session filled with benign activity except for a small window of suspicious behaviour, then this approach should be able to identity the session as anomalous. This approach should be more effective than simply taking the geometric mean of the full session likelihood. This is because the small window of suspicious behaviour might get averaged out by the majority benign behaviour in the session when using the geometric mean approach.
Note that if we have a session of length k, and we use a sliding window of length k+1, then we will end up with np.nan for the rarest window likelihood metric for that session. However, if use_start_end_tokens is set to True, then because we will be appending self.end_token to the session, the session will be treated as a session of length k+1, therefore, we will end up with a non np.nan value.
- Parameters
window_len (int) – length of sliding window for likelihood calculations
use_start_end_tokens (bool) – if True, then start_token and end_token will be prepended and appended to each session respectively before the calculations are done
use_geo_mean (bool) – if True, then each of the likelihoods of the sliding windows will be raised to the power of (1/window_len)
- compute_scores(use_start_end_tokens: bool)
Compute some likelihood based scores/metrics for each of the sessions.
In particular, computes the likelihoods and geometric mean of the likelihoods for each of the sessions. Also, uses the sliding window approach to compute the rarest window likelihoods for each of the sessions. It does this for windows of length 2 and 3.
Note that if we have a session of length k, and we use a sliding window of length k+1, then we will end up with np.nan for the rarest window likelihood metric for that session. However, if use_start_end_tokens is set to True, then because we will be appending self.end_token to the session, the session will be treated as a session of length k+1, therefore, we will end up with a non np.nan value for that session.
- Parameters
use_start_end_tokens (bool) – if True, then self.start_token and self.end_token will be prepended and appended to each of the sessions respectively before the calculations are done.
- compute_setof_params_cond_cmd(use_geo_mean: bool)
Compute likelihood of combinations of params conditional on the cmd.
In particular, go through each command from each session and compute the probability of that set of params (and values if provided) appearing conditional on the command.
This can help us to identify unlikely combinations of params (and values if provided) for each distinct command.
Note, this method is only available if each session is a list of the Cmd datatype. It will result in an Exception if you try and use it when each session is a list of strings.
- Parameters
use_geo_mean (bool) –
if True, then the probabilities will be raised to the power of (1/K)
- case1: we have only params:
Then K is the number of distinct params which appeared for the given cmd across all the sessions.
- case2: we have params and values:
Then K is the number of distinct params which appeared for the given cmd across all the sessions + the number of values which we included in the modelling for this cmd.
- train()
Train the model by computing counts and probabilities.
In particular, computes the counts and probabilities of the commands (and possibly the params if provided, and possibly the values if provided)
msticpy.analysis.anomalous_sequence.sessionize module
Module for creating sessions out of raw data.
- msticpy.analysis.anomalous_sequence.sessionize.create_session_col(data: pandas.core.frame.DataFrame, user_identifier_cols: List[str], time_col: str, max_session_time_mins: int, max_event_separation_mins: int) pandas.core.frame.DataFrame
Create a “session_ind” column in the dataframe.
In particular, the session_ind column will be incremented each time a new session starts.
- Parameters
data (pd.DataFrame) – This dataframe should contain at least the following columns: - time stamp column - columns related to user name and/or computer name and/or ip address etc
user_identifier_cols (List[str]) – Name of the columns which contain username and/or computer name and/or ip address etc. Each time the value of one of these columns changes, a new session will be started.
time_col (str) – Name of the column which contains a time stamp. If this column is not already in datetime64[ns, UTC] format, it will be casted to it.
max_session_time_mins (int) – The maximum length of a session in minutes. If a sequence of events for the same user_identifier_cols values exceeds this length, then a new session will be started.
max_event_separation_mins (int) – The maximum length in minutes between two events in a session. If we have 2 events for the same user_identifier_cols values, and if those two events are more than max_event_separation_mins apart, then a new session will be started.
- Return type
pd.DataFrame with an additional “session_ind” column
- msticpy.analysis.anomalous_sequence.sessionize.sessionize_data(data: pandas.core.frame.DataFrame, user_identifier_cols: List[str], time_col: str, max_session_time_mins: int, max_event_separation_mins: int, event_col: str) pandas.core.frame.DataFrame
Sessionize the input data.
In particular, the resulting dataframe will have 1 row per session. It will contain the following columns: the user_identifier_cols, <time_col>_min, <time_col>_max, <event_col>_list, duration (<time_col>_max - <time_col>_min), number_events (length of the <event_col>_list value)
- Parameters
data (pd.DataFrame) – This dataframe should contain at least the following columns: - time stamp column - columns related to user name and/or computer name and/or ip address etc - column containing an event
user_identifier_cols (List[str]) – Name of the columns which contain username and/or computer name and/or ip address etc. Each time the value of one of these columns changes, a new session will be started.
time_col (str) – Name of the column which contains a time stamp. If this column is not already in datetime64[ns, UTC] format, it will be casted to it.
max_session_time_mins (int) – The maximum length of a session in minutes. If a sequence of events for the same user_identifier_cols values exceeds this length, then a new session will be started.
max_event_separation_mins (int) – The maximum length in minutes between two events in a session. If we have 2 events for the same user_identifier_cols values, and if those two events are more than max_event_separation_mins apart, then a new session will be started.
event_col (str) – Name of the column which contains the event of interest. For example, if we are interested in sessionizing exchange admin commands, the “event_col” could contain values like: “Set-Mailbox” or “Set-User” etc.
- Return type
pd.DataFrame containing the sessionized data. 1 row per session.
Module contents
MSTIC Anomalous Sequence Modelling Tools.