msticpy.analysis.anomalous_sequence.model module
Module for Model class for modelling sessions data.
- class msticpy.analysis.anomalous_sequence.model.Model(sessions: List[List[str | Cmd]], modellable_params: set | None = None)
Bases:
object
Class for modelling sessions data.
Instantiate the Model class.
This Model class can be used to model sessions, where each session is a sequence of commands. We use a sliding window approach to calculate the rarest part of each session. We can view the sessions in ascending order of this metric to see if the top sessions are anomalous/malicious.
- Parameters:
sessions (List[List[Union[str, Cmd]]]) –
list of sessions, where each session is a list of either strings or a list of the Cmd datatype.
The Cmd datatype should have “name” and “params” as attributes where “name” is the name of the command (string) and “params” is either a set of accompanying params or a dict of accompanying params and values.
examples formats of a session: 1) [‘Set-User’, ‘Set-Mailbox’] 2) [Cmd(name=’Set-User’, params={‘Identity’, ‘Force’}), Cmd(name=’Set-Mailbox’, params={‘Identity’, ‘AuditEnabled’})] 3) [Cmd(name=’Set-User’, params={‘Identity’: ‘blahblah’, ‘Force’: ‘true’}), Cmd(name=’Set-Mailbox’, params={‘Identity’: ‘blahblah’, ‘AuditEnabled’: ‘false’})]
modellable_params (set, optional) – set of params which you deem to have categorical values which are suitable for modelling. Note this argument will only have an effect if your sessions include commands, params and values. If your sessions include commands, params and values and this argument is not set, then some rough heuristics will be used to determine which params have values which are suitable for modelling.
- compute_geomean_lik_of_sessions()
Compute the geometric mean of the likelihood for each of the sessions.
This is done by raising the likelihood of the session to the power of (1 / k) where k is the length of the session.
Note: If the lengths (number of commands) of the sessions vary a lot, then you may not be able to fairly compare the likelihoods between a long session and a short session. This is because longer sessions involve multiplying more numbers together which are between 0 and 1. Therefore the length of the session will be negatively correlated with the likelihoods. If you take the geometric mean of the likelihood, then you can compare the likelihoods more fairly across different session lengths.
- compute_likelihoods_of_sessions(use_start_end_tokens: bool = True)
Compute the likelihoods for each of the sessions.
Note: If the lengths (number of commands) of the sessions vary a lot, then you may not be able to fairly compare the likelihoods between a long session and a short session. This is because longer sessions involve multiplying more numbers together which are between 0 and 1. Therefore the length of the session will be negatively correlated with the likelihoods. If you take the geometric mean of the likelihood, then you can compare the likelihoods more fairly across different session lengths
- Parameters:
use_start_end_tokens (bool) – if True, then start_token and end_token will be prepended and appended to the session respectively before the calculations are done
- compute_rarest_windows(window_len: int, use_start_end_tokens: bool = True, use_geo_mean: bool = False)
Find the rarest window and corresponding likelihood for each session.
In particular, uses a sliding window approach to find the rarest window and corresponding likelihood for that window for each session.
If we have a long session filled with benign activity except for a small window of suspicious behaviour, then this approach should be able to identity the session as anomalous. This approach should be more effective than simply taking the geometric mean of the full session likelihood. This is because the small window of suspicious behaviour might get averaged out by the majority benign behaviour in the session when using the geometric mean approach.
Note that if we have a session of length k, and we use a sliding window of length k+1, then we will end up with np.nan for the rarest window likelihood metric for that session. However, if use_start_end_tokens is set to True, then because we will be appending self.end_token to the session, the session will be treated as a session of length k+1, therefore, we will end up with a non np.nan value.
- Parameters:
window_len (int) – length of sliding window for likelihood calculations
use_start_end_tokens (bool) – if True, then start_token and end_token will be prepended and appended to each session respectively before the calculations are done
use_geo_mean (bool) – if True, then each of the likelihoods of the sliding windows will be raised to the power of (1/window_len)
- compute_scores(use_start_end_tokens: bool)
Compute some likelihood based scores/metrics for each of the sessions.
In particular, computes the likelihoods and geometric mean of the likelihoods for each of the sessions. Also, uses the sliding window approach to compute the rarest window likelihoods for each of the sessions. It does this for windows of length 2 and 3.
Note that if we have a session of length k, and we use a sliding window of length k+1, then we will end up with np.nan for the rarest window likelihood metric for that session. However, if use_start_end_tokens is set to True, then because we will be appending self.end_token to the session, the session will be treated as a session of length k+1, therefore, we will end up with a non np.nan value for that session.
- Parameters:
use_start_end_tokens (bool) – if True, then self.start_token and self.end_token will be prepended and appended to each of the sessions respectively before the calculations are done.
- compute_setof_params_cond_cmd(use_geo_mean: bool)
Compute likelihood of combinations of params conditional on the cmd.
In particular, go through each command from each session and compute the probability of that set of params (and values if provided) appearing conditional on the command.
This can help us to identify unlikely combinations of params (and values if provided) for each distinct command.
Note, this method is only available if each session is a list of the Cmd datatype. It will result in an Exception if you try and use it when each session is a list of strings.
- Parameters:
use_geo_mean (bool) –
if True, then the probabilities will be raised to the power of (1/K)
- case1: we have only params:
Then K is the number of distinct params which appeared for the given cmd across all the sessions.
- case2: we have params and values:
Then K is the number of distinct params which appeared for the given cmd across all the sessions + the number of values which we included in the modelling for this cmd.
- train()
Train the model by computing counts and probabilities.
In particular, computes the counts and probabilities of the commands (and possibly the params if provided, and possibly the values if provided)