msticpy.analysis package

msticpy.analysis.anomalous_sequence subpackage

Wrapper module for Model class for modelling sessions.

In particular, this module is for both modelling and visualising your session data.

msticpy.analysis.anomalous_sequence.anomalous.score_and_visualise_sessions(data: pandas.core.frame.DataFrame, session_column: str, window_length: int, time_column: str, likelihood_upper_bound: float = None, source_columns: list = None)

Model sessions and then produces an interactive timeline visualisation plot.

In particular, the sessions are modelled using a sliding window approach within a markov model. The visualisation plot has time on the x-axis and the modelled session likelihood metric on the y-axis.

Parameters:
  • data (pd.DataFrame) – Dataframe which contains at least columns for time and sessions
  • session_column (str) –

    name of the column which contains the sessions The values in the session column should take one of the following formats:

    1. [‘Set-User’, ‘Set-Mailbox’]
    2. [Cmd(name=’Set-User’, params={‘Identity’, ‘Force’}), Cmd(name=’Set-Mailbox’, params={‘Identity’, ‘AuditEnabled’})]
    3. [Cmd(name=’Set-User’, params={‘Identity’: ‘blahblah’, ‘Force’: ‘true’}), Cmd(name=’Set-Mailbox’, params={‘Identity’: ‘blahblah’, ‘AuditEnabled’: ‘false’})]

    The Cmd datatype can be accessed from seqeunce.utils.data_structures.Cmd

  • window_length (int) –

    length of the sliding window to use when computing the likelihood metrics for each session.

    This should be set to an integer >= 2. Note that sessions which have fewer commands than the chosen window_length + 1 will not appear in the visualisation. (The + 1 is because we append a dummy end_token to each session before starting the sliding window, so a session of length 2, would be treated as length 3)

  • time_column (str) – name of the column which contains a timestamp
  • likelihood_upper_bound (float, optional) – an optional upper bound on the likelihood metrics for the visualisation plot. This can help to zoom in on the more anomalous sessions
  • source_columns (list, optional) – An optional list of source columns to include in the tooltips in the visualisation. Note, the content of each of these columns should be json serializable in order to be compatible with the figure
Returns:

Return type:

figure

msticpy.analysis.anomalous_sequence.anomalous.score_sessions(data: pandas.core.frame.DataFrame, session_column: str, window_length: int) → pandas.core.frame.DataFrame

Model sessions using a sliding window approach within a markov model.

Parameters:
  • data (pd.DataFrame) – Dataframe which contains at least a column for sessions
  • session_column (str) –

    name of the column which contains the sessions The values in the session column should take one of the following formats:

    1. [‘Set-User’, ‘Set-Mailbox’]
    2. [Cmd(name=’Set-User’, params={‘Identity’, ‘Force’}), Cmd(name=’Set-Mailbox’, params={‘Identity’, ‘AuditEnabled’})]
    3. [Cmd(name=’Set-User’, params={‘Identity’: ‘blahblah’, ‘Force’: ‘true’}), Cmd(name=’Set-Mailbox’, params={‘Identity’: ‘blahblah’, ‘AuditEnabled’: ‘false’})]

    The Cmd datatype can be accessed from anomalous_sequence.utils.data_structures.Cmd

  • window_length (int) – length of the sliding window to use when computing the likelihood metrics for each session. This should be set to an integer >= 2. Note that sessions which have fewer commands than the chosen window_length + 1 will end up with a np.nan score. (The + 1 is because we append a dummy end_token to each session before starting the sliding window, so a session of length 2, would be treated as length 3)
Returns:

Return type:

input dataframe with two additional columns appended.

msticpy.analysis.anomalous_sequence.anomalous.visualise_scored_sessions(data_with_scores: pandas.core.frame.DataFrame, time_column: str, score_column: str, window_column: str, score_upper_bound: float = None, source_columns: list = None)

Visualise the scored sessions on an interactive timeline.

Parameters:
  • data_with_scores (pd.DataFrame) – Dataframe which contains at least columns for time, session score, window representing the session
  • time_column (str) – name of the column which contains a timestamp
  • score_column (str) – name of the column which contains a numerical score for each of the sessions
  • window_column (str) – name of the column which contains a representation of each of the sessions. This representation will appear in the tooltips in the figure. For example, it could be the rarest window of the session, or the full session etc.
  • score_upper_bound (float, optional) – an optional upper bound on the score for the visualisation figure. This can help to zoom in on the more anomalous sessions
  • source_columns (list, optional) – an optional list of source columns to include in the tooltips in the visualisation. Note, the content of each of these columns should be json serializable in order to be compatible with the figure
Returns:

Return type:

figure

Module for Model class for modelling sessions data.

class msticpy.analysis.anomalous_sequence.model.Model(sessions: List[List[Union[str, msticpy.analysis.anomalous_sequence.utils.data_structures.Cmd]]], modellable_params: set = None)

Bases: object

Class for modelling sessions data.

Instantiate the Model class.

This Model class can be used to model sessions, where each session is a sequence of commands. We use a sliding window approach to calculate the rarest part of each session. We can view the sessions in ascending order of this metric to see if the top sessions are anomalous/malicious.

Parameters:
  • sessions (List[List[Union[str, Cmd]]]) –

    list of sessions, where each session is a list of either strings or a list of the Cmd datatype.

    The Cmd datatype should have “name” and “params” as attributes where “name” is the name of the command (string) and “params” is either a set of accompanying params or a dict of accompanying params and values.

    examples formats of a session: 1) [‘Set-User’, ‘Set-Mailbox’] 2) [Cmd(name=’Set-User’, params={‘Identity’, ‘Force’}), Cmd(name=’Set-Mailbox’, params={‘Identity’, ‘AuditEnabled’})] 3) [Cmd(name=’Set-User’, params={‘Identity’: ‘blahblah’, ‘Force’: ‘true’}), Cmd(name=’Set-Mailbox’, params={‘Identity’: ‘blahblah’, ‘AuditEnabled’: ‘false’})]

  • modellable_params (set, optional) – set of params which you deem to have categorical values which are suitable for modelling. Note this argument will only have an effect if your sessions include commands, params and values. If your sessions include commands, params and values and this argument is not set, then some rough heuristics will be used to determine which params have values which are suitable for modelling.
compute_geomean_lik_of_sessions()

Compute the geometric mean of the likelihood for each of the sessions.

This is done by raising the likelihood of the session to the power of (1 / k) where k is the length of the session.

Note: If the lengths (number of commands) of the sessions vary a lot, then you may not be able to fairly compare the likelihoods between a long session and a short session. This is because longer sessions involve multiplying more numbers together which are between 0 and 1. Therefore the length of the session will be negatively correlated with the likelihoods. If you take the geometric mean of the likelihood, then you can compare the likelihoods more fairly across different session lengths.

compute_likelihoods_of_sessions(use_start_end_tokens: bool = True)

Compute the likelihoods for each of the sessions.

Note: If the lengths (number of commands) of the sessions vary a lot, then you may not be able to fairly compare the likelihoods between a long session and a short session. This is because longer sessions involve multiplying more numbers together which are between 0 and 1. Therefore the length of the session will be negatively correlated with the likelihoods. If you take the geometric mean of the likelihood, then you can compare the likelihoods more fairly across different session lengths

Parameters:use_start_end_tokens (bool) – if True, then start_token and end_token will be prepended and appended to the session respectively before the calculations are done
compute_rarest_windows(window_len: int, use_start_end_tokens: bool = True, use_geo_mean: bool = False)

Find the rarest window and corresponding likelihood for each session.

In particular, uses a sliding window approach to find the rarest window and corresponding likelihood for that window for each session.

If we have a long session filled with benign activity except for a small window of suspicious behaviour, then this approach should be able to identity the session as anomalous. This approach should be more effective than simply taking the geometric mean of the full session likelihood. This is because the small window of suspicious behaviour might get averaged out by the majority benign behaviour in the session when using the geometric mean approach.

Note that if we have a session of length k, and we use a sliding window of length k+1, then we will end up with np.nan for the rarest window likelihood metric for that session. However, if use_start_end_tokens is set to True, then because we will be appending self.end_token to the session, the session will be treated as a session of length k+1, therefore, we will end up with a non np.nan value.

Parameters:
  • window_len (int) – length of sliding window for likelihood calculations
  • use_start_end_tokens (bool) – if True, then start_token and end_token will be prepended and appended to each session respectively before the calculations are done
  • use_geo_mean (bool) – if True, then each of the likelihoods of the sliding windows will be raised to the power of (1/window_len)
compute_scores(use_start_end_tokens: bool)

Compute some likelihood based scores/metrics for each of the sessions.

In particular, computes the likelihoods and geometric mean of the likelihoods for each of the sessions. Also, uses the sliding window approach to compute the rarest window likelihoods for each of the sessions. It does this for windows of length 2 and 3.

Note that if we have a session of length k, and we use a sliding window of length k+1, then we will end up with np.nan for the rarest window likelihood metric for that session. However, if use_start_end_tokens is set to True, then because we will be appending self.end_token to the session, the session will be treated as a session of length k+1, therefore, we will end up with a non np.nan value for that session.

Parameters:use_start_end_tokens (bool) – if True, then self.start_token and self.end_token will be prepended and appended to each of the sessions respectively before the calculations are done.
compute_setof_params_cond_cmd(use_geo_mean: bool)

Compute likelihood of combinations of params conditional on the cmd.

In particular, go through each command from each session and compute the probability of that set of params (and values if provided) appearing conditional on the command.

This can help us to identify unlikely combinations of params (and values if provided) for each distinct command.

Note, this method is only available if each session is a list of the Cmd datatype. It will result in an Exception if you try and use it when each session is a list of strings.

Parameters:use_geo_mean (bool) –

if True, then the probabilities will be raised to the power of (1/K)

case1: we have only params:
Then K is the number of distinct params which appeared for the given cmd across all the sessions.
case2: we have params and values:
Then K is the number of distinct params which appeared for the given cmd across all the sessions + the number of values which we included in the modelling for this cmd.
train()

Train the model by computing counts and probabilities.

In particular, computes the counts and probabilities of the commands (and possibly the params if provided, and possibly the values if provided)

class msticpy.analysis.anomalous_sequence.model.SessionType

Bases: object

Class for storing the types of accepted sessions.

cmds_only = 'cmds_only'
cmds_params_only = 'cmds_params_only'
cmds_params_values = 'cmds_params_values'

Module for creating sessions out of raw data.

msticpy.analysis.anomalous_sequence.sessionize.create_session_col(data: pandas.core.frame.DataFrame, user_identifier_cols: List[str], time_col: str, max_session_time_mins: int, max_event_separation_mins: int) → pandas.core.frame.DataFrame

Create a “session_ind” column in the dataframe.

In particular, the session_ind column will be incremented each time a new session starts.

Parameters:
  • data (pd.DataFrame) – This dataframe should contain at least the following columns: - time stamp column - columns related to user name and/or computer name and/or ip address etc
  • user_identifier_cols (List[str]) – Name of the columns which contain username and/or computer name and/or ip address etc. Each time the value of one of these columns changes, a new session will be started.
  • time_col (str) – Name of the column which contains a time stamp. If this column is not already in datetime64[ns, UTC] format, it will be casted to it.
  • max_session_time_mins (int) – The maximum length of a session in minutes. If a sequence of events for the same user_identifier_cols values exceeds this length, then a new session will be started.
  • max_event_separation_mins (int) – The maximum length in minutes between two events in a session. If we have 2 events for the same user_identifier_cols values, and if those two events are more than max_event_separation_mins apart, then a new session will be started.
Returns:

Return type:

pd.DataFrame with an additional “session_ind” column

msticpy.analysis.anomalous_sequence.sessionize.sessionize_data(data: pandas.core.frame.DataFrame, user_identifier_cols: List[str], time_col: str, max_session_time_mins: int, max_event_separation_mins: int, event_col: str) → pandas.core.frame.DataFrame

Sessionize the input data.

In particular, the resulting dataframe will have 1 row per session. It will contain the following columns: the user_identifier_cols, <time_col>_min, <time_col>_max, <event_col>_list, duration (<time_col>_max - <time_col>_min), number_events (length of the <event_col>_list value)

Parameters:
  • data (pd.DataFrame) – This dataframe should contain at least the following columns: - time stamp column - columns related to user name and/or computer name and/or ip address etc - column containing an event
  • user_identifier_cols (List[str]) – Name of the columns which contain username and/or computer name and/or ip address etc. Each time the value of one of these columns changes, a new session will be started.
  • time_col (str) – Name of the column which contains a time stamp. If this column is not already in datetime64[ns, UTC] format, it will be casted to it.
  • max_session_time_mins (int) – The maximum length of a session in minutes. If a sequence of events for the same user_identifier_cols values exceeds this length, then a new session will be started.
  • max_event_separation_mins (int) – The maximum length in minutes between two events in a session. If we have 2 events for the same user_identifier_cols values, and if those two events are more than max_event_separation_mins apart, then a new session will be started.
  • event_col (str) – Name of the column which contains the event of interest. For example, if we are interested in sessionizing exchange admin commands, the “event_col” could contain values like: “Set-Mailbox” or “Set-User” etc.
Returns:

Return type:

pd.DataFrame containing the sessionized data. 1 row per session.

msticpy.analysis.timeseries

Module for timeseries analysis functions.

msticpy.analysis.timeseries.timeseries_anomalies_stl(data: pandas.core.frame.DataFrame, **kwargs) → pandas.core.frame.DataFrame

Discover anomalies in Timeseries data using STL(Seasonal-Trend Decomposition using LOESS).

Parameters:

data (pd.DataFrame) – DataFrame as a time series data set retrived from data connector or external data source. Dataframe must have 2 columns with time column set as index and other numeric value.

Other Parameters:
 
  • seasonal (int, optional) – Seasonality period of the input data required for STL. Must be an odd integer, and should normally be >= 7 (default).
  • period (int, optional) – Periodicity of the the input data. by default 24 (Hourly).
  • score_threshold (float, optional) – standard deviation threshold value calculated using Z-score used to flag anomalies, by default 3
Returns:

Returns a dataframe with additional columns by decomposing time series data into residual, trend, seasonal, weights, baseline, score and anomalies. The anomalies column will have 0, 1,-1 values based on score_threshold set.

Return type:

pd.DataFrame