Writing and Contributing a Data Provider

See Query Providers Usage (common to all data sources) for more details on use of data providers.

A data provider lets you query data from a notebook in a standardized way. Before reading further you should familiarize yourself with how the data providers work from the Querying and Importing Data section of the MSTICPy documentation.

The term provider is more a concept than defining a single piece of code. There are several components to a provider, the most important of which is a driver. The driver class encapsulates the following functionality:

  • Authentication to the service (usually from configuration values but can also support specifying parameters such as passwords at run time).

  • Querying data from the service - queries can be either:

    • ad hoc queries as strings

    • templated queries allowing substitutable parameters for common items such as time range, account and host names, etc.

  • Returning the data as a pandas DataFrame. The driver is responsible for converting data types if needed. This is particularly important for datetime data that is usually returned as a string. Most MSTICPy functionality expects datetime in a timezone-aware pandas Timestamp.

Implementing a data provider

To implement a data provider you need to do the following:

  1. Write the driver

  2. Customize the driver (optional)

  3. Register the driver

  4. Add queries

  5. Add settings definition

  6. Create documentation

  7. Create unit tests

1. Write the driver class

This must be derived from DriverBase (DriverBase source). You should implement the following methods:

  • __init__

  • connect

  • query

  • query_with_results (optional)

Also see 2. Customize the driver below.

__init__

See DriverBase.__init__

This initializes your driver with anything it needs to load. It should call super().__init__(**kwargs).

Keyword arguments are passed from the QueryProvider class when it is initialized with your provider name. These kwargs will always include data_environment - the name of your provider (see DataEnvironment) and may include the bool debug, which you can use to output optional debug information. Any other kwargs from QueryProvider are passed to your driver class.

At minimum you should set the instance attribute self._loaded to True when your driver __init__ completes successfully.

connect

See DriverBase.connect

This method is called from QueryProvider.connect and is used to authenticate to the data service. It takes and optional connect_str parameter and a kwargs keyword argument dictionary.

Any per-connection configuration settings can be read in here using the DriverBase._get_config_settings(ProviderName) method. This returns the args section of your configuration settings from msticpyconfig.yaml.

Some existing drivers use an API key to authenticate, some use name/password and others use Azure Active Directory (AAD). See KqlDriver (KqlDriver source) for an example of the latter.)

On successful authentication, set self._connected to True. On failure, you can raise a MsticpyConnectionError and provide more details to the user for the reasons. See SplunkDriver for an example.

query

See DriverBase.query

This takes the following parameters:

  • query - string of query text

  • query_source - this is populated if the query is a MSTICPy template query read from a query yaml file (see Creating new queries) and is an instance of QuerySource. This is a representation of the yaml query with extracted parameters and metadata available as explicit attributes

  • kwargs - any other keyword arguments passed when running the query that are not consumed as query parameters, etc.

This method should submit the query to the service and handle the returned data. The data should be returned as a pandas DataFrame.

Note

You should convert data types to their expected format. For example, dates and numeric values are often returned as strings. It is particularly important to convert datetime values. MSTICPy expects datetime to be pandas Timestamp format and timezone-aware (usually UTC but this is not mandatory)

In case of a query failure, it can return the failure response instead of a DataFrame.

query_with_results

See DriverBase.query_with_results

Implementing this is optional, it can be used if you need to be able to return the raw response as well as the data in DataFrame format. However, this method isn’t exposed in the data provider framework - so is more for experimentation/debugging purposes. The query method can call this method to avoid duplication of code.

If you do not implement any logic for this you must still create a dummy method in your class query_with_results and return None, None.

2. Customize the driver

This section is optional but is needed for many providers.

Exposing attributes via the QueryProvider

QueryProvider is a facade class for the driver classes. The user interacts with the former but not directly with the latter.

If you want to expose an attribute from the driver class as an attribute of query provider you can do the following:

  • implement the attribute that you want to expose in the driver (this can be a method or other type)

  • set self.public_attribs to a Python dictionary of { name: value } where name is the name of the attribute you want to appear and value is the value of the attribute supplied by the driver, as shown the example below.

self.public_attribs = {
        "client": self.service,
        "saved_searches": self._saved_searches,
        "fired_alerts": self._fired_alerts,
    }

Custom parameter formatting

The format for Dates and lists differ between different query languages. The driver can implement a custom formatter to render datetime or list parameters into the correct format before being substituted into the query string.

Datetime formatter functions should take a Python datetime and return a string. List formatter functions should take an Iterable and return a string.

# Parameter Formatting methods
@staticmethod
def _format_datetime(date_time: datetime) -> str:
    """Return datetime-formatted string."""
    return f'"{date_time.isoformat(sep=" ")}"'

@staticmethod
def _format_list(param_list: Iterable[Any]) -> str:
    """Return formatted list parameter."""
    fmt_list = [f'"{item}"' for item in param_list]
    return ",".join(fmt_list)

You must register these functions in the driver __init__ method as follows:

self.formatters = {
        Formatters.DATETIME: self._format_datetime,
        Formatters.LIST: self._format_list,
    }

See SplunkDriver (SplunkDriver source) for an example.

Code:

Customizing the query parameter substitution

MSTICPy uses the Python str.format method to substitute named parameters. Here is an example query in a query yaml file:

  sources:
      list_files:
          description: Lists all file events by filename
          metadata:
          args:
          query: '
              {table}
              | where Timestamp >= datetime({start})
              | where Timestamp <= datetime({end})
              | where FileName has "{file_name}"
              {add_query_items}'

Each value surrounded by braces is considered to be a substitutable parameter name. If you need to include explicit brace characters in the string you can escape the substitution using double braces sequences: {{ and }}. These get converted to single braces by str.format().

While this works well for most query languages, in some cases (like queries expressed as JSON strings), replacing all braces with escaped double-braces is onerous. In this case you can opt to do the parameter substitution in the driver itself. To do this implement a method that expects two parameters:

  • query - the raw query string from the yaml file

  • param_dict - a dictionary of parameter name, parameter value

The param_dict values will already have been formatted into a suitable string format using any methods you specified in Custom parameter formatting. Substitute the parameter values into the raw query string and return the query string. The query string will be passed to your driver’s query method.

You need to register the parameter substitution function in your driver’s __init__ method

self.formatters = {
        Formatters.PARAM_HANDLER: self._custom_param_handler,
        Formatters.DATETIME: self._format_datetime,
        Formatters.LIST: self._format_list,
    }

3. Register the driver

There are two updates to classes that you need to make to register your driver.

Add the provider as a DataEnvironment

In the enum DataEnvironment (DataEnvironments source) add an entry for your provider using the next available enum value.

  @export
  class DataEnvironment(Enum):
      """
      Enumeration of data environments.

      Used to identify which queries are relevant for which
      data sources.
      """

      Unknown = 0
      AzureSentinel = 1  # alias of LogAnalytics
      LogAnalytics = 1
      MSSentinel = 1
      Kusto = 2
      ...
      ResourceGraph = 9
      Sumologic = 10
      M365D = 11
      Cybereason = 12
      Elastic = 14
      YourProvider = 15

You can also add aliases by re-using the same value(see the MSSentinel, AzureSentinel, LogAnalytics, entries.)

Add an entry to the driver dynamic load table

In the __init__.py module of data drivers (drivers sub-package __init__ source)

4. Add queries

Create a folder in msticpy/data/queries with the name of your DataEnvironment and add queries. The folder name must match the item that you added to the DataEnvironment Enum class in step 3 above. The For more details on creating queries, see Creating new queries.

Query parameter names

While you can choose whatever parameter names you like for your queries, certain functionality in MSTICPy (e.g. Pivot functions) will use standardized names to add additional functionality. For example, all queries with the host_name parameter are automatically added as enrichment functions to the Host entity.

This is a list of commonly used parameter names:

Parameter name

Use

start

Query start time

end

Query end time

account_name

User account name

commandline

Process command line

domain

DNS domain name

file_hash

File hash string

host_name

Host name (FQDN or simple)

ip_address

Dotted IP address string

logon_session_id

User logon session

process_id

Process ID

process_name

Process or file name

resource_id

Azure resource ID

url

URL

5. Add settings definition

MSTICPy’s settings editor uses configuration from a YAML file to create UI settings. This allows user’s to set settings interactively.

Define whatever settings you need as sub-keys of the args key

DataProviders:
  MicrosoftDefender:
    Args:
      ClientId: str(format=uuid)
      TenantId: str(format=uuid)
      # [SuppressMessage("Microsoft.Security", "CS002:SecretInNextLine", Justification="Test code")]
      ClientSecret: *cred_key

Use the examples and documentation in mpconfig_defaults.yaml to specify your settings.

The special value *cred_key is a YAML macro and used where you need to store a secret of some kind. Items of this type allow the user to store the value in an environment variable or as an Azure Key Vault secret rather than in the msticpyconfig file.

6. Add provider documentation

A data provider should have documentation describing its configuration and use. This should be in restructured text for generating document pages in Sphinx.

See the examples Splunk Provider and Sumologic Provider

7. Create driver unit tests

Please add a unit test using mocks to simulate the service responses. Code coverage should be at least 80%.

Do no add unit tests that call the live service. You can include tests that do this but you must mark them as to be skipped during normal unit test runs.

See the examples in MSTICPy data drivers unit tests