API

class anamod.core.model_analyzer.ModelAnalyzer(model, data, targets, **kwargs)[source]

Analyzes properties of learned models.

Required parameters:

model: object

A model object that provides a ‘predict’ function that returns the model’s predictions on input data, i.e. predictions = model.predict(data)

For instance, this may be a simple wrapper around a scikit-learn or Tensorflow model.

data: 2D numpy array

Test data matrix of instances x features.

targets: 1D numpy array

A vector containing targets for each instance in the test data.

Common optional parameters:

output_dir: str, default: ‘anamod_outputs’

Directory to write logs, intermediate files, and outputs to.

num_permutations: int, default: 50

Number of permutations to perform in permutation test.

permutation_test_statistic: str, choices: [‘mean_loss’, ‘mean_log_loss’, ‘median_loss’, ‘relative_mean_loss’, ‘sign_loss’], default: mean_loss

Test statistic to use for computing empirical p-values

feature_names: list of strings, default: None

List of names to be used assigned to features.

If None, features will be identified using their indices as names.

If feature_hierarchy is provided, names from that will be used instead.

feature_hierarchy: anytree.Node object, default: None

Hierarchy over features, defined as an anytree Node or a JSON file. anytree allows importing trees from multiple formats (Python dict, JSON)

If no hierarchy is provided, a flat hierarchy will be auto-generated over base features.

Supersedes feature_names for source of feature names.

visualize: bool, default: True

Flag to control output visualization.

seed: int, default: 13997

Seed for random number generator (used to order features to be analyzed).

loss_function: str, choices: {‘quadratic_loss’, ‘binary_cross_entropy’, ‘zero_one_loss’, None, ‘absolute_difference_loss’}, default: None

Loss function to apply to model outputs. If no loss function is specified, then quadratic loss is chosen for continuous targets and binary cross-entropy is chosen for binary targets.

importance_significance_level: float, default: 0.1

Significance level and FDR control level used for hypothesis testing to assess feature importance.

compile_results_only: bool, default: False

Flag to attempt to compile results only (assuming they already exist), skipping actually launching jobs.

HTCondor parameters:

condor: bool, default: False

Flag to enable parallelization using HTCondor. Requires PyPI package htcondor to be installed.

shared_filesystem: bool, default: False

Flag to indicate a shared filesystem, making file/software transfer unnecessary for running condor.

cleanup: bool, default: True

Remove intermediate condor files upon completion (typically for debugging). Enabled by default to reduced space usage and clutter.”

features_per_worker: int, default: 1

Number of features to test per condor job. Fewer features per job reduces job load at the cost of more jobs. TODO: If none provided, this will be chosen automatically to create up to 100 jobs.

memory_requirement: int, default: 8

Memory requirement in GB

disk_requirement: int, default: 8

Disk requirement in GB

model_loader_filename: str, default: None

Python script that provides functions to load/save model. Required for condor since each job runs in its own environment. If none is provided, cloudpickle will be used - see model_loader for a template.

avoid_bad_hosts: bool, default: False

Avoid condor hosts that intermittently give issues. Enable to reduce likelihood of failures at the cost of increased runtime. List of hosts: [‘mastodon-5.biostat.wisc.edu’, ‘qlu-1.biostat.wisc.edu’, ‘e269.chtc.wisc.edu’, ‘e1039.chtc.wisc.edu’, ‘chief.biostat.wisc.edu’, ‘mammoth-1.biostat.wisc.edu’, ‘nebula-7.biostat.wisc.edu’, ‘mastodon-1.biostat.wisc.edu’]

retry_arbitrary_failures: bool, default: False

Retry failing jobs due to any reason, up to a maximum of 50 attempts per job. Use with caution - enable if failures stem from condor issues.

analyze()[source]

Performs feature importance analysis of model and returns feature objects.

In addition, writes out:

  • a table summarizing feature importance: <output_dir>/feature_importance.csv

  • a visualization of the feature importance hierarchy: <output_dir>/feature_importance_hierarchy.png

Returns

features – List of feature objects with feature importance attributes:

  • feature.important: flag to indicate whether or not the feature is important

  • feature.importance_score: degree of importance

  • feature.pvalue: p-value for importance test

Return type

list <feature object>

class anamod.core.model_analyzer.TemporalModelAnalyzer(model, data, targets, **kwargs)[source]

Analyzes properties of learned temporal models.

Required parameters:

model: object

A model object that provides a ‘predict’ function that returns the model’s predictions on input data, i.e. predictions = model.predict(data)

For instance, this may be a simple wrapper around a scikit-learn or Tensorflow model.

data: 3D numpy array

Test data tensor of instances x features x sequences.

targets: 1D numpy array

A vector containing targets for each instance in the test data.

Temporal model analysis parameters:

window_search_algorithm: str, choices: {‘effect_size’}, default: ‘effect_size’

Search algorithm to use to search for relevant window (TODO: document).

window_effect_size_threshold: float, default: 0.01

Fraction of total feature importance (effect size) permitted outside window while searching for relevant window.

Common optional parameters:

output_dir: str, default: ‘anamod_outputs’

Directory to write logs, intermediate files, and outputs to.

num_permutations: int, default: 50

Number of permutations to perform in permutation test.

permutation_test_statistic: str, choices: [‘mean_loss’, ‘mean_log_loss’, ‘median_loss’, ‘relative_mean_loss’, ‘sign_loss’], default: mean_loss

Test statistic to use for computing empirical p-values

feature_names: list of strings, default: None

List of names to be used assigned to features.

If None, features will be identified using their indices as names.

If feature_hierarchy is provided, names from that will be used instead.

feature_hierarchy: anytree.Node object, default: None

Hierarchy over features, defined as an anytree Node or a JSON file. anytree allows importing trees from multiple formats (Python dict, JSON)

If no hierarchy is provided, a flat hierarchy will be auto-generated over base features.

Supersedes feature_names for source of feature names.

visualize: bool, default: True

Flag to control output visualization.

seed: int, default: 13997

Seed for random number generator (used to order features to be analyzed).

loss_function: str, choices: {‘quadratic_loss’, ‘binary_cross_entropy’, ‘zero_one_loss’, None, ‘absolute_difference_loss’}, default: None

Loss function to apply to model outputs. If no loss function is specified, then quadratic loss is chosen for continuous targets and binary cross-entropy is chosen for binary targets.

importance_significance_level: float, default: 0.1

Significance level and FDR control level used for hypothesis testing to assess feature importance.

compile_results_only: bool, default: False

Flag to attempt to compile results only (assuming they already exist), skipping actually launching jobs.

HTCondor parameters:

condor: bool, default: False

Flag to enable parallelization using HTCondor. Requires PyPI package htcondor to be installed.

shared_filesystem: bool, default: False

Flag to indicate a shared filesystem, making file/software transfer unnecessary for running condor.

cleanup: bool, default: True

Remove intermediate condor files upon completion (typically for debugging). Enabled by default to reduced space usage and clutter.”

features_per_worker: int, default: 1

Number of features to test per condor job. Fewer features per job reduces job load at the cost of more jobs. TODO: If none provided, this will be chosen automatically to create up to 100 jobs.

memory_requirement: int, default: 8

Memory requirement in GB

disk_requirement: int, default: 8

Disk requirement in GB

model_loader_filename: str, default: None

Python script that provides functions to load/save model. Required for condor since each job runs in its own environment. If none is provided, cloudpickle will be used - see model_loader for a template.

avoid_bad_hosts: bool, default: False

Avoid condor hosts that intermittently give issues. Enable to reduce likelihood of failures at the cost of increased runtime. List of hosts: [‘mastodon-5.biostat.wisc.edu’, ‘qlu-1.biostat.wisc.edu’, ‘e269.chtc.wisc.edu’, ‘e1039.chtc.wisc.edu’, ‘chief.biostat.wisc.edu’, ‘mammoth-1.biostat.wisc.edu’, ‘nebula-7.biostat.wisc.edu’, ‘mastodon-1.biostat.wisc.edu’]

retry_arbitrary_failures: bool, default: False

Retry failing jobs due to any reason, up to a maximum of 50 attempts per job. Use with caution - enable if failures stem from condor issues.

analyze()[source]

Performs feature importance analysis of model and returns feature objects.

In addition, writes out:

  • a table summarizing feature importance: <output_dir>/feature_importance.csv

  • a visualization of important windows: <output_dir>/feature_importance_windows.png

Returns

features – List of feature objects with feature importance attributes:

  • feature.important: flag to indicate whether the feature is important

  • feature.importance_score: degree of importance

  • feature.pvalue: p-value for importance test

  • feature.ordering_important: flag to indicate whether the feature’s overall ordering is important

  • feature.ordering_pvalue: p-value for overall ordering importance test

  • feature.window: (left, right) timestep boundaries of important window (0-indexed)

  • feature.window_important: flag to indicate whether the window is important

  • feature.window_importance_score: degree of importance of window

  • feature.window_pvalue: p-value for window importance test

  • feature.window_ordering_important: flag to indicate whether ordering within the window is important

  • feature.window_ordering_pvalue: p-value for window ordering importance test

Return type

list <feature object>