API¶

class anamod.core.model_analyzer.ModelAnalyzer(model, data, targets, **kwargs)[source]¶

Analyzes properties of learned models.

Required parameters:

model: object
A model object that provides a ‘predict’ function that returns the model’s predictions on input data, i.e. predictions = model.predict(data)

For instance, this may be a simple wrapper around a scikit-learn or Tensorflow model.

data: 2D numpy array
Test data matrix of instances x features.

targets: 1D numpy array
A vector containing targets for each instance in the test data.

Common optional parameters:

output_dir: str, default: ‘anamod_outputs’
Directory to write logs, intermediate files, and outputs to.

num_permutations: int, default: 50
Number of permutations to perform in permutation test.

permutation_test_statistic: str, choices: [‘mean_loss’, ‘mean_log_loss’, ‘median_loss’, ‘relative_mean_loss’, ‘sign_loss’], default: mean_loss
Test statistic to use for computing empirical p-values

feature_names: list of strings, default: None
List of names to be used assigned to features.

If None, features will be identified using their indices as names.

If feature_hierarchy is provided, names from that will be used instead.

feature_hierarchy: anytree.Node object, default: None
Hierarchy over features, defined as an anytree Node or a JSON file. anytree allows importing trees from multiple formats (Python dict, JSON)

If no hierarchy is provided, a flat hierarchy will be auto-generated over base features.

Supersedes feature_names for source of feature names.

visualize: bool, default: True
Flag to control output visualization.

seed: int, default: 13997
Seed for random number generator (used to order features to be analyzed).

loss_function: str, choices: {‘quadratic_loss’, ‘binary_cross_entropy’, ‘zero_one_loss’, None, ‘absolute_difference_loss’}, default: None
Loss function to apply to model outputs. If no loss function is specified, then quadratic loss is chosen for continuous targets and binary cross-entropy is chosen for binary targets.

importance_significance_level: float, default: 0.1
Significance level and FDR control level used for hypothesis testing to assess feature importance.

compile_results_only: bool, default: False
Flag to attempt to compile results only (assuming they already exist), skipping actually launching jobs.

HTCondor parameters:

condor: bool, default: False
Flag to enable parallelization using HTCondor. Requires PyPI package htcondor to be installed.

shared_filesystem: bool, default: False
Flag to indicate a shared filesystem, making file/software transfer unnecessary for running condor.

cleanup: bool, default: True
Remove intermediate condor files upon completion (typically for debugging). Enabled by default to reduced space usage and clutter.”

features_per_worker: int, default: 1
Number of features to test per condor job. Fewer features per job reduces job load at the cost of more jobs. TODO: If none provided, this will be chosen automatically to create up to 100 jobs.

memory_requirement: int, default: 8
Memory requirement in GB

disk_requirement: int, default: 8
Disk requirement in GB

model_loader_filename: str, default: None
Python script that provides functions to load/save model. Required for condor since each job runs in its own environment. If none is provided, cloudpickle will be used - see model_loader for a template.

avoid_bad_hosts: bool, default: False
Avoid condor hosts that intermittently give issues. Enable to reduce likelihood of failures at the cost of increased runtime. List of hosts: [‘mastodon-5.biostat.wisc.edu’, ‘qlu-1.biostat.wisc.edu’, ‘e269.chtc.wisc.edu’, ‘e1039.chtc.wisc.edu’, ‘chief.biostat.wisc.edu’, ‘mammoth-1.biostat.wisc.edu’, ‘nebula-7.biostat.wisc.edu’, ‘mastodon-1.biostat.wisc.edu’]

retry_arbitrary_failures: bool, default: False
Retry failing jobs due to any reason, up to a maximum of 50 attempts per job. Use with caution - enable if failures stem from condor issues.

analyze()[source]¶

Performs feature importance analysis of model and returns feature objects.

In addition, writes out:

a table summarizing feature importance: <output_dir>/feature_importance.csv
a visualization of the feature importance hierarchy: <output_dir>/feature_importance_hierarchy.png

Returns

features – List of feature objects with feature importance attributes:

feature.important: flag to indicate whether or not the feature is important
feature.importance_score: degree of importance
feature.pvalue: p-value for importance test

Return type

list <feature object>

class anamod.core.model_analyzer.TemporalModelAnalyzer(model, data, targets, **kwargs)[source]¶

Analyzes properties of learned temporal models.