API¶
- class anamod.core.model_analyzer.ModelAnalyzer(model, data, targets, **kwargs)[source]¶
Analyzes properties of learned models.
Required parameters:
- model: object
A model object that provides a ‘predict’ function that returns the model’s predictions on input data, i.e. predictions = model.predict(data)
For instance, this may be a simple wrapper around a scikit-learn or Tensorflow model.
- data: 2D numpy array
Test data matrix of instances x features.
- targets: 1D numpy array
A vector containing targets for each instance in the test data.
Common optional parameters:
- output_dir: str, default: ‘anamod_outputs’
Directory to write logs, intermediate files, and outputs to.
- num_permutations: int, default: 50
Number of permutations to perform in permutation test.
- permutation_test_statistic: str, choices: [‘mean_loss’, ‘mean_log_loss’, ‘median_loss’, ‘relative_mean_loss’, ‘sign_loss’], default: mean_loss
Test statistic to use for computing empirical p-values
- feature_names: list of strings, default: None
List of names to be used assigned to features.
If None, features will be identified using their indices as names.
If
feature_hierarchy
is provided, names from that will be used instead.- feature_hierarchy: anytree.Node object, default: None
Hierarchy over features, defined as an anytree Node or a JSON file. anytree allows importing trees from multiple formats (Python dict, JSON)
If no hierarchy is provided, a flat hierarchy will be auto-generated over base features.
Supersedes
feature_names
for source of feature names.- visualize: bool, default: True
Flag to control output visualization.
- seed: int, default: 13997
Seed for random number generator (used to order features to be analyzed).
- loss_function: str, choices: {‘quadratic_loss’, ‘binary_cross_entropy’, ‘zero_one_loss’, None, ‘absolute_difference_loss’}, default: None
Loss function to apply to model outputs. If no loss function is specified, then quadratic loss is chosen for continuous targets and binary cross-entropy is chosen for binary targets.
- importance_significance_level: float, default: 0.1
Significance level and FDR control level used for hypothesis testing to assess feature importance.
- compile_results_only: bool, default: False
Flag to attempt to compile results only (assuming they already exist), skipping actually launching jobs.
HTCondor parameters:
- condor: bool, default: False
Flag to enable parallelization using HTCondor. Requires PyPI package htcondor to be installed.
- shared_filesystem: bool, default: False
Flag to indicate a shared filesystem, making file/software transfer unnecessary for running condor.
- cleanup: bool, default: True
Remove intermediate condor files upon completion (typically for debugging). Enabled by default to reduced space usage and clutter.”
- features_per_worker: int, default: 1
Number of features to test per condor job. Fewer features per job reduces job load at the cost of more jobs. TODO: If none provided, this will be chosen automatically to create up to 100 jobs.
- memory_requirement: int, default: 8
Memory requirement in GB
- disk_requirement: int, default: 8
Disk requirement in GB
- model_loader_filename: str, default: None
Python script that provides functions to load/save model. Required for condor since each job runs in its own environment. If none is provided, cloudpickle will be used - see model_loader for a template.
- avoid_bad_hosts: bool, default: False
Avoid condor hosts that intermittently give issues. Enable to reduce likelihood of failures at the cost of increased runtime. List of hosts: [‘mastodon-5.biostat.wisc.edu’, ‘qlu-1.biostat.wisc.edu’, ‘e269.chtc.wisc.edu’, ‘e1039.chtc.wisc.edu’, ‘chief.biostat.wisc.edu’, ‘mammoth-1.biostat.wisc.edu’, ‘nebula-7.biostat.wisc.edu’, ‘mastodon-1.biostat.wisc.edu’]
- retry_arbitrary_failures: bool, default: False
Retry failing jobs due to any reason, up to a maximum of 50 attempts per job. Use with caution - enable if failures stem from condor issues.
- analyze()[source]¶
Performs feature importance analysis of model and returns feature objects.
In addition, writes out:
a table summarizing feature importance: <output_dir>/feature_importance.csv
a visualization of the feature importance hierarchy: <output_dir>/feature_importance_hierarchy.png
- Returns
features – List of feature objects with feature importance attributes:
feature.important: flag to indicate whether or not the feature is important
feature.importance_score: degree of importance
feature.pvalue: p-value for importance test
- Return type
list <feature object>
- class anamod.core.model_analyzer.TemporalModelAnalyzer(model, data, targets, **kwargs)[source]¶
Analyzes properties of learned temporal models.
Required parameters:
- model: object
A model object that provides a ‘predict’ function that returns the model’s predictions on input data, i.e. predictions = model.predict(data)
For instance, this may be a simple wrapper around a scikit-learn or Tensorflow model.
- data: 3D numpy array
Test data tensor of instances x features x sequences.
- targets: 1D numpy array
A vector containing targets for each instance in the test data.
Temporal model analysis parameters:
- window_search_algorithm: str, choices: {‘effect_size’}, default: ‘effect_size’
Search algorithm to use to search for relevant window (TODO: document).
- window_effect_size_threshold: float, default: 0.01
Fraction of total feature importance (effect size) permitted outside window while searching for relevant window.
Common optional parameters:
- output_dir: str, default: ‘anamod_outputs’
Directory to write logs, intermediate files, and outputs to.
- num_permutations: int, default: 50
Number of permutations to perform in permutation test.
- permutation_test_statistic: str, choices: [‘mean_loss’, ‘mean_log_loss’, ‘median_loss’, ‘relative_mean_loss’, ‘sign_loss’], default: mean_loss
Test statistic to use for computing empirical p-values
- feature_names: list of strings, default: None
List of names to be used assigned to features.
If None, features will be identified using their indices as names.
If
feature_hierarchy
is provided, names from that will be used instead.- feature_hierarchy: anytree.Node object, default: None
Hierarchy over features, defined as an anytree Node or a JSON file. anytree allows importing trees from multiple formats (Python dict, JSON)
If no hierarchy is provided, a flat hierarchy will be auto-generated over base features.
Supersedes
feature_names
for source of feature names.- visualize: bool, default: True
Flag to control output visualization.
- seed: int, default: 13997
Seed for random number generator (used to order features to be analyzed).
- loss_function: str, choices: {‘quadratic_loss’, ‘binary_cross_entropy’, ‘zero_one_loss’, None, ‘absolute_difference_loss’}, default: None
Loss function to apply to model outputs. If no loss function is specified, then quadratic loss is chosen for continuous targets and binary cross-entropy is chosen for binary targets.
- importance_significance_level: float, default: 0.1
Significance level and FDR control level used for hypothesis testing to assess feature importance.
- compile_results_only: bool, default: False
Flag to attempt to compile results only (assuming they already exist), skipping actually launching jobs.
HTCondor parameters:
- condor: bool, default: False
Flag to enable parallelization using HTCondor. Requires PyPI package htcondor to be installed.
- shared_filesystem: bool, default: False
Flag to indicate a shared filesystem, making file/software transfer unnecessary for running condor.
- cleanup: bool, default: True
Remove intermediate condor files upon completion (typically for debugging). Enabled by default to reduced space usage and clutter.”
- features_per_worker: int, default: 1
Number of features to test per condor job. Fewer features per job reduces job load at the cost of more jobs. TODO: If none provided, this will be chosen automatically to create up to 100 jobs.
- memory_requirement: int, default: 8
Memory requirement in GB
- disk_requirement: int, default: 8
Disk requirement in GB
- model_loader_filename: str, default: None
Python script that provides functions to load/save model. Required for condor since each job runs in its own environment. If none is provided, cloudpickle will be used - see model_loader for a template.
- avoid_bad_hosts: bool, default: False
Avoid condor hosts that intermittently give issues. Enable to reduce likelihood of failures at the cost of increased runtime. List of hosts: [‘mastodon-5.biostat.wisc.edu’, ‘qlu-1.biostat.wisc.edu’, ‘e269.chtc.wisc.edu’, ‘e1039.chtc.wisc.edu’, ‘chief.biostat.wisc.edu’, ‘mammoth-1.biostat.wisc.edu’, ‘nebula-7.biostat.wisc.edu’, ‘mastodon-1.biostat.wisc.edu’]
- retry_arbitrary_failures: bool, default: False
Retry failing jobs due to any reason, up to a maximum of 50 attempts per job. Use with caution - enable if failures stem from condor issues.
- analyze()[source]¶
Performs feature importance analysis of model and returns feature objects.
In addition, writes out:
a table summarizing feature importance: <output_dir>/feature_importance.csv
a visualization of important windows: <output_dir>/feature_importance_windows.png
- Returns
features – List of feature objects with feature importance attributes:
feature.important: flag to indicate whether the feature is important
feature.importance_score: degree of importance
feature.pvalue: p-value for importance test
feature.ordering_important: flag to indicate whether the feature’s overall ordering is important
feature.ordering_pvalue: p-value for overall ordering importance test
feature.window: (left, right) timestep boundaries of important window (0-indexed)
feature.window_important: flag to indicate whether the window is important
feature.window_importance_score: degree of importance of window
feature.window_pvalue: p-value for window importance test
feature.window_ordering_important: flag to indicate whether ordering within the window is important
feature.window_ordering_pvalue: p-value for window ordering importance test
- Return type
list <feature object>