vhammll #
fn analyze_dataset #
fn analyze_dataset(opts Options) AnalyzeResult
analyze_dataset returns a struct with information about a datafile.
Optional:
if show_flag is true, displays on the console (using show_analyze):
1. a list of attributes, their types, the unique values, and a count of
missing values;
2. a table with counts for each type of attribute;
3. a list of discrete attributes useful for training a classifier;
4. a list of continuous attributes useful for training a classifier;
5. a breakdown of the class attribute, showing counts for each class.
class_missing_purge_flag (-pmc): if true, removes instances whose class
value is missing before analysis;
outputfile_path: if specified, saves the analysis results.
fn append_instances #
fn append_instances(cl Classifier, instances_to_append ValidateResult, opts Options) Classifier
append_instances extends a classifier by adding more instances. It returns the extended classifier struct.
Output options:
show_flag: display results on the console;
outputfile_path: saves the extended classifier to a file.
fn auc_roc #
fn auc_roc(roc_point_array []RocPoint) f64
auc_roc returns the area under the Receiver Operating Characteristic curve, for an array of points.
fn cli #
fn cli(cli_options CliOptions) !
cli() is the command line interface app for the holder66.vhamml ML library.
Usage: v run main.v [command] [flags and options] <path_to_file>
Datafiles should be either tab-delimited, or have extension .csv or .arff
Commands: analyze | append | cross | display | examples | explore
| make | optimals | orange | partition | query | rank | validate | verify
To get help with individual commands, type `v run main.v [command] -h`
Flags and options (note that most are specific to commands):
-a --attributes, followed by one, two, or 3 integers: Parameters.number_of_attributes
-aa --all-attributes: for each category of optimals, retain all settings (the
default is to retain only settings with unique attribute numbers in each category);
-af --all-flags: Options.traverse_all_flags
-b --bins, followed by one, two, or 3 integers: Binning
A single value will be used for all attributes; two integers for a range of bin
values; a third integer specifies an interval for the range (note that
the binning range is from the upper to the lower value);
note: when doing an explore, the first integer specifies the lower
limit for the number of bins, and the second gives the upper value
for the explore range. Example: explore -b 3,6 would first use 3 - 3,
then 3 - 4, then 3 - 5, and finally 3 - 6 for the binning ranges.
If the uniform flag is true, then a single integer specifies
the number of bins for all continuous attributes; two integers for a
range of uniform bin values for the explore command; a third integer
for the interval to be used over the explore range;
-bp, --balanced-prevalences: Parameters.balance_prevalences_flag
-bpt --balance-prevalences-threshold, followed by a float: the ratio threshold
below which class prevalences are considered imbalanced (default 0.9):
LoadOptions.balance_prevalences_threshold
-c --concurrent, permit parallel processing to use multiple cores: Options.concurrency_flag
-cl --combination-limits: sets minimum and maximum lengths for combinations
of multiple classifiers: Options.DisplaySettings.CombinationSizeLimits;
entering values for limits also sets the generate_combinations_flag so
that classifier combinations will be generated.
-e --expanded, expanded results on the console: DisplaySettings.expanded_flag
-ea display information re trained attributes on the console, for
classification operations; DisplaySettings.show_attributes_flag
-exr --explore-rank, followed by eg '2,7', will repeat the ranking
exercise over the binning range from 2 through 7
fn cross_validate #
fn cross_validate(opts Options) CrossVerifyResult
cross_validate performs n-fold cross-validation on a dataset: it partitions the instances in a dataset into a fold, trains a classifier on all the dataset instances not in the fold, and then uses this classifier to classify the fold cases. This process is repeated for each of n folds, and the classification results are summarized.
Options (also see the Options struct):
bins: range for binning or slicing of continuous attributes;
number_of_attributes: the number of attributes to use, in descending
order of rank value;
exclude_flag: excludes missing values when ranking attributes;
weighting_flag: nearest neighbor counts are weighted by
class prevalences;
folds: number of folds n to use for n-fold cross-validation (default
is leave-one-out cross-validation);
repetitions: number of times to repeat n-fold cross-validations;
random-pick: choose instances randomly for n-fold cross-validations.
balance_prevalences_flag / balance_prevalences_threshold: when set,
duplicates minority-class instances until prevalences are balanced;
threshold is the min/max ratio below which balancing triggers (default 0.9).
traverse_all_flags: when used with multiple_flag, iterates over all
combinations of multi_strategy and break_on_all_flag and prints each
result side by side (3 strategies
fn display_file #
fn display_file(path string, in_opts Options)
display_file displays on the console, a results file as produced by other hamnn functions; a multiple classifier settings file; or graphs for explore, ranking, or crossvalidation results.
display_file('path_to_saved_results_file', expanded_flag: true)
Output options:
expanded_flag: display additional information on the console, including
a confusion matrix for cross-validation or verification operations;
graph_flag: generates plots for display in the default web browser;
Options for displaying classifier settings files (suffix .opts):
show_attributes_flag: list the attributes used by each classifier;
classifiers: a list of classifier id's to display.
'
fn explore #
fn explore(opts Options) ExploreResult
explore runs a series of cross-validations or verifications, over a range of attributes and a range of binning values.
Options (also see the Options struct):
append_settings_flag: if true, appends classifier settings to an
opts file;
bins: range for binning or slicing of continuous attributes;
uniform_bins: same number of bins for all continuous attributes;
number_of_attributes: range for attributes to include;
exclude_flag: excludes missing values when ranking attributes;
weighting_flag: nearest neighbor counts are weighted by
class prevalences;
folds: number of folds n to use for n-fold cross-validation (default
is leave-one-out cross-validation);
repetitions: number of times to repeat n-fold cross-validations;
random-pick: choose instances randomly for n-fold cross-validations.
Output options:
show_flag: display results on the console;
expanded_flag: display additional information on the console, including
a confusion matrix for each explore step;
graph_flag: generate plots of Receiver Operating Characteristics (ROC)
by attributes used; ROC by bins used, and accuracy by attributes
used.
balance_prevalences_flag / balance_prevalences_threshold: when set,
duplicates minority-class instances until prevalences are balanced;
threshold is the min/max ratio below which balancing triggers (default 0.9);
traverse_all_flags: repeat the explore operation for all possible
combinations of the flags uniform_bins, weight_ranking_flag, etc;
note that if -bp is also set, then -af will only traverse the settings
which make sense in the context of balancing prevalences;
outputfile_path: saves the result to a file.
fn file_type #
fn file_type(path string) string
file_type returns a string identifying how a dataset is structured or formatted, eg 'orange_newer', 'orange_older', 'arff', or 'csv'. On the assumption that an 'orange_older' file will always identify a class attribute by having 'c' or 'class' in the third header line, all other tab-delimited datafiles will be typed as 'orange_newer'.
Example
assert file_type('datasets/iris.tab') == 'orange_older'
fn get_environment #
fn get_environment() Environment
get_environment collects and returns information about the computer, the operating system and its version, the version and build of V, the version of HamNN, and the date and time. It returns an Environment struct.
fn load_classifier_file #
fn load_classifier_file(path string) !Classifier
load_classifier_file loads a file generated by make_classifier(); returns a Classifier struct.
Example
cl := load_classifier_file('tempfolder/saved_classifier.txt')
fn load_file #
fn load_file(path string, opts LoadOptions) Dataset
load_file returns a struct containing the datafile's contents, suitable for generating a classifier
Example
ds := load_file('datasets/iris.tab')
fn load_instances_file #
fn load_instances_file(path string) !ValidateResult
load_instances_file loads a file generated by validate() or query(), and returns it as a struct, suitable for appending to a classifier.
Example
instances := load_instances_file('tempfolder/saved_validate_result.txt')
fn make_classifier #
fn make_classifier(opts Options) Classifier
make_classifier returns a Classifier struct, given a Dataset (as created by load_file).
Options (also see the Options struct):
bins: range for binning or slicing of continuous attributes;
uniform_bins: same number of bins for continuous attributes;
number_of_attributes: the number of highest-ranked attributes to include;
exclude_flag: excludes missing values when ranking attributes;
purge_flag: remove those instances which are duplicates, after
binning and based on only the attributes to be used;
balance_prevalences_flag: when true, duplicates instances of minority
classes until class prevalences are sufficiently balanced;
balance_prevalences_threshold: the min/max class-count ratio below which
balancing is triggered (default 0.9; set via -bpt);
switches_flag: when true and the dataset has exactly 2 classes, excludes
bin counts whose dominant-class switch count exceeds switches_threshold
from the rank value search; attributes where every bin count exceeds
the threshold receive rank value 0;
switches_threshold: maximum permitted switches per bin count when
switches_flag is active (default 2);
outputfile_path: if specified, saves the classifier to this file;
append_settings_flag / settingsfile_path: if set, appends the classifier
parameters as a ClassifierSettings entry to the given settings file;
class_missing_purge_flag (-pmc): if true, removes instances whose class
value is missing before training.
fn one_vs_rest_verify #
fn one_vs_rest_verify(opts Options) CrossVerifyResult
one_vs_rest_verify classifies all the cases in a verification datafile (specified by opts.testfile_path) using an array of trained Classifiers, one per class; each classifier is trained using a one class vs all the other classes. It returns metrics comparing the inferred classes to the labeled (assigned) classes of the verification datafile.
Optional (also see `make_classifier.v` for options in training a classifier)
weighting_flag: nearest neighbor counts are weighted by
class prevalences.
Output options:
show_flag: display results on the console;
expanded_flag: display additional information on the console, including
a confusion matrix.
outputfile_path: saves the result as a json file
fn optimals #
fn optimals(path string, opts Options) OptimalsResult
optimals determines which classifiers provide the best balanced accuracy, best Matthews Correlation Coefficient (MCC), highest total for correct inferences, and highest correct inferences per class, for multiple classifiers whose settings are stored in a settings file.
Options:
-cl --combination-limits: enumerate all combinations of the optimal classifiers
and compute each combination's AUC; an optional pair of integers sets the
minimum and maximum combination size.
-e --expanded: for each setting, print the Parameters, results obtained, and Metrics.
-g --graph: without -cl, plots the ROC curve for individual classifiers (binary only);
with -cl, plots AUC vs rank for every multi-classifier combination, one scatter
trace per combination length; silently skipped for multi-class datasets.
-l --limit-output: with -g -cl, caps each combination-length trace to the top N
combinations by AUC (0 = show all).
-p --purge: discard duplicate settings (identical parameters, different IDs).
-aa --all-attributes: show all settings in each category; default shows only those
with unique attribute counts.
-o --output: path to a file in which to save the (purged) settings.
-s --show: print only classifier IDs for each category.
'
fn opts #
fn opts(s string, c Cmd) Options
opts takes a string of command line arguments and returns an Options struct corresponding to the command line arguments.
fn plot_auc_combinations #
fn plot_auc_combinations(combos []AucClassifiers, files RocFiles, top_n int)
plot_auc_combinations generates an interactive scatter plot of the Area Under the ROC Curve (AUC) for each multi-classifier combination, with one trace per combination length. X-axis is rank within the length group (1 = highest AUC); Y-axis is AUC; hovering over a marker shows the constituent classifier IDs. The incoming combos slice must already be sorted descending by AUC (as done in optimals()). If top_n is greater than zero, only the top_n highest-AUC combinations are shown for each length; passing zero shows all combinations. Called by optimals when -g and -cl are both active on a binary dataset.
fn plot_mult_roc #
fn plot_mult_roc(rocdata_array []RocData, files RocFiles)
plot_mult_roc generates an interactive Receiver Operating Characteristic plot in the default web browser for one or more ROC traces. Each element of rocdata_array produces one scatter trace; the AUC is computed and shown in the plot title.
fn purge_instances_for_missing_class_values_not_inline #
fn purge_instances_for_missing_class_values_not_inline(mut ds Dataset) Dataset
purge_instances_for_missing_class_values_not_inline removes all instances whose class value is in the missings list, returning the modified dataset. This is a non-method wrapper around the method form; prefer the method form where possible.
fn query #
fn query(cl Classifier, opts Options) ClassifyResult
query takes a trained classifier and performs an interactive session with the user at the console, asking the user to input a value for each trained attribute. It then asks to confirm or redo the responses. Once confirmed, the instance is classified and the inferred class is shown. The classified instance can optionally be saved in a file. The saved instance can be appended to the classifier using append_instances().
fn rank_attributes #
fn rank_attributes(opts Options) RankingResult
rank_attributes takes a Dataset and returns a list of all the dataset's usable attributes, ranked in order of each attribute's ability to separate the classes.
Algorithm:
for each attribute:
create a matrix with attribute values for row headers, and
class values for column headers;
for each unique value `val` for that attribute:
for each unique value `class` of the class attribute:
for each instance:
accumulate a count for those instances whose class value
equals `class`;
populate the matrix with these accumulated counts;
for each `val`:
get the absolute values of the differences between accumulated
counts for each pair of `class` values`;
add those absolute differences;
total those added absolute differences to get the raw rank value
for that attribute.
To obtain rank values weighted by class prevalences, use the same algorithm
except before taking the difference of each pair of accumulated counts,
multiply each count of the pair by the class prevalence of the other class.
(Note: rank_attributes always uses class prevalences as weights)
Obtain a maximum rank value by calculating a rank value for the class
attribute itself.
To obtain normalized rank values:
for each attribute:
divide its raw rank value by the maximum rank value and multiply by 100.
Sort the attributes by descending rank values.
Options:
`binning`: specifies the range for binning (slicing) continous attributes;
`weight_ranking_flag`: appplies prevalences of each class in calculating rankings;
`exclude_flag`: exclude missing values when calculating rank values;
`explore_rank`: gives start and end values for maximum binning number to be
over an exploration of ranking for different binning values;
`class_missing_purge_flag` (-pmc): if true, removes instances whose class
value is missing before ranking;
Output options:
`show_flag`: print the ranked list to the console;
`graph_flag`: generate a rank-values plot (continuous attributes, y axis)
vs number of bins (x axis); skipped silently when no continuous
attributes exist. When `switches_flag` is also set on a binary dataset,
a second plot of dominant-class switches vs bins is produced.
`overfitting_flag`: generates metrics/plots to help determine, for continuous
attributes, whether overfitting is occurring.
`weighting_flag`: for the hits per bin graph produced by the overfitting flag,
weights and normalizes the hits.
`outputfile_path`: saves the result as json.
rank_attributes loads a dataset from opts.datafile_path and ranks its attributes. This is the entry point used by the CLI 'rank' command and any caller that works with a file path. Callers that already have an in-memory Dataset (e.g. per-fold training partitions in cross-validation) should call rank_dataset directly to avoid reloading the full file.
fn rank_one_vs_rest #
fn rank_one_vs_rest(opts Options) RankingResult
rank_one_vs_rest ranks the dataset's attributes using a one-vs-rest strategy: for each class, instances of that class are treated as positives and all other instances as negatives. Attributes are ranked by their ability to separate each class from the rest. The result and output options are the same as for rank_attributes().
Algorithm:
for each attribute:
create a matrix with attribute values for row headers, and
class values for column headers;
for each unique value `val` for that attribute:
for each unique value `class` of the class attribute:
for each instance:
accumulate a count for those instances whose class value
equals `class`;
populate the matrix with these accumulated counts;
for each `val`:
get the absolute values of the differences between accumulated
counts for each pair of `class` values`;
add those absolute differences;
total those added absolute differences to get the raw rank value
for that attribute.
To obtain rank values weighted by class prevalences, use the same algorithm
except before taking the difference of each pair of accumulated counts,
multiply each count of the pair by the class prevalence of the other class.
(Note: rank_attributes always uses class prevalences as weights)
Obtain a maximum rank value by calculating a rank value for the class
attribute itself.
To obtain normalized rank values:
for each attribute:
divide its raw rank value by the maximum rank value and multiply by 100.
Sort the attributes by descending rank values.
Options:
-b --bins: specifies the range for binning (slicing) continous attributes;
-x --exclude: to exclude missing values when calculating rank values;
Output options:
`show_flag` to print the ranked list to the console;
`graph_flag` to generate plots of rank values for each attribute on the
y axis, with number of bins on the x axis.
`outputfile_path`, saves the result as json.
fn roc_values #
fn roc_values(pairs [][]f64, classifier_ids [][]int) []RocPoint
roc_values takes a list of pairs of sensitivity and specificity values, along with the corresponding list of classifier ID's, and returns a list of Receiver Operating Characteristic plot points (sensitivity vs 1 - specificity).
fn save_json_file #
fn save_json_file[T](u T, path string)
save_json_file serialises the value u to JSON and writes it to path, overwriting any existing file. Works with any type T.
fn show_analyze #
fn show_analyze(result AnalyzeResult)
show_analyze prints out to the console, a series of tables detailing a dataset. It takes as input an AnalyzeResult struct generated by analyze_dataset().
fn show_classifier #
fn show_classifier(cl Classifier)
show_classifier outputs to the console information about a classifier
fn show_crossvalidation #
fn show_crossvalidation(result CrossVerifyResult, opts Options)
show_crossvalidation prints to the console the results of a cross_validate() run: the partitioning scheme, classifier parameters (or multiple-classifier settings), instance counts, per-class accuracy, confusion matrix (when expanded), and binary metrics for two-class problems.
fn show_rank_attributes #
fn show_rank_attributes(result RankingResult)
show_rank_attributes prints to the console the ranked list of attributes in result, showing each attribute's name, index, type, rank value, and bin count. Output is limited to result.limit_output entries when that field is non-zero.
fn show_validate #
fn show_validate(result ValidateResult)
show_validate prints to the console the results of a validate() run: the classifier parameters, the number of instances validated, and for each instance its index, inferred class, and nearest- neighbor counts by class.
fn show_verify #
fn show_verify(result CrossVerifyResult, opts Options)
show_verify prints to the console the results of a verify() run: the classifier parameters (or multiple-classifier settings), instance counts, per-class accuracy, confusion matrix (when expanded), and binary metrics for two-class problems.
fn transpose #
fn transpose[T](matrix [][]T) [][]T
transpose a 2d array
fn validate #
fn validate(cl Classifier, opts Options) !ValidateResult
validate classifies each instance of a validation datafile against a trained Classifier; returns the predicted classes for each case of the validation_set. The file to be validated is specified by opts.testfile_path. Optionally, saves the cases and their predicted classes in a file. This file can be used to append these cases to the classifier.
fn verify #
fn verify(opts Options) CrossVerifyResult
verify classifies all the instances in a verification datafile (specified by opts.testfile_path) using a trained Classifier; returns metrics comparing the inferred classes to the labeled (assigned) classes of the verification datafile.
Optional (also see `make_classifier.v` for options in training a classifier)
weighting_flag: nearest neighbor counts are weighted by
class prevalences.
traverse_all_flags: when used with multiple_flag, iterates over all
combinations of multi_strategy and break_on_all_flag and prints each
result side by side (3 strategies
type StringFloatMap #
type StringFloatMap = map[string]f64
struct AnalyzeResult #
struct AnalyzeResult {
LoadOptions
pub mut:
struct_type string = '.AnalyzeResult'
environment Environment
datafile_path string
datafile_type string
class_name string
class_index int
class_counts map[string]int
attributes []Attribute
overall_min f32
overall_max f32
use_inferred_types_flag bool
}
AnalyzeResult is returned by analyze_dataset(); it contains per-attribute statistics, dataset-level metadata (path, type, class breakdown), and overall min/max values.
struct Attribute #
struct Attribute {
pub mut:
id int
name string
count int
counts_map map[string]int
uniques int
missing int
raw_type string
att_type string
inferred_type string
for_training bool
min f32
max f32
mean f32
median f32
}
Attribute holds descriptive statistics and metadata for a single attribute in a dataset, as produced by analyze_dataset(): name, type, unique-value count, missing-value count, and (for continuous attributes) min, max, mean, and median.
struct BinaryCounts #
struct BinaryCounts {
pub mut:
t_p int
f_n int
t_n int
f_p int
}
BinaryCounts holds the raw binary-classification confusion counts (true positives, false negatives, true negatives, false positives) before metric calculation.
struct BinaryMetrics #
struct BinaryMetrics {
pub mut:
t_p int
f_n int
t_n int
f_p int
raw_acc f64
bal_acc f64
sens f64
spec f64
ppv f64
npv f64
f1_score_binary f64
mcc f64 // Matthews Correlation Coefficient
}
BinaryMetrics holds performance metrics for a binary classifier: TP, FP, TN, FN counts; raw and balanced accuracy; sensitivity; specificity; PPV; NPV; F1 score; and the Matthews Correlation Coefficient (MCC).
struct Binning #
struct Binning {
mut:
lower int
upper int
interval int
}
Binning specifies the lower bound, upper bound, and step interval for binning (discretising) continuous attribute values.
struct Class #
struct Class {
pub mut:
class_name string // the attribute which holds the class
class_index int
classes []string // to ensure that the ordering remains the same
// positive_class string
class_values []string
missing_class_values []int // these are the indices of the original class values array
class_counts map[string]int
pre_balance_prevalences_class_counts map[string]int
lcm_class_counts i64
prepurge_class_values_len int
postpurge_class_counts map[string]int
postpurge_lcm_class_counts i64
}
Class holds all class-attribute metadata for a dataset: the class attribute name and index, the unique class labels, per-class instance counts, and pre/post-purge statistics.
struct Classifier #
struct Classifier {
History
Parameters
LoadOptions
Class
pub mut:
struct_type string = '.Classifier'
datafile_path string
attribute_ordering []string
trained_attributes map[string]TrainedAttribute
// maximum_hamming_distance int
indices []int
instances [][]u8
// history []HistoryEvent
}
Classifier is a fully trained classifier produced by make_classifier(). It contains the trained attribute maps, encoded instance byte arrays, class information, and the creation history needed to reproduce or extend the classifier.
struct ClassifierID #
struct ClassifierID {
pub mut:
classifier_id int
datafile_path string
}
ClassifierID links a numeric classifier identifier to the datafile path from which the classifier was trained.
struct ClassifierSettings #
struct ClassifierSettings {
Parameters
BinaryMetrics
Metrics
LoadOptions
ClassifierID
}
ClassifierSettings bundles all parameters needed to recreate a single classifier, together with the binary and multi-class performance metrics recorded when it was evaluated.
struct ClassifyResult #
struct ClassifyResult {
LoadOptions
Class
pub mut:
struct_type string = '.ClassifyResult'
index int
inferred_class string
inferred_class_array []string
labeled_class string
nearest_neighbors_by_class []int
nearest_neighbors_array [][]int
classes []string
class_counts map[string]int
weighting_flag bool
weighting_flag_array []bool
multiple_flag bool
hamming_distance int
sphere_index int
}
ClassifyResult holds the outcome of classifying a single instance: the inferred class, nearest-neighbor counts by class, the labeled class (if known), Hamming distance, and sphere index reached.
struct CliOptions #
struct CliOptions {
LoadOptions
pub mut:
args []string
astr string
}
CliOptions allows the cli() function to be driven programmatically: pass either a pre-split args slice or a single space-separated string (astr). If both are empty, os.args is used.
struct Cmd #
struct Cmd {
pub mut:
cmd string
}
Cmd carries the command name used by opts() to set Options.command when constructing an Options struct from a string.
struct CombinationSizeLimits #
struct CombinationSizeLimits {
pub mut:
generate_combinations_flag bool
min int
max int
}
CombinationSizeLimits controls the minimum and maximum number of classifiers to combine when generating multi-classifier combinations. Setting min/max also activates the generate_combinations_flag.
struct CrossVerifyResult #
struct CrossVerifyResult {
Parameters
LoadOptions
DisplaySettings
Metrics
BinaryMetrics
MultipleOptions // MultipleClassifierSettingsArray
Class
pub mut:
struct_type string = '.CrossVerifyResult'
command string
datafile_path string
testfile_path string
multiple_classify_options_file_path string
multiple_classifier_settings []ClassifierSettings
labeled_classes []string
actual_classes []string
inferred_classes []string
nearest_neighbors_by_class [][]int
instance_indices []int
// classes []string
// class_counts map[string]int
// pre_balance_prevalences_class_counts map[string]int
train_dataset_class_counts map[string]int
labeled_instances map[string]int
correct_inferences map[string]int
incorrect_inferences map[string]int
wrong_inferences map[string]int
true_positives map[string]int
false_positives map[string]int
true_negatives map[string]int
false_negatives map[string]int
// outer key: actual class; inner key: predicted class
confusion_matrix_map map[string]StringFloatMap
pos_neg_classes []string
correct_count int
incorrects_count int
wrong_count int
total_count int
bin_values []int // used for displaying the binning range for explore
attributes_used int
prepurge_instances_counts_array []int
classifier_instances_counts []int
repetitions int
confusion_matrix [][]string
// trained_attribute_maps_array []map[string]TrainedAttribute
trained_attribute_maps_array []map[string]TrainedAttribute
}
CrossVerifyResult is returned by cross_validate() and verify(). It contains the inferred and actual class arrays, a full confusion matrix, per-class inference counts, accuracy metrics, and provenance information (file paths, classifier settings).
struct Dataset #
struct Dataset {
Class // DataDict
LoadOptions
pub mut:
struct_type string = '.Dataset'
path string
attribute_names []string
attribute_flags []string
raw_attribute_types []string
attribute_types []string
inferred_attribute_types []string
data [][]string
useful_continuous_attributes map[int][]f32
useful_discrete_attributes map[int][]string
row_identifiers []string
}
Dataset is the primary data structure produced by load_file(). It holds all attribute data and types, class information, and pre-computed maps of useful continuous and discrete attributes ready for training a classifier.
fn (Dataset) purge_instances_for_missing_class_values #
fn (mut ds Dataset) purge_instances_for_missing_class_values() Dataset
purge_instances_for_missing_class_values removes all instances in ds whose class value is in the missings list, updating class_values, class_counts, data, and the useful attribute maps in place, then returns the modified dataset.
struct DefaultVals #
struct DefaultVals {
pub mut:
missings []string = ['?', '', 'NA', ' ']
integer_range_for_discrete []int = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}
DefaultVals holds configurable default values used during dataset loading: the string tokens recognised as missing values, and the integer range treated as discrete rather than continuous.
struct DisplaySettings #
struct DisplaySettings {
CombinationSizeLimits
pub mut:
show_flag bool
expanded_flag bool
show_attributes_flag bool
graph_flag bool
help_flag bool
verbose_flag bool
generate_roc_flag bool
limit_output int
limit_continuous int
overfitting_flag bool
all_attributes_flag bool
}
DisplaySettings aggregates all flags and limits that control what is printed to the console or generated as plots: show, expand, graph, verbose, ROC, overfitting, output limits, and combination size limits.
struct Environment #
struct Environment {
pub mut:
vhammll_version string
// cached_cpuinfo map[string]string
os_kind string
os_details string
arch_details []string
vexe_mtime string
v_full_version string
vflags string
}
Environment captures a snapshot of the runtime environment (OS kind and details, architecture, V executable mtime and version, and VFLAGS) recorded in classifier history events.
struct ExploreResult #
struct ExploreResult {
Class
Parameters
LoadOptions
AttributeRange
DisplaySettings
pub mut:
struct_type string = '.ExploreResult'
path string
testfile_path string
pos_neg_classes []string
array_of_results []CrossVerifyResult
// accuracy_types []string = ['raw accuracy', 'balanced accuracy', ' MCC (Matthews Correlation Coefficient)']
// analytics []MaxSettings
// analytics map[string]Analytics
args []string
}
ExploreResult is returned by explore(); it holds the array of CrossVerifyResults produced over a parameter sweep, together with the attribute range, binning, and display settings used.
struct History #
struct History {
pub mut:
history_events []HistoryEvent
}
History wraps the ordered list of HistoryEvent records that track how a Classifier was created and subsequently extended.
struct HistoryEvent #
struct HistoryEvent {
Environment
pub mut:
event_date string
instances_count int
prepurge_instances_count int
// event_environment Environment
event string
file_path string
}
HistoryEvent records a single event (create or append) in a classifier's lifecycle, capturing the date, instance counts before and after any purging, and the source file path.
struct LoadOptions #
struct LoadOptions {
DefaultVals
pub mut:
positive_class string
class_missing_purge_flag bool
balance_prevalences_flag bool
balance_prevalences_threshold f64 = 0.9
}
LoadOptions are passed to load_file() to control dataset loading: the positive class label, whether to purge instances with a missing class value, and whether to balance class prevalences.
struct Metrics #
struct Metrics {
pub mut:
precision []f64
recall []f64
f1_score []f64
avg_precision []f64
avg_recall []f64
avg_f1_score []f64
avg_type []string
balanced_accuracy f64
class_counts_int []int
correct_counts []int
incorrect_counts []int
}
Metrics holds multi-class accuracy metrics computed for a verification or cross-validation: precision, recall, and F1 per class; their averages; balanced accuracy; and per-class instance, correct, and incorrect counts.
struct MultipleOptions #
struct MultipleOptions {
TotalNnParams
pub mut:
break_on_all_flag bool
multi_strategy string // '', 'combined', or 'totalnn'
classifiers []int // refers to an array of classsifier ID values
}
MultipleOptions holds settings that govern how multiple classifiers are combined: whether to stop as soon as all classifiers agree (break_on_all_flag), which combination strategy to use (multi_strategy: '', 'combined', or 'totalnn'), and which classifier IDs to include.
struct OneVsRestClassifier #
struct OneVsRestClassifier {
Parameters
LoadOptions
Class
History
pub mut:
struct_type string = '.OneVsRestClassifier'
datafile_path string
}
OneVsRestClassifier holds metadata for a one-vs-rest classification strategy, used when classifying multiclass problems by training a separate binary classifier for each class against all other classes.
struct OptimalsResult #
struct OptimalsResult {
RocData
RocFiles
pub mut:
settings_length int
settings_purged int
all_attributes_flag bool
settingsfile_path string
datafile_path string
class_counts []int
best_balanced_accuracies []f64
best_balanced_accuracies_classifiers_all [][]int // refers to an array of classsifier ID values
best_balanced_accuracies_classifiers [][]int
mcc_max f64
mcc_max_classifiers_all []int // refers to an array of classsifier ID values
mcc_max_classifiers []int
correct_inferences_total_max int
correct_inferences_total_max_classifiers_all []int // refers to an array of classsifier ID values
correct_inferences_total_max_classifiers []int
classes []string
correct_inferences_by_class_max []int
correct_inferences_by_class_max_classifiers_all [][]int // refers to an array of classsifier ID values
correct_inferences_by_class_max_classifiers [][]int
receiver_operating_characteristic_settings []int
reversed_receiver_operating_characteristic_settings []int
all_optimals []int
all_optimals_unique_attributes []int
multi_classifier_combinations_for_auc []AucClassifiers
}
OptimalsResult is returned by optimals(); it identifies which classifier combinations achieve the best balanced accuracy, highest Matthews Correlation Coefficient (MCC), highest total correct inferences, and highest per-class correct inferences.
struct Options #
struct Options {
Parameters
LoadOptions
DisplaySettings
MultipleOptions // MultipleClassifierSettingsArray
pub mut:
struct_type string = '.Options'
non_options []string
bins []int = [2, 16]
explore_rank []int
partition_sizes []int
concurrency_flag bool
datafile_path string
traverse_all_flags bool
testfile_path string
outputfile_path string
classifierfile_path string
instancesfile_path string
multiple_classify_options_file_path string
multiple_classifier_settings []ClassifierSettings
settingsfile_path string
roc_settingsfile_path string
partitionfiles_paths []string
append_settings_flag bool
command string
args []string
kagglefile_path string
}
Options is the main all-in-one options struct used throughout the library. It embeds Parameters, LoadOptions, DisplaySettings, and MultipleOptions, and adds file paths (data, test, classifier, output, settings) and runtime flags such as concurrency and traverse_all_flags. It can be used as the last parameter of a function to pass named options with defaults.
struct Parameters #
struct Parameters {
pub mut:
binning Binning
number_of_attributes []int = [0]
uniform_bins bool
exclude_flag bool
purge_flag bool
weighting_flag bool
weight_ranking_flag bool
// switches_flag enables the dominant-class switch metric for 2-class
// datasets during attribute ranking (-sw / --switches). When true, bin
// counts whose switch count exceeds switches_threshold are excluded from
// the search for the maximum rank value; attributes where every bin count
// exceeds the threshold receive rank value 0 (treated as noise).
switches_flag bool
// switches_threshold sets the maximum permitted number of dominant-class
// switches per bin count when switches_flag is active (-swt). Valid values
// are 1 through the upper binning limit; default 2 allows U-shaped /
// inverted-U dose-response curves while blocking likely noise.
switches_threshold int = 2
one_vs_rest_flag bool
multiple_flag bool
folds int
repetitions int
random_pick bool
// balance_prevalences_flag bool
maximum_hamming_distance int
}
Parameters holds the core training and cross-validation settings shared across many operations: binning range, number of attributes, purge/weighting/one-vs-rest flags, fold and repetition counts, and the maximum Hamming distance.
struct PlotResult #
struct PlotResult {
pub mut:
bin int
attributes_used int
correct_count int
total_count int
}
PlotResult holds a single data point for plotting accuracy vs parameter settings: bin count, number of attributes used, correct-inference count, and total instance count.
struct RankedAttribute #
struct RankedAttribute {
pub mut:
attribute_index int
attribute_name string
attribute_type string
rank_value f32
rank_value_array []f32
bins int
array_of_hits_arrays [][][]int
// switches is the number of dominant-class flips across bins at the
// selected bin count; -1 means not applicable (multi-class or discrete).
switches int = -1
switches_array []int
}
RankedAttribute represents a single attribute together with its computed rank value, optimal bin count, and supporting hit arrays.
struct RankingResult #
struct RankingResult {
Class
LoadOptions
DisplaySettings
pub mut:
struct_type string = '.RankingResult'
path string
exclude_flag bool
weight_ranking_flag bool
switches_flag bool
switches_threshold int
binning Binning
array_of_ranked_attributes []RankedAttribute
}
RankingResult is returned by rank_attributes() and rank_one_vs_rest(); it contains the ordered list of ranked attributes and the options used to produce the ranking.
struct ResultForClass #
struct ResultForClass {
pub mut:
labeled_instances int
correct_inferences int
incorrect_inferences int
wrong_inferences int
confusion_matrix_row map[string]int
}
ResultForClass holds per-class tabulation of labeled, correct, incorrect, and wrong inferences for a single class in a verification or cross-validation run.
struct RocFiles #
struct RocFiles {
pub mut:
datafile string
testfile string
settingsfile string
}
RocFiles holds the file paths associated with a ROC plot: the training datafile, the test/verification file, and the classifier settings file used to generate the curves.
struct SettingsForROC #
struct SettingsForROC {
pub mut:
master_class_index int
classifiers_for_roc []ClassifierSettings
array_of_correct_counts [][]int
}
SettingsForROC collects the per-fold classifier settings and correct-count arrays needed to generate a ROC curve after a cross-validation run.
struct TrainedAttribute #
struct TrainedAttribute {
pub mut:
attribute_type string
translation_table map[string]int
minimum f32
maximum f32
bins int
rank_value f32
index int
// switches is the number of dominant-class flips across bins at the
// selected bin count; -1 means not applicable (multi-class or discrete).
switches int = -1
folds_count int // for cross-validations, this tracks how many folds use this attribute
}
TrainedAttribute holds the training-time representation of a single attribute: its type, the value-to-integer translation table (discrete) or range and bin count (continuous), rank value, and fold-usage counter.
struct ValidateResult #
struct ValidateResult {
Class
Parameters
LoadOptions
pub mut:
struct_type string = '.ValidateResult'
datafile_path string
validate_file_path string
row_identifiers []string
inferred_classes []string
counts [][]int
instances [][]u8
attributes_used int
prepurge_instances_counts_array []int
classifier_instances_counts []int
}
ValidateResult is returned by validate(); it contains the inferred classes for an unlabeled dataset, the encoded instance arrays, and provenance metadata. The result can be saved and later used to extend a classifier via append_instances().