vhammll #

fn analyze_dataset #

fn analyze_dataset(opts Options) AnalyzeResult

analyze_dataset returns a struct with information about a datafile.

Optional:
if show_flag is true, displays on the console (using show_analyze):
1. a list of attributes, their types, the unique values, and a count of
missing values;
2. a table with counts for each type of attribute;
3. a list of discrete attributes useful for training a classifier;
4. a list of continuous attributes useful for training a classifier;
5. a breakdown of the class attribute, showing counts for each class.

class_missing_purge_flag (-pmc): if true, removes instances whose class
    value is missing before analysis;
outputfile_path: if specified, saves the analysis results.

fn append_instances #

fn append_instances(cl Classifier, instances_to_append ValidateResult, opts Options) Classifier

append_instances extends a classifier by adding more instances. It returns the extended classifier struct.

Output options:
show_flag: display results on the console;
outputfile_path: saves the extended classifier to a file.

fn auc_roc #

fn auc_roc(roc_point_array []RocPoint) f64

auc_roc returns the area under the Receiver Operating Characteristic curve, for an array of points.

fn cli #

fn cli(cli_options CliOptions) !

cli() is the command line interface app for the holder66.vhamml ML library.

Usage: v run main.v [command] [flags and options] <path_to_file>
Datafiles should be either tab-delimited, or have extension .csv or .arff
Commands: analyze | append | cross | display | examples | explore
| make | optimals | orange | partition | query | rank | validate | verify
To get help with individual commands, type `v run main.v [command] -h`

Flags and options (note that most are specific to commands):
-a --attributes, followed by one, two, or 3 integers: Parameters.number_of_attributes
-aa --all-attributes: for each category of optimals, retain all settings (the
   default is to retain only settings with unique attribute numbers in each category);
-af --all-flags: Options.traverse_all_flags
-b --bins, followed by one, two, or 3 integers: Binning
   A single value will be used for all attributes; two integers for a range of bin
   values; a third integer specifies an interval for the range (note that
   the binning range is from the upper to the lower value);
   note: when doing an explore, the first integer specifies the lower
   limit for the number of bins, and the second gives the upper value
   for the explore range. Example: explore -b 3,6 would first use 3 - 3,
   then 3 - 4, then 3 - 5, and finally 3 - 6 for the binning ranges.
   If the uniform flag is true, then a single integer specifies
   the number of bins for all continuous attributes; two integers for a
   range of uniform bin values for the explore command; a third integer
   for the interval to be used over the explore range;
-bp, --balanced-prevalences: Parameters.balance_prevalences_flag
-bpt --balance-prevalences-threshold, followed by a float: the ratio threshold
   below which class prevalences are considered imbalanced (default 0.9):
   LoadOptions.balance_prevalences_threshold
-c --concurrent, permit parallel processing to use multiple cores: Options.concurrency_flag
-cl --combination-limits: sets minimum and maximum lengths for combinations
   of multiple classifiers: Options.DisplaySettings.CombinationSizeLimits;
   entering values for limits also sets the generate_combinations_flag so
   that classifier combinations will be generated.
-e --expanded, expanded results on the console: DisplaySettings.expanded_flag
-ea display information re trained attributes on the console, for
   classification operations; DisplaySettings.show_attributes_flag
-exr --explore-rank, followed by eg '2,7', will repeat the ranking
   exercise over the binning range from 2 through 7

fn cross_validate #

fn cross_validate(opts Options) CrossVerifyResult

cross_validate performs n-fold cross-validation on a dataset: it partitions the instances in a dataset into a fold, trains a classifier on all the dataset instances not in the fold, and then uses this classifier to classify the fold cases. This process is repeated for each of n folds, and the classification results are summarized.

Options (also see the Options struct):
bins: range for binning or slicing of continuous attributes;
number_of_attributes: the number of attributes to use, in descending
    order of rank value;
exclude_flag: excludes missing values when ranking attributes;
weighting_flag: nearest neighbor counts are weighted by
    class prevalences;
folds: number of folds n to use for n-fold cross-validation (default
    is leave-one-out cross-validation);
repetitions: number of times to repeat n-fold cross-validations;
random-pick: choose instances randomly for n-fold cross-validations.
balance_prevalences_flag / balance_prevalences_threshold: when set,
 duplicates minority-class instances until prevalences are balanced;
 threshold is the min/max ratio below which balancing triggers (default 0.9).
traverse_all_flags: when used with multiple_flag, iterates over all
 combinations of multi_strategy and break_on_all_flag and prints each
 result side by side (3 strategies

fn display_file #

fn display_file(path string, in_opts Options)

display_file displays on the console, a results file as produced by other hamnn functions; a multiple classifier settings file; or graphs for explore, ranking, or crossvalidation results.

display_file('path_to_saved_results_file', expanded_flag: true)
Output options:
expanded_flag: display additional information on the console, including
    a confusion matrix for cross-validation or verification operations;
graph_flag: generates plots for display in the default web browser;
Options for displaying classifier settings files (suffix .opts):
show_attributes_flag: list the attributes used by each classifier;
classifiers: a list of classifier id's to display.
'

fn explore #

fn explore(opts Options) ExploreResult

explore runs a series of cross-validations or verifications, over a range of attributes and a range of binning values.

Options (also see the Options struct):
append_settings_flag: if true, appends classifier settings to an
 opts file;
bins: range for binning or slicing of continuous attributes;
uniform_bins: same number of bins for all continuous attributes;
number_of_attributes: range for attributes to include;
exclude_flag: excludes missing values when ranking attributes;
weighting_flag: nearest neighbor counts are weighted by
    class prevalences;
folds: number of folds n to use for n-fold cross-validation (default
    is leave-one-out cross-validation);
repetitions: number of times to repeat n-fold cross-validations;
random-pick: choose instances randomly for n-fold cross-validations.
Output options:
show_flag: display results on the console;
expanded_flag: display additional information on the console, including
    a confusion matrix for each explore step;
graph_flag: generate plots of Receiver Operating Characteristics (ROC)
    by attributes used; ROC by bins used, and accuracy by attributes
    used.
balance_prevalences_flag / balance_prevalences_threshold: when set,
 duplicates minority-class instances until prevalences are balanced;
 threshold is the min/max ratio below which balancing triggers (default 0.9);
traverse_all_flags: repeat the explore operation for all possible
 combinations of the flags uniform_bins, weight_ranking_flag, etc;
 note that if -bp is also set, then -af will only traverse the settings
 which make sense in the context of balancing prevalences;
outputfile_path: saves the result to a file.

fn file_type #

fn file_type(path string) string

file_type returns a string identifying how a dataset is structured or formatted, eg 'orange_newer', 'orange_older', 'arff', or 'csv'. On the assumption that an 'orange_older' file will always identify a class attribute by having 'c' or 'class' in the third header line, all other tab-delimited datafiles will be typed as 'orange_newer'.

Example

assert file_type('datasets/iris.tab') == 'orange_older'

fn get_environment #

fn get_environment() Environment

get_environment collects and returns information about the computer, the operating system and its version, the version and build of V, the version of HamNN, and the date and time. It returns an Environment struct.

fn load_classifier_file #

fn load_classifier_file(path string) !Classifier

load_classifier_file loads a file generated by make_classifier(); returns a Classifier struct.

Example

cl := load_classifier_file('tempfolder/saved_classifier.txt')

fn load_file #

fn load_file(path string, opts LoadOptions) Dataset

load_file returns a struct containing the datafile's contents, suitable for generating a classifier

Example

ds := load_file('datasets/iris.tab')

fn load_instances_file #

fn load_instances_file(path string) !ValidateResult

load_instances_file loads a file generated by validate() or query(), and returns it as a struct, suitable for appending to a classifier.

Example

instances := load_instances_file('tempfolder/saved_validate_result.txt')

fn make_classifier #

fn make_classifier(opts Options) Classifier

make_classifier returns a Classifier struct, given a Dataset (as created by load_file).

Options (also see the Options struct):
bins: range for binning or slicing of continuous attributes;
uniform_bins: same number of bins for continuous attributes;
number_of_attributes: the number of highest-ranked attributes to include;
exclude_flag: excludes missing values when ranking attributes;
purge_flag: remove those instances which are duplicates, after
    binning and based on only the attributes to be used;
balance_prevalences_flag: when true, duplicates instances of minority
    classes until class prevalences are sufficiently balanced;
balance_prevalences_threshold: the min/max class-count ratio below which
    balancing is triggered (default 0.9; set via -bpt);
switches_flag: when true and the dataset has exactly 2 classes, excludes
    bin counts whose dominant-class switch count exceeds switches_threshold
    from the rank value search; attributes where every bin count exceeds
    the threshold receive rank value 0;
switches_threshold: maximum permitted switches per bin count when
    switches_flag is active (default 2);
outputfile_path: if specified, saves the classifier to this file;
append_settings_flag / settingsfile_path: if set, appends the classifier
    parameters as a ClassifierSettings entry to the given settings file;
class_missing_purge_flag (-pmc): if true, removes instances whose class
    value is missing before training.

fn one_vs_rest_verify #

fn one_vs_rest_verify(opts Options) CrossVerifyResult

one_vs_rest_verify classifies all the cases in a verification datafile (specified by opts.testfile_path) using an array of trained Classifiers, one per class; each classifier is trained using a one class vs all the other classes. It returns metrics comparing the inferred classes to the labeled (assigned) classes of the verification datafile.

Optional (also see `make_classifier.v` for options in training a classifier)
weighting_flag: nearest neighbor counts are weighted by
    class prevalences.
Output options:
show_flag: display results on the console;
expanded_flag: display additional information on the console, including
        a confusion matrix.
outputfile_path: saves the result as a json file

fn optimals #

fn optimals(path string, opts Options) OptimalsResult

optimals determines which classifiers provide the best balanced accuracy, best Matthews Correlation Coefficient (MCC), highest total for correct inferences, and highest correct inferences per class, for multiple classifiers whose settings are stored in a settings file.

Options:
-cl --combination-limits: enumerate all combinations of the optimal classifiers
     and compute each combination's AUC; an optional pair of integers sets the
     minimum and maximum combination size.
-e --expanded: for each setting, print the Parameters, results obtained, and Metrics.
-g --graph: without -cl, plots the ROC curve for individual classifiers (binary only);
     with -cl, plots AUC vs rank for every multi-classifier combination, one scatter
     trace per combination length; silently skipped for multi-class datasets.
-l --limit-output: with -g -cl, caps each combination-length trace to the top N
     combinations by AUC (0 = show all).
-p --purge: discard duplicate settings (identical parameters, different IDs).
-aa --all-attributes: show all settings in each category; default shows only those
     with unique attribute counts.
-o --output: path to a file in which to save the (purged) settings.
-s --show: print only classifier IDs for each category.
'

fn opts #

fn opts(s string, c Cmd) Options

opts takes a string of command line arguments and returns an Options struct corresponding to the command line arguments.

fn plot_auc_combinations #

fn plot_auc_combinations(combos []AucClassifiers, files RocFiles, top_n int)

plot_auc_combinations generates an interactive scatter plot of the Area Under the ROC Curve (AUC) for each multi-classifier combination, with one trace per combination length. X-axis is rank within the length group (1 = highest AUC); Y-axis is AUC; hovering over a marker shows the constituent classifier IDs. The incoming combos slice must already be sorted descending by AUC (as done in optimals()). If top_n is greater than zero, only the top_n highest-AUC combinations are shown for each length; passing zero shows all combinations. Called by optimals when -g and -cl are both active on a binary dataset.

fn plot_mult_roc #

fn plot_mult_roc(rocdata_array []RocData, files RocFiles)

plot_mult_roc generates an interactive Receiver Operating Characteristic plot in the default web browser for one or more ROC traces. Each element of rocdata_array produces one scatter trace; the AUC is computed and shown in the plot title.

fn purge_instances_for_missing_class_values_not_inline #

fn purge_instances_for_missing_class_values_not_inline(mut ds Dataset) Dataset

purge_instances_for_missing_class_values_not_inline removes all instances whose class value is in the missings list, returning the modified dataset. This is a non-method wrapper around the method form; prefer the method form where possible.

fn query #

fn query(cl Classifier, opts Options) ClassifyResult

query takes a trained classifier and performs an interactive session with the user at the console, asking the user to input a value for each trained attribute. It then asks to confirm or redo the responses. Once confirmed, the instance is classified and the inferred class is shown. The classified instance can optionally be saved in a file. The saved instance can be appended to the classifier using append_instances().

fn rank_attributes #

fn rank_attributes(opts Options) RankingResult

rank_attributes takes a Dataset and returns a list of all the dataset's usable attributes, ranked in order of each attribute's ability to separate the classes.

Algorithm:
for each attribute:
    create a matrix with attribute values for row headers, and
    class values for column headers;
    for each unique value `val` for that attribute:
        for each unique value `class` of the class attribute:
            for each instance:
                accumulate a count for those instances whose class value
                equals `class`;
                populate the matrix with these accumulated counts;
    for each `val`:
        get the absolute values of the differences between accumulated
        counts for each pair of `class` values`;
        add those absolute differences;
    total those added absolute differences to get the raw rank value
for that attribute.
To obtain rank values weighted by class prevalences, use the same algorithm
except before taking the difference of each pair of accumulated counts,
multiply each count of the pair by the class prevalence of the other class.
(Note: rank_attributes always uses class prevalences as weights)

Obtain a maximum rank value by calculating a rank value for the class
attribute itself.

To obtain normalized rank values:
for each attribute:
    divide its raw rank value by the maximum rank value and multiply by 100.

Sort the attributes by descending rank values.

Options:
`binning`: specifies the range for binning (slicing) continous attributes;
`weight_ranking_flag`: appplies prevalences of each class in calculating rankings;
`exclude_flag`: exclude missing values when calculating rank values;
`explore_rank`: gives start and end values for maximum binning number to be
    over an exploration of ranking for different binning values;
`class_missing_purge_flag` (-pmc): if true, removes instances whose class
    value is missing before ranking;

Output options:
`show_flag`: print the ranked list to the console;
`graph_flag`: generate a rank-values plot (continuous attributes, y axis)
    vs number of bins (x axis); skipped silently when no continuous
    attributes exist.  When `switches_flag` is also set on a binary dataset,
    a second plot of dominant-class switches vs bins is produced.
`overfitting_flag`: generates metrics/plots to help determine, for continuous
    attributes, whether overfitting is occurring.
`weighting_flag`: for the hits per bin graph produced by the overfitting flag,
    weights and normalizes the hits.
`outputfile_path`: saves the result as json.

rank_attributes loads a dataset from opts.datafile_path and ranks its attributes. This is the entry point used by the CLI 'rank' command and any caller that works with a file path. Callers that already have an in-memory Dataset (e.g. per-fold training partitions in cross-validation) should call rank_dataset directly to avoid reloading the full file.

fn rank_one_vs_rest #

fn rank_one_vs_rest(opts Options) RankingResult

rank_one_vs_rest ranks the dataset's attributes using a one-vs-rest strategy: for each class, instances of that class are treated as positives and all other instances as negatives. Attributes are ranked by their ability to separate each class from the rest. The result and output options are the same as for rank_attributes().

Algorithm:
for each attribute:
    create a matrix with attribute values for row headers, and
    class values for column headers;
    for each unique value `val` for that attribute:
        for each unique value `class` of the class attribute:
            for each instance:
                accumulate a count for those instances whose class value
                equals `class`;
                populate the matrix with these accumulated counts;
    for each `val`:
        get the absolute values of the differences between accumulated
        counts for each pair of `class` values`;
        add those absolute differences;
    total those added absolute differences to get the raw rank value
for that attribute.
To obtain rank values weighted by class prevalences, use the same algorithm
except before taking the difference of each pair of accumulated counts,
multiply each count of the pair by the class prevalence of the other class.
(Note: rank_attributes always uses class prevalences as weights)

Obtain a maximum rank value by calculating a rank value for the class
attribute itself.

To obtain normalized rank values:
for each attribute:
    divide its raw rank value by the maximum rank value and multiply by 100.

Sort the attributes by descending rank values.

Options:
-b --bins: specifies the range for binning (slicing) continous attributes;
-x --exclude:  to exclude missing values when calculating rank values;
Output options:
`show_flag` to print the ranked list to the console;
`graph_flag` to generate plots of rank values for each attribute on the
    y axis, with number of bins on the x axis.
`outputfile_path`, saves the result as json.

fn roc_values #

fn roc_values(pairs [][]f64, classifier_ids [][]int) []RocPoint

roc_values takes a list of pairs of sensitivity and specificity values, along with the corresponding list of classifier ID's, and returns a list of Receiver Operating Characteristic plot points (sensitivity vs 1 - specificity).

fn save_json_file #

fn save_json_file[T](u T, path string)

save_json_file serialises the value u to JSON and writes it to path, overwriting any existing file. Works with any type T.

fn show_analyze #

fn show_analyze(result AnalyzeResult)

show_analyze prints out to the console, a series of tables detailing a dataset. It takes as input an AnalyzeResult struct generated by analyze_dataset().

fn show_classifier #

fn show_classifier(cl Classifier)

show_classifier outputs to the console information about a classifier

fn show_crossvalidation #

fn show_crossvalidation(result CrossVerifyResult, opts Options)

show_crossvalidation prints to the console the results of a cross_validate() run: the partitioning scheme, classifier parameters (or multiple-classifier settings), instance counts, per-class accuracy, confusion matrix (when expanded), and binary metrics for two-class problems.

fn show_rank_attributes #

fn show_rank_attributes(result RankingResult)

show_rank_attributes prints to the console the ranked list of attributes in result, showing each attribute's name, index, type, rank value, and bin count. Output is limited to result.limit_output entries when that field is non-zero.

fn show_validate #

fn show_validate(result ValidateResult)

show_validate prints to the console the results of a validate() run: the classifier parameters, the number of instances validated, and for each instance its index, inferred class, and nearest- neighbor counts by class.

fn show_verify #

fn show_verify(result CrossVerifyResult, opts Options)

show_verify prints to the console the results of a verify() run: the classifier parameters (or multiple-classifier settings), instance counts, per-class accuracy, confusion matrix (when expanded), and binary metrics for two-class problems.

fn transpose #

fn transpose[T](matrix [][]T) [][]T

transpose a 2d array

fn validate #

fn validate(cl Classifier, opts Options) !ValidateResult

validate classifies each instance of a validation datafile against a trained Classifier; returns the predicted classes for each case of the validation_set. The file to be validated is specified by opts.testfile_path. Optionally, saves the cases and their predicted classes in a file. This file can be used to append these cases to the classifier.

fn verify #

fn verify(opts Options) CrossVerifyResult

verify classifies all the instances in a verification datafile (specified by opts.testfile_path) using a trained Classifier; returns metrics comparing the inferred classes to the labeled (assigned) classes of the verification datafile.

Optional (also see `make_classifier.v` for options in training a classifier)
weighting_flag: nearest neighbor counts are weighted by
    class prevalences.
traverse_all_flags: when used with multiple_flag, iterates over all
 combinations of multi_strategy and break_on_all_flag and prints each
 result side by side (3 strategies

type StringFloatMap #

type StringFloatMap = map[string]f64

struct AnalyzeResult #

struct AnalyzeResult {
	LoadOptions
pub mut:
	struct_type             string = '.AnalyzeResult'
	environment             Environment
	datafile_path           string
	datafile_type           string
	class_name              string
	class_index             int
	class_counts            map[string]int
	attributes              []Attribute
	overall_min             f32
	overall_max             f32
	use_inferred_types_flag bool
}

AnalyzeResult is returned by analyze_dataset(); it contains per-attribute statistics, dataset-level metadata (path, type, class breakdown), and overall min/max values.

struct Attribute #

struct Attribute {
pub mut:
	id            int
	name          string
	count         int
	counts_map    map[string]int
	uniques       int
	missing       int
	raw_type      string
	att_type      string
	inferred_type string
	for_training  bool
	min           f32
	max           f32
	mean          f32
	median        f32
}

Attribute holds descriptive statistics and metadata for a single attribute in a dataset, as produced by analyze_dataset(): name, type, unique-value count, missing-value count, and (for continuous attributes) min, max, mean, and median.

struct BinaryCounts #

struct BinaryCounts {
pub mut:
	t_p int
	f_n int
	t_n int
	f_p int
}

BinaryCounts holds the raw binary-classification confusion counts (true positives, false negatives, true negatives, false positives) before metric calculation.

struct BinaryMetrics #

struct BinaryMetrics {
pub mut:
	t_p             int
	f_n             int
	t_n             int
	f_p             int
	raw_acc         f64
	bal_acc         f64
	sens            f64
	spec            f64
	ppv             f64
	npv             f64
	f1_score_binary f64
	mcc             f64 // Matthews Correlation Coefficient
}

BinaryMetrics holds performance metrics for a binary classifier: TP, FP, TN, FN counts; raw and balanced accuracy; sensitivity; specificity; PPV; NPV; F1 score; and the Matthews Correlation Coefficient (MCC).

struct Binning #

struct Binning {
mut:
	lower    int
	upper    int
	interval int
}

Binning specifies the lower bound, upper bound, and step interval for binning (discretising) continuous attribute values.

struct Class #

struct Class {
pub mut:
	class_name  string // the attribute which holds the class
	class_index int
	classes     []string // to ensure that the ordering remains the same
	// positive_class string
	class_values                         []string
	missing_class_values                 []int // these are the indices of the original class values array
	class_counts                         map[string]int
	pre_balance_prevalences_class_counts map[string]int
	lcm_class_counts                     i64
	prepurge_class_values_len            int
	postpurge_class_counts               map[string]int
	postpurge_lcm_class_counts           i64
}

Class holds all class-attribute metadata for a dataset: the class attribute name and index, the unique class labels, per-class instance counts, and pre/post-purge statistics.

struct Classifier #

struct Classifier {
	History
	Parameters
	LoadOptions
	Class
pub mut:
	struct_type        string = '.Classifier'
	datafile_path      string
	attribute_ordering []string
	trained_attributes map[string]TrainedAttribute
	// maximum_hamming_distance int
	indices   []int
	instances [][]u8
	// history   []HistoryEvent
}

Classifier is a fully trained classifier produced by make_classifier(). It contains the trained attribute maps, encoded instance byte arrays, class information, and the creation history needed to reproduce or extend the classifier.

struct ClassifierID #

struct ClassifierID {
pub mut:
	classifier_id int
	datafile_path string
}

ClassifierID links a numeric classifier identifier to the datafile path from which the classifier was trained.

struct ClassifierSettings #

struct ClassifierSettings {
	Parameters
	BinaryMetrics
	Metrics
	LoadOptions
	ClassifierID
}

ClassifierSettings bundles all parameters needed to recreate a single classifier, together with the binary and multi-class performance metrics recorded when it was evaluated.

struct ClassifyResult #

struct ClassifyResult {
	LoadOptions
	Class
pub mut:
	struct_type                string = '.ClassifyResult'
	index                      int
	inferred_class             string
	inferred_class_array       []string
	labeled_class              string
	nearest_neighbors_by_class []int
	nearest_neighbors_array    [][]int
	classes                    []string
	class_counts               map[string]int
	weighting_flag             bool
	weighting_flag_array       []bool
	multiple_flag              bool
	hamming_distance           int
	sphere_index               int
}

ClassifyResult holds the outcome of classifying a single instance: the inferred class, nearest-neighbor counts by class, the labeled class (if known), Hamming distance, and sphere index reached.

struct CliOptions #

@[params]

struct CliOptions {
	LoadOptions
pub mut:
	args []string
	astr string
}

CliOptions allows the cli() function to be driven programmatically: pass either a pre-split args slice or a single space-separated string (astr). If both are empty, os.args is used.

struct Cmd #

@[params]

struct Cmd {
pub mut:
	cmd string
}

Cmd carries the command name used by opts() to set Options.command when constructing an Options struct from a string.

struct CombinationSizeLimits #

@[params]

struct CombinationSizeLimits {
pub mut:
	generate_combinations_flag bool
	min                        int
	max                        int
}

CombinationSizeLimits controls the minimum and maximum number of classifiers to combine when generating multi-classifier combinations. Setting min/max also activates the generate_combinations_flag.

struct CrossVerifyResult #

struct CrossVerifyResult {
	Parameters
	LoadOptions
	DisplaySettings
	Metrics
	BinaryMetrics
	MultipleOptions // MultipleClassifierSettingsArray
	Class
pub mut:
	struct_type                         string = '.CrossVerifyResult'
	command                             string
	datafile_path                       string
	testfile_path                       string
	multiple_classify_options_file_path string
	multiple_classifier_settings        []ClassifierSettings
	labeled_classes                     []string
	actual_classes                      []string
	inferred_classes                    []string
	nearest_neighbors_by_class          [][]int
	instance_indices                    []int
	// classes                              []string
	// class_counts                         map[string]int
	// pre_balance_prevalences_class_counts map[string]int
	train_dataset_class_counts map[string]int
	labeled_instances          map[string]int
	correct_inferences         map[string]int
	incorrect_inferences       map[string]int
	wrong_inferences           map[string]int
	true_positives             map[string]int
	false_positives            map[string]int
	true_negatives             map[string]int
	false_negatives            map[string]int
	// outer key: actual class; inner key: predicted class
	confusion_matrix_map            map[string]StringFloatMap
	pos_neg_classes                 []string
	correct_count                   int
	incorrects_count                int
	wrong_count                     int
	total_count                     int
	bin_values                      []int // used for displaying the binning range for explore
	attributes_used                 int
	prepurge_instances_counts_array []int
	classifier_instances_counts     []int
	repetitions                     int
	confusion_matrix                [][]string
	// trained_attribute_maps_array    []map[string]TrainedAttribute
	trained_attribute_maps_array []map[string]TrainedAttribute
}

CrossVerifyResult is returned by cross_validate() and verify(). It contains the inferred and actual class arrays, a full confusion matrix, per-class inference counts, accuracy metrics, and provenance information (file paths, classifier settings).

struct Dataset #

struct Dataset {
	Class // DataDict
	LoadOptions
pub mut:
	struct_type                  string = '.Dataset'
	path                         string
	attribute_names              []string
	attribute_flags              []string
	raw_attribute_types          []string
	attribute_types              []string
	inferred_attribute_types     []string
	data                         [][]string
	useful_continuous_attributes map[int][]f32
	useful_discrete_attributes   map[int][]string
	row_identifiers              []string
}

Dataset is the primary data structure produced by load_file(). It holds all attribute data and types, class information, and pre-computed maps of useful continuous and discrete attributes ready for training a classifier.

fn (Dataset) purge_instances_for_missing_class_values #

fn (mut ds Dataset) purge_instances_for_missing_class_values() Dataset

purge_instances_for_missing_class_values removes all instances in ds whose class value is in the missings list, updating class_values, class_counts, data, and the useful attribute maps in place, then returns the modified dataset.

struct DefaultVals #

struct DefaultVals {
pub mut:
	missings                   []string = ['?', '', 'NA', ' ']
	integer_range_for_discrete []int    = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}

DefaultVals holds configurable default values used during dataset loading: the string tokens recognised as missing values, and the integer range treated as discrete rather than continuous.

struct DisplaySettings #

@[params]

struct DisplaySettings {
	CombinationSizeLimits
pub mut:
	show_flag            bool
	expanded_flag        bool
	show_attributes_flag bool
	graph_flag           bool
	help_flag            bool
	verbose_flag         bool
	generate_roc_flag    bool
	limit_output         int
	limit_continuous     int
	overfitting_flag     bool
	all_attributes_flag  bool
}

DisplaySettings aggregates all flags and limits that control what is printed to the console or generated as plots: show, expand, graph, verbose, ROC, overfitting, output limits, and combination size limits.

struct Environment #

struct Environment {
pub mut:
	vhammll_version string
	// cached_cpuinfo map[string]string
	os_kind        string
	os_details     string
	arch_details   []string
	vexe_mtime     string
	v_full_version string
	vflags         string
}

Environment captures a snapshot of the runtime environment (OS kind and details, architecture, V executable mtime and version, and VFLAGS) recorded in classifier history events.

struct ExploreResult #

struct ExploreResult {
	Class
	Parameters
	LoadOptions
	AttributeRange
	DisplaySettings
pub mut:
	struct_type      string = '.ExploreResult'
	path             string
	testfile_path    string
	pos_neg_classes  []string
	array_of_results []CrossVerifyResult
	// accuracy_types   []string = ['raw accuracy', 'balanced accuracy', ' MCC (Matthews Correlation Coefficient)']
	// analytics        []MaxSettings
	// analytics map[string]Analytics
	args []string
}

ExploreResult is returned by explore(); it holds the array of CrossVerifyResults produced over a parameter sweep, together with the attribute range, binning, and display settings used.

struct History #

struct History {
pub mut:
	history_events []HistoryEvent
}

History wraps the ordered list of HistoryEvent records that track how a Classifier was created and subsequently extended.

struct HistoryEvent #

struct HistoryEvent {
	Environment
pub mut:
	event_date               string
	instances_count          int
	prepurge_instances_count int
	// event_environment        Environment
	event     string
	file_path string
}

HistoryEvent records a single event (create or append) in a classifier's lifecycle, capturing the date, instance counts before and after any purging, and the source file path.

struct LoadOptions #

@[params]

struct LoadOptions {
	DefaultVals
pub mut:
	positive_class                string
	class_missing_purge_flag      bool
	balance_prevalences_flag      bool
	balance_prevalences_threshold f64 = 0.9
}

LoadOptions are passed to load_file() to control dataset loading: the positive class label, whether to purge instances with a missing class value, and whether to balance class prevalences.

struct Metrics #

struct Metrics {
pub mut:
	precision         []f64
	recall            []f64
	f1_score          []f64
	avg_precision     []f64
	avg_recall        []f64
	avg_f1_score      []f64
	avg_type          []string
	balanced_accuracy f64
	class_counts_int  []int
	correct_counts    []int
	incorrect_counts  []int
}

Metrics holds multi-class accuracy metrics computed for a verification or cross-validation: precision, recall, and F1 per class; their averages; balanced accuracy; and per-class instance, correct, and incorrect counts.

struct MultipleOptions #

struct MultipleOptions {
	TotalNnParams
pub mut:
	break_on_all_flag bool
	multi_strategy    string // '', 'combined', or 'totalnn'
	classifiers       []int  // refers to an array of classsifier ID values
}

MultipleOptions holds settings that govern how multiple classifiers are combined: whether to stop as soon as all classifiers agree (break_on_all_flag), which combination strategy to use (multi_strategy: '', 'combined', or 'totalnn'), and which classifier IDs to include.

struct OneVsRestClassifier #

struct OneVsRestClassifier {
	Parameters
	LoadOptions
	Class
	History
pub mut:
	struct_type   string = '.OneVsRestClassifier'
	datafile_path string
}

OneVsRestClassifier holds metadata for a one-vs-rest classification strategy, used when classifying multiclass problems by training a separate binary classifier for each class against all other classes.

struct OptimalsResult #

struct OptimalsResult {
	RocData
	RocFiles
pub mut:
	settings_length                                     int
	settings_purged                                     int
	all_attributes_flag                                 bool
	settingsfile_path                                   string
	datafile_path                                       string
	class_counts                                        []int
	best_balanced_accuracies                            []f64
	best_balanced_accuracies_classifiers_all            [][]int // refers to an array of classsifier ID values
	best_balanced_accuracies_classifiers                [][]int
	mcc_max                                             f64
	mcc_max_classifiers_all                             []int // refers to an array of classsifier ID values
	mcc_max_classifiers                                 []int
	correct_inferences_total_max                        int
	correct_inferences_total_max_classifiers_all        []int // refers to an array of classsifier ID values
	correct_inferences_total_max_classifiers            []int
	classes                                             []string
	correct_inferences_by_class_max                     []int
	correct_inferences_by_class_max_classifiers_all     [][]int // refers to an array of classsifier ID values
	correct_inferences_by_class_max_classifiers         [][]int
	receiver_operating_characteristic_settings          []int
	reversed_receiver_operating_characteristic_settings []int
	all_optimals                                        []int
	all_optimals_unique_attributes                      []int
	multi_classifier_combinations_for_auc               []AucClassifiers
}

OptimalsResult is returned by optimals(); it identifies which classifier combinations achieve the best balanced accuracy, highest Matthews Correlation Coefficient (MCC), highest total correct inferences, and highest per-class correct inferences.

struct Options #

@[params]

struct Options {
	Parameters
	LoadOptions
	DisplaySettings
	MultipleOptions // MultipleClassifierSettingsArray
pub mut:
	struct_type                         string = '.Options'
	non_options                         []string
	bins                                []int = [2, 16]
	explore_rank                        []int
	partition_sizes                     []int
	concurrency_flag                    bool
	datafile_path                       string
	traverse_all_flags                  bool
	testfile_path                       string
	outputfile_path                     string
	classifierfile_path                 string
	instancesfile_path                  string
	multiple_classify_options_file_path string
	multiple_classifier_settings        []ClassifierSettings
	settingsfile_path                   string
	roc_settingsfile_path               string
	partitionfiles_paths                []string
	append_settings_flag                bool
	command                             string
	args                                []string
	kagglefile_path                     string
}

Options is the main all-in-one options struct used throughout the library. It embeds Parameters, LoadOptions, DisplaySettings, and MultipleOptions, and adds file paths (data, test, classifier, output, settings) and runtime flags such as concurrency and traverse_all_flags. It can be used as the last parameter of a function to pass named options with defaults.

struct Parameters #

struct Parameters {
pub mut:
	binning              Binning
	number_of_attributes []int = [0]
	uniform_bins         bool
	exclude_flag         bool
	purge_flag           bool
	weighting_flag       bool
	weight_ranking_flag  bool
	// switches_flag enables the dominant-class switch metric for 2-class
	// datasets during attribute ranking (-sw / --switches). When true, bin
	// counts whose switch count exceeds switches_threshold are excluded from
	// the search for the maximum rank value; attributes where every bin count
	// exceeds the threshold receive rank value 0 (treated as noise).
	switches_flag bool
	// switches_threshold sets the maximum permitted number of dominant-class
	// switches per bin count when switches_flag is active (-swt). Valid values
	// are 1 through the upper binning limit; default 2 allows U-shaped /
	// inverted-U dose-response curves while blocking likely noise.
	switches_threshold int = 2
	one_vs_rest_flag   bool
	multiple_flag      bool
	folds              int
	repetitions        int
	random_pick        bool
	// balance_prevalences_flag bool
	maximum_hamming_distance int
}

Parameters holds the core training and cross-validation settings shared across many operations: binning range, number of attributes, purge/weighting/one-vs-rest flags, fold and repetition counts, and the maximum Hamming distance.

struct PlotResult #

struct PlotResult {
pub mut:
	bin             int
	attributes_used int
	correct_count   int
	total_count     int
}

PlotResult holds a single data point for plotting accuracy vs parameter settings: bin count, number of attributes used, correct-inference count, and total instance count.

struct RankedAttribute #

struct RankedAttribute {
pub mut:
	attribute_index      int
	attribute_name       string
	attribute_type       string
	rank_value           f32
	rank_value_array     []f32
	bins                 int
	array_of_hits_arrays [][][]int
	// switches is the number of dominant-class flips across bins at the
	// selected bin count; -1 means not applicable (multi-class or discrete).
	switches       int = -1
	switches_array []int
}

RankedAttribute represents a single attribute together with its computed rank value, optimal bin count, and supporting hit arrays.

struct RankingResult #

struct RankingResult {
	Class
	LoadOptions
	DisplaySettings
pub mut:
	struct_type                string = '.RankingResult'
	path                       string
	exclude_flag               bool
	weight_ranking_flag        bool
	switches_flag              bool
	switches_threshold         int
	binning                    Binning
	array_of_ranked_attributes []RankedAttribute
}

RankingResult is returned by rank_attributes() and rank_one_vs_rest(); it contains the ordered list of ranked attributes and the options used to produce the ranking.

struct ResultForClass #

struct ResultForClass {
pub mut:
	labeled_instances    int
	correct_inferences   int
	incorrect_inferences int
	wrong_inferences     int
	confusion_matrix_row map[string]int
}

ResultForClass holds per-class tabulation of labeled, correct, incorrect, and wrong inferences for a single class in a verification or cross-validation run.

struct RocFiles #

struct RocFiles {
pub mut:
	datafile     string
	testfile     string
	settingsfile string
}

RocFiles holds the file paths associated with a ROC plot: the training datafile, the test/verification file, and the classifier settings file used to generate the curves.

struct SettingsForROC #

struct SettingsForROC {
pub mut:
	master_class_index      int
	classifiers_for_roc     []ClassifierSettings
	array_of_correct_counts [][]int
}

SettingsForROC collects the per-fold classifier settings and correct-count arrays needed to generate a ROC curve after a cross-validation run.

struct TrainedAttribute #

struct TrainedAttribute {
pub mut:
	attribute_type    string
	translation_table map[string]int
	minimum           f32
	maximum           f32
	bins              int
	rank_value        f32
	index             int
	// switches is the number of dominant-class flips across bins at the
	// selected bin count; -1 means not applicable (multi-class or discrete).
	switches    int = -1
	folds_count int // for cross-validations, this tracks how many folds use this attribute
}

TrainedAttribute holds the training-time representation of a single attribute: its type, the value-to-integer translation table (discrete) or range and bin count (continuous), rank value, and fold-usage counter.

struct ValidateResult #

struct ValidateResult {
	Class
	Parameters
	LoadOptions
pub mut:
	struct_type                     string = '.ValidateResult'
	datafile_path                   string
	validate_file_path              string
	row_identifiers                 []string
	inferred_classes                []string
	counts                          [][]int
	instances                       [][]u8
	attributes_used                 int
	prepurge_instances_counts_array []int
	classifier_instances_counts     []int
}

ValidateResult is returned by validate(); it contains the inferred classes for an unlabeled dataset, the encoded instance arrays, and provenance metadata. The result can be saved and later used to extend a classifier via append_instances().