README #

GitHub

VHamMLL

A machine learning (ML) library for classification using a nearest neighbor algorithm based on Hamming distances.

You can incorporate the VHamMLL functions into your own code, or use the included Command Line Interface app (cli.v).

Link to html documentation for the library functions and structs

You can use VHamMLL with your own datasets, or with a selection of publicly available datasets that are widely used for demonstrating and testing ML classifiers, in the datasets directory. These files are mostly in Orange file format; there are also datasets in ARFF (Attribute-Relation File Format) or in comma-separated-values (CSV) as used in Kaggle.

This table reports balanced accuracy results for classification of a variety of publicly available datasets.

What, another AI package? Is that necessary? And have a look here for a more complete description and potential use cases.

Glossary of terms

For interactive descriptions of the two key algorithms used by VHamMLL, download the Numbers app spreadsheets: Description of Ranking Algorithm and Description of Classification Algorithm.

Usage:

To use the VHamMLL library in an existing Vlang project:

v install holder66.vhammll

In your v code, add: import holder66.vhammll

To use the library with the Command Line Interface (CLI):

First, install V, if not already installed. On MacOS, Linux etc. you need git and a C compiler (For windows or android environments, see the v lang documentation).

In a terminal:

git clone https://github.com/vlang/v
cd v
make
sudo ./v symlink	##v install holder66.vhammll

On older macs, if the make process fails, you may need to also do:

brew install bdw-gc    ##cp /usr/local/Cellar/bdw-gc/8.2.8/lib/libgc.a  .thirdparty/tcc/lib/libgc.a  ##

Then repeat the make in the v directory. Finally, export VFLAGS="-d dynamic_boehm"

See above re needed dependencies.

In a folder or directory that you want to use for your project, you will need to create a file with module main, and a function main(). You can do this in the terminal, or with a text editor. The file should contain:

module main
import holder66.vhammll

fn main() {
    vhammll.cli()!
}

Assuming you've named the directory or folder vhamml and the file within main.v, in the terminal: v run . followed by the command line arguments, eg v run . --help or v run . analyze <path_to_dataset_file> Command-specific help is available, like so: v run . explore --help or v run . explore -h

Note that the publicly available datasets included with the VHamMLL distribution can be found at ~/.vmodules/holder66/vhammll/datasets.

That's it!

Tutorial:

v run . examples go

Updating:

v up        ##v update    ##v .         ##

Getting help:

The V lang community meets on Discord

For bug reports, feature requests, etc., please raise an issue on github

Speed things up:

Use the -c (--concurrent) argument (in the CLI) to make use of available CPU cores for some vhammll functions; this may speed things up (timings are on a MacBook Pro 2019)

v main.v
./main explore ~/.vmodules/holder66/vhammll/datasets/iris.tab  ##./main explore -c  ~/.vmodules/holder66/vhammll/datasets/iris.tab   ##

A huge speedup usually happens if you compile using the -prod (for production) option. The compilation itself takes longer, but the resulting code is highly optimized.

v -prod main.v
./main explore ~/.vmodules/holder66/vhammll/datasets/iris.tab  ##./main explore -c  ~/.vmodules/holder66/vhammll/datasets/iris.tab   ##

Note that in this case, there is no speedup for -prod when the -c argument is used.

Examples showing use of the Command Line Interface

Please see examples_of_command_line_usage.md

Example: typical use case, a clinical risk calculator

Health care professionals frequently make use of calculators to inform clinical decision-making. Data regarding symptoms, findings on physical examination, laboratory and imaging results, and outcome information such as diagnosis, risk for developing a condition, or response to specific treatments, is collected for a sample of patients, and then used to form the basis of a formula that can be used to predict the outcome information of interest for a new patient, based on how their symptoms and findings, etc. compare to those in the dataset.

Please see clinical_calculator_example.md.

Example: finding useful information embedded in noise

Please see a worked example here: noisy_data.md

MNIST dataset

The mnist_train.tab file is too large to keep in the repository. If you wish to experiment with it, it can be downloaded by right-clicking on this link in a web browser, or downloaded via the command line:

wget https://henry.olders.ca/datasets/mnist_train.tab

The process of development in its early stages is described in this essay written in 1989.