Carmelo Saffioti's Blog: WEKA da prompt dei comandi

giovedì 12 febbraio 2009

WEKA da prompt dei comandi

WEKA, notissima suite per data mining, può essere utilizzato anche da prompt dei comandi, permettendo un completo ed utile controllo dei parametri per elaborazioni batch.

Vediamo come specificare i parametri per la classificazione:

Esempio: java weka.classifiers.trees.J48 -t data/weather.arff

-t	specifies the training file (ARFF format)
-T	specifies the test file in (ARFF format). If this parameter is missing, a crossvalidation will be performed (default: ten-fold cv)
-x	This parameter determines the number of folds for the cross-validation. A cv will only be performed if -T is missing.
-c	As we already know from the weka.filters section, this parameter sets the class variable with a one-based index.
-d	The model after training can be saved via this parameter. Each classifier has a different binary format for the model, so it can only be read back by the exact same classifier on a compatible dataset. Only the model on the training set is saved, not the multiple models generated via cross-validation.
-l	Loads a previously saved model, usually for testing on new, previously unseen data. In that case, a compatible test file should be specified, i.e. the same attributes in the same order.
-p #	If a test file is specified, this parameter shows you the predictions and one attribute (0 for none) for all test instances. If no test file is specified, this outputs nothing. In that case, you will have to use callClassifier from Appendix A.
-i	A more detailed performance description via precision, recall, true- and false positive rate is additionally output with this parameter. All these values can also be computed from the confusion matrix.
-o	This parameter switches the human-readable output of the model description off. In case of support vector machines or NaiveBayes, this makes some sense unless you want to parse and visualize a lot of information.

I clusterers possono essere utilizzati in modo simile ai classificatori, vediamo come specificare i parametri per il clustering:

Esempio: java weka.clusterers.EM -I 10 -t train.arff

-t
specifies the training file
-T
specifies the test file
-p
for outputting predictions (if a test file is present, then for this one, otherwise the train file)
-c
performs a classes to clusters evaluation (during training, the class attribute will be automatically ignored)
-x
performs cross-validation for density-based clusterers (no classes to clusters evaluation possible!). With weka.clusterers.MakeDensityBasedClusterer, any clusterer can be turned into a density-based one.
-d and -l
for saving and loading serialized models