WEKA FOR PREDICTIONS

Gold Ochim
8 min readAug 26, 2021
Image extracted from: National Geographic Society

People love predictions; everyone wants to know or wants to be able to say what will happen in the next 2 years or 5 years or how much they’d make in the next quarter of the year. Data has become the tool people use to predict these days. It has helped people make important business decisions, major health decisions, financial decisions, educational decisions and even marital decisions. How do people make these predictions? They use datasets. They develop models that can subsequently help them predict future occurrences. In building some models, the concept of Machine Learning comes in. As the name implies, the machine (computer) learns from previous datasets and applies this knowledge of observations or patterns to predict future occurrences.

What is WEKA

There are several machine learning tools presently. Some include RapidMiner, KNIME, MOA, etc. WEKA is one of such tools. Briefly on WEKA; It was created by the University of Waikato, New Zealand. It was actually made alongside a book titled “Data Mining: Practical Machine Learning Tools and Techniques”. WEKA was written in Java.

WEKA is great for many reasons. One reason is that it does some level of helpful hyper parameter tunings for models (it basically just tries to give you the best version of that model). I’m not saying there aren’t things to be done to make the models better in WEKA, but generally, it tunes real good by default.

WEKA can be used for data preparation, classification, regression, clustering, association rules mining and visualization. It contains numerous machine learning algorithms. WEKA can be downloaded HERE

Learning how to use WEKA is quite broad but the focus here will simply be on performing predictions with it.

When you open WEKA, you are met with the interface shown below:

The WEKA interface

As seen above, WEKA has 5 major applications. For the predictions we are about to do, we will make use of the Explorer application (the first). When you click on the Explorer application, the interface opened looks like this:

WEKA Explorer

We would be using the popular Titanic Dataset (this data contains details of passengers who boarded the grand Titanic ship that sank in 1912). The dataset I used has just 7 columns.

I have split the dataset into two, 787 for the training and 100 for the testing of the model. Note that this split can be done by WEKA automatically when set. I also removed the name column because it wouldn’t be relevant to our model creation and predictions and also because the full stops found in the title of those names pose an error with csv format in WEKA. So take note of the content of your datasets. I also changed the 0's and 1’s in the Survived column to No and Yes respectively. WEKA detects between nominal and numerical columns and when you give it numerical instead of nominal or vice versa, it affects the visualization and may prevent you from using certain algorithms.

Loading the Dataset

The first thing to do is load the dataset. This can be done simply by clicking on “Open file”, then selecting the dataset to be used.

The open file button on the WEKA Explorer interface used for loading in datasets

In selecting the dataset, the file type can be changed to easily locate the dataset especially if it isn’t in the arff format.

In this case, our dataset is in csv format.

Once loaded, some visualizations are automatically generated. More visualizations on other columns are seen when you click on “Visualize All”. Details are shown when you hover over the blocks in the charts:

The interface with loaded dataset

Editing Dataset within WEKA

Back to the dataset. If for any reason, the dataset needs to be edited within the WEKA interface, click on the “Edit” button as can be seen above. The dataset is opened and becomes editable. Double clicking on instances allow changes on instances and right clicking on column headings allow for column-related changes.

The Titanic Dataset to be used in the WEKA Edit/Viewer Interface

Before an algorithm is selected to be used, one ought to at least have an idea what these algorithms are best suited for. For this, a classification algorithm (specifically J48) will be used. This is a type of Decision Tree algorithm

When you click on the “Classify” menu, you see the following:

The WEKA interface showing the information under the Classify menu

Take note of the “(Nom) Survived” column as shown above. That should be the column determining the classification (that is what every instance in the entire dataset is/should be classified into). If the column isn’t correctly selected by WEKA, then this can be changed. But to help WEKA do this correctly, while preparing your csv file, place the classifying column(output column) first before any other column.

Selecting an Algorithm and a Test Option

Once the dataset is correctly loaded, an algorithm needs to be selected and then the model trained and subsequently tested. WEKA does this test by simply using the trained model to predict the output of the test dataset then compares this predicted values with the actual value of the test dataset. This is how the performance of the model is determined. These two steps, training and testing are done one after the other but seen as almost simultaneous in WEKA due to the short time it uses to train and then subsequently test the model. But that is not to say some trainings won’t take so much time.

The algorithm can be selected with the “Choose” button. The data split for the training and testing can be set with the options provided above. “Use Training set” implies that the entire dataset used for training will also be used for testing the performance of the model. “Supply test set” implies that an external test dataset kept aside can be uploaded and used for testing while the previously uploaded dataset will be used only for the training. “Cross validation” means that the dataset will be divided into 10 parts; 9 parts will be used for training and 1 part for testing. The process is repeated ten times with a different 1 part each time. Then all the results gotten from the testing (all 1 parts) is averaged. “Percentage split” simply divides the dataset into two parts e.g 80% for training and 20% for testing.

We would select the J48 algorithm (found under trees) and the Supply test set option. The start button begins the training and subsequently the testing.

Interpreting WEKA Results

There are a whole lot to interpret in WEKA results. Generally, it is advisable to know about performance measures so as to know what best explains the performance of a model.

In this case, the accuracy of the model and a confusion matrix is used to explain the performance of this model.

Predictions:

The prediction of instances can be done in two ways; with the actual or correct values already known (This is referred to as Testing because predicting outputs when the actual values are known is usually done to simply check how the model is faring) or with the actual values being unknown. It is advisable to do a test prediction (which has already been done above) before a prediction in which the outputs are not known so as to know how much you can trust your predicted outputs.

For a prediction in which the outputs aren’t known, make sure that the names of columns corresponds to that used in the training dataset. Also, the output column needs to be present, but this can be filled with question marks (?) after all, they are indeed unknown.

With the Test option still set to supply Test Set, select the prepared dataset for prediction (Note that I’d still use the same Test dataset for the purpose of explaining but I would replace the instances in the “Survived” column with question marks (?).

Make sure to choose the output format in which the “Output prediction” is selected to. This step is shown below:

I do prefer the HTML format. because of how it is displayed. The display looks like the following when the instruction is run in WEKA:

To view better, copy the texts into notepad, save with the extension .html, close and then open with a browser. You get the following:

Above, you can see that the “actual” columns are filled with question marks. Well, good thing we are interested in the “predicted” column. The displayed predicted values are arranged serially according to how the dataset were arranged in the csv file. So this can simply be copied and added to the csv file serially so that the inputs can be seen alongside the predicted values. The “prediction” column contains the probability with which the prediction was done. For example, we can see on instance number 14. Prediction was done with a 100% (1) probability based on the trained model.

Sharing Prediction Results

So generally, it will be clearer to output the final results as shown below or generally whatever preferred way as long as it is easy to understand what inputs determined the output of every instance. For example, from the second row shown below, we can see that if the Passenger class of a 20 year old male is 3 and the passenger has no siblings nor children aboard and the fare paid was 7.75, then the passenger isn’t likely to survive in the Titanic (Note that the fare column can be removed as the values vary alot).

And that’s how prediction is done!

In conclusion, it is important to know that the J48 model used here can be visualized as a decision tree (although not displayed here). And from this decision tree, IF ELSE statements can be generated and thus translated to a working software. So if predicting within WEKA isn’t enough for you, you can go further to build a software. There is a whole lot to do with WEKA, why not explore deeper if it tickles your fancy!

I’d be glad to answer questions if you’ve got any on the use of WEKA for predictions.

REFERENCES

Cross Validation in Weka (pentaho.com)

https://waikato.github.io/weka-wiki/making_predictions/

--

--