This assessment task, is related to Topics 1 – 5.
For this assessment, you are required to use Weka 3.8.3 (or a later version available at https://www.cs.waikato.ac.nz/ml/weka/downloading.html ), you will use this throughout the duration of this subject. You will also need to use a text editor such as notepad for windows system or Textedit for Mac.
Task 1: Create and explore ARFF data for Weka [30 marks]
In this task you are expected to convert a text file into an ARFF file for Weka. The text file you will be using contains a sample of real life data related to parking fines in Australia. You are then asked to explore the data using Weka. Below are the specific task requirements.
Download the text file called ParkingFines.csv and open it using a text editor such as Notepad (Windows) or TextEdit (Mac). The file ParkingFines.csv has been partially formatted as an ARFF file. Identify any errors in the file and complete the formatting to obtain a valid ARFF file saved as ParkingFines.arff. Itemise any errors identified and include a screenshot of your corrected ARFF file to support the itemised errors identified as part of your submission. [20 marks]
Explore the ParkingFines.arff file you just created in Weka using Weka Explorer and answer the following questions. Make sure to include screenshot of the visualisations to support your answers.
What proportion of people who committed the offence “Contravene No Stopping” actually paid their fine? [5mrks]
What proportion of people who were fined $50 were exempted from paying the fine? [5mrks]
Task 2: Explore and Analyse adult.arff data using Weka [35 marks]
In this task you will explore the adult.arff dataset using Weka Explorer. The adult dataset which comes as part of the Weka installation is a dataset containing various attributes of individuals obtained through a census of people living in the US. The dataset was curated to be used to build a model that can predict whether or not an individual will earn greater than $50k based on his/her other attribute values. Load the adult.arff data file available in Weka and answer the following questions with justifications and screenshots.
With the aid of a visualisation, identify the most populous age bracket? [5 mrks]
With the aid of visualisations compare the distribution of the female population in this dataset (adult.arff) to Australia’s female population distribution in 2019 as shown in the image below obtained from the Australian Bureau of Statistics (ABS). The distribution shown in the image reflects the entire population distribution of both females and males regardless of their income. Briefly discuss any two similarities/differences between the age distribution of females in the adult.arff dataset and the 2019 distribution of females in Australia . [15mrks ]
With the aid of visualisations justify whether you agree or disagree with the following statement: From the adult.arff dataset, there are more men who earn less than 50K than there are women who earn less than 50k [15 mrks]
Task 3: Decision Tree Analysis [35 marks]
The table below shows a dataset for a binary class problem. By using information gain, justify with calculations which attribute (A or B) the decision tree algorithm will choose to split on. [20 mks]
Explain whether you think gain ratio could be a better metric for this example or not. [15 mrk]
This assessment task will assess the following learning outcome/s:
be able to identify and analyse business requirements for the identification of patterns and trends in data sets.
be able to appraise the different approaches and categories of data mining problems.
be able to compare and evaluate output patterns.
be able to explore and critically analyse data sets and evaluate their data quality, integrity and security requirements.
be able to compare and evaluate appropriate techniques for detecting and evaluating patterns in a given data set.