Machine Learning - Assignment 6

CSCI 446
Artificial Intelligence
Fall 2015

Schedule | Assignments | Course syllabus | Moodle

ASSIGNMENT 6

Using the programming language of your choice, you are to write a machine learning program, using at least one, but possibly more, of the following techniques:

	Instance Based Learning - 50 points
	Clustering - 50 points
	Rule Based Learning - 100 points
	Decision Trees - 100 points
	Artificial Neural Network - 100 Points
	Genetic Algorithm - 100 Points

At a minimum, you must implement at least one of these. Extra implementations will get you extra credit points.

No matter which technique(s) you choose, you must remember that you should use a portion of your data for training and a portion for testing. The technique(s) you choose should be able to read in a training data set and produce a structure that represents what it has learned. It should also be able to read in a test data set (without the outcome variable) and predict what that outcome should be. You probably want to work out some way of measuring how successful the predictions were.

There are three data sets posted on the website. Each of these is in csv format, so you should be able to read it in as a text file. Some files have header data (attribute names), some don't. Also, the DeerHunter dataset is provided in Excel format so you can see the explanations of the attributes if you hover the mouse over the attribute names. The datasets are:

Weather: the toy dataset we used in class to determine whether to play or not. There are 14 instances in this dataset, and all data is nominal (symbolic). Results I've gotten on this set range from 50-65% correct.

Class: a toy dataset to determine whether one of my classes will be held on time, late, very late, or cancelled. There are 21 instances in this dataset and all data is nominal. Results I've gotten on this set are around 65% correct.

DeerHunter: a real dataset which is the result of a survey of hunters on whether they would pay $x more to hunt the next time than what they paid previously. Open up the Excel version of this file to get descriptions of the attributes if you're interested in their meaning. All data in this set is numeric, though much (or actually all, if you chose to do it that way) could be discretized. There are 6,062 instances in this dataset, and like real data, it has noise and errors in it. Results I've gotten on this set range from 60 - 65%

Submission. What you should turn in:

1. Electronic Version of Source Code 
2. Compilation Instructions 
3. Run Instructions 
4. If you modified the datasets to work with your code, send me the modified versions also. 
5. A Description of what you did and the results: 
	A. Tell me how you treated the data (how much you used for training and testing, did you discretized, did you normalize, etc.) 
	B. Tell me any assumptions you made in your algorithm (did you use pre-pruning or post-pruning to compensate for noise, etc.) 
	C. Tell me the results of running your algorithm on the test data – what percentage did your approach get correct, etc. Note: this implies that the structure you built must be usable for predicting an outcome, that is, decision trees or rules must be executable in some form. Worst case, you can produce a set of rules or a tree and manually walk through the test data, but if you do this, tell me this is what you did. Also, tell me if there are any datasets that your code will not work on, e.g. if it won't work with numeric data or with missing data, etc.

Page last updated: October 28, 2015