CSCI 136
Fundamentals of Computer Science II
Spring 2014

Montana Tech
Computer Science & Software Engineering



ASSIGNMENT #3 - Predictive Keyboard

In this assignment, you will be building a predictive keyboard. The keyboard learns to make predictions by text given at startup and by things the user types into the program. As users type, the program predicts the most likely word given the currently entered word prefix. So for example, if the user has typed th, the keyboard will probably predict the.

You will learn how to use the Java HashMap collection which is crucial to making the keyboard's learning and predictions fast and efficient.

Language modeling 101. Our predictive keyboard will make use of a very simple model of language, a unigram language model. Our keyboard will keep track of every unique word it has seen. For each unique word, it will count how many times that word has occurred. A simple maximum likelihood estimate for the unigram probability of a word is to take the number of times you've seen that word occur divided by the total number of words you've encountered. For example, here is a small bit of text:
The cat is in the corner of their garage.
There are 9 total words in our text. Ignoring case and punctuation, the word the appears twice. The unigram probability of the is thus P(the) = 2/9. The unigram probability of cat is P(cat) = 1/9. The other six words: is, in, corner, of, their, garage all also have a probability of 1/9. If a word has never been seen, its probability is zero: P(zebra) = 0/9 = 0.0.

Given the data, if the user were to type in c, both cat and corner are equally likely. But if they type ca, we would predict cat, while if they typed co we'd predict corner. If the user typed ci, we would make not prediction (since none of our words start with ci).

Files. Start by downloading keyboard.zip. This contains the stub files for the two classes you will be developing. It contains two support classes Stats.java and StdDraw.java. The Stats class is used by our test method to track elapsed time and memory. It also contains some text files for use in training your model: Word and probability class. In this assignment we need a class to represent a word and its unigram probability. We have given you a completed version of DictEntry, you shouldn't need to modify this file. We won't be using the audio functionality of DictEntry in this assignment (though you could imagine it would be handy if your program was to be used as a voice output communication aid). Here is the API to DictEntry:
public class DictEntry
------------------------------------------------------------------------------------------------------------------------------
         DictEntry(String word, double prob)                  // create a new entry given a word and probability
         DictEntry(String word, double prob, String filename) // create a new entry given a word, probability, and audio filename
  String getAudio()                                           // getter for audio filename
  String getWord()                                            // getter for the word 
  double getProb()                                            // getter for the probability
 boolean matchPattern(String pattern)                         // does this word match the given pattern?
Prediction class. The brains of this whole operation is the class WordPredictor. This class learns how to make word predictions based on being shown training data. It can learn from all the words in a file (via the train() method) or from a single individual word (via the trainWord() method). After new training data is added, the build() method must be called so the class can recompute the most likely word for all possible prefixes. Here is the API for the WordPredictor class:
public class WordPredictor
-----------------------------------------------------------------------------------------
      void train(String trainingFile) // train the unigram model on all the words in the given file
      void trainWord(String word)     // train on a single word, convert to lowercase, strip anything not a-z or '
      long getTrainingCount()         // get the number of total words we've trained on
       int getWordCount(String word)  // get the number of times we've seen this word (0 if never)
      void build()                    // recompute the word probabilities and prefix mapping
 DictEntry getBest(String prefix)     // return the most likely DictEntry object given a word prefix
Training the model. Model training occurs in the train() and trainWord() methods. train() should parse out each word in the specified file on disk. If the file cannot be read, it should print out an error, "Could not open training file: file.txt". All training words are converted to lowercase and stripped of any characters that are not a-z or the single apostrophe. During training, you need to update the instance variables:
	private HashMap<String, Integer> wordToCount;
	private long total;
The wordToCount instance variable is a map where the keys are the unique words encountered in the training data. The values are the integer count of how many times we've seen each word in the data. Only words seen in the training data will have an entry in the HashMap. The total instance variable tracks how many words of training data we have seen. That is, total should equal the sum of all integer counts stored in your map. Training is cumulative for repeated calls to train() and trainWord(), so you just keep increasing the counts stored in your wordToCount map and your total count of training words.

Making predictions. The class needs to be able to predict the most probable word for a given word prefix give the statistics found during training. For example, after training on mobydick.txt, if the user types the prefix wh, wha, whal or whale, the most likely word is whale. The job of mapping word prefix strings to the most probable dictionary entry falls to the third instance variable:
	private HashMap<String, DictEntry> prefixToEntry;
A Map collection in Java can map multiple keys to the same value. This is exactly what we need in this case since a set of strings (such as wh, wha, whal and whale) may all need to map to a single DictEntry object. You need to ensure that each prefix maps to its most likely word. So even if the word thespian was encountered first in the training data, for any sensible training text, th should probably map to the and not thespian.

Every time the build() method is called, you need to re-initialize the contents of prefixToEntry. This ensures that you take into account any new words or updated counts present in wordToCount. Here is an overview of the algorithm you should be aiming for in build():