CSCI 470
Web Science
Spring 2013

Montana Tech
Computer Science & Software Engineering



Assignment #9 - Cipher Solver
Due: 10PM, Sunday 4/7.
In this assignment you will be implementing a program that can (usually) break monoalphabetic substitution ciphers. This assignment is worth 30 points. There will also be a 15 point extra-credit bonus competition during class on Monday 4/8 (see below).


Overview. A monoalphabetic substitution cipher replaces each plain text character (denoted here by a lowercase letter) with a cipher character (denoted by an uppercase letter). The Caesar shift is a special case of a monoalphabetic substitution cipher with a key that is a shifted version of the letters A-Z:
plain  abcdefghijklmnopqrstuvwxyz
key    DEFGHIJKLMNOPQRSTUVWXYZABC
In general, the key may be any permutation of the letters A-Z:
plain  abcdefghijklmnopqrstuvwxyz
key    XBSINEMURZOQKAHCPVYLGFJTWD
A key might also be set using a password or phrase, removing any repeated letters and filling the end with the remaining letters:
plain  abcdefghijklmnopqrstuvwxyz
key    UGLYBACKSWNDEFHIJMOPQRTVXZ
Your mission, if you choose to accept it, is to develop a Java program that given some ciphertext, recovers the key (and thus plaintext) based on only the ciphertext.

Search algorithm. You can assume the original plaintext consisted of only the letters a-z with spaces removed. Ciphertexts only contain the letters A-Z and resulted from applying a monoalphabetic substitution cipher. The exact specifics for searching for the key is up to you. At a minimum, I would expect an algorithm capable of producing sensible results on general substitution ciphers (no fair just handling Caesar shifts). I suggest looking at Algorithm 1 in the paper Solving Substitution Ciphers.

Language modeling. A key part of your algorithm will be to judge the "goodness" of a particular key. To do this, you will decipher using a candidate key and then see how "English" looking the resulting plaintext is. I suggest doing this by measuring the probability of the plaintext under a language model trained on a large sample of newswire text. Text that is more consistent with expected letter sequences in English will have a higher probability.

The order of a model determines how many previous characters are used to predict the next character. A 3-gram model predicts the next character based on the previous two characters, e.g. P(e | th). Longer models tend to make better predictions, but are also slower to use. The provided language models have been trained on 2B words of newswire text, stripped to contain only the letters a-z. In case you're interested, the models were trained using SRILM and used Witten-Bell smoothing. Witten-Bell smoothing is equivalent to prediction by partial match (PPM) method C which we'll learn about in the lectures on compression.

You have been provided with the NGramLM Java class and associated support classes. You should not need to change these source files. The NGramLM class loads an ARPA format language model which is stored as an ASCII text file. The main method you will need in NGramLM is GetSentLogProb(String text). This method returns the log probability of the given sentence. Log probs are used to avoid underflow and are always negative. The closer a log prob is to zero, the more likely the given sentence was under the model.

The provided language models are letter-based with 26 "words", one for each of the letters a-z. The language modeling class uses spaces to separate "words" in a sentence. Thus you should always put spaces between the letters in the plaintext when asking for a log prob calculation.

CipherSolver. You have been given the partially completed program CipherSolver.java. Currently it accepts three command line arguments: Your submission should continue to use these first three command line arguments, but you are free to add additional parameters to control the behavior of your algorithm. I have included the class Stats which can be used to determine the elapsed time of your program. Once the time limit is exceeded, the program should terminate but first output the following: You are free to output other things during the optimization run such as the best key found thus far and the current best log prob. But make sure the last two things output are the above list.

Example ciphers. I have provided a variety of ciphered text files. If your program is working, you should be able to decipher the majority of these. However, the smaller files (e.g. tiny.txt and small.txt) may prove challenging.

The competition. We will have an in-class competition pitting our solvers against each other. If a student solver wins, that person or pair of people will be award 15 points of assignment extra credit. Rules:
Submission. Submit your source code to Moodle dropbox. The Moodle timestamp is the completion time for the assignment.

Page last updated: May 15, 2013