CSCI 470
Web Science
Spring 2014

Montana Tech
Computer Science & Software Engineering



Cipher Solver
In this assignment you will be implementing a program that can (usually) break monoalphabetic substitution ciphers. This assignment is worth 30 points. There will also be a 5 point extra-credit bonus competition during class (see below).


Overview. A monoalphabetic substitution cipher replaces each plain text character with a cipher character. Throughout this description, plaintext characters are denoted by lowercase letters, cipher characters are denoted by uppercase letters. The Caesar shift is a special case of a monoalphabetic substitution cipher with a key that is a shifted version of the letters A-Z, e.g.:
plain  abcdefghijklmnopqrstuvwxyz
key    DEFGHIJKLMNOPQRSTUVWXYZABC
In general, the key may be any permutation of the letters A-Z:
plain  abcdefghijklmnopqrstuvwxyz
key    XBSINEMURZOQKAHCPVYLGFJTWD
A key might also be set using a password or phrase, removing any repeated letters and filling the end with the remaining letters:
plain  abcdefghijklmnopqrstuvwxyz
key    UGLYBACKSWNDEFHIJMOPQRTVXZ
Your mission, if you choose to accept it, is to develop a Java program that given some ciphertext, recovers the key (and thus plaintext) based on only the ciphertext.

Search algorithm. You can assume the original English plaintext consists of only the letters a-z with spaces removed. Ciphertexts only contain the letters A-Z and resulted from applying a monoalphabetic substitution cipher. The exact specifics for searching for the key is up to you. At a minimum, I would expect an algorithm capable of producing sensible results on general substitution ciphers (no fair just handling Caesar shifts). I suggest looking at Algorithm 1 in the paper Solving Substitution Ciphers.

Language modeling. A key part of your algorithm will be to judge the "goodness" of a particular key. To do this, you will decipher using a candidate key and then see how "English" looking the resulting plaintext is. I suggest doing this by measuring the probability of the plaintext under a language model trained on a large sample of newswire text. Text that is more consistent with expected letter sequences in English will have a higher probability.

The order of a model determines how many previous characters are used to predict the next character. A 3-gram model predicts the next character based on the previous two characters, e.g. P(e | th). Longer models tend to make better predictions, but are also slower to use. The provided language models have been trained on 2B words of newswire text, stripped to contain only the letters a-z. In case you're interested, the models were trained using SRILM and Witten-Bell smoothing. Witten-Bell smoothing is equivalent to prediction by partial match (PPM) method C which we'll learn about in the lectures on compression.

In cipher-solver.zip, you have been provided with the NGramLM Java class and associated support classes. You should not need to change these source files. The NGramLM class loads an ARPA format language model which is stored as an ASCII text file. The main method you will need in NGramLM is GetSentLogProb(String text). This method returns the log probability of the given sentence. Log probs are used to avoid underflow and are always negative. The closer a log prob is to zero, the more likely the given sentence was under the model.

The provided language models are letter-based with 26 "words", one for each of the letters a-z. The language modeling class uses spaces to separate "words" in a sentence. Thus you should always put spaces between the letters in the plaintext when asking for a log prob calculation.

CipherSolver. You have been given the partially completed program CipherSolver.java. Currently it accepts four command line arguments: Your submission should continue to use these first four command line arguments, but you are free to add additional parameters to control the behavior of your algorithm. I have included the class Stats which can be used to determine the elapsed time of your program. Once the time limit is exceeded, the program should terminate but first output exactly three lines to the filename specified as the 4th command line argument: You are free to output whatever you like to standard output during the optimization run such as the best key found thus far and the current best log prob. This is not required, but will make it more fun to watch your solver work.

Example ciphers. I have provided a variety of ciphered text files. If your program is working, you should be able to decipher the majority of these. However, the smaller files (e.g. tiny.txt and small.txt) may prove challenging.

The competition. We will have an in-class competition pitting our solvers against each other. The winning person or pair will be awarded 5 points of assignment extra credit. Rules:
Submission. Submit your source code to Moodle dropbox. The Moodle timestamp is the completion time for the assignment.

Page last updated: December 30, 2014