Web Science - CSCI 470

CSCI 470
Web Science
Spring 2013

Schedule | Assignments | Resources | Course syllabus | Moodle

Assignment #9 - Cipher Solver
Due: 10PM, Sunday 4/7.
In this assignment you will be implementing a program that can (usually) break monoalphabetic substitution ciphers. This assignment is worth 30 points. There will also be a 15 point extra-credit bonus competition during class on Monday 4/8 (see below).

Overview. A monoalphabetic substitution cipher replaces each plain text character (denoted here by a lowercase letter) with a cipher character (denoted by an uppercase letter). The Caesar shift is a special case of a monoalphabetic substitution cipher with a key that is a shifted version of the letters A-Z:

plain  abcdefghijklmnopqrstuvwxyz
key    DEFGHIJKLMNOPQRSTUVWXYZABC

In general, the key may be any permutation of the letters A-Z:

plain  abcdefghijklmnopqrstuvwxyz
key    XBSINEMURZOQKAHCPVYLGFJTWD

A key might also be set using a password or phrase, removing any repeated letters and filling the end with the remaining letters:

plain  abcdefghijklmnopqrstuvwxyz
key    UGLYBACKSWNDEFHIJMOPQRTVXZ

Your mission, if you choose to accept it, is to develop a Java program that given some ciphertext, recovers the key (and thus plaintext) based on only the ciphertext.

Search algorithm. You can assume the original plaintext consisted of only the letters a-z with spaces removed. Ciphertexts only contain the letters A-Z and resulted from applying a monoalphabetic substitution cipher. The exact specifics for searching for the key is up to you. At a minimum, I would expect an algorithm capable of producing sensible results on general substitution ciphers (no fair just handling Caesar shifts). I suggest looking at Algorithm 1 in the paper Solving Substitution Ciphers.

Language modeling. A key part of your algorithm will be to judge the "goodness" of a particular key. To do this, you will decipher using a candidate key and then see how "English" looking the resulting plaintext is. I suggest doing this by measuring the probability of the plaintext under a language model trained on a large sample of newswire text. Text that is more consistent with expected letter sequences in English will have a higher probability.

The order of a model determines how many previous characters are used to predict the next character. A 3-gram model predicts the next character based on the previous two characters, e.g. P(e | th). Longer models tend to make better predictions, but are also slower to use. The provided language models have been trained on 2B words of newswire text, stripped to contain only the letters a-z. In case you're interested, the models were trained using SRILM and used Witten-Bell smoothing. Witten-Bell smoothing is equivalent to prediction by partial match (PPM) method C which we'll learn about in the lectures on compression.

You have been provided with the NGramLM Java class and associated support classes. You should not need to change these source files. The NGramLM class loads an ARPA format language model which is stored as an ASCII text file. The main method you will need in NGramLM is GetSentLogProb(String text). This method returns the log probability of the given sentence. Log probs are used to avoid underflow and are always negative. The closer a log prob is to zero, the more likely the given sentence was under the model.

The provided language models are letter-based with 26 "words", one for each of the letters a-z. The language modeling class uses spaces to separate "words" in a sentence. Thus you should always put spaces between the letters in the plaintext when asking for a log prob calculation.

CipherSolver. You have been given the partially completed program CipherSolver.java. Currently it accepts three command line arguments:

Cipher file - filename of a text file containing the cipher text
LM file - filename of an ARPA format letter-based language model
Time limit - time limit in seconds for searching for the best key

Your submission should continue to use these first three command line arguments, but you are free to add additional parameters to control the behavior of your algorithm. I have included the class Stats which can be used to determine the elapsed time of your program. Once the time limit is exceeded, the program should terminate but first output the following:

26-letters of the best key found
The plaintext resulting from deciphering using the best key

You are free to output other things during the optimization run such as the best key found thus far and the current best log prob. But make sure the last two things output are the above list.

Example ciphers. I have provided a variety of ciphered text files. If your program is working, you should be able to decipher the majority of these. However, the smaller files (e.g. tiny.txt and small.txt) may prove challenging.

The competition. We will have an in-class competition pitting our solvers against each other. If a student solver wins, that person or pair of people will be award 15 points of assignment extra credit. Rules:

All solvers will be given a fixed time period to work on a particular cipher.
Solvers will be run one-at-a-time and thus have dedicated use of my Macbook Pro (4 cores, 8GB memory).
Solvers will compete in a series of heats on cipher texts of different lengths, source texts and key types.
If a solver has optional command-line arguments, the author(s) can tell me what to use for each heat.
If a solver requires non-default JVM settings, the author(s) can specify that as well (e.g. increasing heap space).
The winner of a heat will be the solver that produces the lowest character error rate (CER) for a given cipher. The CER is the number of character insertions, substitutions and deletions required to transform the solution into the original plaintext.
A solver gets 3 points for first place, 2 points for second place, 1 point for third place, 0 points if it crashes or exceeds the time limit by more than 1 second.
I get to choose the set of texts to cipher as well as the nature of their keys, but I will not tune my solver to these ciphers. My solver is a pretty straightforward implementation using the algorithm in the paper cited above. It should be beatable.

Submission. Submit your source code to Moodle dropbox. The Moodle timestamp is the completion time for the assignment.

Page last updated: May 15, 2013