Mini-Project 6 : Cracking a Cipher

For this mini-project, you are going to produce a series of functions to help encipher and decipher strings (or at least make a good attempt). The specific functions you will be responsible for writing (along with some associated helper functions) are:

create-cipher
encipher
create-inventory
find-vector-max
reverse-cipher!
decipher

You should write your program in a file called cipher.scm and submit it to Gradescope when you are done.

Part 1: Creating Substitution Ciphers

A cipher is a specified method to disguise writing so it can only be read by intended recipients. One of the most common (and basic) versions of a cipher is known as a substitution cipher, where the alphabet is scrambled when writing the message: each letter in the alphabet is assigned to a random letter instead, so that the scrambled text appears as gibberish to anyone who does not know the substitutions used.

For example, say our cipher includes the following letter substitutions:

‘h’ to ‘c’
‘e’ to ‘g’
‘l’ to ‘t’
‘o’ to ‘a’

Under this partial cipher, the string ‘hello’ would become ‘cgtta’ instead.

There are a few possible choices for how to represent a substitution cipher in scamper, but we are going to use a vector for this mini-project. Each index of the vector will represent a letter in our alphabet, and each element in the vector will represent a substitution. For this assignment, we will use the following alphabet (stored as a list of characters):

(define example-alphabet (string->list "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ "))

Our cipher will be a vector of the same length. Each entry in the vector corresponds to the letter at the same index of this alphabet. So the entry at index 7 corresponds to the letter ‘h’ in the original message. Meanwhile, the value stored at that index in the vector will be an integer representing the index of the letter that will substituted for the original. So if ‘h’ was to be replaced with a ‘c’ instead, the entry at index 7 would be 2, i.e., (vector-ref cipher 7) should give 2 as the result.

As a small scale example, let’s consider a simplified alphabet (string->list "abcde"). If our cipher is the vector (vector 3 2 4 1 0), then our substitutions are:

‘a’ (index 0) to ‘d’ (index 3)
‘b’ (index 1) to ‘c’ (index 2)
‘c’ (index 2) to ‘e’ (index 4)
‘d’ (index 3) to ‘b’ (index 1)
‘e’ (index 4) to ‘a’ (index 0)

We can see now that a substitution cipher is a permuted (rearranged) vector of the integers 0 to n-1, where n is the number of characters in the alphabet. This gives us a method for creating our own random ciphers: create a vector of the integers 0 to n-1 and randomly permute/shuffle it!

To help you with this task, you are being provided with the code for a (shuffle! vec) function that will randomly rearrange the elements of an input vector. You may use these functions directly in your submission, but make sure that you properly attribute this assignment writeup at the top of your file. (In the future, these functions may be included in the scamper library.) Note that the shuffle! function rearranges the input vector in-place, actually altering the vector as an effect. The function itself returns only void.

;;; (swap! vec i j) -> void
;;;   vec : vector?
;;;   i : integer?
;;;   j : integer?
;;; Swaps the ith and jth entries in the provided vector vec.
(define swap!
  (lambda (vec i j)
    (let ([temp (vector-ref vec i)])
      (begin
        (vector-set! vec i (vector-ref vec j))
        (vector-set! vec j temp)))))

;;; (shuffle-helper! vec n) -> void
;;;   vec : vector?
;;;   n : integer?
;;; Recursively selects the next random element and swaps it into location n.
(define shuffle-helper!
  (lambda (vec n)
    (if (zero? n)
      void
      (begin
        (swap! vec (random (+ n 1)) n)
        (shuffle-helper! vec (- n 1))
        void))))

;;; (shuffle! vec) -> void
;;;   vec : vector?
;;; Randomly shuffles the elements of the input vector.
(define shuffle!
  (lambda (vec)
    (shuffle-helper! vec ( - (vector-length vec) 1))))

Now you are ready to create your own ciphers!

Write a function (create-cipher n) that takes a single input integer n, and returns a cipher of length n.

Testing: Create 2-3 ciphers of different size. Check to make sure the vectors at least appear to be random, and that all of the necessary elements are in the vectors.

Demo: Create a random cipher using your function and save it for later use. Make sure your cipher is actually displayed when your code is run.

Part 2: Enciphering

Now that we have the ability to create ciphers, it would be nice if we could actually use them to encipher messages. Let’s first consider how to encipher a single letter. We’ll need to use our alphabet to figure out which entry we should look at in our cipher vector, then use our alphabet again to discover what letter that cipher entry maps to.

Write a function (encipher-single-char ch cipher alphabet) that takes as input a character ch with the cipher and alphabet, and returns the enciphered character that will replace the original in the enciphered message. If the input character is not in the alphabet, leave it unchanged (return the same ch as was input). You may want to use the (list-contains lst val) function provided below.

Testing: Create a small test-alphabet, something like (string->list "abcde"), and manually create a small test-cipher of the same size (do not call the function from Part 1, since that will produce a random cipher, which will not work well for testing). Use test-case to write several tests using the cipher you created. The small example cipher from Part 1 could make a good test case.

;;; (list-contains lst val) -> boolean?
;;;   lst : list?
;;;   val : any
;;; Returns #t if and only if the list contains the value.
(define list-contains
  (lambda (lst val)
    (match lst
      [null #f]
      [(cons head tail) (or (equal? head val) (list-contains tail val))])))

Once you have the ability to encipher a single character, a simple map would allow us to encipher every character in the original message.

Write a function (encipher str cipher alphabet) that takes a string str as input, along with the cipher and alphabet, and uses vectors to create and return the enciphered string. There are several correct ways to do this, but your implementation should use vectors for the sake of computational efficiency. Your function should leave any character that is not in the alphabet unchanged in the returned string.

Testing: Using your small test-alphabet and test-cipher, write several tests of your encipher function. There should be at least one test that includes characters that are not in your alphabet.

Demo: Define your own message string (aim for something that is a couple paragraphs in length), and encipher it using the full cipher and alphabet you created in Part 1 of the assignment. You should check a few letters in the enciphered string to ensure that they were mapped to the correct characters according to your cipher. Make sure to save both the original and scrambled messages for later use. Make sure both are actually displayed when your code is run.

Part 3: Letter Inventories and Ciphers

Now that you have written code to create and use ciphers, we will explore a method for attempting to decipher an encoded message. Our deciphering method will rely on letter inventories. It turns out that the frequency with which a certain letter appears in written text is fairly consistent for a given language. For example, ‘e’ is the most commonly written letter in the English language. If you consider any text written in English, the letter ‘e’ will be the most frequent letter you see (except in rare special circumstances).

But what if the text has been enciphered? Then all of the ‘e’s in the text will instead be replaced by a different letter. Critically, however, all of those ‘e’s were replaced by the same letter, while all other characters in the original text were replaced by other letters. So which letter replaced the ‘e’s? The most frequently observed letter in the enciphered text! The second most frequently observed letter will correspond to ‘t’ (the second most frequent letter in English), the third most frequently observed letter will correspond to ‘a’ (the third most frequent in English), etc.

The first step for deciphering a message is to count the number of times each letter from the alphabet appears in the string. Write a function (create-inventory str alphabet) that takes in an input string str and the alphabet, and returns a vector inventory of the letters in the string. The inventory should be the same length as the alphabet, and the ith entry of the vector should be the number of times the letter (list-ref example-alphabet i) appears in the input string. You will probably want to use for-range in this function, and you may also want to consider writing a helper function that updates the inventory based on a single input character. Your function should ignore any characters that are not in the alphabet.

Testing: Using your small test-alphabet, write several tests of your create-inventory function: create several strings using letters from your test-alphabet and make sure the resulting inventory correctly records the count of each letter you chose.

Demo: Obtain the inventories for both your original message and the enciphered version you created in Part 2. Make sure to save both inventories for later use. Make sure both are actually displayed when your code is run.

Part 4: Deciphering

With the ability to create letter inventories, we can return to our deciphering strategy: the most frequent letter in the enciphered message will correspond to the most frequent letter in the original message.

The most frequent letter will correspond to the maximum value in the letter inventory, so we need a few more functions for finding the index of that maximum value.

Write a function (find-vector-max vec) that returns the index of the maximum value in the input vector. You may use the vector-max and vector-index-of functions provided below, or you may write a recursive function to find the index directly.

Testing: Make sure you provide sufficient test cases for your find-vector-max function.

;;; (vector-max-helper vec max n) -> any
;;;   vec : vector?
;;;   max : any
;;;   n : integer?
;;; Recursively finds the maximum value of the vector.
(define vector-max-helper
  (lambda (vec max n)
    (if (< n 0)
      max
      (let ([current (vector-ref vec n)])
        (if (> current max)
          (vector-max-helper vec current (- n 1))
          (vector-max-helper vec max (- n 1)))))))

;;; (vector-max vec) -> any
;;;   vec : vector?
;;; Returns the maximum value in the vector.
(define vector-max
  (lambda (vec)
    (vector-max-helper vec (vector-ref vec (- (vector-length vec) 1)) (- (vector-length vec) 1))))

;;; (vector-index-of-helper vec val n) -> integer?
;;;   vec : vector?
;;;   val : any
;;;   n : integer?
;;; Recursively determines the first index of val in vec, or returns -1 if it is not present.
(define vector-index-of-helper
  (lambda (vec val n)
    (if (> n (- (vector-length vec) 1))
      -1
      (if (equal? (vector-ref vec n) val)
        n
        (vector-index-of-helper vec val (+ n 1))))))

;;; (vector-index-of vec val) -> integer?
;;;   vec : vector?
;;;   val : any
;;; Returns the index of the first occurrence of val in vec or -1 if val is not in vec.
(define vector-index-of
  (lambda (vec val)
    (vector-index-of-helper vec val 0)))

We are now ready to implement our deciphering strategy. We are going to use our letter inventories, along with this find-vector-max function, to create a reverse cipher that will let us decipher an encoded message. The idea here is to repetitively find the next most frequent letter in each inventory and associate those letters together in the reverse inventory.

The most frequent letters in each string correspond to the maximum values in their inventories. So we need to find the indices of those maximum values. Our reverse cipher will then associate the index of the enciphered letter, i, with the index of the original letter, j, so that the value of (vector-ref reverse-cipher i) will be j.

But once we’ve found and associated the most frequent letters in each string, we need to find the second most frequent letters, which would be the next largest value in each inventory. How do we find the second-to-largest value in a vector? The trick here: when we find the maximum, replace it with -1. Since all entries in the inventory are non-negative, this trick ensures that the new value will be less than all other entries in the vector. With the old maximum value replaced by -1, the second largest value will become the new maximum! If you repeat this process n times (where n is the number of characters in the alphabet), you will have completed the reverse cipher.

Write a function (reverse-cipher! encoded-inv ref-inv) that takes in the inventory of the encoded string and a reference inventory (from an unencoded text), and returns the reverse cipher. You may use numerical recursion or for-range to obtain the desired repetitive behavior.

This function has an unfortunate side-effect: it must edit all entries in both inventories and replace them all with -1. Below is the provided code for a copy-vector function. This will allow you to copy the inventories and edit the copies so that you leave the originals in tact. You are not required to use this function, but it has been provided in case you would like to.

;;; (copy-vector vec) -> vector?
;;;   vec : vector?
;;; Returns a copy of the original vector (will not work in some rare specific cases).
(define copy-vector
  (lambda (vec)
    (let ([copy (vector-range (vector-length vec))])
      (begin
        (for-range 0 (vector-length vec) (lambda (n) (vector-set! copy n (vector-ref vec n))))
        copy))))

Demo: You should now calculate the reverse cipher for your messages from Parts 1 and 2, using the inventories you created in Part 3. Make sure the reverse cipher is actually displayed when your code is run.

Part 5: Does it work?

Now let’s put everything together to actually decipher an input string. Write a function (decipher scrambled alphabet ref-inv) that takes in an encoded string scrambled, the alphabet, and a reference inventory ref-inv. This function should return the (attempted) deciphered message. Note that once you have the reverse cipher, you can use your encipher function to apply the reverse cipher to the scrambled string, which will decipher it!

Demo: Using your enciphered message from Part 2 along with the inventory for the original string that you created in Part 3 (be careful in case you edited it since creating it), test your decipher function. You should get something that is… readable…

What happened? What causes the mistakes in our deciphering?

There are a few things going on here. First, the strings you were using were probably fairly short (a few paragraphs long, as instructed). With strings that short, there aren’t enough total letters to guarantee a proper distinction between them. As an extreme example, the string ‘hello world’ has fewer ‘e’s than ‘o’s, even though ‘e’ should be the most frequent letter in an English text.

Another issue with this approach is that while English is fairly consistent with letter frequencies, it is not quite consistent enough to guarantee that this trick will work in general. A much better approach is to look at the frequency with which one specific letter follows another, and use that to figure out the substitutions. The number of times a ‘g’ follows an ‘a’, or an ‘e’ follows a ‘b’, etc., is actually much more consistent across all English texts. Recording these more complicated transition inventories and optimizing the possible reverse-ciphers to best match the transition frequencies will lead to significantly improved results. Although that optimization process will require some more significant mathematics as well.

There is one final discussion that we should have about this mini-project. Technically, these instructions have cheated a little bit. When you were testing your code, you were instructed to use the inventory of the original message to decipher the encoded message. That’s cheating because in a real world situation, where you are trying to crack an unknown cipher, you usually wont have access to the original message!

Now, you might think that this doesn’t matter much, but it actually makes a huge difference. If you would like to experiment, we suggest that you take a few minutes to use with-file and your create-inventory function to create an inventory for a very large reference text (you could find and download the text of a large book like ‘A Tale of Two Cities’ by Charles Dickens, which is freely available online: https://www.gutenberg.org/files/98/98-0.txt). This is a common approach when dealing with real life language processing tasks: finding a very large text to use as a reference for what ‘true English’ looks like (or whatever language you are working with, since this trick can work with many other languages as well).

(Side note: If you choose to do this, expect that it will take several minutes to run. You will likely have to instruct VSCode to keep waiting several times before the inventory is created. You should not try to save the inventory, since it can only be obtained with a button click. However, you don’t want to calculate this more than once. So we suggest that you copy-paste the resulting inventory into your code and save that vector directly.)

Now try to decipher your encoded message using this new reference inventory. You’ll notice that it does a very poor job. But why? Shouldn’t such a large text work better than the much shorter original message?

But remember: we cheated! We used the original message’s inventory as the reference inventory. Because both the original message and the encoded message are actually the same English words, the inventories will perfectly match! This means that we should expect near-perfect results when we use the original message as a reference when we decipher the encoded message. Why not exactly perfect? The strings are so short that some letters have the same frequency. If both ‘F’ and ‘q’ show up the same number of times, we wont be able to tell the difference between them.

Summary of Deliverables

Part 1
create-cipher
Part 2
encipher-single-char
encipher
Part 3
create-inventory
Part 4
find-vector-max
reverse-cipher!
Part 5
decipher

Copyright © Eric Autry, Charlie Curtsinger, Sarah Dahlby Albright, Janet Davis, Nicole Eikmeier, Fahmida Hamid, Priscilla Jiménez, Barbara Johnson, Titus Klinge, Peter-Michael Osera, Samuel A. Rebelsky, John David Stone, Anya Vostinar, Henry Walker, and Jerod Weinman.

Unless specified otherwise elsewhere on this page, this work is licensed under a Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

This website was built using Jekyll, Twitter Bootstrap, and the Bootswatch Cosmo Theme.