In this article we are going to create a Rust project that generates real looking but random text using Markov chains.
You can find the finished code on Github.
What is a Markov chain?
According to Wikipedia, a Markov chain is
a stochastic process that satisfies the Markov property (usually characterized as “memorylessness”).
Um, OK, that helped a lot. Let’s try a different approach. Look at the following diagram:
By Joxemai4 - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=10284158
Let’s say that we start at the A state. The diagram says that we have a 60% chance of staying at the A state and a 40% chance of moving to the E state. Let’s ask Python where we should go:
Ok, we are staying at A.
Now we are moving to E. The diagram says that we have a 30% chance of staying at E and a 70% chance of moving back to A. Let’s write a function to decide where we should go from here:
Now let’s find out where we are going:
I think that’s enough moves for now. What we just did is generate the string AAEEAEAAE by using a Markov chain. Let’s look back at that Wikipedia definition again:
a stochastic process that satisfies the Markov property (usually characterized as “memorylessness”).
“Stochastic” just means that the process is random. Our experiment showed that we were able to find out what the next item in the chain was just by knowing what the current item was. This is the Markov property in a nutshell. If we put it all together, we get that a Markov chain is the result of running a random process that satisfies the property that we can predict the next result of the process just by knowing the current result, and no amount of prior results will help us make a better prediction.
Now let’s find out how we can use Markov chains to generate text.
n-grams
There is one more thing that we need to learn about before we can start creating our book writing bot, and that is n-grams. Luckily, they are easier to understand than Markov chains. The Wikipedia definition is excellent:
an n-gram is a contiguous sequence of n items from a corpus (a given sequence of text or speech).
For example, if we had the sentence “I like to have pizza and ice cream for dinner”, “I like”, “like to”, “to have”, “have pizza”, “pizza and”, “and ice”, “ice cream”, “cream for”, and “for dinner.” are all 2-grams from that sentence. (Actually, they are all the 2-grams). We can also have 3-grams and 4-grams, or any positive number for n, but 2 and 3 seem to be the most common.
Let’s write some code!
Now that we got all that theory out of the way, we can start coding. First make sure that you’ve installed Rust correctly by running the following command in your terminal
If it outputs something like rustc 1.13.0
, than you should be good to go. Otherwise, follow these instructions to install Rust and Cargo.
If you haven’t already, open up a terminal and cd
to wherever you want to keep your project. Then run cargo new --bin markov
to create a new Rust project in a folder named markov. This will also be the name of our binary. Now cd
into the newly created markov directory.
Cargo will also create a new git repository in the project folder, so if you want to use it you should probably create a “Created project” commit.
Finally, open your favorite editor and open up the markov folder as a project.
Where are my crates?
We are going to need a couple crates for this project, so let’s open up Cargo.toml
and add them. When you first open up Cargo.toml
, it should look like this:
We need to add the following two lines to the end of the file (right after [dependencies]
):
This will add the rand and clap crates to our project, which we will need for generating random numbers and for parsing command line arguments.
Parsing the arguments.
Now we are going to create the parser for the command line arguments. Open up src/main.rs
, delete everything in it, and add the following code:
(of course you should replace my name and email with yours)
Let’s also add a println!
call so that we can test that the code is working properly:
Now let’s try to build and run our project:
If you got the above output, then you are good to go on, otherwise double check that you have entered in the above code exactly.
Reading the corpus file
Now let’s move on to reading the corpus file. First, we need to import some standard library modules that have to do with io and files. Stick the following code after the extern crate clap
line:
This code imports the io module, and brings everything in the io::prelude module plus File from fs::file into the current scope. Now we can write our file reading function:
We return the contents of the file as a io::Result to make error handling easier by letting us use the try! macro. This macro automatically handles errors and early returns for us.
Now let’s add some code to the main function to use our shiny new read_file
function:
Pretty simple stuff. We pattern match on the result of the read_file function. If it is Ok
we just assign the file contents to corpus, but if it is Err
we print an error message and exit.
Try building and running the project again to make sure that reads a corpus file correctly and also gracefully exits if there is a problem opening and reading a file.
Generating bigrams
We’ve gotten to the point of generating bigrams. Let’s create a new function to do that:
Before we can use this code, we need to import HashMap and define the Bigram type. I’m just using a HashMap of tuples and vectors to be simple.
While that type looks complicated, it’s actually quite simple. A Bigram
is just a HashMap
with keys of type (&'a str, &'a str)
(a tuple of two str references that have a lifetime of a) and keys of type Vec<&'a str>
(a vector of str references that also have a lifetime of a).
The Book has a great intro to lifetimes if you haven’t encountered them before.
We need to add some more code to the main function to make use of the bigram generator:
In order to test our code, we need a real corpus to work with. I’m using the following small corpus to test out the code:
Save that in a file named corpus.txt, and run the following commands:
You should get the following output:
Generating random words and sentences
We’ve finally gotten to the point of generating random words! 😀
We need to add the following extern and use to the top of our file:
Now we can write our function for generating random words:
Now let’s generate random sentences. We’ll start by creating a function that generates one random sentence. Its logic is much like generate_words()
, but it has some special code to find words that start and end sentences in the corpus.
We can use this function to easily create a function to generate multiple sentences:
Putting it all together
We can finally finish up our main function and have our finished app. First, remove all the println!
calls (except for the error messages) from the main function, and then add in the following code at the end:
Try building it with cargo build
to make sure that there are no errors. Then build it again with cargo build --release
to build an executable that runs much, much faster than the debug releases. Project Gutenberg has many free, plain text ebooks that work perfectly as corpuses.
To close this article, I’ll include some random sentences generated from The Cat of Bubastes: A Tale of Ancient Egypt by G. A. Henty.
Ameres. He was devoted to his disadvantage.
Ruth looked the elder of the things you have.
I understand now.
God.” “They are to stay here, my lord,” Amuba answered; “but at the head and made a motion of the two.
The shields had a weapon of some dangerous creature.
On ceremonial occasions, as the slaying of the bales of goods were made in the search will be willing to admit that the battle in which Ameres was bold enough to see that we should find a ship to take the former had recovered to some seats placed beneath trees on the head had, to Amuba, a lad who was bleeding freely, was with Mysa at the point that was carrying them; but the completeness of the Roman connection with them, who was to dispose of the Rebu, where, indeed, even if the eldest son was, he was lifted out, carried into an orchard, and lie there with only your face that he is left to himself, but he thought it best that you were doing.” Jethro then returned across the country.
Amuba, let us land and run and fetch a hoe.” Throwing down his merchandise by ship and start upon their flocks for the coast.
I admit it is all I don’t suppose we shall start exactly at sunrise.
Have you heard aught in the afternoon when the two lads bowed respectfully to the state of change.
Ameres,” Jethro said.
Credits
This article really helped me to understand how text generation using Markov chains works, and it’s code was the inspiration for the Rust code created in this post.
Wikipedia helped out with some definitions.
The Markov chain diagram was by Joxemai4 - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=10284158