Generating gibberish

Generating gibberish is not really a very useful thing to do. But it's a funny thing to do, and it's not even hard. For the purposes of this blog post, generating gibberish is defined as:

  • Taking some amount of real, english text as input.
  • Being able to generate an (in principle) infinite amount of text that looks like real english as output.

We of course want the output to look silly enough for it to be funny. I think the method I have used here is what's generally seen as using markov chains to generate text. But I haven't read enough about that to be sure. In any case, it looks very much like the same thing, so I have used the term order to describe the constant that defines how the program learns, just as one would when describing markov chains. We need some datastructure to map 'beginnings' to 'possible finishes', in my case I have used a Python dictionary (Otherwise known as hashtable, hashmap, or just hash in other languages). The dictionary is the most essential part of the state of the program, but the vital part is that it is there, not that it is a dictionary. A binary-search tree would work, as would an ordered array. The following shows how the state of a program may look.

order = 3
word_based = False
state = {'the': [' ', 'y', 'm'], 'he ': ['i', 's' ...] ...}

Training

The correlation between the order, word_based and the state should be fairly obvious. The way we build up a state should also be easy to grasp. The following psedo-code illustrates this:

if word_based
    input = words(input)
for index = order; index < input.length; index = index + 1
    key = input.subsequence(index - order, index)
    if word_based
	key = unwords(key)
    if state.has(key)
        state[key].append(input[index])
    else
	state[key] = [input[index]]

Here, input.subsequence is assumed to be exclusive in the right endpoint. That is, "foobar".subsequence(0, 3) would return "foo", not "foob". Our algorithm so far, has two problems. First of all, we're adding duplicate values, that is every time 'the ' occurs in the text, we add another space to state['the']. This gives us weighting, but that might not be what we want. It is, however easy to check if there are duplicates before adding it. The second problem is that we do not keep track of 'starting points' for sentences. I shall explain why this is important in the next section.

Generating output

The generating output part is pretty easy. Make a result (It may be a string or a list, it does not really matter). Select a random key from our state, and concatenate it onto the result. Ideally, we want this random key to be one that is logical to start a sentence with so it may be a good idea to keep track of those aside from just our simple state dictionary. To see why, consider what happens if we accidentally selects 'end' as the random key. Next, it might find out that it can continue end by adding a dot ('.'), resulting in 'end.'. Then, it turns out that 'nd.' is not a key, and it can't make a longer output. This is of course an example, but it shows that there are reasons for keeping track of other things than just exactly what you need. Anyway, from state[key], select a random unit and concatenate it onto result.

Now we have result. The algorithm then, is as follows:

  1. Select the last order units of your result, as your new key
  2. Concatenate a random value from state[key] onto your result.
  3. Not done yet? Go back to 1).

The banananana problem

One would think that the condition for stopping the generation of output would be to stop when your key is no longer in the state dictionary. This is obvious, we can't make more output when it happens. However, there is no guarantee that this will ever happen. Consider:

state = {'nan':['a'], 'ana': ['n']}

It should be obvious that this causes a loop. The easiest way to prevent this from putting your program into an infinite loop, is to limit the possible length of the output as well. This will occasionally give you output with repeats, but it won't freeze the program and eat all your memory.

Order

Varying order varies quality of the output (Assuming that the input is of a decent quality as well). When you use a very high order, like, say 6, you will get output that is very close to the input, often several sentences will be identical. Generally, using anything above 4 for character based training gives you output that is too similar to the input to be very fun. When order is in {3, 4} you usually get pretty funny output. When you use the word-based training, you should use {2, 3}.

Samples

I felt that I couldn't really close off this post without showing some samples, so here they are. The program that generated this output was fed this article (Except for the code), and used order-3 letter based training.

We next as for it hashtable, has two program, but it does no look silly seen there, not keeping a diction for stop when it happen. Concatenate it looks like real english is not reasons following duplicatenate it happen. Consider what's not there, input. Then, is ever, word_based a Python dictionary-search tree would return "foobar".subsequences.

Here, not keeping anything, but to map 'beginning you need some amount of 'starting, so I have useful think there adding somethink the starting, but that has two pretty easiest way to longer output.subsequences will occasionary. A binary is of reasons following some datastructure to 1)

It is, how the text, it is is what mighting, but is, "foo", not even hard. For things' to check if we want in my case I haven't make a starting, but it's a funny. I shall explain why, continue endpoint.

Conclusion

The point of this post was to be guide to my method of solving the problem of generating pseudo-english text. It is by no means the only, or best method, and it might have serious shortcomings. I don't think there are any errors in the pseudocode, but I can't guarantee that. Oh, and it is hilarious to train programs like these on Steve Yegge posts (Paul Graham is on a good second, for hilarity).

No comments