Markov chains and generating names

In a post recently I wrote about how to model patient flow in ER using markov chains and hidden markov model. In that example I built configuration for chains by hand, only mentioning in passing that it could be done programmatically. In this post I’m going to have a look how to build a simple model of names and use that to generate similar looking names.

The example will be using same chain-factory as in modeling patient flow, with one minor tweak. Instead of terminating chain when an empty array is returned as list of possible next element, I’m supplying a function from outside that does the job. It will be called every time a new element is being generated and will instruct the process to continue or halt. Advantage of this is that I have better control when to terminate with different kinds of chains. Now it’s possible to build chains that will contain 0, nil or empty list as an element.

First step is to obtain some sample data. I’m using a list of ancient greek names that I found from internet. I’m not going to repeat it here, as it’s not particularly interesting or needed to understand the gist of the post.

(setv data ["acacius", "aristodemos", "herodes", "polykarpos" ...]

The idea is that we build a configuration that reflects several things: what letter pairs can start a chain, how letter pairs follow is each in chain and what terminates a chain. In this example I’m not going to try and model frequencies of pairs. Each and every pair that can follow certain pair has equal chance of showing up.

I’m also using a group function I wrote some while ago. It simply takes a list, splits it into sublists of given length and returns the whole result as a list. List of lists of strings is a bit unwieldy to handle, so first we need a function that nicely splits a given string into a list of substrings of given length.

(defn split-into-parts [name length]
  "split name into parts of given length"
  (list (ap-map (.join "" it) (group name length))))

=> (split-into-parts "aristodemos" 2)
['ar', 'is', 'to', 'de', 'mo', 's']

By shortening length of chain links we get more variation in generated names and by lengthening it we get less variation. At least with the sample data I used length of 2 worked nicely. This is not the only way of splitting names into chain links. One could write more sophisticated routine that would split them by syllables for example.

Then comes the meat of the post: building configuration for chain-factory. We’ll tackle the easier one first: figuring out what pairs can start a chain:

(defn add-starting-link [starting-links parts-list]
  "add starting link using standard frequency"
  (let [[first-item (, (first parts-list) 0 10)]]
    (when (not (in first-item starting-links))
      (.append starting-links first-item))))

Very straightforward piece of code that just takes the first pair of supplied parts-list and appends it into starting-link list with frequency information. As previously said, I’m not trying to model frequencies, so each pair will have the same range. This will cause the markov chain routine to randomly choose between all of them. If I wanted to do model the frequencies, this is the part where I would have to keep track of how frequent any given pair is and adjust their random ranges accordingly.

Building a dictionary containing all the possible letter pairs and transitions from them it’s a bit more complex code. The basic idea is to take a pair and the pair following it. If pair doesn’t exists in the dictionary, add it there. Add the second pair with frequency information in the possible list of transitions if it’s not already there. In a special case that there is no second pair (we have reaced end of the word), add nil with frequency information signifying that this pair might end the generated name.

Traveling a list and processing it a pair of elements at a time can be written in various ways. Here I chose to use recursive method, as it made the routine easier to write and length of words is small. Problem with using recursive methods in Python is that since there is no tail-call optimization, one could run out of stack space. It’s not a problem if you know size of the data being processed, but it’s still worth keeping in mind.

(defn add-to-links [links parts-list]
  "add list of chain parts into links, each with same frequency"
  (when parts-list
    (let [[first-item (first parts-list)]
          [second-item (ap-if (second parts-list)
                              (, it 0 10)
                              (, nil 0 10))]
          [tail (list (rest parts-list))]]
      (if (not (in first-item links))
        (assoc links first-item [second-item])
        (when (not (in second-item (get links first-item)))
          (.append (get links first-item) second-item)))
      (when tail (add-to-links links tail)))))

To finish configuration code, I added one more function that processes given list of names using the functions described earlier. Links and starting-links parameters use to control where final configuration is being stored. In the future I should change the code to actually create and return them, instead of just modifying some already existing (albeit hopefully empty) data structures.

(defn add-names [links starting-links length names]
  "process a list of names and prepare configuration for markov chain factory"
  (ap-each names (do (add-starting-link starting-links (split-into-parts it length))
                     (add-to-links links (split-into-parts it length)))))

Finally it’s time to tie everything together and a a convenience method that first generates a chain of word pairs and then assembles a string from the result:

(setv starting-links [])
(setv links {})
(add-names links starting-links 2 names)
(setv markov-names (chain-factory starting-links
                                  (fn [item] (not (is item nil)))))
(defn generate-name []
  "generate a name"
  (.capitalize (.join "" (list (markov-names)))))

Testing the code will create a list of sort of believable looking (at least if you don’t actual speak greek:

=> (list (ap-map (it) (repeat generate-name 5)))
['Eraskleodoulosios', 'Procastoclides', 'Iasosilinus', 'Ianos', 'Xantios']

From here process could be fine tuned by adding more data to draw samples from, fine tuning the algorithm used to form chain links (after all, they don’t have to be same size) and adjusting the configuration to take different link frequencies into account.


One thought on “Markov chains and generating names

  1. Pingback: Generating scrolls | Engineer's Journey

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s