Finite automata and pattern matching – the barbara problem

Intro

Ever heard of the name Barbara? It's fairly common in the place where I live. Except the commonness, word Barbara has some special features. It consist only of 3 letters and it the substring bar appears twice. This word is a perfect example on how to use deterministic finite automata (DFA, explained in the article) for pattern matching and substring searching.

Finite automata?

An automaton (in automata theory) is a 5-tuple (Q, Σ, δ, q0, F) defined as following:

  • Q – Finite set of states
  • Σ – Alphabet
  • δ – Transition function (δ: Q × Σ → Q)
  • q0 – First (starting) state
  • F – Set of finishing (accept) states

Before I go further, I have to define that a string is an array of symbols over Σ (labeled as Σ*, where * is a Kleene star, also called an iterator). Basically, an automaton is a machine that takes a string and either accepts it or declines it (go figure). The string is accepted if the automaton is left in a finishing (accept) state after "reading" it. "Reading" the string is done symbol by symbol where each symbol is fed to the transition function that determines what the next state will be. If the automaton is not in an accept state at the end, the string gets declined. Finite automata can be divided into two subgroups. Automata can be either deterministic (DFA) or nondeterministic (NFA), based on the way they "read". From now on, I'll use acronyms for both. DFA, is, as the name states, deterministic, meaning the transition from one state to another is unique. NFA is the opposite, transition doesn't have to be unique, meaning that automaton can go from one state to several different states by "reading" just one symbol. That said, DFA accepts a string if there's a unique path from starting state to one of the accept states, and NFA accepts a string if a path exists, no matter how it's formed. Even though these two seem quite different, both of them accept the same group of languages over Σ, and every NFA can be converted to DFA and vice versa. Languages that finite automata accept are called Regular languages (Kleene theorem) and can be represented by a regular expression. I believe that's quite enough of math theory for now. The last thing I need to say is that finite automata is usually represented by a directed graph where arrows represent the transition function.

The barbara

As I've stated in the intro, barbara is a fun word. It consists of just three letters (b, a, r) and has the substring bar repeated twice. That makes it specially interesting. Why? Because you can't just check only if b precedes a, a precedes r and so on. Given a random string of text, how can we determine if barbara appears in it? Sure, we could use one of the many substring searching algorithms (Knuth–Morris–Pratt, Boyer-Moore and many other), but why not do something a lot more interesting, like constructing a DFA for it? If you read the theory in the paragraph above, you know that a properly constructed DFA will accept a random string iff (if and only if) it is left in an accept state after "reading" the string. Automaton given on the graph below will be left in an accept state iff it contains barbara as a substring. Let's get on with the construction.

In order to detect barbara, we will need 8 states and only one of them will be an accept state (first seven will be for checking what precedes what). So, our set Q will be Q = {0, 1, 2, 3, 4, 5, 6, 7}. For the alphabet Σ, we will use the standard English alphabet. Strings over Σ are all words that can be generated using letters of the English alphabet (labeled as Σ*) (they absolutely don't have to have a meaning, as this article is not connected to linguistics in any way). The transition function δ is represented by the arrows on the graph below. Our starting state will be 0, and accept state set will be F = {7}. Desiging δ is the hardest part of the process, of course.

The transition function displayed on the graph is the following:

	δ(3, b) = 4; 		// bar -> barb
	δ(6, b) = 4; 		// barbar -> barb
	δ({0, s}, b) = 1;   		// x -> b

	δ(1, a) = 2; 		// b -> ba
	δ(4, a) = 5;		// barb -> barba
	δ(6, a) = 7;		// barbar -> barbara (the end)
	δ(s, a) = 0;		// x -> λ (empty word)

	δ(2, r) = 3;		// ba -> bar
	δ(5, r) = 6;		// barba -> barbar
	δ(s, r) = 0;		// x -> λ

	δ(s, x) = 0;		// x -> λ	

    s is a variable state not included in the definition

As you can see, the automaton will be in the finishing state iff it recognizes barbara as a substring. When it gets to the finishing state, the loop will make it stay there. Let's try it out and see how it works. For the example, I'll use the transition function and I'll show you each step. Before that, I have to state that δ(s, abc) = δ(δ(s, a), bc), meaning that the word can be broken at any place and we will still get the same results.

	δ(0, oifsfscnbarbakjkjibarbarabkf) = 

	δ(δ(0, oifsfscn), barbakjkjibarbarabkf) = 

	δ(0, barbakjkjibarbarabkf) = 

	δ(δ(0, bar), bakjkjibarbarabkf) = 

	δ(3, bakjkjibarbarabkf) = 

	δ(δ(3, ba), kjkjibarbarabkf) = 

	δ(5, kjkjibarbarabkf) = 

	δ(δ(5, kjkji), barbarabkf) = 

	δ(0, barbarabkf) = 

	δ(δ(0, barbara), bkf) = 

	δ(7, bkf) = 7 which is an accept state.

The algorithms

The last part of the article, are, of course, the implementations. It wouldn't make a lot of sense to just talk about math and not utilize the talk in some way. This time, I'll implement the algorithm in three different languages, just so you can see the differences between three programming paradigms. The first one is Scheme (functional language, a lisp dialect), the second one is Prolog (a widely known logic language) and the third one is C++ (a purely procedural language) this time. I'm giving Java a lil' break today.

Scheme code

(define findbarbara (lambda (string)
  (letrec
    (
      (barbaraDFA (lambda (state string)
        (if (= state 7)
        	#t
        	(if (null? string)
				#f
        		(barbaraDFA (transit state (car string)) (cdr string))
        	)
       	)
       )
      )       

      (transit (lambda (state input)
			  (cond
			    ((eqv? input 'b)
			      (cond
			        ((= state 3) 4)
			        ((= state 6) 4)
			        (else 1)
			      )
			    )
			    ((eqv? input 'a)
			      (cond
			        ((= state 1) 2)
			        ((= state 4) 5)
			        ((= state 6) 7) ; We're done
			        (else 0)
			      )
			    )
			    ((eqv? input 'r)
			      (cond
			        ((= state 2) 3)
			        ((= state 5) 6)
			        (else 0)
			      )
			    )
			    (else 0)  ; The X arrow
			  )
			 )
       )
    )
    (barbaraDFA 0 string) ; 0 is the starting state
  )
)
)

Prolog code

barbaraDFA([_|_], 7).
barbaraDFA([H|T], X) :- X =\= 7, transit(H, X, X1), barbaraDFA(T, X1).

transit(b, 3, 4).
transit(b, 6, 4).
transit(b, _, 1).
	
transit(a, 1, 2).
transit(a, 4, 5).
transit(a, 6, 7).
transit(a, _, 0).
	
transit(r, 2, 3).
transit(r, 5, 6).
transit(4, _, 0).
	
transit(_, _, 0).
	
findBarbara(X) :- barbaraDFA(X, 0).

C++ code

#include <iostream>

using namespace std;

int main() {
	char str[100];
	int state = 0, i = 0;

	cin >> str;

	while(state != 7 && str[i] != '\0') {
		switch(str[i++]) {
			case 'b':
				if(state == 3 || state == 6) 
					state = 4;
				else 
					state = 1;
			break;
			case 'a':
				if(state == 1)
					state = 2;
				else if(state == 4)
					state = 5;
				else if(state == 6)
					state = 7;
				else 
					state = 0;
			break;
			case 'r':
				if(state == 2)
					state = 3;
				else if(state == 5)
					state = 6;
				else 
					state = 0;
			break;
			default:
				state = 0;
			break;
		}
	}

	if(state == 7)
		cout << "#t";
	else 
		cout << "#f";

	return 0;
}

Outro

Even though this possibly seems as a harder way to achieve substring search than using regular widely-known algorithms, finite automata can also be used to match not only substrings but also regular expressions of all kinds. As I've said, DFA's accept regular languages and a language is regular iff it has a regular expression representing it.