Context Free Grammars And Push Down Automata

CFG => NDPM

Continuing the Context free grammar, non deterministic machine discussion we are now going to see that CFG and NDPM are the same thing, when we have looked at finite state machines we have shown that they are the same as regular expressions, the same as linear grammars, a big cycle of equivalence between these and non deterministic machines; We have already ascertained that these are all the same thing.

Here everything is not the same, determinism for push down machines is not the same thing as non determinism, as was the case for regular languages. Non deterministic push down machines can do more than deterministic machines, they are more powerful, the equivalence that we can show at this level is that context free grammars are the same thing as non deterministic push down machines. We will examine transforming a context free grammar into a push down machine, it is also possible to do the inverse and convert a non deterministic push down machine into a context free grammar but the process is far more complicated.

To help us simplify this process we will be using Chomsky Normal Form, it is the first of three ideas that are like this for which we will be using it.

Chomsky Normal Form

An example of a simple grammar in Chomsky Normal Form, this is how we convert it into an equivalent machine.


					A → BC | AB | 1

					B → AA | 0

					C → CB | 1 | 0

As an aside, there also exist a set of grammars called LR(k) grammars that are equivalent to deterministic push down machines, this is more deeply explored in information sources that pertain to compilers, we will skirt around this subject for now, but should keep in mind this important subset of the context free grammars that relate to the most practical class of deterministic push down machine most practical class of deterministic push down machines.

`10110`

We are going to examine a conversion that goes to a non deterministic machine, one that a tool such a YACC would refuse complaining that it does not know which production to use. We start with A and will try to make a left most derivation to get 10110, as we guess which production to start with, we will always substitute for the left most non terminal that is still remaining, we should notice something that is very nice about Chomskys Normal Forms in the organisation of our productions.

Left Most Derivation


					A → AB → 1B → 1AA → 1BCA → 10CA

					→ 101A → 101AB → 1011B → 10110

As we begging making substitutions for our non terminals, using this left most derivation first format, the string begins to form with terminals on the left first with the new non terminals on the right, as such the resultant string grows in such a way as to transform from left to right.

For demonstrative purposes, here is another left most derivation of this word, using the same grammar.


					A → BC → AAC → ABAC → 1BAC → 10AC

					→ 101C → 101CB → 1011B → 10110

Whichever way we apply the grammar we are still getting our string, both trees were formed with left most derivations from which we can conclude that this is an ambiguous grammar, the trees for each of our derivations would look different. Ambiguous grammars are bad for parsers, as they do not know what to do at any one moment, as such we would try to fix this grammar if at all possible, making it into an unambiguous grammar. But all that we need to understand for now is the parsing mechanism, and that there are choices here and that we could make the grammar harder, such that the wrong choice would not accept the string and there be only one possible solution.

We can see that both our parse trees have concluded in the same number of steps, for Chomskys' Normal Form this is always twice the length of the string minus one; For use now this is nine steps. Chomsky Normal Form grammars are very easy to control, they all work in a very specific way because of this, their predictable nature. This is clear if we consider that for our five symbols in the target string we must take five steps each substituting a capital letter for a symbol, we start with a single letter and then for each symbol we are adding a double Chomsky normal form production, so we are adding four extra capital letters, so this is n + n-1 steps.

We have generated our parse trees just from looking at our simple grammar, we could have applied a simple brute force algorithm and in the worst case tried all of 2^9th trees to find out whether or not we can generate the desired string, in this case we would have found at least these two parse trees that accept the string. We are now going to try to make a non deterministic push down machine that can simulate whatever it was that we did to get our tree. We can definitely write a program to do this, we have just described one, but what we are looking to achieve is a non deterministic machine with only one stack that can do it, rather than a brute force algorithm.

When we think of the way that our left derived grammar grew as we chose from the available productions, it is very natural for us to use a stack to evaluate and work with these letters as if we are storing them on a stack we are still required to access each letter in turn to expand it into its producing symbols, as such we will be using the last letter placed on the stack each time in the normal order of the operation. Which is to say that a stack is a very natural data structure to use for this operation.

push down state machine for the previous grammar

Changing A into BC has been shown in full with several micro states that are required to make the transition, the other two possible loops for A have been displayed in a more concise form, along with those that depict the consumption or passing over of a terminal output onto the string or tape.

And then for C, here displayed apart for simplicity...

In a Chomsky Normal Form grammar, the stack can not empty and then grow again, because of the way that it is structured, either you succeed and accept a string or it fails. The machine has three primary states, start parse and accept, then whilst in the parse state it passes though the numerous micro-states that are required to accept all of the productions of the grammar and like so we construct our machine.

Pumping Lemma for context free grammar

We have seen that for finite state machines we have the pumping lemma, a tool which enabled that we show that a string is outside of a regular language and specific machines without using the 'heavy guns' of digitalisation. It basically says that if you have long enough strings that are supposed to be accepted by your machine, those strings will have loops in them within the computation and therefore parts of them can be pumped out and you must accept all of those too. That there are lots of sets that do not abide by this property, that you can show that no matter how they are split up, that there are loops that if you pump them out enough will give you strings that are not in the set. This is how we use the pumping lemma for finite state machines.

Now if we try to use the pumping lemma for push down machines, something happens that is wrong. If we have some long string that we put through a machine like our previous example that has only three states, if our string has nine symbols, then we must have a loop in the string somewhere. Supposing a loop 010 before it 11 and after it 01 such as:


					11 010 01

					11 010 010 01

Why is it that in a push down machine this loop is not the same as for a finite state machine? ... The essential difference is that it now also depends upon the stack, even though we have arrived at the same state the machines reaction may not be the same at all, depending upon the state that is induced by symbols previously stored on the stack. This means that we can not assume that we have arrived back in the same state in a loop of states when there is a loop in our string, as it is no longer the case, we would also need to know that we have the same state on both the string and the stack.

The way to find the pumping lemma in a push down machine, is to look at the grammar and the parse trees for the loop. The best way to understand what this means is to see an example of it. We will use the following grammar to describe the string 11100001 and examine the parse tree that the grammar generates in order to produce the string, within that parse tree when examining the terminal nodes of that tree in their paths back up to the root node, that the same non terminal production will appear twice within the tree structure.


					A → BC | 0

					B → BA | 1 | CC

					C → AB | 0

11100001

So long as we have a parse tree that has a path of four or more productions, then we can find a duplication when starting from the terminals and working up to the root.

How long does a string need to be to insure that there is a path depth of at least four in a CNF grammar? For short strings there will be no duplicates, you have to have a string that is long enough.

The string must be at the very least (2^n)+1 symbols long to generate a loop, some exponential number. Had our string only contained four terminals, we could generate it with a parse tree that is only three nodes deep, one node being the terminal and another the root, giving no possibility for any repetitions at all.

If our grammar has ten non terminals, then our string would have to be about a thousand symbols long to insure that there is a pumping lemma repetition in it, around 2^10 symbols, it is not linear as it is for finite state machines this is exponential.

The way in which the pumping occurs is different, if we take our marked repetition of A and replace the second A with the tree as it is formed at the first A node, our string grows with this pumping of the formation.

the pumping lemma within a pushdown automata parse tree

When we replace the lower A branch with the complete A node from above, consider the 10 that are the terminals from that lower node that is to be replaced, this same string also exists within the terminals of the new fuller node. In effect we are doubling the sections of string on either side of this central piece.


					11100001

					11110000001

Applying the same again, we double this effect on either side of the central node, there is a form of the pumping lemma, different to that which we observed within regular expressions.


					11111000000001