Context Free Grammars And Push Down Automata
CFG => NDPM
Continuing the Context free grammar, non deterministic machine discussion we are now going to see that CFG and NDPM are the same thing, when we have looked at finite state machines we have shown that they are the same as regular expressions, the same as linear grammars, a big cycle of equivalence between these and non deterministic machines; We have already ascertained that these are all the same thing.
Here everything is not the same, determinism for push down machines is not the same thing as non determinism, as was the case for regular languages. Non deterministic push down machines can do more than deterministic machines, they are more powerful, the equivalence that we can show at this level is that context free grammars are the same thing as non deterministic push down machines. We will examine transforming a context free grammar into a push down machine, it is also possible to do the inverse and convert a non deterministic push down machine into a context free grammar but the process is far more complicated.
To help us simplify this process we will be using Chomsky Normal Form, it is the first of three ideas that are like this for which we will be using it.
Chomsky Normal Form
An example of a simple grammar in Chomsky Normal Form, this is how we convert it into an equivalent machine.
A → BC | AB | 1
B → AA | 0
C → CB | 1 | 0
As an aside, there also exist a set of grammars called
LR(k)
grammars that are equivalent to deterministic push
down machines, this is more deeply explored in information sources that
pertain to compilers, we will skirt around this subject for now, but
should keep in mind this important subset of the context free grammars
that relate to the most practical class of deterministic push down
machine most practical class of deterministic push down machines.
10110
We are going to examine a conversion that goes to a non
deterministic machine, one that a tool such a YACC would refuse
complaining that it does not know which production to use. We start with
A
and will try to make a left most derivation to get
10110
, as we guess which production to start with, we will
always substitute for the left most non terminal that is still
remaining, we should notice something that is very nice about Chomskys
Normal Forms in the organisation of our productions.
Left Most Derivation
A → AB → 1B → 1AA → 1BCA → 10CA
→ 101A → 101AB → 1011B → 10110
As we begging making substitutions for our non terminals, using this left most derivation first format, the string begins to form with terminals on the left first with the new non terminals on the right, as such the resultant string grows in such a way as to transform from left to right.
For demonstrative purposes, here is another left most derivation of this word, using the same grammar.
A → BC → AAC → ABAC → 1BAC → 10AC
→ 101C → 101CB → 1011B → 10110
Whichever way we apply the grammar we are still getting our string, both trees were formed with left most derivations from which we can conclude that this is an ambiguous grammar, the trees for each of our derivations would look different. Ambiguous grammars are bad for parsers, as they do not know what to do at any one moment, as such we would try to fix this grammar if at all possible, making it into an unambiguous grammar. But all that we need to understand for now is the parsing mechanism, and that there are choices here and that we could make the grammar harder, such that the wrong choice would not accept the string and there be only one possible solution.
We can see that both our parse trees have concluded in the same number of steps, for Chomskys' Normal Form this is always twice the length of the string minus one; For use now this is nine steps. Chomsky Normal Form grammars are very easy to control, they all work in a very specific way because of this, their predictable nature. This is clear if we consider that for our five symbols in the target string we must take five steps each substituting a capital letter for a symbol, we start with a single letter and then for each symbol we are adding a double Chomsky normal form production, so we are adding four extra capital letters, so this is n + n-1 steps.
We have generated our parse trees just from looking at our simple grammar, we could have applied a simple brute force algorithm and in the worst case tried all of 2^9th trees to find out whether or not we can generate the desired string, in this case we would have found at least these two parse trees that accept the string. We are now going to try to make a non deterministic push down machine that can simulate whatever it was that we did to get our tree. We can definitely write a program to do this, we have just described one, but what we are looking to achieve is a non deterministic machine with only one stack that can do it, rather than a brute force algorithm.
When we think of the way that our left derived grammar grew as we chose from the available productions, it is very natural for us to use a stack to evaluate and work with these letters as if we are storing them on a stack we are still required to access each letter in turn to expand it into its producing symbols, as such we will be using the last letter placed on the stack each time in the normal order of the operation. Which is to say that a stack is a very natural data structure to use for this operation.
Changing A
into BC
has been shown in full
with several micro states that are required to make the transition, the
other two possible loops for A
have been displayed in a
more concise form, along with those that depict the consumption or
passing over of a terminal output onto the string or tape.
And then for C
, here displayed apart for simplicity...
In a Chomsky Normal Form grammar, the stack can not empty and then grow again, because of the way that it is structured, either you succeed and accept a string or it fails. The machine has three primary states, start parse and accept, then whilst in the parse state it passes though the numerous micro-states that are required to accept all of the productions of the grammar and like so we construct our machine.
Pumping Lemma for context free grammar
We have seen that for finite state machines we have the pumping lemma, a tool which enabled that we show that a string is outside of a regular language and specific machines without using the 'heavy guns' of digitalisation. It basically says that if you have long enough strings that are supposed to be accepted by your machine, those strings will have loops in them within the computation and therefore parts of them can be pumped out and you must accept all of those too. That there are lots of sets that do not abide by this property, that you can show that no matter how they are split up, that there are loops that if you pump them out enough will give you strings that are not in the set. This is how we use the pumping lemma for finite state machines.
Now if we try to use the pumping lemma for push down machines,
something happens that is wrong. If we have some long string that we
put through a machine like our previous example that has only three
states, if our string has nine symbols, then we must have a loop in the
string somewhere. Supposing a loop 010
before it
11
and after it 01
such as:
11 010 01
11 010 010 01
Why is it that in a push down machine this loop is not the same as for a finite state machine? ... The essential difference is that it now also depends upon the stack, even though we have arrived at the same state the machines reaction may not be the same at all, depending upon the state that is induced by symbols previously stored on the stack. This means that we can not assume that we have arrived back in the same state in a loop of states when there is a loop in our string, as it is no longer the case, we would also need to know that we have the same state on both the string and the stack.
The way to find the pumping lemma in a push down machine, is to look
at the grammar and the parse trees for the loop. The best way to
understand what this means is to see an example of it. We will use the
following grammar to describe the string 11100001
and
examine the parse tree that the grammar generates in order to produce
the string, within that parse tree when examining the terminal nodes of
that tree in their paths back up to the root node, that the same non
terminal production will appear twice within the tree structure.
A → BC | 0
B → BA | 1 | CC
C → AB | 0
11100001
So long as we have a parse tree that has a path of four or more productions, then we can find a duplication when starting from the terminals and working up to the root.
How long does a string need to be to insure that there is a path depth of at least four in a CNF grammar? For short strings there will be no duplicates, you have to have a string that is long enough.
The string must be at the very least (2^n)+1
symbols
long to generate a loop, some exponential number. Had our string only
contained four terminals, we could generate it with a parse tree that
is only three nodes deep, one node being the terminal and another the
root, giving no possibility for any repetitions at all.
If our grammar has ten non terminals, then our string would have to be about a thousand symbols long to insure that there is a pumping lemma repetition in it, around 2^10 symbols, it is not linear as it is for finite state machines this is exponential.
The way in which the pumping occurs is different, if we take our
marked repetition of A
and replace the second
A
with the tree as it is formed at the first
A
node, our string grows with this pumping of the
formation.
When we replace the lower A
branch with the
complete A
node from above, consider the 10
that are the terminals from that lower node that is to be replaced,
this same string also exists within the terminals of the new fuller
node. In effect we are doubling the sections of string on either side
of this central piece.
11100001
11110000001
Applying the same again, we double this effect on either side of the central node, there is a form of the pumping lemma, different to that which we observed within regular expressions.
11111000000001