Computation and Chomsky Normal Forms
- Relationship between compiling and programming languages
- Chomsky Normal Form
- Every string of length n is derivable in 2n-1 steps.
- Proof of pumping lemma for CFL's.
- Every CFG has an equivalent NDPM.
- CYK O(n^3) algorithm for the membership problem.
<stmt> → <assign>|<if-then>|<if-then-else>|<begin-end>
<if-then> → if <expression> then <stmt>
<if-then-else> → if <expression> then <stmt> else <stmt>
<begin-end> → begin <stmt-list> end
<stmt-list> → <stmt-list><stmt>|<stmt>
<assign> → <ID> := <expression>
<ID> → <letter>|<tmp>
<tmp> → <tmp><letter>|<tmp><digit>|e
Within the above snippet of a grammar there are two lines
that are sometimes displayed using syntax diagram,
distinguishable by the flow of logic using arrows that can form
into loops, these loops are denoting recursion such as the 0*
from our previous grammars. In the above the repartition of
<stmt>
within the <list-stmt>
expansion and also the <tmp>
production, both
are displayed as loops in such a diagram.
CNF Chomsky Normal Form
All productions in CNF are the same, they have a capital letter on the left and a pair of non terminal capital letters on the right, or a capital letter on the left with a single terminal on the right, a double non terminal or a single terminal.
A→BC
D→•
This provides enough flexibility to capture any context free language, taking any context free grammar, it can be converted into a grammar that has this form, if we insist that both non-terminal symbols on the right are terminal, then this becomes a regular set.
Next we will show that any grammar can be turned into a grammar of this form, step by step in an attempt to make it intuitive.
This fraction of a grammar that is not in CNF, what do we need to do
to this grammar to transform it so that it is? S→C
is too
small, DEF
is too big, etc.
S → 0A1B | C | DEF
A → e | CC
Before going any further we can simplify our task by stating that
O→1
and Z→0
and then replacing our 1's and
0's removing this output data from our grammar.
O → 1
Z → 0
S → ZAOB | C | DEF
A → e | CC
Now all of our productions can be reduced into much simpler
categories, we now have only production that are either too long or too
short. To deal with ZAOB
we will split the production into
several groupings, first Z
and AOB
we will
make the latter into a new production assigning it a new letter
M
.
On repeating the process we next target OB
replacing that
with N
.
We need to deal with the e
productions and the unit, or
single productions, such as C
in our grammar, in this
process there are some subtleties, which we will hide initially, these
nuances would if ignored, allow for some particularly sneaky context
free grammar to cause the program to crash or open it up for
exploitation; However the complexity of this issue detracts from the
main idea and as such we will put it aside for the time being.
e ... the empty string
Supposing the following production:
A → ZNONZ
N → e
What can we do to reduce this long production and how can we loose
N
going to the empty string N→e
entirely, all
without loosing any information? What we are going to have to do is to
find every single appearance of N
within the grammar, and
make it disappear, making all of the substitution in advance whilst
leaving all possibilities intact.
We can make both the N
disappear by including unions
with new productions, one production for each form of the grammar both
with and without the e
. Were it that we had 3
N
's in our grammar, then we would require 2^3 new
productions to replace all of the e
terminals, minus the
empty string and the original production, we end up adding 6
productions.
A → ZNONZ | ZONZ | ZNOZ | ZOZ
N → e
The order of operations
- e productions
- unit productions
- long productions
e productions
Highlighting a subtlety
Here is a made up grammar just to highlight a subtlety, it is not an interesting grammar, the fragment of a larger grammar, it does serve to point out this subtlety.
S → 0 | X0 | ZYX
X → Y | e
Y → 1 | X
The only e
production that we can see is
X→Y|e
so on examining our grammar we can see that we must
add O
as a production so as to remove the X
,
in doing this we can see why it is that we do e
productions before unit productions, because they create unit
productions.
S → 0 | XO | ZYZ | O
X → Y | e
Y → 1 | X
We could at this point believe that we are finished, we do not want
to replace the X
in the last production with an
e
production, we should not add extra e
's.
But we are not done, we will have missed information here if we stop,
because X
can go to Y
, and Y
can
go to X
then X
goes to e
, so
Y
can also go to e
!
So the first thing that we must do is to look for both
e
productions as well as these paths towards nullable
e
productions, known as nullable non
terminals.
How do we check for nullable non-terminals, the algorithm or
procedure is this, first look for all the e
productions,
next look for all of the non-terminal symbols that go to e
as one of their productions, such as X
in our example.
From there we then look for productions that contain both
X
's and e
's such as Y
.
e
X, e
Y, X, e
If S
were to go to XY
then we would add
S
to this list. We continue doing this until we finally
reach a stage that gives us the same output as the previous one, at
this point we stop. There is inevitably a stopping point, because the
grammar can only have a finite number of non-terminals.
Once you have all of the symbols that go to e
we can
return to our grammar and start adding productions so as to remove
them. We can add both O
and ZZ
, then we can
remove all of the e
productions.
unit productions
Subtlety in unit productions
In the following grammar, how can we remove B
from the
production of A
? ...
A
goes to
B
so we simply add all of the productions of
B
on to A
.
A → CC | DD | BOZO | CC | ABC | 0
B → BOZO | CC | ABC | 0
We don't need CC
twice, but it does not do any harm whilst
we are reducing our grammar.
Consider this new grammar, if we write a computer program to change this grammar into Chomsky normal form, what do we want to do here?
S → A | 11
A → B | 1
B → S | 0
Taking first the S
which goes to A
, we go
to A
and see that it generates B|1
, we do not
want to add B
to our S
productions, we do not
want to add new unit productions but we will add the 1
terminal symbol.
A
goes to B
so this means that A
can also go to 0
.
B
goes to S
which generates
11|1
.
Something is not right here, S
can be 0
and we did not see this, why is that?
This does not happen often but it occurs when we get these chains in unit productions, your program could run for a year without ever seeing a grammar like this, but when it does, it will give you the wrong Chomsky Normal Form.
So what is the subtlety here, it is similar to that which we have
already seen in nullable productions. When you look at a unit
production, such as the initial S→A
we look forwards to
see if A
has any further possible unit productions, making
a list of all of the single non-terminals that S
can form,
even if this entails more than one step; Where previously we have
stepped backwards from e
null terminals until we reach
S
, now we move forwards though the layers looking for unit
productions. From this search we can make lists of related unit
productions.
S A B
A,B B,S S,A
Approaching the same example again the correct way, by looking ahead before removing any productions we see that we finish up with much more of the same in each and every one of our extended productions, this way we do not miss anything as we have done in out previous attempt.
Analysing our initial grammar, we can ask ourselves 'can we remove any of the aspects of the productions, is any of the grammar superfluous?' and the answer is not at all clear. The same can not be said of our new version of that grammar, all of the productions are useless, we have no way of inducing any production from the within the others.
We have seen already how by removing e
can give unit
productions, now we see that when removing unit productions can produce
useless symbols, symbols that do nothing in the grammar. We can now see
that starting at S
can no longer reach either of
A
or B
at all!
Next we will see how to remove these useless symbols, this procedure must be carried out at both the beginning and the very end of the process, because the procedure itself can generate more useless symbols it is not enough to perform it only once.
Useless Symbols
Symbols are useless when they can not be reached from S
these symbols in our cave analogy, are like rooms that are not
connected to the cave structure that we are in, and can not be
reached.
There is another way that symbols can be useless, if they either
include symbols that do not have productions, or even worse, symbols
that have productions that do not terminate, such as
CC
.
There are two reasons that a terminal can be useless either nothing can reach you or if something can reach you and you do not do anything useful, we need to check for both. It makes a very big difference which ones you check for first.
- Find all the non-terminals that can generate output, keep those.
- Find all non-terminals that can be reached from
S
, keep those.
All of the productions that do not do anything, we get rid of along
with all of the productions in which they appear. Supposing a
production BC
we would have to get rid of that, because it
is tied to the C
, B
is made useless by its
connection to C
.
There are two reasons that a terminal can be useless either nothing can reach it or it does nothing at all when it does.
example
Both S
and A
can generate something, we keep
both, B
does nothing so we can remove it, along with all
of its productions and any that it appears in.
S
can not reach any non-terminal, so we need only keep
S
and we can remove A
because it cannot be
reached.
So this grammar reduces to S
goes to 0
,
all of the other symbols are useless.
Returning to the previous method, so as to highlight this very difficult to predict subtlety. Approaching it in the opposite order, as most folk do approach this sort of problem, starting from the beginning rather than going to the ends first.
Why not first check which symbols we can get to, and then check whether they can actually terminate?
We examine our grammar to see whether all symbols can be reached, and
they can, as such we include both, next we check which non terminals
can generate something, S
can A
can and
B
can't, so we now get rid of AB
and we think
that we are done, but we have made a mistake, A
is useless
and unreachable.
The reason here is subtle, it is possible to reach parts of the grammar
going forwards, such as B
that we later on can not reach
due to the removal of a part of the grammar because of a useless
terminal.