Computation and Chomsky Normal Forms

Relationship between compiling and programming languages
Chomsky Normal Form

Every string of length n is derivable in 2n-1 steps.
Proof of pumping lemma for CFL's.
Every CFG has an equivalent NDPM.
CYK O(n^3) algorithm for the membership problem.


        <stmt> → <assign>|<if-then>|<if-then-else>|<begin-end>

     <if-then> → if <expression> then <stmt>

<if-then-else> → if <expression> then <stmt> else <stmt>

   <begin-end> → begin <stmt-list> end

   <stmt-list> → <stmt-list><stmt>|<stmt>

      <assign> → <ID> := <expression>

          <ID> → <letter>|<tmp>

         <tmp> → <tmp><letter>|<tmp><digit>|e

Within the above snippet of a grammar there are two lines that are sometimes displayed using syntax diagram, distinguishable by the flow of logic using arrows that can form into loops, these loops are denoting recursion such as the 0* from our previous grammars. In the above the repartition of <stmt> within the <list-stmt> expansion and also the <tmp> production, both are displayed as loops in such a diagram.

CNF Chomsky Normal Form

All productions in CNF are the same, they have a capital letter on the left and a pair of non terminal capital letters on the right, or a capital letter on the left with a single terminal on the right, a double non terminal or a single terminal.


					A→BC

					D→•

This provides enough flexibility to capture any context free language, taking any context free grammar, it can be converted into a grammar that has this form, if we insist that both non-terminal symbols on the right are terminal, then this becomes a regular set.

Next we will show that any grammar can be turned into a grammar of this form, step by step in an attempt to make it intuitive.

This fraction of a grammar that is not in CNF, what do we need to do to this grammar to transform it so that it is? S→C is too small, DEF is too big, etc.


					S → 0A1B | C | DEF

					A → e | CC

Before going any further we can simplify our task by stating that O→1 and Z→0 and then replacing our 1's and 0's removing this output data from our grammar.


					O → 1

					Z → 0

					S → ZAOB | C | DEF

					A → e | CC

Now all of our productions can be reduced into much simpler categories, we now have only production that are either too long or too short. To deal with ZAOB we will split the production into several groupings, first Z and AOB we will make the latter into a new production assigning it a new letter M.

On repeating the process we next target OB replacing that with N.

We need to deal with the e productions and the unit, or single productions, such as C in our grammar, in this process there are some subtleties, which we will hide initially, these nuances would if ignored, allow for some particularly sneaky context free grammar to cause the program to crash or open it up for exploitation; However the complexity of this issue detracts from the main idea and as such we will put it aside for the time being.

e ... the empty string

Supposing the following production:


					A → ZNONZ

					N → e

What can we do to reduce this long production and how can we loose N going to the empty string N→e entirely, all without loosing any information? What we are going to have to do is to find every single appearance of N within the grammar, and make it disappear, making all of the substitution in advance whilst leaving all possibilities intact.

We can make both the N disappear by including unions with new productions, one production for each form of the grammar both with and without the e. Were it that we had 3 N's in our grammar, then we would require 2^3 new productions to replace all of the e terminals, minus the empty string and the original production, we end up adding 6 productions.


					A → ZNONZ | ZONZ | ZNOZ | ZOZ

					N → e

The order of operations

e productions
unit productions
long productions

e productions

Highlighting a subtlety

Here is a made up grammar just to highlight a subtlety, it is not an interesting grammar, the fragment of a larger grammar, it does serve to point out this subtlety.


					S → 0 | X0 | ZYX

					X → Y | e

					Y → 1 | X

The only e production that we can see is X→Y|e so on examining our grammar we can see that we must add O as a production so as to remove the X, in doing this we can see why it is that we do e productions before unit productions, because they create unit productions.


					S → 0 | XO | ZYZ | O

					X → Y | e

					Y → 1 | X

We could at this point believe that we are finished, we do not want to replace the X in the last production with an e production, we should not add extra e's. But we are not done, we will have missed information here if we stop, because X can go to Y, and Y can go to X then X goes to e, so Y can also go to e!

So the first thing that we must do is to look for both e productions as well as these paths towards nullable e productions, known as nullable non terminals.

How do we check for nullable non-terminals, the algorithm or procedure is this, first look for all the e productions, next look for all of the non-terminal symbols that go to e as one of their productions, such as X in our example. From there we then look for productions that contain both X's and e's such as Y.


					e

					X, e

					Y, X, e

If S were to go to XY then we would add S to this list. We continue doing this until we finally reach a stage that gives us the same output as the previous one, at this point we stop. There is inevitably a stopping point, because the grammar can only have a finite number of non-terminals.

Once you have all of the symbols that go to e we can return to our grammar and start adding productions so as to remove them. We can add both O and ZZ, then we can remove all of the e productions.

a
grammar with the e terminals crossed out

unit productions

Subtlety in unit productions

In the following grammar, how can we remove B from the production of A? ...
A goes to B so we simply add all of the productions of B on to A.

a
grammar with a single production crossed out


					A → CC | DD | BOZO | CC | ABC | 0

					B → BOZO | CC | ABC | 0

We don't need CC twice, but it does not do any harm whilst we are reducing our grammar.

Consider this new grammar, if we write a computer program to change this grammar into Chomsky normal form, what do we want to do here?


					S → A | 11
					
					A → B | 1

					B → S | 0

Taking first the S which goes to A, we go to A and see that it generates B|1, we do not want to add B to our S productions, we do not want to add new unit productions but we will add the 1 terminal symbol.

A goes to B so this means that A can also go to 0.

B goes to S which generates 11|1.

Something is not right here, S can be 0 and we did not see this, why is that?

This does not happen often but it occurs when we get these chains in unit productions, your program could run for a year without ever seeing a grammar like this, but when it does, it will give you the wrong Chomsky Normal Form.

So what is the subtlety here, it is similar to that which we have already seen in nullable productions. When you look at a unit production, such as the initial S→A we look forwards to see if A has any further possible unit productions, making a list of all of the single non-terminals that S can form, even if this entails more than one step; Where previously we have stepped backwards from e null terminals until we reach S, now we move forwards though the layers looking for unit productions. From this search we can make lists of related unit productions.

					
S      A      B

A,B    B,S    S,A

Approaching the same example again the correct way, by looking ahead before removing any productions we see that we finish up with much more of the same in each and every one of our extended productions, this way we do not miss anything as we have done in out previous attempt.

Analysing our initial grammar, we can ask ourselves 'can we remove any of the aspects of the productions, is any of the grammar superfluous?' and the answer is not at all clear. The same can not be said of our new version of that grammar, all of the productions are useless, we have no way of inducing any production from the within the others.

We have seen already how by removing e can give unit productions, now we see that when removing unit productions can produce useless symbols, symbols that do nothing in the grammar. We can now see that starting at S can no longer reach either of A or B at all!

Next we will see how to remove these useless symbols, this procedure must be carried out at both the beginning and the very end of the process, because the procedure itself can generate more useless symbols it is not enough to perform it only once.

Useless Symbols

Symbols are useless when they can not be reached from S these symbols in our cave analogy, are like rooms that are not connected to the cave structure that we are in, and can not be reached.

There is another way that symbols can be useless, if they either include symbols that do not have productions, or even worse, symbols that have productions that do not terminate, such as CC.

a
grammar with production crossed out and another that does not
terminate

There are two reasons that a terminal can be useless either nothing can reach you or if something can reach you and you do not do anything useful, we need to check for both. It makes a very big difference which ones you check for first.

Find all the non-terminals that can generate output, keep those.
Find all non-terminals that can be reached from S, keep those.

All of the productions that do not do anything, we get rid of along with all of the productions in which they appear. Supposing a production BC we would have to get rid of that, because it is tied to the C, B is made useless by its connection to C.

removing useless productions from a grammar

There are two reasons that a terminal can be useless either nothing can reach it or it does nothing at all when it does.

example

Both S and A can generate something, we keep both, B does nothing so we can remove it, along with all of its productions and any that it appears in.

the same grammar with a useless production crossed out

S can not reach any non-terminal, so we need only keep S and we can remove A because it cannot be reached.

the same grammar with a useless symbol crossed out

So this grammar reduces to S goes to 0, all of the other symbols are useless.

Returning to the previous method, so as to highlight this very difficult to predict subtlety. Approaching it in the opposite order, as most folk do approach this sort of problem, starting from the beginning rather than going to the ends first.

Why not first check which symbols we can get to, and then check whether they can actually terminate?

We examine our grammar to see whether all symbols can be reached, and they can, as such we include both, next we check which non terminals can generate something, S can A can and B can't, so we now get rid of AB and we think that we are done, but we have made a mistake, A is useless and unreachable.

The reason here is subtle, it is possible to reach parts of the grammar going forwards, such as B that we later on can not reach due to the removal of a part of the grammar because of a useless terminal.