Context Free Language
There are two things essentially that we will be looking at here, and instead of starting with the machines we will start with the grammars, we have already seen right and left linear grammars when looking at finite state machines, now we will look at grammars in relation to context free language. The bigger picture from here are turning machines, we will be looking at those at a later stage, but for now we will refine the notion of context free languages into two parts, of the inner aspect that is closer to finite state machines are those machines that are deterministic, called deterministic push down machines or DPDM. On the outside are non deterministic push down machines.
Context free languages are equivalent to non deterministic push down machines. Deterministic push down machines have less power than non deterministic push down machines and they represent a subset of the context free languages. A very special subset whos grammars are very difficult to describe easily, but that are also very important called LR(K) grammars.
LR(K) grammars are the grammars that most compilers are built from, programming languages are described by LR(K) grammars, the implication being that if you have an LR(K) grammar then a compiler is easy to build around it, where as for a more general context free grammar a compiler is more difficult to construct, you need determinism to build a compiler.
For finite state machines determinism and non determinism are the same thing in disguise, for context free languages they are distinctly different, a genuine subset.
The definition of a context free grammar is essentially that a context free grammar is any grammar where the left side of every production has a single non terminal symbol. There is a single capital letter on the left of the arrow where as the right is completely unrestricted.
Left linear grammars are more restricted than context free grammars, they are context free with a single non terminal on the left, but on the right they are restricted to having productions of the form terminal non terminal, if they are right linear then they are restricted to the form non terminal terminal; They can not be anything else, whereas context free language can have anything at all on the right.
Grammars are very powerful things, it is very hard to predict exactly what it is that they are doing. There are a lot of techniques to designing grammars and a lot of techniques for understanding them a lot of techniques for making the connection between grammars and parsing.
S→0S1|e
S can be substituted for 0S1
S→0S1→0 0S1 1→0011
This grammar generates 0011
it also generates
0n1n
or any string of 0's followed by the same number of
1's. Giving us the language:
{0^n1^n|n>=0}
Here we fine the first language that we considered when investigating the pumping lemma in proving that the language does not constitute a finite state machine, so immediately we find a context free grammar that is unrestricted on the right, as should be every context free grammar. In so doing we have generated a set that is definitely not a finite state machine, demonstrating that we are no longer within their domain.
Balance Parenthesis String
The following language may be considered to be a compiler that generates certain strings.
S→0S1|SS|e
Consider the following string, it is the balance parentheses string where the 0 is an open parenthesis and the 1 is a close parenthesis, is it possible that our compiler has generated it?
How does this grammar work, if you have any S then it should have a 0 on one side of it and a 1 on the right of it, this is the rule about putting an open and closed around something that is already balanced, and then the other rule of parenthesis is that if you have two sets of balanced parenthesis, you can concatenate them together.
00100110110011
We can use a strategy of pairing the 0 and 1's making certain that each occurrence of 0 or 1 has a corresponding digit satisfying our grammars requirements.
Considering the ordering of the grammar, if we were to take the first production in our grammar and look at the string with only this first in mind, we would never succeed in parsing the string as there will never be a version of the entire string that fits this part of the grammar, so we must consider all productions, parsing the string fully.
As we step through the string, for each production there is only one possible choice of derivation, we must begin with SS and then for both cases that follow we must replace the S with 0S1, we have only to choose which side first, the left or the right of the SS. Once a choice has been made of left or right, typically the same pattern will be used throughout the operation. This is known as either a left or right derivation of the grammar.
S→SS→S0S1→0S10S1
Is the same thing as:
S→SS→0S1S→0S10S1
Another way of writing the same is as a tree, when we do write it like this, the question of right or left derivation becomes mute, and there is only one graph of the structure, however there are still different ways of forming the graph, when we reach the token SS there is ambiguity as to which branch should get which S. This ambiguity can cause differences in semantic meaning when dealing with programming languages.
This tree is known as a parse tree, it represents a derivation of a particular string, if there are more than one possible parse tress derivable from a grammar, it means that the grammar is ambiguous.
A grammar is said to be ambiguous if any string in the language has two or more parse trees, the grammar is unambiguous, only if every single string in the language has it's own unique parse tree, or, a unique left most derivation.
Our current grammar is highly ambiguous, with the SS production in particular, we could replace any one of the S's in our parse trees with an SS that leads to another S and an e, also any branches created where there is an SS can be made in either of two ways, which differ if the productions that follow the SS left and right are not exactly the same.
It would be pertinent to highlight now that the question, is an arbitrary grammar unambiguous? Unambiguous is not decidable, if it has discernible ambiguities it is proven to be ambiguous, it is not possible to calculate and prove ambiguity about any grammar, this is un decidable.
Another Grammar
This is a famous grammar.
S→S+S|S*S|0|1|2|…|9
The + and the * are terminal symbols, as are the 0 1 and 2, this grammar generates strings over the alphabet +*012 a five symbol alphabet.
Considering the string 3+4*5
what is the parse
tree this string will generate?
This grammar generates arithmetic expressions with the integer values 0 though 9 separated by + and * symbols, it is part of any language that lets you have expression on the right hand side of assignment statements.
Once again we can easily see that our grammar is ambiguous, the string is easily presentable with different parse trees.
The reason that this is a problem is that a compiler when it reads these parse trees, can arrive at different semantic meanings from that which it is reading, depending upon the order in which it is read. In the case of our two parse trees, they evaluate to both 23 and 35, which is clearly problematic if we are depending upon the result as a calculation. There is currently no notion of precedence in the grammar as we have defined it.
In mathematics we remember a high order grammar to which we refer when such ambiguity occurs, some use the acronym PEMDAS as a memory aid for this (Parentheses, Exponents, Multiplication and Division, Addition and Subtraction). As such it is the parse tree that concludes with 23 that we would expect to be used and consider to be the 'correct' response.
By the addition of two more symbols to our alphabet we can resolve this issue.
S→(S+S)|(S*S)|0|1|2|…|9
Our string 3+4*5
is no longer a legitimate
string in this language. (3+(4*5))
would be, as
would ((3+4)*5)
each with their own unique parse
tree.
This resolves our ambiguity issues, the point here is to understand that we can have many different grammars for the same thing, some ambiguous and other not.
Building Grammars
How do we construct these grammars to solve problems? Harder than building finite state machines grammars tend to be more challenging. Where machines are an iterative process that can be stepped though Where machines are an iterative process that can be stepped though, grammars can be far more illusive in trying to pin them down, grammars can be far more illusive in their definition. One style of grammar formation is recursively to induce the grammar recursively. Another style is similar to machines have each non terminal have a semantic meaning and keep that meaning consistent.
- Recursive
- Semantic Meaning Non-terminals
Equal 0's and 1's
Accept strings that have an equal quantity of 0's and 1's in any order. Definitely not a regular set, we can make a context free grammar that will generate our strings, being careful to verify that the grammar both creates every single string that has an equal number but also that it rejects all strings that do not, verifying both the positive and negative cases.
Semantic Meaning
We have two choices here, to either start with a 0 or to start with a 1, if we start with a 0 lets call that semantic state A and if we start with a 1 then another state called B then of course we could also go to empty. Next we look at state A, all the possible non-terminals that we could reach from there.
S = even
A = I owe you a 1
B = I owe you a 0
S→ 0A | 1B | e
A→ 1S | 0AA
B→ 0S | 1BB
This is how we create a grammar, thinking of the semantic meaning of the non terminals and then defining them in terms of each other.
If we consider the nature of our grammar we might ask ourselves whether or not it is ambiguous, the clue to answering this, the grammar is ambiguous, is in the double A and double B in A and B's productions, in both cases either one can be satisfied by any number of different strings so long as they fulfil the requirement of settling upon either one extra 0 or one extra 1. We need to be careful with grammars as it is really easy to have the wrong instinct about them as they are really very complex, harder to understand than state machines. The recursive way of conceiving of this grammar is far simpler:
Recursive
0^n1^n
Here is another grammar, the same again, only this one we have not proven that it is, it would have to be done with induction as it is recursive, but we can try to develop intuition about it.
S→ SAB | e
A→ 0S1 | e
B→ 1S0 | e
Can it generate this string 00011011
?
It can, but we quickly see that the grammar is extremely ambiguous, there are millions of ways to create this string with this grammar.
So we have another equivalent grammar, it would be really nice if we had a way to see whether or not grammars are equal, but this is absolutely undecidable. In fact the only thing that we can decide about a grammar is whether or not it accepts nothing, we can't even decide this for a Turing machine!
This grammar 0^n1^n0^n
has no context free
language that describes it, we will prove this later on, for
now we will just accept this. We will see that a context
sensitive grammar can very easily express this language, and
how these grammars look very much like machines.
If you can think of the functionality of a machine, you can also think of a context dependant grammar that will describe it, this does not work for context free grammars. In exploring this connection between machines and grammars we will come to understand the connection between computation and grammars. The relationship between context free grammars and machines is complicated so this example serves as a good introduction to it.
Imagine a shooting gallery at a funfair, with a plastic duck that travels back and forth for you to shoot at, left and right, this is what our grammar is doing, L and R are the left and right extremities, D is the duck and ABC are the non-terminals that will turn into the required output, A's are the 0's on the left B's the 1's and C's the 0's on the right.
Context dependant grammar
0^n1^n0^n
S→L D A B C R
The duck moves from left to right examining and acting upon
each character as it does so, upon arriving at the end of the
line at R it skips back to L and then all the A's turn into 0's
the B's into 1's and the C's 0's to generate our output for that
run of the duck, so this grammar can only generate output if it
makes it all the way back to the beginning after having read an
equal number of A's B's and C's on reading the LD at the start
after a run it will output a production. The duck is thought of
as a movement of a machine as if reading a tape and acting on
that tape depending upon what is read, the tape LDA
would be read by the duck and parsed back out as
LAAD
, which is to say that the duck upon reading an
A
will produce AA
as it passes by.
LDA→LAAD
The notion of this piece of tape is very similar to that of a
Turing machine the tape has a state and as the duck moves along
it alters this state. LDA
is not a single symbol as
we have seen in our previous grammars, it is not context free it
is context sensitive, the A can not be substituted for AAD it is
the presence of the LD in front of the A that gives rise to the
modification, this is a context sensitive grammar.
In effect context free grammar enforces action upon every instance of the presence of every state where as contextual grammar gives you the ability to generate far more varied scenarios, it allows you to generate machines.
The next production for our machine is ADA
what
the duck does here is to pass over the A giving AAD
without changing a thing, again the D the duck is just like a
machine. The next possible case is ADB
from where
our duck will generate another B giving ABBD
and if
we were between two B's then again nothing would change and the
D will move along, and so on...
LDA→LAAD
ADA→AAD
ADB→ABBD
BDB→BBD
BDC→BCCD
CDC→CCD
DR→ER
When we get to the end of the string and encounter the
DR
state, then the duct turns round to return to
the start and on the other side has an E
. So we now
need a group of production for our E painted duck that go from
the right to the left returning to the start.
CE→EC
BE→EB
AE→EA
LE→EL
A→0
B→1
C→0
R→e
LD→e
Our grammar does not deal with the empty string itself, but this issue can be resolved very simply, all that we need do is to make a union with a grammar that does.
S→L D A B C R | e
Context free grammars are closed under union, if we take any context free grammar and join it to another context free grammar, the resulting grammar is always closed, this is the simplest operation that we can possibly make with two grammars.
0^n1^n0^p
Considering the above language descriptor 0 to the n 1 to the n followed by any number of 0's, we can take two context free grammars and concatenate them together. From this we can deduce that context free languages are also closed under concatenation.
S→0S1|M|e 0^n1^n
M→0M|e 0*
To show that a given grammar generates a given language, you have to show that there exists a parse tree for every string in that language. You also have to show that there are not any other strings, not in the language, that the grammar can generate a parse tree for.
That the grammar may generate so many trees that you can not do anything with the sheer quantity, does not matter, in so far as the Turing machine would be concerned, these would be considered as being bad computations and we do not care about them.
0^q1^m0^m
Considering the above language descriptor generating any number of 0's followed by an equal number of 1's and 0's, we split it into two parts making a non terminal for the first part 0* and another for the equal number of characters, as we have already seen.
S→AB
A→0A|e
B→0B1|A|e
For the intersection between the above two language
descriptions we find that what they have in common is another
language one that we have already explored above
0^n1^n0^n
we could not describe this with a context
free grammar and have required a context sensitive grammar to
generate it.
Our language is the intersection of two context free languages, so we can surmise that two context free languages are not closed under intersection, that they are also not closed under compliment, because they are always closed under union.
If they were closed under union and compliment then they would also be closed under intersection.
Context free grammars are closed under concatenation, closed under union, not closed under compliment and not closed under intersection. This does not mean that if you take the compliment of a context free language that it is definitely not context free, it might be but it also might not be, you can never say whether the compliment of a grammar is going to be context free or not, some are but this can not be determined.
The compliment of 0^n1^n
is context free, the
compliment of palindromes is context free, strings that have two
equal parts such as AA
the compliment is not
context free. The compliment of strings for which the two halves
are not equal is context free.