-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathnotes.xml
6725 lines (6545 loc) · 359 KB
/
notes.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<sheaf title="Concepts of Programming Languages"
subtitle="Parsing, Interpretation, Compilation, Type Systems, and Programming Paradigms"
author="Andrei Lapets"
authorlink="http://lapets.io"
>
<section title="Introduction, Background, and Motivation">
<subsection title="Overview">
<text><![CDATA[
A large variety of modern technology, from mobile devices and personal computers to datacenters and entire infrastructures, are programmable. These entities are controlled, operated, maintained, and monitored using a variety of interfaces and languages, which might take the form of graphical user interfaces, libraries, APIs, embedded languages, domain-specific languages, and general-purpose programming languages.
]]></text>
<text><![CDATA[
Many of these interfaces and languages are defined and designed according to principles and conventions that have been developed over the last several decades.
]]></text>
<text><![CDATA[
There are three major themes around which this course is organized. Each of the concepts, examples, and problems discussed in this course will relate to one or more of these themes:
]]></text>
<orderedlist>
<item title="Defining and working with programming languages"><![CDATA[
We will define and become familiar with the basic concepts that underlie the definition and design of common programming languages, and we will learn to identify and implement the common components found in static analysis tools, interpreters, and compilers.
]]></item>
<item title="Programming paradigms"><![CDATA[
You should now be familiar with the imperative and object-oriented programming paradigms. In this course, we will provide a high-level picture of the landscape of different programming paradigms, and will focus in particular on the declarative and functional programming paradigms. We will learn to identify problems that can be solved more efficiently, more concisely, and with greater generality using programming languages that support the functional and declarative paradigms. This includes gaining experience taking advantage of features such as comprehensions, algebraic data types, static type checking, and higher-order functions, and techniques such as recursion.
]]></item>
<item title="Languages as solutions"><![CDATA[
The purpose of a programming language is to provide a data representation and storage format, a communication protocol, or a control interface that can be used across time and space, between different humans and/or devices. Thus, as with any language, any programming language can be viewed as a solution to a representation, communication, or control problem. We will learn to identify when a programming language may be an appropriate solution for a problem, what trade-offs should be considered when deciding whether to design a point solution or a language, and what are some of the appropriate mathematical concepts and software tools for doing so.
]]></item>
</orderedlist>
<text><![CDATA[
Throughout this course, we use formal mathematical conventions and notations developed by the community of computer scientists and mathematicians to communicate with one another about the definitions and implementations of programming languages. Some of these notations are used throughout mathematics and computer science, while others are more obscure and are used primarily by the community of programming language theorists. Throughout your career, you will be required to learn a variety of new notations for a variety of new technologies. This course provides an opportunity to <i>practice learning new notations</i> quickly. This is a valuable skill to possess even if your area of expertise lies or will lie outside the topics covered in this course.
]]></text>
</subsection>
<subsection title="Background and prerequisites">
<paragraph title="Basic logic" hooks="math"><![CDATA[
It is expected that the reader is familiar with the basic concepts of mathematical logic, including formulas, predicates, relational operators, and terms.
]]></paragraph>
<paragraph title="Basic set theory" hooks="math"><![CDATA[
It is expected that the reader is familiar with the notion of a set (both finite and infinite), set size, and the set comprehension notation, e.g.,
\begin{eqnarray}
{1,2,3,4} is a set
\end{eqnarray}
\begin{eqnarray}
{ %x | %x \in \Z, %x > 0 } & = & {1, 2, 3, ...}
\end{eqnarray}
and the membership and subset relations between elements and sets, e.g.,
\begin{eqnarray}
1 & \in & { 1, 2, 3 } \\
{2, 3} & \subset & { 1, 2, 3 }
\end{eqnarray}
]]></paragraph>
<paragraph title="Programming using imperative and object-oriented language"><![CDATA[
It is expected that the reader is familiar with at least one imperative, object-oriented language, such as Java, C++, or Python.
]]></paragraph>
</subsection>
</section>
<section title="Defining Programming Languages">
<text><![CDATA[
In order to define and reason about programming languages, and in order to write automated tools and algorithms that can operate on programs written using programming languagues, we must be able to define formally (i.e., mathematically) what is a programming language and what is a program.
]]></text>
<subsection title="Sets of character strings">
<text><![CDATA[
In computer science, one common way to mathematically model a formal language is to introduce a finite set of symbols (an <i>alphabet</i>). A <i>language</i> is then any subset of the set of all strings consisting of symbols from that alphabet.
]]></text>
<definition required="true" hooks="math" id="95e72ad05e66427281549720c9ed975f"><![CDATA[
An <i>alphabet</i> is a finite set %A. We will call elements %a \in %A of the set <i>characters</i>. The typical alphabet we will use in this course is the set of 128 <a href="http://en.wikipedia.org/wiki/Ascii">ASCII</a> characters. However, any finite set can be an alphabet.
]]></definition>
<definition required="true" hooks="math" id="3822c49489b942f59671177f1ada70ca"><![CDATA[
Given an alphabet %A, a <i>character string</i> or <i>string</i> is any ordered finite sequence of characters from that alphabet. We will denote the empty string (containing no characters) using the symbol \varepsilon, and we will denote the length of a string %s using the notation |%s|.
]]></definition>
<example required="true" hooks="math" id="daa30c4fc2be458baf5bff14cd632cd9"><![CDATA[
The set %A = {@A,@B,@C} is an alphabet. @A@A@A, @A@B, @B, @C@B@A, and \varepsilon are all examples of strings built using the alphabet %A.
]]></example>
<definition required="true" hooks="math" id="d1e9fd2dc4724ef58c57cd881340e32d"><![CDATA[
Given an alphabet %A, let %U be the set of all finite strings that can be built using characters from %A (including the empty string, which we will call \varepsilon). In other words, we can say:
\begin{eqnarray}
%U & = & { %s | %s is a finite string of characters from alphabet %A }
\end{eqnarray}
Any subset %L \subset %U is a <i>language</i>. That is, any set of character strings (whether the set is finite or infinite) is a language.
]]></definition>
<example required="true" hooks="math" id="a32c619802234e94a307d71f676e3186"><![CDATA[
The set %A = {@X, @Y, @Z} is an alphabet. The set of strings { \varepsilon, @X, @Y, @Z } is one example of a language. The infinite set of strings { \varepsilon, @X, @X@X, @X@X@X, @X@X@X@X, ... } is also a language. The infinite set of strings { @Y@Z, @Y@Z@Y@Z, @Y@Z@Y@Z@Y@Z, ... } is also a language.
]]></example>
<example required="true" hooks="math" id="fc5990b5b76040539c931b1ca279e222"><![CDATA[
The set %A = {@0, @1, @2, @3, @4, @5, @6, @7, @8, @9} is an alphabet. The following set of strings can be a language for representing positive integers between 1 and 9999 (inclusive):
\begin{eqnarray}
%L & = & { %s | %s is a string of characters from %A, %s does not begin with @0, 1 \leq |%s| \leq 4 }
\end{eqnarray}
]]></example>
</subsection>
<subsection title="Regular expressions">
<text hooks="math"><![CDATA[
If languages are merely subsets of the set of all finite character strings that can be built using the characters in an alphabet, then we can always specify any finite subset (and thus language) by simply writing down every string in the language. However, what if we want to write down the definition of that subset using a shorter representation, or we want to specify an infinitely large subset of character strings (e.g., all strings consisting of one or more copies of the character <b>a</b>: {@a, @a@a, @a@a@a, ...})?
]]></text>
<paragraph><![CDATA[
Regular expressions are a notation for precisely and concisely defining certain sets of character strings.
]]></paragraph>
<definition required="true" hooks="math" id="06defac822974dd688904a07cf4f6f18">
<text><![CDATA[
Given an alphabet %A, a regular expression over %A can contain any of the characters in %A as well as any number of the following special characters:
]]></text>
<unorderedlist>
<item><![CDATA[the "or" symbol <b>|</b> consisting of single vertical bar;]]></item>
<item><![CDATA[the grouping symbols <b>(</b> and <b>)</b>, which must always be balanced the same way that parentheses must be balanced in arithmetic expressions;]]></item>
<item><![CDATA[the <b>+</b> and <b>*</b> symbols.]]></item>
</unorderedlist>
<text><![CDATA[
The subset of character strings that a regular expression defines can be determined in the following way:
]]></text>
<unorderedlist>
<item><![CDATA[a regular expression %r that contains no special characters defines a set of strings containin just one string: {%r};]]></item>
<item><![CDATA[if a regular expression %r_1 defines a set %S_1 and a regular expression %r_2 defines a set %S_2, then the regular expression %r_1 <b>|</b> %r_2 defines the set of character strings %S_1 \cup %S_2;]]></item>
<item><![CDATA[if a regular expression %r defines a set %S, then the regular expression <b>(</b>%r<b>)</b> defines the set of character strings %S (i.e., there is no difference);]]></item>
<item><![CDATA[if a regular expression %r defines a set %S, then the regular expression %r<b>+</b> defines the set of character strings %S' that consists of all possible concatenations of any of the strings in %S;]]></item>
<item><![CDATA[if a regular expression %r defines a set %S, then the regular expression %r<b>*</b> defines the set of character strings %S' that consists of all possible concatenations of any of the strings in %S, as well as the empty string.]]></item>
</unorderedlist>
</definition>
<example required="true" hooks="math" id="1deeccf7a0e5426db9bd875b3b727e49">
<text><![CDATA[
Assume that our alphabet consists of all alphanumeric characters. For each of the following regular expressions, describe verbally what set of character strings they define:
]]></text>
<unorderedlist>
<item><![CDATA[@a@b@c@d]]></item>
<item><![CDATA[<b>(</b>@a@b@c@d<b>)</b>+]]></item>
<item><![CDATA[<b>red | green | blue</b>]]></item>
<item><![CDATA[<b>(</b><b>red | green | blue</b><b>)</b>+]]></item>
<item><![CDATA[<b>(0+) | (1+)</b>]]></item>
<item><![CDATA[<b>(0 | 1)</b>+]]></item>
</unorderedlist>
</example>
<example required="true" hooks="math" id="98592789a35440a8ac6d28bbc97e0302">
<text><![CDATA[
Assume that our alphabet consists of all alphanumeric characters. For each of the following verbal descriptions of sets of character strings, find a regular expression that defines the described set:
]]></text>
<unorderedlist>
<item><![CDATA[the set of character strings that consist entirely of vowels;]]></item>
<item><![CDATA[the set of all integers (negative and positive);]]></item>
<item><![CDATA[the set of all arithmetic expressions involving binary numbers, addition, and subtraction;]]></item>
<item><![CDATA[the set of all words that begin with a vowel and end with a vowel;]]></item>
<item><![CDATA[the set of all numbers of even length.]]></item>
</unorderedlist>
</example>
<example required="true" hooks="math" id="791effdeab954f668592fa5324aff657">
<text><![CDATA[
In practice, programming languages that provide libraries of functions and procedures for working with regular expressions also support other special characters. For example, Python regular expressions may contain some of the following special characters:
]]></text>
<unorderedlist>
<item><![CDATA[the <code>\(</code>, <code>\)</code> special characters make it possible to include parentheses in expressions in a way that does not cause them to interpreted as regular expression grouping symbols;]]></item>
<item><![CDATA[the special symbol <code>\s</code> matches a single whitespace character;]]></item>
<item><![CDATA[the special symbol <code>[0-9]</code> matches a single numeric digit;]]></item>
<item><![CDATA[the special symbol <code>[a-z]</code> matches a single lowercase letter;]]></item>
<item><![CDATA[the special symbol <code>[A-Z]</code> matches a single uppercase letter;]]></item>
<item><![CDATA[the special symbol <code>[a-zA-Z]</code> matches a single letter;]]></item>
<item><![CDATA[the special symbol <code>[a-zA-Z0-9]</code> matches a single alphanumeric character.]]></item>
</unorderedlist>
</example>
<example required="true" id="71f7125be5124059a51f065d1950e6e8">
<text><![CDATA[
In the Python programming language, the <code>re</code> module provides functionality for automatically checking whether a string matches a particular regular expression. In order to check whether a string exactly matches a regular expression, it is necessary to wrap the regular expression in parentheses and then add special markers to ensure that the regular expression matches from the beginning of the string to the end of the string.
]]></text>
<paragraph><![CDATA[
If a match succeeds, a match object is returned that contains additional information (e.g., the position of the match); otherwise, <code>None</code> is returned.
]]></paragraph>
<code class="py"><![CDATA[
>>> import re
>>> re.match(r"^(a|b|c)$", "a") # Succeeds.
<_sre.SRE_Match object; span=(0, 1), match='a'>
>>> re.match(r"^(a|b|c)$", "def") # Fails.
None
>>> re.match(r"^((red|green|blue)+)$", "redgreenblueredblue")
<_sre.SRE_Match object; span=(0, 19), match='redgreenblueredblue'>
>>> re.match(r"^([a-zA-Z0-9]+)$", "redgreenblueredblue")
<_sre.SRE_Match object; span=(0, 19), match='redgreenblueredblue'>
>>> re.match(r"^([a-zA-Z0-9]+)$", "!@#$")
None
]]></code>
</example>
<fact required="true" hooks="math" id="174fccbd604846ad92d764daf2393aa3">
<text><![CDATA[
Regular expressions are not powerful enough to describe many common languages in which we may be interested. Examples of sets of character strings that cannot be defined using regular expression include:
]]></text>
<unorderedlist>
<item><![CDATA[the set of arithmetic expressions with balanced parentheses;]]></item>
<item><![CDATA[the set of all palindromes;]]></item>
<item><![CDATA[the set of all strings in which the same string is repeated exactly two times in a row.]]></item>
</unorderedlist>
<text><![CDATA[
For languages such as the above, a more powerful tool for describing sets of character strings is needed.
]]></text>
</fact>
</subsection>
<subsection title="Sets of token sequences">
<text><![CDATA[
Unlike human languagues, programming languages usually have a relatively small collection of symbol strings (e.g., commands or instructions) that are used to construct programs. Thus, we can adjust the definition of what constitutes a language to account for this.
]]></text>
<definition required="true" hooks="math" id="fe3a9fbf541445de977ab5df0e57621f"><![CDATA[
Given an alphabet %A, a <i>token</i> is a finite, non-empty (usually short) string of characters from that alphabet.
]]></definition>
<definition required="true" hooks="math" id="8f78a2be462e425da6c87339850e7f8b"><![CDATA[
Given a set of tokens %T, let %U be the set of all finite sequences that can be built using tokens from %T (including the empty sequence, which we will call \varepsilon). In other words, we can say:
\begin{eqnarray}
%U & = & { %s | %s is a finite sequence of tokens from %T }
\end{eqnarray}
Any subset %L \subset %U is a <i>language</i>. That is, any set of token sequences (whether the set is finite or infinite) is a language.
]]></definition>
<example required="true" hooks="math" id="06d4c9d409dc44ad8c1dccbef3412163"><![CDATA[
Consider the following set of tokens:
\begin{eqnarray}
%T & = & { <b>true</b>, <b>false</b>, <b>or</b>, <b>and</b>, <b>not</b>, <b>(</b>, <b>)</b>, <b>,</b> }
\end{eqnarray}
The set of token sequences that represent valid boolean formulas is a language:
\begin{eqnarray}
%L & = & { & %~ & <b>or ( false , and ( true , false ) )</b>, \\
& %~ & & %~ & <b>and ( true , false )</b>, \\
& %~ & & %~ & <b>not</b> <b>(</b> <b>false</b> <b>)</b>, \\
& %~ & & %~ & <b>true</b>, \\
& %~ & & %~ & <b>false</b>, \\
& %~ & & %~ & ... }
\end{eqnarray}
]]></example>
<text><![CDATA[
If a language is just a subset of the set of all possible token sequences, how do we mathematically specify interesting subsets?
]]></text>
</subsection>
<subsection title="Language syntax and Backus-Naur Form (BNF) notation">
<text><![CDATA[
If a language is just a set of finite token strings, then defining a language amounts to defining this set. How can we define this set? By introducing a collection of rules or constraints governing how characters and/or tokens can be assembled to form strings or sequences in the language.
]]></text>
<definition required="true" hooks="math" id="7e82906541a4467faea16807f380dd44"><![CDATA[
Given a token set %T and a language %L consisting of finite sequences of tokens from %T, the <i>syntax</i> of %L is the set of rules defining <i>exactly</i> which token sequences are in %L.
These rules are sometimes also called <i>syntactic rules</i>, a <i>formal grammar</i>, or simply a <i>grammar</i>.
]]></definition>
<text><![CDATA[
There are many possible ways by which we could represent syntactic rules, and these rules can be classified, or stratified, according to their expressive power. A more extensive coverage of this topic is beyond the scope of this course, and is normally presented in courses on the theory of computation and automata. In this course, we will focus on two particular kinds of grammar: regular grammars and context-free grammars. The most common representation for such grammars is Backus-Naur Form, abbreviated henceforward as <i>BNF</i>.
]]></text>
<definition required="true" hooks="math" id="2075c983aff44402b4e5467cdd450d53"><![CDATA[
A grammar definition consists of one or more <i>productions</i> (or <i>production rules</i>). Each production has the following form:
\begin{eqnarray}
<i>non-terminal</i> & ::= & <i>terminal_or_non-terminal</i> ... <i>terminal_or_non-terminal</i> \\
& | & <i>terminal_or_non-terminal</i> ... <i>terminal_or_non-terminal</i> \\
& \vdots & \\
& | & <i>terminal_or_non-terminal</i> ... <i>terminal_or_non-terminal</i>
\end{eqnarray}
The left-hand side (to the left of the ::= symbol) in each production is called a <i>non-terminal</i>. The right-hand side of each production is an unordered collection of <i>choices</i> separated by the | symbol. Each choice is a <i>sequence</i> of non-terminals (which must appear once on the left-hand side of a production) or <i>terminals</i> (a terminal is a token).
]]></definition>
<text><![CDATA[
These production rules in a grammar's BNF representation can be viewed both as a way to construct an element (i.e., a token sequence that is a program) in the language, or as a way to break down a token sequence piece by piece until nothing is left.
]]></text>
<example required="true" hooks="math" id="16147b7c6f904109b9900322be8ad838"><![CDATA[
Let %T be a token set:
\begin{eqnarray}
%T & = & { <b>true</b> }
\end{eqnarray}
The following is a very simple programming language that contains only a single possible token sequence consisting of the single token <b>true</b>:
\begin{eqnarray}
<i>program</i> & ::= & <b>true</b>
\end{eqnarray}
In this case, the language is finite and small, so we can actually write it down as a set:
\begin{eqnarray}
%L & = & { <b>true</b> }
\end{eqnarray}
]]></example>
<example required="true" hooks="math" id="b8f92b53cb7449998112583773b3d97c"><![CDATA[
We can extend the language by adding another token:
\begin{eqnarray}
%T & = & { <b>true</b>, <b>false</b> }
\end{eqnarray}
The following BNF grammar definition now contains two choices (each choice is a sequence consisting of a single terminal):
\begin{eqnarray}
<i>program</i> & ::= & <b>true</b> \\
& | & <b>false</b>
\end{eqnarray}
This programming language now contains two token sequences:
\begin{eqnarray}
%L & = & { <b>true</b>, <b>false</b> }
\end{eqnarray}
]]></example>
<example required="true" hooks="math" id="9380e58ddbde4fe88ddb2daff5bd4617"><![CDATA[
We can extend the language definition further:
\begin{eqnarray}
%T & = & { <b>true</b>, <b>false</b>, <b>or</b>, <b>and</b>, <b>not</b>, <b>(</b>, <b>)</b>, <b>,</b> }
\end{eqnarray}
The following BNF grammar definition now contains five choices (each choice is a sequence consisting of non-terminals and terminals):
\begin{eqnarray}
<i>program</i> & ::= & <b>true</b> \\
& | & <b>false</b> \\
& | & <b>and (</b> <i>program</i> <b>,</b> <i>program</i> <b>)</b> \\
& | & <b>or (</b> <i>program</i> <b>,</b> <i>program</i> <b>)</b> \\
& | & <b>not (</b> <i>program</i> <b>)</b>
\end{eqnarray}
This programming language now contains infinitely many finite token sequences:
\begin{eqnarray}
%L & = & { & %~ & <b>or ( false , and ( true , false ) )</b>, \\
& %~ & & %~ & <b>and ( true , false )</b>, \\
& %~ & & %~ & <b>not</b> <b>(</b> <b>false</b> <b>)</b>, \\
& %~ & & %~ & <b>true</b>, \\
& %~ & & %~ & <b>false</b>, \\
& %~ & & %~ & ... }
\end{eqnarray}
]]></example>
<example required="true" hooks="math" id="8c1203dd65bb49868abc64ad5353725f"><![CDATA[
Let us consider another example: a language of positive integers.
\begin{eqnarray}
%T & = & { <b>0</b>, <b>1</b>, <b>2</b>, <b>3</b>, <b>4</b>, <b>5</b>, <b>6</b>, <b>7</b>, <b>8</b>, <b>9</b> }
\end{eqnarray}
We can define the following grammar:
\begin{eqnarray}
<i>number</i> & ::= & <b>0</b> | <b>1</b> | <b>2</b> | <b>3</b> | <b>4</b> | <b>5</b> | <b>6</b> | <b>7</b> | <b>8</b> | <b>9</b> \\
& | & <b>1</b> <i>number</i> | <b>2</b> <i>number</i> | <b>3</b> <i>number</i> | <b>4</b> <i>number</i> \\
& | & <b>5</b> <i>number</i> | <b>6</b> <i>number</i> | <b>7</b> <i>number</i> | <b>8</b> <i>number</i> | <b>9</b> <i>number</i> \\
\end{eqnarray}
However, the above does not allow us to have a <b>0</b> in any number with more than one digit. One way to fix this (there are many other ways) is to introduce more productions into the grammar:
\begin{eqnarray}
<i>nozero</i> & ::= & <b>1</b> | <b>2</b> | <b>3</b> | <b>4</b> | <b>5</b> | <b>6</b> | <b>7</b> | <b>8</b> | <b>9</b> \\
<i>digit</i> & ::= & <b>0</b> | <i>nozero</i> \\
<i>digits</i> & ::= & <i>digit</i> | <i>digit</i> <i>digits</i> \\
<i>number</i> & ::= & <i>digit</i> | <i>nozero</i> <i>digits</i>
\end{eqnarray}
]]></example>
<text>
Note that the grammar may contain multiple productions. Any non-terminal defined in a production can appear on the right-hand side of any of the productions.
</text>
<example required="true" hooks="math" id="19dc317f45ec4ae18018ca5b17fce114"><![CDATA[
We can extend the language by adding a production for statements, and allowing a program to be a sequence of one or more statements.
\begin{eqnarray}
%T & = & { <b>true</b>, <b>false</b>, <b>or</b>, <b>and</b>, <b>not</b>, <b>(</b>, <b>)</b>, <b>,</b>, <b>print</b>, <b>skip</b>, <b>;</b> }
\end{eqnarray}
The following is the new grammar definition for this language.
\begin{eqnarray}
<i>program</i> & ::= & <i>statement</i> \\
& | & <i>statement</i> <i>program</i> \\
<i>statement</i> & ::= & <b>print</b> <i>formula</i> <b>;</b> \\
& | & <b>skip</b> <b>;</b> \\
<i>formula</i> & ::= & <b>true</b> \\
& | & <b>false</b> \\
& | & <b>not (</b> <i>formula</i> <b>)</b> \\
& | & <b>and (</b> <i>formula</i> <b>,</b> <i>formula</i> <b>)</b> \\
& | & <b>or (</b> <i>formula</i> <b>,</b> <i>formula</i> <b>)</b>
\end{eqnarray}
]]></example>
</subsection>
</section>
<section title="Parsing">
<text><![CDATA[
Given a programming language definition, we want to have the ability to operate on programs written in that language using a computer. To do so, we need to convert the character string representations of programs in that programming language into instances of a data structure; each data structure instance would then be a representation of the program as data.
]]></text>
<subsection title="Concrete and abstract syntaxes">
<definition required="true" id="e601deb568ed46a1a1d741907a6dcfa9"><![CDATA[
Given an alphabet, token set, and grammar definition (e.g., represented using BNF notation), we define the <i>concrete syntax</i> to be the set of all character strings that conform to the grammar definition. We call a particular character string that conforms to the grammar definition a <i>concrete syntax instance</i>.
]]></definition>
<definition required="true" id="db9e874ec6b6475a93bc3ef4db0f2066"><![CDATA[
For a particular programming language definition, we define the <i>abstract syntax</i> to be the set of all data structure instances that correspond to a character string that conforms to the grammar definition for that language. An instance of the abstract syntax is sometimes called a <i>parse tree</i>.
]]></definition>
<example required="true" id="4ba33545333f48e88baddfa4507a1db0">
<text hooks="math"><![CDATA[
Consider again the language that conforms to the following grammar:
\begin{eqnarray}
<i>program</i> & ::= & <b>true</b> | <b>false</b> | <b>not (</b> <i>program</i> <b>)</b> \\
& | & <b>and (</b> <i>program</i> <b>,</b> <i>program</i> <b>)</b> | <b>or (</b> <i>program</i> <b>,</b> <i>program</i> <b>)</b>
\end{eqnarray}
The following is the character string of one possible program in the language. This character string is an instance of the concrete syntax of the language.
]]></text>
<code class="py"><![CDATA[
and (or (and (true, false), not(false)), or (true, false))
]]></code>
<text><![CDATA[
The above character string might be converted into a structured representation of the program within some host language (i.e., the programming language being used to operate on these programs: checking for errors, interpreting, or compiling the program). Below, we present one possible Python representation of the instance of the abstract syntax (i.e., the parse tree) corresponding to the concrete syntax instance above. This representation uses nested Python dictionaries to represent the parse tree, with strings being used to represent node labels and leaves.
]]></text>
<code class="py"><![CDATA[
{ "And": [
{ "Or": [
{ "And": ["True","False"]},
{ "Not": ["False"]}
]
},
{ "Or": ["True","False"]}
]
}
]]></code>
</example>
</subsection>
<subsection title="Lexing (a.k.a., tokenizing) and parsing">
<definition required="true" id="3f625b68bc8a426c80fe1c774e04c362"><![CDATA[
A <i>lexer</i> or <i>tokenizer</i> is an algorithm that converts an instance of the concrete syntax of a language (i.e., a character string that conforms to the grammar definition for the language) into a sequence of tokens.
]]></definition>
<example required="true" id="2c7cc0e69813454eafc7128e5ac57055">
<text hooks="math"><![CDATA[
Consider again the language that conforms to the following grammar:
\begin{eqnarray}
<i>program</i> & ::= & <b>true</b> | <b>false</b> | <b>not (</b> <i>program</i> <b>)</b> \\
& | & <b>and (</b> <i>program</i> <b>,</b> <i>program</i> <b>)</b> | <b>or (</b> <i>program</i> <b>,</b> <i>program</i> <b>)</b>
\end{eqnarray}
The following Python implementation of a tokenizing algorithm for this language uses regular expressions to split a character string into a sequence of individual tokens from the token set.
]]></text>
<code class="py"><![CDATA[
import re
def tokenize(s):
# Use a regular expression to split the string into
# tokens or sequences of zero or more spaces.
tokens = [t for t in re.split(r"(\s+|true|false|and|or|not|,|\(|\))", s)]
# Throw out the spaces and return the result.
return [t for t in tokens if not t.isspace() and not t == ""]
]]></code>
<text><![CDATA[
Below is an example input and output.
]]></text>
<code class="py"><![CDATA[
>>> tokenize("and (or (and (true, false), not(false)), or (true, false))")
['and', '(', 'or', '(', 'and', '(', 'true', ',', 'false', ')', ',', 'not',
'(', 'false', ')', ')', ',', 'or', '(', 'true', ',', 'false', ')', ')']
]]></code>
</example>
<definition required="true" id="3c944fff48f246aba9409f76e48ccef4"><![CDATA[
A <i>parser</i> is an algorithm that converts a token sequence into an instance of the abstract syntax (i.e., a parse tree).
]]></definition>
<text><![CDATA[
The tokenizer and parser are then composed to transform a character string into a parse tree.
]]></text>
<diagram id="a928f667dad647659519275ac46bbe29"><![CDATA[
<table class="container">
<tr>
<td class="box" style="background-color:powderblue;">character string<br/>(concrete syntax)</td>
<td><span style="font-size:20px;">⇒</span></td>
<td class="box" style="background-color:lightyellow;">lexer/<br/>tokenizer</td>
<td><span style="font-size:20px;">⇒</span></td>
<td class="box" style="background-color:powderblue;">token<br/>sequence</td>
<td><span style="font-size:20px;">⇒</span></td>
<td class="box" style="background-color:lightyellow;">parser</td>
<td><span style="font-size:20px;">⇒</span></td>
<td class="box" style="background-color:powderblue;">parse tree<br/>(abstract syntax)</td>
</tr>
</table>
]]></diagram>
<text><![CDATA[
Often, the tokenizer and parser are together called a <i>parser</i>. In situations where this can cause confusion, we will refer to the actual process that converts token sequences into parse trees as the <i>parsing algorithm</i>.
]]></text>
<diagram id="60d3952f533a4306a9c3167d6a33bae1"><![CDATA[
<table class="container">
<tr>
<td class="box" style="background-color:powderblue;">character string<br/>(concrete syntax)</td>
<td><span style="font-size:20px;">⇒</span></td>
<td class="box" style="background-color:#EFEFEF;">
<table class="container">
<tr>
<td class="box" style="background-color:lightyellow;">lexer/<br/>tokenizer</td>
<td><span style="font-size:20px;">⇒</span></td>
<td class="box" style="background-color:powderblue;">token<br/>sequence</td>
<td><span style="font-size:20px;">⇒</span></td>
<td class="box" style="background-color:lightyellow;">parsing<br/>algorithm</td>
</tr>
</table>
<br/>parser
</td>
<td><span style="font-size:20px;">⇒</span></td>
<td class="box" style="background-color:powderblue;">parse tree<br/>(abstract syntax)</td>
</tr>
</table>
]]></diagram>
<text><![CDATA[
The BNF representation of a grammar can be converted into a parsing algorithm that turns a token sequence into an abstract syntax data structure instance (i.e., a parse tree). How easily this can be done depends on the properties of the grammar.
]]></text>
<fact required="true" id="47254608df414ace8d04c630a2b15689"><![CDATA[
Given a BNF representation of a grammar, if for every production in the grammar, each choice begins with a unique terminal (i.e., a terminal that is <i>not</i> the first terminal in any other choice within that production), then we say the grammar is in <i>LL(1)</i> form, and we can implement a <i>predictive recursive descent</i> parsing algorithm to parse any token sequence that conforms to this grammar (note that these are only <i>sufficient</i> conditions for the grammar to be in LL(1) form; less restrictive conditions also exist).
]]></fact>
<text><![CDATA[
A predictive recursive descent parser can effectively run in linear time; it decomposes the token sequence from left to right while assembling a parse tree.
]]></text>
<example required="true" id="2cd418f2876c42d59a57e34bd6288f22">
<text hooks="math"><![CDATA[
Consider again the language that conforms to the following grammar:
\begin{eqnarray}
<i>program</i> & ::= & <b>true</b> | <b>false</b> | <b>not (</b> <i>program</i> <b>)</b> \\
& | & <b>and (</b> <i>program</i> <b>,</b> <i>program</i> <b>)</b> | <b>or (</b> <i>program</i> <b>,</b> <i>program</i> <b>)</b>
\end{eqnarray}
The following Python implementation of a predictive recursive descent parsing algorithm for this language builds a parse tree using the nested dictionary representation seen in <a href="#4ba33545333f48e88baddfa4507a1db0">a previous example</a>. This recursive algorithm takes a single argument: a sequence of tokens. It returns two results: a parse tree, and the remainder of the token sequence. <b>Note that the order in which the choices in the production are being handled is not determined by the order of the choices in the production.</b> The choices in a production are unordered; any parser implementation that captures all the possible choices conforms to the grammar definition.
]]></text>
<code class="py"><![CDATA[
def parse(tokens):
if tokens[0] == 'true':
return ('True', tokens[1:])
if tokens[0] == 'false':
return ('False', tokens[1:])
if tokens[0] == 'not' and tokens[1] == '(':
(e1, tokens) = parse(tokens[2:])
if tokens[0] == ')':
return ({'Not':[e1]}, tokens[1:])
if tokens[0] == 'or' and tokens[1] == '(':
(e1, tokens) = parse(tokens[2:])
if tokens[0] == ',':
(e2, tokens) = parse(tokens[1:])
if tokens[0] == ')':
return ({'Or':[e1,e2]}, tokens[1:])
if tokens[0] == 'and' and tokens[1] == '(':
(e1, tokens) = parse(tokens[2:])
if tokens[0] == ',':
(e2, tokens) = parse(tokens[1:])
if tokens[0] == ')':
return ({'And':[e1,e2]}, tokens[1:])
]]></code>
<text><![CDATA[
Below is an example input and output. Notice that no tokens are left in the token sequence once the result is returned.
]]></text>
<code class="py"><![CDATA[
>>> import json
>>> (tree, tokens) =\
parse(tokenize(\
"and (or (and (true, false), not(false)), or (true, false))"\
)\
)
>>> print(json.dumps(tree, sort_keys=True, indent=2))
{
"And": [
{
"Or": [
{
"And": [
"True",
"False"
]
},
{
"Not": [
"False"
]
}
]
},
{
"Or": [
"True",
"False"
]
}
]
}
>>> print(tokens)
[]
]]></code>
</example>
<fact required="true" id="95cb72f5b4d24ddba2c8081c7b42618a"><![CDATA[
If we relax the conditions on the grammar definition that make it LL(1) by dropping the requirement that the first terminal in each sequence within a production must be unique, we can no longer use predictive recursive descent parsing algorithm to parse a language corresponding to this grammar because the first terminal in each sequence within a production no longer uniquely determines which choice within a production should be used to continue parsing a token sequence. However, as long as every sequence within every production starts with a terminal, we can implement a <i>backtracking recursive descent</i> parsing algorithm to parse token sequence in the language.
]]></fact>
<example required="true" id="5917f6f5de7c46079190f4b6c4961ca3">
<text hooks="math"><![CDATA[
In the previus example, the grammar only allowed prefix logical operators. Suppose we wanted to parse token sequences for a grammar with infix operators.
\begin{eqnarray}
<i>program</i> & ::= & <b>true</b> | <b>false</b> | <b>not (</b> <i>program</i> <b>)</b> \\
& | & <b>(</b> <i>program</i> <b>and</b> <i>program</i> <b>)</b> | <b>(</b> <i>program</i> <b>or</b> <i>program</i> <b>)</b>
\end{eqnarray}
It is no longer possible to implement a predictive recursive descent parser. We must instead employ backtracking, and we must also keep track of whether we have consumed all the tokens.
]]></text>
<code class="py"><![CDATA[
def parse(tmp, top = True):
tokens = tmp[0:]
if tokens[0] == 'true':
tokens = tokens[1:]
if not top or len(tokens) == 0:
return ('True', tokens)
tokens = tmp[0:]
if tokens[0] == 'false':
tokens = tokens[1:]
if not top or len(tokens) == 0:
return ('False', tokens)
tokens = tmp[0:]
if tokens[0] == 'not' and tokens[1] == '(':
tokens = tokens[2:]
r = parse(tokens, False)
if not r is None:
(e1, tokens) = r
if tokens[0] == ')':
tokens = tokens[1:]
if not top or len(tokens) == 0:
return ({'Not':[e1]}, tokens)
tokens = tmp[0:]
if tokens[0] == '(':
tokens = tokens[1:]
r = parse(tokens, False)
if not r is None:
(e1, tokens) = r
if tokens[0] == 'or':
tokens = tokens[1:]
r = parse(tokens, False)
if not r is None:
(e2, tokens) = r
if tokens[0] == ')':
tokens = tokens[1:]
if not top or len(tokens) == 0:
return ({'Or':[e1,e2]}, tokens)
tokens = tmp[0:]
if tokens[0] == '(':
tokens = tokens[1:]
r = parse(tokens, False)
if not r is None:
(e1, tokens) = r
if tokens[0] == 'and':
tokens = tokens[1:]
r = parse(tokens, False)
if not r is None:
(e2, tokens) = r
if tokens[0] == ')':
tokens = tokens[1:]
if not top or len(tokens) == 0:
return ({'And':[e1,e2]}, tokens)
]]></code>
<text><![CDATA[
The above code is far too repetitive. However, we can take the parts that repeat and turn them into a loop body that loops over all the possible choices in the production.
]]></text>
<code class="py"><![CDATA[
def parse(tmp, top = True):
seqs = [\
('True', ['true']), \
('False', ['false']), \
('Not', ['not', '(', parse, ')']), \
('And', ['(', parse, 'and', parse, ')']), \
('Or', ['(', parse, 'or', parse, ')']) \
]
# Try each choice sequence.
for (label, seq) in seqs:
tokens = tmp[0:]
ss = [] # To store matched terminals.
es = [] # To collect parse trees from recursive calls.
# Walk through the sequence and either
# match terminals to tokens or make
# recursive calls depending on whether
# the sequence entry is a terminal or
# parsing function.
for x in seq:
if type(x) == type(""): # Terminal.
if tokens[0] == x: # Does terminal match token?
tokens = tokens[1:]
ss = ss + [x]
else:
break # Terminal did not match token.
else: # Parsing function.
# Call parsing function recursively
r = x(tokens, False)
if not r is None:
(e, tokens) = r
es = es + [e]
# Check that we got either a matched token
# or a parse tree for each sequence entry.
if len(ss) + len(es) == len(seq):
if not top or len(tokens) == 0:
return ({label:es} if len(es) > 0 else label, tokens)
]]></code>
</example>
</subsection>
<subsection title="More parsing examples and building parsers for other classes of grammar">
<exercise required="true" id="7c9969979aa04b03a4fbecdaca21880e">
<text hooks="math"><![CDATA[
Consider the following grammar definition:
\begin{eqnarray}
<i>command</i> & ::= & <b>start</b> \\
& | & <b>suspend</b> \\
& | & <b>wake</b> \\
& | & <b>terminate</b> \\
& | & <b>reboot</b> \\
& | & <b>if</b> <i>condition</i> <b>then</b> <i>command</i> \\
& | & <b>if</b> <i>condition</i> <b>then</b> <i>command</i> <b>else</b> <i>command</i> \\
& | & <b>repeat</b> <i>command</i> \\
& | & <b>while</b> <i>condition</i> <b>then</b> <i>command</i> \\
<i>condition</i> & ::= & <b>power low</b> \\
& | & <b>temperature high</b> \\
& | & <b>temperature very low</b> \\
& | & <b>user input</b>
\end{eqnarray}
Suppose we want to implement a parser for the above grammar. A partial implementation of a parser for the above grammar (what has been presented in lecture so far) is provided below.
]]></text>
<code class="py"><![CDATA[
def command(tokens, top = True):
seqs = {\
("Start", "start"),\
("Suspend", "suspend"),\
("Wake", "wake"),\
("Terminate", "terminate"),\
("Reboot", "reboot")\
}
for (key, value) in seqs:
if tokens[0] == value:
tokens = tokens[1:]
if not top or len(tokens) == 0:
return (key, tokens)
if tokens[0] == 'repeat':
r = command(tokens[1:], False)
if not r is None:
(e, tokens) = r
if not top or len(tokens) == 0:
return ({"Repeat": [e]}, tokens)
if tokens[0] == 'while':
r = condition(tokens[1:], False)
if not r is None:
(e1, tokens) = r
if tokens[0] == 'then':
r = command(tokens[1:], False)
if not r is None:
(e2, tokens) = r
if not top or len(tokens) == 0:
return ({"While": [e1,e2]}, tokens)
def condition(tokens, top = True):
seqs = [\
("PowerLow", ["power", "low"]),\
("TempHigh", ["temperature", "high"]),\
("UserInput", ["user", "input"])\
]
for (key, seq) in seqs:
if tokens[0] == seq[0]:
if tokens[1] == seq[1]:
tokens = tokens[2:]
if not top or len(tokens) == 0:
return (key, tokens)
]]></code>
<text><![CDATA[
It is possible to replace the <code>condition()</code> function with a generic parser for base cases that contain an arbitrary number of terminals in each sequence.
Any call to <code>condition(...)</code> can then be replaced with a call to <code>parseBaseCases(seqsCondition, ...)</code>.
]]></text>
<code class="py"><![CDATA[
seqsCondition = [\
("PowerLow", ["power", "low"]),\
("TempHigh", ["temperature", "high"]),\
("TempVeryLow", ["temperature", "very", "low"]),\
("UserInput", ["user", "input"])\
]
def parseBaseCases(seqs, tokens, top = True):
for (key, seq) in seqs:
# Check if token sequence matches sequence.
i = 0
for terminal in seq:
if terminal == tokens[i]:
pass
else:
break
i = i + 1
# Check if the previous loop succeeded.
if i == len(seq):
tokens = tokens[len(seq):]
if not top or len(tokens) == 0:
return (key, tokens)
]]></code>
</exercise>
<example required="true" id="b1e8bae08f134922b2bdc85fd206cbe4">
<text hooks="math"><![CDATA[
Consider the following grammar definition:
\begin{eqnarray}
<i>formula</i> & ::= & <b>true</b> \\
& | & <b>false</b> \\
& | & <b>not</b> <i>formula</i> \\
& | & <b>(</b> <i>formula</i> <b>)</b> \\
& | & <i>formula</i> <b>and</b> <i>formula</i> \\
& | & <i>formula</i> <b>or</b> <i>formula</i>
\end{eqnarray}
Implementing a naive recursive descent parser, predictive or backtracking, would not work for this grammar. Consider what would happen if we ran the following code on any input:
]]></text>
<code class="py"><![CDATA[
def parse(tokens):
if tokens[0] == 'true':
return ('True', tokens[1:])
if tokens[0] == 'false':
return ('False', tokens[1:])
# ...
# Recursive call, but no tokens consumed.
(e1, tokens) = parse(tokens)
if tokens[0] == 'and':
(e2, tokens) = parse(tokens[1:])
return ({'And':[e1,e2]}, tokens[1:])
# ...
]]></code>
<text hooks="math"><![CDATA[
The above code never terminates, and after a large number of recursive calls are made, the Python interpreter returns an error indicating it is out of stack space.
To get around this problem, one option is to perform <i>left-recursion elimination</i> on the grammar so that a recursive call never occurs first for any of the choices.
\begin{eqnarray}
<i>formula</i> & ::= & <i>left</i> <b>and</b> <i>formula</i> \\
& | & <i>left</i> <b>or</b> <i>formula</i> \\
& | & <i>left</i> \\
<i>left</i> & ::= & <b>true</b> \\
& | & <b>false</b> \\
& | & <b>not</b> <i>formula</i> \\
& | & <b>(</b> <i>formula</i> <b>)</b>
\end{eqnarray}
The above is usually acceptable if the operator, such as <b>and</b>, is commutative or right-associative. However, if the operator is left-associative, the above strategy would not necessarily lead to a correct parse tree. Can you explain why? In such a scenario, other techniques would need to be employed. If the operator is indeed associative or right-associative, however, the parser implementation could then look something like the following:
]]></text>
<code class="py"><![CDATA[
def formula(tmp):
tokens = tmp[0:]
(e1, tokens) = left(tokens)
if tokens[0] == 'and':
(e2, tokens) = formula(tokens[1:])
return ({'And':[e1,e2]}, tokens[1:])
# ...
def left(tokens):
if tokens[0] == 'true':
return ('True', tokens[1:])
if tokens[0] == 'false':
return ('False', tokens[1:])
# ...
]]></code>
<text hooks="math"><![CDATA[
Note that performing left-recursion elimination does <i>not</i> necessarily change the definition of the concrete syntax, the definition of the abstract syntax of a language (i.e., the set of parse trees), or the meaning of a language. Left-recursion elimination is a strategy for converting a grammar definition into a definition that is easier to implement using a recursive descent parser; it is an implementation strategy. Thus, the resulting parse trees should contain no record or indication that left-recursion elimination was performed on the grammar before the parser was implemented.
]]></text>
<paragraph hooks="math"><![CDATA[
Also note that the resulting parser implementation <code>formula()</code> uses backtracking because the <i>formula</i> production rule has three choices that all begin with the same non-terminal <i>left</i>. It would be possible to perform left factoring on the <i>formula</i> production rule as follows, which would then make it possible to implement <code>formula()</code> as a predictive parser:
\begin{eqnarray}
<i>formula</i> & ::= & <i>left</i> <i>rest</i> \\
<i>rest</i> & ::= & <b>and</b> <i>formula</i> \\
& | & <b>or</b> <i>formula</i> \\
& | &
\end{eqnarray}
]]></paragraph>
</example>
</subsection>
<!--include sheaf="assignment-one.xml"/-->
</section>
<section title="Semantics, Evaluation, and Interpretation">
<subsection title="Formally defining an abstract syntax">
<text><![CDATA[
While the abstract syntax of a programming language is the set of data structure instances that represent programs, it is also useful to model the abstract syntax as a mathematical object in its own right. This makes it possible to define formally (i.e., mathematically, independently of any implementation language, platform, operating system, and so on) the meaning of a language, and how programs can be run. It also makes it possible to formally define analyses on programs, as well as properties of transformations over programs.
]]></text>
<paragraph><![CDATA[
Typically, an abstract syntax definition will closely match a concrete syntax definition, except that there is no need to specify the token set, and redundant syntactic constructs and syntactic sugar will be eliminated. For example, parentheses are a syntactic convention that is not necessary if one is working with parse trees because parse trees are already grouped implicitly due to the tree structure of the abstract syntax instance.
]]></paragraph>
<example required="true" id="7bf280c88ee14a299c3490762a33e4dd">
<text hooks="math"><![CDATA[
The following is an example of an abstract syntax definition.
\begin{eqnarray}
<i>formula</i> & ::= & <b>true</b> | <b>false</b> | <b>not</b> <i>formula</i> | <i>formula</i> <b>and</b> <i>formula</i> | <i>formula</i> <b>or</b> <i>formula</i>
\end{eqnarray}
Notice the omission of the parentheses. Also, there is no need to be concerned with the fixity (i.e., infix vs. prefix) of binary operators, since this definition is not being used to implement a parsing algorithm.
]]></text>
</example>
</subsection>
<subsection title="Denotational semantics and operational semantics">
<text><![CDATA[
The abstract syntax of a programming language is a set of symbolic objects (i.e., the abstract syntax instances, such as programs) that have no meaning unless a meaning is assigned to them. There are two ways in which we can assign meaning to these objects. We can choose to assign a mathematical object to each abstract syntax instance, or we can define a collection of deterministic transformations that specify how we can convert each abstract syntax instance into another abstract syntax instance. Roughly speaking, assigning a mathematical object to each program tells us what it <i>means</i>, while specifying symbolic converion rules tells us how to <i>run</i> the program.
]]></text>
<definition required="true" hooks="math" id="41aa3d27c2754fa880cb8c2280322f63"><![CDATA[
The <i>denotational semantics</i> of an abstract syntax is a mapping from the set of abstract syntax instances %A to some mathematical set of objects %D, which is often called a <i>semantic domain</i> or just <i>domain</i>. The mapping from %A to %D itself is often denoted using the circumfix Oxford double bracket notation [[ ... ]], and the definition of a denotational semantics of %A (i.e., the definition of this mapping [[ ... ]]) is often specified using a collection of inference rules.
]]></definition>
<definition required="true" hooks="math" id="63092f7e13d44b0cac8bde679e6f2c6d"><![CDATA[
Let %A be an abstract syntax of a programming language. The <i>operational semantics</i> of an abstract syntax is a set of rules that specify how each abstract syntax instance %a \in %A can be transformed some kind of object that represents the result of performing the computation described by %A.
]]></definition>
<text><![CDATA[
There are distinct kinds of operational semantics, such as <i>small-step semantics</i> and <i>big-step semantics</i> (also known as <i>natural semantics</i>). In this course, the operational semantics we will be using is closest to big-step semantics, with some simplifications. We adopt this particular approach to defining operational semantics because it corresponds more closely to a functional, recursive implementation of an algorithm for interpreting programs.
<br/><br/>
The operational semantics for a programming language represents a <i>contract</i>, a set of <i>constraints</i>, or a set of <i>requirements</i> that an algorithm that implementing an interpreter or compiler of that language must respect in order to be considered correct. However, whoever builds an implementation of an interpreter or compiler for a language has full freedom and flexibility in how they choose to implement the interpreter in all other aspects as long as it conforms to the operational semantics. This is what makes it possible to introduce optimizations into interpreters and compilers (such as optimizations to improve performance or reduce use of memory) while preserving the correctness of their behavior. The operational semantics for a programming language is defined using a collection of <i>inference rules</i>.
]]></text>
<definition required="true" id="9af40ca04b2f44c795bbb02e7bed83ef">
<text hooks="math"><![CDATA[
An <i>inference rule</i> is a notation used within mathematics and computer science to define relationships between mathematical facts and formulas. Each inference rule consists of a horizontal line with zero or more logical formulas above the line and one logial formula below the line. The logical formulas above the line are called <i>premises</i>, and the formula below the line is called the <i>conclusion</i>.
]]></text>
<inferences hooks="math">
<inference title="Name-of-Inference-Rule">
<premises><![CDATA[<i>premise</i> %~ %~ <i>premise</i>]]></premises>
<conclusion><![CDATA[<i>conclusion</i>]]></conclusion>
</inference>
<inference title="Example">
<premises><![CDATA[<b>sun is out</b> %~ %~ <b>sky is clear</b>]]></premises>
<conclusion><![CDATA[<b>it is not raining</b>]]></conclusion>
</inference>
</inferences>
<text hooks="math"><![CDATA[
An inference rule can be interpreted as a portion of a larger algorithm. The premises specify the recursive calls, or calls to other functions, that may need to be made, and the results that are obtained from those invocations. The conclusion specifies what inputs can be handled by that inference rule, and what outputs should be returned given those inputs and the premises.
]]></text>
<inferences hooks="math">
<inference title="Algorithm-Case">
<premises><![CDATA[<i>input_1</i> \Downarrow <i>output_1</i> %~ %~ <i>input_2</i> \Downarrow <i>output_2</i>]]></premises>
<conclusion><![CDATA[<i>input_0</i> \Downarrow <i>output_0</i>]]></conclusion>
</inference>
</inferences>
<text hooks="math"><![CDATA[
Note that in the above, <i>input_1</i> and <i>input_2</i> may depend on <i>input_0</i>, and <i>output_0</i> may depend on <i>output_1</i> and <i>output_2</i>. In other words, one could rewrite an inference rule in the following way using natural language:
]]></text>
<inferences hooks="math">
<inference title="Algorithm-Case">
<premises><![CDATA[<b>invoking this or another algorithm with</b> <i>input_1</i> <b>yields</b> <i>output_1</i>]]></premises>
<conclusion><![CDATA[<b>given</b> <i>input_0</i>, <b> and if premises above are true, then output</b> <i>output_0</i>]]></conclusion>
</inference>
</inferences>
</definition>
</subsection>
<subsection title="Evaluation of expressions">
<text><![CDATA[
The abstract syntax, or a subset of the abstract syntax, of a programming language is considered to be a set of <i>expressions</i> if the language's operational semantics do not impose any restrictions on the <i>order</i> in which a computation can operate on the expression to produce a result, called a <i>value</i>. This is possible because expressions usually represent operations with no <i>side effects</i>, such as emitting output to a screen, reading or writing files, looking at a clock, controlling a device, and so on.
]]></text>
<definition required="true" hooks="math" id="b16e887dc1bc430da525e124a5f603b1"><![CDATA[
Let %A be an abstract syntax of a programming language, and let %V be some subset of %A that we will call the <i>value set</i>. This set will represent the possible meanings of parse trees in %A, and it will represent the possible results of evaluating parse trees in %A. Values that can occur directly within abstract syntax trees of the language (e.g., numeric and string literals, constructors, and so on) are usually called <i>constants</i>.
]]></definition>
<definition required="true" hooks="math" id="53f0258543cc4f4592e7dc534bdfff60"><![CDATA[
An <i>evaluation algorithm</i> converts any abstract syntax tree that represents an expression into an abstract syntax tree that represents a value.
<div class="diagram">
<table class="container">
<tr>
<td class="box" style="background-color:powderblue;">expressions<br/>(abstract syntax)</td>
<td><span style="font-size:20px;">⇒</span></td>
<td class="box" style="background-color:lightyellow;">evaluation<br/>algorithm</td>
<td><span style="font-size:20px;">⇒</span></td>
<td class="box" style="background-color:powderblue;">values<br/>(abstract syntax)</td>
</tr>
</table>
</div>
]]></definition>
<example required="true" id="634d9dc447034b4abc749a6713b63c19">
<text hooks="math"><![CDATA[
Define the abstract syntax according to the following grammar, with %A consisting of all formula abstract syntax instances, and %V = {<b>true</b>, <b>false</b>} consisting of all value abstract syntax instances:
\begin{eqnarray}
<i>formula</i> & ::= & <i>value</i> | <b>not</b> <i>formula</i> | <i>formula</i> <b>and</b> <i>formula</i> | <i>formula</i> <b>or</b> <i>formula</i> \\
<i>value</i> & ::= & <b>true</b> | <b>false</b>
\end{eqnarray}
The following is a definition of an operational semantics for this language.
]]></text>
<inferences hooks="math">
<inference title="True">
<premises><![CDATA[]]></premises>
<conclusion><![CDATA[<b>true</b> \Downarrow <b>true</b>]]></conclusion>
</inference>
<inference title="False">
<premises><![CDATA[]]></premises>
<conclusion><![CDATA[<b>false</b> \Downarrow <b>false</b>]]></conclusion>
</inference>
<inference title="Not-True">
<premises><![CDATA[%f \Downarrow <b>true</b>]]></premises>
<conclusion><![CDATA[<b>not</b> %f \Downarrow <b>false</b>]]></conclusion>
</inference>
<inference title="Not-False">
<premises><![CDATA[%f \Downarrow <b>false</b>]]></premises>
<conclusion><![CDATA[<b>not</b> %f \Downarrow <b>true</b>]]></conclusion>
</inference>
<inference title="And-True-True">
<premises><![CDATA[%f_1 \Downarrow <b>true</b> %~ %~ %f_2 \Downarrow <b>true</b>]]></premises>
<conclusion><![CDATA[%f_1 <b>and</b> %f_2 \Downarrow <b>true</b>]]></conclusion>
</inference>
<inference title="And-True-False">
<premises><![CDATA[%f_1 \Downarrow <b>true</b> %~ %~ %f_2 \Downarrow <b>false</b>]]></premises>
<conclusion><![CDATA[%f_1 <b>and</b> %f_2 \Downarrow <b>false</b>]]></conclusion>
</inference>
<inference title="And-False-True">
<premises><![CDATA[%f_1 \Downarrow <b>false</b> %~ %~ %f_2 \Downarrow <b>true</b>]]></premises>
<conclusion><![CDATA[%f_1 <b>and</b> %f_2 \Downarrow <b>false</b>]]></conclusion>
</inference>
<inference title="And-False-False">
<premises><![CDATA[%f_1 \Downarrow <b>false</b> %~ %~ %f_2 \Downarrow <b>false</b>]]></premises>
<conclusion><![CDATA[%f_1 <b>and</b> %f_2 \Downarrow <b>false</b>]]></conclusion>
</inference>
<inference title="Or-True-True">
<premises><![CDATA[%f_1 \Downarrow <b>true</b> %~ %~ %f_2 \Downarrow <b>true</b>]]></premises>
<conclusion><![CDATA[%f_1 <b>or</b> %f_2 \Downarrow <b>true</b>]]></conclusion>
</inference>
<inference title="Or-True-False">
<premises><![CDATA[%f_1 \Downarrow <b>true</b> %~ %~ %f_2 \Downarrow <b>false</b>]]></premises>
<conclusion><![CDATA[%f_1 <b>or</b> %f_2 \Downarrow <b>true</b>]]></conclusion>
</inference>
<inference title="Or-False-True">
<premises><![CDATA[%f_1 \Downarrow <b>false</b> %~ %~ %f_2 \Downarrow <b>true</b>]]></premises>
<conclusion><![CDATA[%f_1 <b>or</b> %f_2 \Downarrow <b>true</b>]]></conclusion>
</inference>
<inference title="Or-False-False">
<premises><![CDATA[%f_1 \Downarrow <b>false</b> %~ %~ %f_2 \Downarrow <b>false</b>]]></premises>
<conclusion><![CDATA[%f_1 <b>or</b> %f_2 \Downarrow <b>false</b>]]></conclusion>
</inference>
</inferences>
</example>
<example required="true" id="60b6dc0bb16b498594677f3bb08ccc82">
<text hooks="math"><![CDATA[
The rules in the above example are numerous, and having this many rules in a definition becomes impractical (especially with more complex operators and language constructs). To address this, we can define a meta-language on the set of values %V = {<b>true</b>, <b>false</b>}.
\begin{eqnarray}
\neg <b>true</b> & = & <b>false</b>\\
\neg <b>false</b> & = & <b>true</b>\\
<b>true</b> \wedge <b>true</b> & = & <b>true</b>\\
<b>true</b> \wedge <b>false</b> & = & <b>false</b>\\
<b>false</b> \wedge <b>true</b> & = & <b>false</b>\\