EiffelParse Tutorial
OVERVIEW
Parsing is the task of analyzing the structure of documents such as programs, specifications or other structured texts.
Many systems need to parse documents. The best-known examples are compilers, interpreters and other software development tools; but as soon as a system provides its users with a command language, or processes input data with a non-trivial structure, it will need parsing facilities.
This chapter describes the EiffelParse library, which you can use to process documents of many different types. It provides a simple and flexible parsing scheme, resulting from the full application of object-oriented principles.
Because it concentrates on the higher-level structure, the EiffelParse library requires auxiliary mechanisms for identifying a document's lexical components: words, numbers and other such elementary units. To address this need it is recommended, although not required, to complement EiffelParse with the EiffelLex library studied in the previous chapter.
Figure 1 shows the inheritance structure of the classes discussed in this chapter.
Figure 1: Parse class structure
WHY USE THE EIFFELPARSE LIBRARY
Let us fist look at the circumstances under which you may want - or not want - to use the EiffelParse library.
The EiffelParse library vs. parser generators
Parsing is a heavily researched area of computing science and many tools are available to generate parsers. In particular, the popular Yacc tool, originally developed for Unix, is widely used to produce parsers.
In some cases Yacc or similar tools are perfectly adequate. It is also sometimes desirable to write a special-purpose parser for a language, not relying on any parser generator. Several circumstances may, however, make the Parse library attractive:
- The need to interface the parsing tasks with the rest of an object-oriented system (such as a compiler or more generally a "document processor" as defined below) in the simplest and most convenient way.
- The desire to apply object-oriented principles as fully as possible to all aspects of a system, including parsing, so as to gain the method's many benefits, in particular reliability, reusability and extendibility.
- The need to tackle languages whose structure is not easily reconciled with the demands of common parser generator, which usually require the grammar to be LALR (1). (The EiffelParse library uses a more tolerant LL scheme, whose only significant constraint is absence of left-recursivity; the library provides mechanisms to detect this condition, which is easy to correct.)
- The need to define several possible semantic treatments on the same syntactic structure.
The last reason may be the most significant practical argument in favor of using EiffelParse. Particularly relevant is the frequent case of a software development environment in which a variety of tools all work on the same basic syntactic structure. For example an environment supporting a programming language such as Pascal or Eiffel may include a compiler, an interpreter, a pretty-printer, software documentation tools (such as Eiffel's short and flat-short facilities), browsing tools and several other mechanisms that all need to perform semantic actions on software texts that have the same syntactic structure. With common parser generators such as Yacc, the descriptions of syntactic structure and semantic processing are inextricably mixed, so that you normally need one new specification for each tool. This makes design, evolution and reuse of specifications difficult and error-prone.
In contrast, the EiffelParse library promotes a specification style whereby syntax and semantics are kept separate, and uses inheritance to allow many different semantic descriptions to rely on the same syntactic stem. This will make EiffelParse particularly appropriate in such cases.
A word of caution
At the time of publication the EiffelParse library has not reached the same degree of maturity as the other libraries presented in this book. It should thus be used with some care. You will find at the end of this chapter a few comments about the work needed to bring the library to its full realization.
AIMS AND SCOPE OF THE EIFFELPARSE LIBRARY
To understand the EiffelParse library it is necessary to appreciate the role of parsing and its place in the more general task of processing documents of various kinds.
Basic terminology
First, some elementary conventions. The word document will denote the texts to be parsed. The software systems which perform parsing as part of their processing will be called document processors.
Typical document processors are compilers, interpreters, program checkers, specification analyzers and documentation tools. For example the EiffelStudio environment contains a number of document processors, used for compiling, documentation and browsing.
Parsing, grammars and semantics
Parsing is seldom an end in itself; rather, it serves as an intermediate step for document processors which perform various other actions.
Parsing takes care of one of the basic tasks of a document processor: reconstructing the logical organization of a document, which must conform to a certain syntax (or structure), defined by a grammar.
Once parsing has reconstructed the structure of a document, the document processor will perform various operations on the basis of that structure. For example a compiler will generate target code corresponding to the original text; a command language interpreter will execute the operations requested in the commands; and a documentation tool such as the short and flat-short commands for Eiffel will produce some information on the parsed document. Such operations are called semantic actions. One of the principal requirements on a good parsing mechanism is that it should make it easy to graft semantics onto syntax, by adding semantic actions of many possible kinds to the grammar.
The EiffelParse library provides predefined classes which handle the parsing aspect automatically and provide the hooks for adding semantic actions in a straightforward way. This enables developers to write full document processors - handling both syntax and semantics - simply and efficiently.
As noted at the beginning of this chapter, it is possible to build a single syntactic base and use it for several processors (such as a compiler and a documentation tool) with semantically different goals, such as compilation and documentation. In the EiffelParse library the semantic hooks take the form of deferred routines, or of routines with default implementations which you may redefine in descendants.
LIBRARY CLASSES
The EiffelParse library contains a small number of classes which cover common document processing applications. The classes, whose inheritance structure was shown at the beginning of this chapter, are:
- CONSTRUCT , describing the general notion of syntactical construct.
- AGGREGATE , describing constructs of the "aggregate" form.
- CHOICE , describing constructs of the "choice" form.
- REPETITION , describing constructs of the "repetition" form.
- TERMINAL , describing "terminal" constructs with no further structure.
- KEYWORD , describing how to handle keywords.
- L_INTERFACE , providing a simple interface with the lexical analysis process and the Lex library.
- INPUT , describing how to handle the input document.
EXAMPLES
The EiffelStudio delivery includes (in the examples/library/parse subdirectory) a simple example using the EiffelParse Library classes. The example is a processor for "documents" which describe computations involving polynomials with variables. The corresponding processor is a system which obtains polynomial specifications and variable values from a user, and computes the corresponding polynomials. This example illustrates the most important mechanisms of the EiffelParse Library and provides a guide for using the facilities described in this chapter. The components of its grammar appear as illustrations in the next sections.
CONSTRUCTS AND PRODUCTIONS
A set of documents possessing common properties, such as the set of all valid Eiffel classes or the set of all valid polynomial descriptions, is called a language.
In addition to its lexical aspects, the description of a language includes both syntactic and semantic properties. The grammar - the syntactic specification - describes the structure of the language (for example how an Eiffel class is organized into a number of clauses); the semantic specification defines the meaning of documents written in the language (for example the run-time properties of instances of the class, and the effect of feature calls).
To discuss the EiffelParse library, it is simpler to consider "language' as a purely syntactic notion; in other words, a language is simply the set of documents conforming to a certain syntactic grammar (taken here to include the supporting lexical grammar). Any semantic aspect will be considered to belong to the province of a specific document processor for the language, although the technique used for specifying the grammar will make it easy to add the specification of the semantics, or several alternative semantic specifications if desired.
This section explains how you may define the syntactic base - the grammar.
Constructs
A grammar consists of a number of constructs, each representing the structure of documents, or document components, called the specimens of the construct. For example, a grammar for Eiffel will contain constructs such as Class, Feature_clause and Instruction. A particular class text is a specimen of construct Class.
Each construct will be defined by a production, which gives the structure of the construct's specimens. For example, a production for Class in an Eiffel grammar should express that a class (a specimen of the Class construct) is made of an optional Indexing part, a Class_header, an optional Formal_generics part and so on. The production for Indexing will indicate that any specimen of this construct - any Indexing part - consists of the keyword indexing followed by zero or more specimens of Index_clause.
Although some notations for syntax descriptions such as BNF allow more than one production per construct, the EiffelParse library relies on the convention that every construct is defined by at most one production. Depending on whether there is indeed such a production, the construct is either non-terminal or terminal:
- A non-terminal construct (so called because it is defined in terms of others) is specified by a production, which may be of one of three types: aggregate, choice and repetition. The construct will accordingly be called an aggregate, choice or repetition construct.
- A terminal construct has no defining production. This means that it must be defined outside of the syntactical grammar. Terminals indeed come from the lexical grammar. Every terminal construct corresponds to a token type (regular expression or keyword) of the lexical grammar, for which the parsing duty will be delegated to lexical mechanisms, assumed in the rest of this chapter to be provided by the Lex library although others may be substituted if appropriate.
All specimens of terminal constructs are instances of class TERMINAL . A special case is that of keyword constructs, which have a single specimen corresponding to a keyword of the language. For example, if
is a keyword of Eiffel. Keywords are described by class KEYWORD , an heir of TERMINAL .
The rest of this section concentrates on the parsing-specific part: non-terminal constructs and productions. Terminals will be studied in the discussion of how to interface parsing with lexical analysis.
Varieties of non-terminal constructs and productions
An aggregate production defines a construct whose specimens are obtained by concatenating ("aggregating") specimens of a list of specified constructs, some of which may be optional. For example, the production for construct Conditional in an Eiffel grammar may read:Conditional [=] if Then_part_list [Else_part] end
This means that a specimen of Conditional (a conditional instruction) is made of the keyword if
, followed by a specimen of Then_part_list, followed by zero or one specimen of Else_part (the square brackets represent an optional component), followed by the keyword end
.
A choice production defines a construct whose specimens are specimens of one among a number of specified constructs. For example, the production for construct Type in an Eiffel grammar may read:Type [=] Class_type | Class_type_expanded | Formal_generic_name | Anchored | Bit_type
This means that a specimen of Type is either a specimen of Class_type, or a specimen of Class_type_expanded etc.
Finally, a repetition production defines a construct whose specimens are sequences of zero or more specimens of a given construct (called the base of the repetition construct), separated from each other by a separator. For example, the production for construct Compound in an Eiffel grammar may read Compound [=] {Instruction ";" ...}
This means that a specimen of Compound is made of zero or more specimens of Instruction, each separated from the next (if any) by a semicolon.
These three mechanisms - aggregate, choice and repetition - suffice to describe the structure of a wide array of practical languages. Properties which cannot be handled by them should be dealt with through semantic actions, as explained below.
An example grammar
The example directory included in the delivery implements a processor for a grammar describing a simple language for expressing polynomials. A typical "document" in this language is the line
x; y: x * (y + 8 - (2 * x))
The beginning of the line, separated from the rest by a colon, is the list of variables used in the polynomial, separated by semicolons. The rest of the line is the expression defining the polynomial.
Using the conventions defined above, the grammar may be written as:Line [=] Variables ":" Sum
Variables [=] {Identifier ";" ...}
Sum [=] {Diff "+" ...}
Diff [=] {Product "-" ...}
Product [=] {Term " * " ...}
Term [=] Simple_var Int_constant Nested
Nested [=] "(" Sum ")"
This grammar assumes a terminal Identifier, which must be defined as a token type in the lexical grammar. The other terminals are keywords, shown as strings appearing in double quotes, for example "+".
PARSING CONCEPTS
The EiffelParse library supports a parsing mechanism based on the concepts of object-oriented software construction.
Class CONSTRUCT
The deferred class CONSTRUCT describes the general notion of construct; instances of this class and its descendants are specimens of the constructs of a grammar.
Deferred though it may be, CONSTRUCT defines some useful general patterns; for example, its procedure process appears as: parse
if parsed then
semantics
end
where procedures
Such descendants, given in the library, are classes AGGREGATE , CHOICE , REPETITION and TERMINAL . They describe the corresponding types of construct, with features providing the operations for parsing their specimens and applying the associated semantic actions.
Building a processor
To build a processor for a given grammar, you write a class, called a construct class, for every construct of the grammar, terminal or non-terminal. The class should inherit from AGGREGATE , CHOICE , REPETITION or TERMINAL depending on the nature of the construct. It describes the production for the construct and any associated semantic actions.
To complete the processor, you must choose a "top construct" for that particular processor, and write a root class. In accordance with the object-oriented method, which implies that "roots" and "tops" should be chosen last, these steps are explained at the end of this chapter.
The next sections explain how to write construct classes, how to handle semantics, and how to interface parsing with the lexical analysis process. All these tasks rely on a fundamental data abstraction, the notion of abstract syntax tree.
Abstract syntax trees
The effect of processing a document with a processor built from a combination of construct classes is to build an abstract syntax tree for that document, and to apply any requested semantic actions to that tree.
The syntax tree is said to be abstract because it only includes important structural information and does not retain the concrete information such as keywords and separators. Such concrete information, sometimes called "syntactic sugar", serves only external purposes but is of no use for semantic processing.
The combination of Eiffel techniques and libraries yields a very simple approach to building and processing abstract syntax trees. Class CONSTRUCT is a descendant of the Data Structure Library class TWO_WAY_TREE , describing a versatile implementation of trees; so, as a consequence, are CONSTRUCT's own descendants. The effect of parsing any specimen of a construct is therefore to create an instance of the corresponding construct class. This instance is (among other things) a tree node, and is automatically inserted at its right place in the abstract syntax tree.
As noted in the discussion of trees, class TWO_WAY_TREE makes no formal distinction between the notions of tree and tree node. So you may identify the abstract syntax tree with the object (instance of CONSTRUCT ) representing the topmost construct specimen in the structure of the document being analyzed.
The production function
A construct class describes the syntax of a given construct through a function called production, which is a direct representation of the corresponding production. This function is declared in CONSTRUCT as production: LINKED_LIST [CONSTRUCT]
-- Right-hand side of the production for the construct
deferred
end
Function production remains deferred in classes AGGREGATE , CHOICE and REPETITION . Every effective construct class that you write must provide an effecting of that function. It is important for the efficiency of the parsing process that every effective version of production be a
Classes AGGREGATE , CHOICE , REPETITION and TERMINAL also have a deferred function construct_name: STRING
-- Symbolic name of the construct
once
Result := "INSTRUCTION"
end
The examples of the next few sections, which explain how to write construct classes, are borrowed from the small "Polynomial" language mentioned above, which may be found in the examples directory in the ISE Eiffel delivery.
PREPARING GRAMMARS
Having studied the EiffelParse library principles, let us see how to write grammar productions for various kinds of construct. The main task is to write the production function for each construct class.
The production function for a descendant of AGGREGATE will describe how to build a specimen of the corresponding function from a sequence of specimens of each of the constituent constructs. Writing this function from the corresponding production is straightforward.
As an example, consider the production function of class LINE for the Polynomial example language. The corresponding production is Line [=] Variables ":" Sum
where Variables and Sum are other constructs, and the colon ":" is a terminal. This means that every specimen of Line consists of a specimen of Variables, followed by a colon, followed by a specimen of Sum.
Here is the corresponding production function as it appears in class LINE: production: LINKED_LIST [CONSTRUCT]
local
var: VARIABLES
sum: SUM
once
create Result.make
Result.forth
create var.make
put (var)
keyword (":")
create sum.make
put (sum)
end
As shown by this example, the production function for an aggregate construct class should declare a local entity (here var
and sum
) for each non-keyword component of the right-hand side, the type of each entity being the corresponding construct class (here VARIABLES and SUM).
The body of the function should begin with create Result.make
Result.forth
to create the object containing the result. Then for each non-keyword component, represented by the local entity component
(this applies to var
and sum
in the example), there should be a sequence of two instructions, of the form create component.make
put (component)
For any keyword of associated string symbol, such as the colon ":" in the example, there should be a call to keyword (symbol)
The order of the various calls to
All components in the above example are required. In the general case an aggregate production may have optional components. To signal that a component component of the right-hand side is optional, include a call of the form component.set_optional
This call may appear anywhere after the corresponding create component
put (symbol)
component.set_optional
Choices
The
As an example, consider the Term [=] Simple_var Poly_integer Nested
where Simple_var, Poly_integer and Nested are other constructs. This means that every specimen of Term consists of one specimen of any one of these three constructs. Here is the corresponding production: LINKED_LIST [CONSTRUCT]
local
id: SIMPLE_VAR
val: POLY_INTEGER
nest: NESTED
once
create Result.make
Result.forth
createid.make
put (id)
create val.make
put (val)
create nest.make
put (nest)
end
As shown by this example, the id
, val
and nest
- for each alternative component of the right-hand side. The type of each entity is the corresponding construct class - here
The body of the function must begin by create Result.make
Result.forth
Then for each alternative component represented by a local entity component (in the example this applies to id
, val
and nest
) there should be two instructions of the form create component.make
put (component)
Caution: The order of the various calls to
Repetitions
The
As an example, consider the construct Variables in the Polynomial example language. The right-hand side of the corresponding production is Variables [=] {Identifier ";" ...}
where Identifier is another construct, and the semicolon ";" is a terminal. This means that every specimen of Variables consists of zero or more specimens of Identifier, separated from each other (if more than one) by semicolons.
Here are the corresponding production: LINKED_LIST [IDENTIFIER]
local
base: VAR
once
create Result.make
Result.forth
create base.make
put (base)
end
separator: STRING = ";"
As shown by this example, function base
, is required; its type must be the class corresponding to the construct serving as the base of the repetition, VAR in the example.
INTERFACE TO LEXICAL ANALYSIS
One more type of construct class remains to be discussed: terminal construct classes. Since terminal constructs serve to elevate lexical tokens (regular expressions and keywords) to the dignity of syntactical construct, we must first take a look at how the EiffelParse library classes collaborate with their counterparts in the EiffelLex library.
The notion of lexical interface class
To parse a document, you need to get tokens from a lexical analyzer. This is achieved by making some construct classes, in particular those describing terminals, descendants of one of the lexical classes.
The best technique is usually to write a class covering the lexical needs of the language at hand, from which all construct classes that have some lexical business will inherit. Such a class is called a lexical interface class.
Lexical interface classes usually follow a common pattern. To take advantage of this uniformity, the EiffelParse library includes a deferred class L_INTERFACE which describes that pattern. Specific lexical interface classes may be written as descendants of L_INTERFACE.
L_INTERFACE is a simple deferred class, with a deferred procedure
Obtaining a lexical analyzer
An effective descendant of L_INTERFACE must define procedure
- You may build the lexical analyzer by defining its regular expressions one by one, using the procedures described in the presentation of METALEX, in particular
put_expression andput_keyword . - You may use use procedure
retrieve_analyzer from METALEX to retrieve an analyzer which a previous session saved into a file. - Finally, you may write a lexical grammar file (or reuse an existing one) and process it on the spot by using procedure
read_grammar from METALEX.
A lexical interface class
An example of a lexical interface class is POLY_LEX for the Polynomial example language. Here is the complete text of that class:indexing
description: "Lexical interface class for the Polynomial language"
class
POLY_LEX
inherit
L_INTERFACE
CONSTANTS
undefine
consistent,
copy,
is_equal,
setup
end
feature {NONE}
obtain_analyzer
-- Create lexical analyzer for the Polynomial language
do
ignore_case
keywords_ignore_case
build_expressions
build_keywords
set_separator_type (blanks)
end
build_expressions
-- Define regular expressions
-- for the Polynomial language
do
put_expression (special_expression, special, "Special")
put_expression ("*('a'..'z')", simple_identifier, "Simple_identifier")
put_expression ("+('0'..'9')", integer_constant, "Integer_constant")
put_expression ("+('\t'|'\n'|' ')", blanks, "Blanks")
end
special_expression: STRING
-- Regular expression describing Special
once
create Result.make (80)
Result.append ("('\050'..'\057')|")
Result.append ("('\072'..'\076')|")
Result.append ("'['|']'|'|'|'{'|'}'|%"->%"|%":=%"")
end
build_keywords
-- Define keywords (special symbols)
-- for the Polynomial language
do
put_keyword ("+", special)
put_keyword ("-", special)
put_keyword (";", special)
put_keyword (":", special)
put_keyword ("(", special)
put_keyword (")", special)
put_keyword ("*", special)
end
end
This class illustrates the straightforward scheme for writing lexical interface classes. It introduces constants such as Special to represent the regular expressions supported, and effects procedure
All the classes of a document processor that need to interact with the lexical analysis should inherit from a lexical interface class such as
More on terminal constructs
Terminal construct classes are examples of classes that need to interact with the lexical analysis, and should thus inherit from the lexical interface class.
Class class
INT_CONSTANT
inherit
TERMINAL
CONSTANTS
feature
token_type: INTEGER
do
Result := integer_constant
end
feature {NONE}
construct_name: STRING
once
Result := "INT_CONSTANT"
end
end
SPECIFYING THE SEMANTICS
As mentioned at the beginning of this chapter, parsing is usually done not for itself but as a way to perform some semantic processing. The EiffelParse Library classes define the general framework for grafting such semantics onto a syntactical stem.
Semantic procedures
The principal procedures for defining semantic actions are
As defined in
For
Often, the semantic procedures need to compute various elements of information. These may be recorded using appropriate attributes of the corresponding construct classes.
Polynomial semantics
As an example let us examine the semantics of the Product construct for the polynomial language. It is a repetition construct, with Term as the base construct; in other words a specimen of Product is a sequence of one or more terms, representing the product term1 * term2 ... * termn. Here is the
post_action
local
int_value: INTEGER
do
if not no_components then
from
child_start
if not child_after then
int_value := 1
end
until
child_after
loop
child.post_action
nt_value := int_value * info.child_value
child_forth
end
info.set_child_value (int_value)
end
end
Here each relevant construct class has an attribute
Keeping syntax and semantics separate
For simple examples such as the Polynomial language, it is convenient to use a single class to describe both the syntax of a construct (through the production function and associated features) and its semantics (action routines and associated features).
For more ambitious languages and processors, however, it is often preferable to keep the two aspects separate. Such separation of syntax and semantics, and in particular the sharing of the same syntax for different processors with different semantic actions, is hard or impossible to obtain with traditional document processing tools such as Yacc on Unix. Here is how to achieve it with the EiffelParse library:
- First write purely syntactic classes, that is to say construct classes which only effect the syntactical part (in particular function production). As a consequence, these classes usually remain deferred. The recommended convention for such syntactic classes is to use names beginning with
S_ , for exampleS_INSTRUCTION orS_LOOP . - Then for each construct for which a processor defines a certain semantics, define another class, called a semantic class, which inherits from the corresponding syntactic class. The recommended convention for semantic classes is to give them names which directly reflect the corresponding construct name, as in
INSTRUCTION orLOOP .
To build a semantic class in in step 2 it is often convenient to use multiple inheritance from a syntactic class and a "semantics-only" class. For example in a processor for Eiffel class
One of the advantages of this scheme is that it makes it easy to associate two or more types of processing with a single construct, by keeping the same syntactic class (such as
As noted earlier in this chapter, this is particularly useful in an environment where different processors need to perform differents actions on specimens of the same construct. In an Eiffel environment, for example, processors that manipulate classes and other Eiffel construct specimens may include a compiler, an interpreter, a flattener (producing the flat form), a class abstracter (producing the short or flat-short form), and various browsing tools such as those provided by Eiffel Software.
For obvious reasons of convenience and ease of maintenance, it is desirable to let these processors share the same syntactic descriptions. The method just described, relying on multiple inheritance, achieves this goal.
HOW PARSING WORKS
Classes AGGREGATE, CHOICE, TERMINAL and REPETITION are written in such a way that you do not need to take care of the parsing process. They make it possible to parse any language built according to the rules given - with one limitation, left recursion, discussed below. You can then concentrate on writing the interesting part - semantic processing.
To derive the maximum benefit from the EiffelParse library, however, it is useful to gain a little more insight into the way parsing works. Let us raise the veil just enough to see any remaining property that is relevant to the building of parsers and document processors.
The parsing technique
The EiffelParse library relies on a general approach known as recursive descent, meaning that various choices will be tried in sequence and recursively to recognize a certain specimen.
If a choice is attempted and fails (because it encounters input that does not conform to what is expected), the algorithm will try remaining choices, after having moved the input cursor back to where it was before the choice that failed. This process is called backtracking. It is handled by the parsing algorithms in an entirely automatic fashion, without programmer intervention.
Left recursion
Recursive descent implies the danger of infinite looping when parsing is attempted for left-recursive productions of the form A [=] A ...
or, more generally, cases in which the left recursion is indirect, as in A [=] B ...
B [=] C ...
...
L [=] A ...
Direct left recursion is easy to avoid, but indirect recursion may sneak in in less trivial ways.
To determine whether the production for a construct is directly or indirectly left-recursive, use the query left_recursion from class
Backtracking and the commit procedure
Another potential problem may arise from too much backtracking. In contrast with left recursion, this is a performance issue, not a threat to the correctness of the parsing algorithm. Automatic backtracking is in fact essential to the generality and flexibility of the recursive descent parsing algorithm; but too much of it may degrade the efficiency of the parsing mechanism.
Two techniques are available to minimize backtracking. One, mentioned above, is to organize the production functions for choice construct classes so that they list the most frequent cases first. The other is to use the commit procedure in the production functions for aggregate constructs.
A call to commit in an aggregate A is a hint to the parser, which means:
If you get to this point in trying to recognize a specimen of A as one among several possible choices for a choice construct C, and you later fail to obtain an A, then forget about other choices for C: you won't be able to find a C here. You may go back to the next higher-level choice before C - or admit failure if there is no such choice left.
Such a hint is useful when you want to let the parser benefit from some higher-level knowledge about the grammar, which is not directly deducible from the way the productions have been written.
Here is an example. The production function for (s)
where s is a specimen of production: LINKED_LIST [CONSTRUCT]
local
expression: SUM
once
create Result.make
Result.forth
keyword ("(")
commit
create expression.make
put (expression)
keyword (")")
end
The commit after the recognition of the keyword "(" is there to use the following piece of higher-level knowledge:
No choice production of the grammar that has NESTED as one of its alternatives has another alternative construct whose specimens could begin with an opening parenthesis "(".
Because of this property, if the parser goes so far as to recognize an opening parenthesis as part of parsing any construct
In this example,
The use of commit assumes global knowledge about the grammar and its future extensions, which is somewhat at odds with the evolutionary approach suggested by the Eiffel method. Applied improperly, this mechanism could lead to the rejection of valid texts as invalid. Used with care, however, it helps in obtaining high-performance parsing without impairing too much the simplicity of preparing parsers and other document processors.
BUILDING A DOCUMENT PROCESSOR
We are ready now to put together the various elements required to build a document processor based on the EiffelParse library.
The overall picture
The documents to be processed will be specimens of a certain construct. This construct is called the top construct for that particular processing.
Caution: Be sure to note that with the EiffelParse library there is no room for a concept of top construct of a grammar: the top construct is only defined with respect to a particular processor for that grammar.
Attempting to define the top of a grammar would be contrary to the object-oriented approach, which de-emphasizes any notion of top component of a system.
Different processors for the same grammar may use different top constructs.
A document processor will be a particular system made of construct classes, complemented by semantic classes, and usually by other auxiliary classes. One of the construct classes corresponds to the top construct and is called the top construct class.
Steps in the execution of a document processor
As any root class of a system, the root of a document processor must have a creation procedure which starts the execution. Here the task of this class is the following:
- Define an object representing the input document to be processed; this will be an instance of class
INPUT . - Obtain a lexical analyzer applicable to the language, and connect it with the document.
- Select an input file, containing the actual document to be processed.
- Process the document: in other words, parse it and, if parsing is successful, apply the semantics.
- To execute these steps in a simple and convenient fashion, it is useful to declare the root class as a descendant of the lexical interface class. The root class, being an heir to the top construct class, will also be a descendant of
CONSTRUCT .
Connecting with lexical analysis
To achieve the effect of steps 1 and 2 , a simple call instruction suffices: just call the procedure build, inherited from build (document)
Although you may use this line as a recipe with no need for further justification, it is interesting to see what build does. Feature document describes the input document to be processed; it is introduced as a Once function in class build (doc: INPUT)
-- Create lexical analyzer and set doc
-- to be the input document.
require
document_exists: doc /= void
do
metalex_make
obtain_analyzer
make_analyzer
doc.set_lexical (analyzer)
end
The call to obtain_analyzer defines the regular grammar for the language at hand. Recall that obtain_analyzer is deferred in
Starting the actual processing
The call build (
document takes care of steps 1 and 2 of the root's creation procedure. Step 3 selects the file containing the input document; this is achieved through the call document.set_input_file (some_file_name)
where set_input_file, from class
Finally, step 4 (processing the document) is simply a call to procedure process, obtained from CONSTRUCT . Recall that this procedure simply executes parse
if parsed then
semantics
end
The structure of a full example
The polynomial example provides a simple example of a full document processor, which you may use as a guide for your own processors. The root class of that example is root_line: LINE
make
local
text_name: STRING
do
create root_line.make
build (root_line.document)
... Instructions prompting the user for the name of the
file to be parsed, and assigning it to text_name ...
root_line.document.set_input_file (text_name)
root_line.process
end
Although it covers a small language, this example may serve as a blueprint for most applications of the EiffelParse library.
FUTURE WORK
It was mentioned at the beginning of this chapter that further work is desirable to make the EiffelParse library reach its full bloom. Here is a glimpse of future improvements.
Expressions
Many languages include an expression construct having the properties of traditional arithmetic expressions:
- An expression is a succession of basic operands and operators.
- The basic operands are lexical elements, such as identifiers and constants.
- Operators are used in prefix mode (as in - a) or infix mode (as in b - a).
- Each operator has a precedence level; precedence levels determine the abstract syntactic structure of expressions and consequently their semantics. For example the abstract structure of a + b * c shows this expression to be the application of the operator + to a and to the application of the operator * to b and c. That this is the correct interpretation of the instruction follows from the property that * has a higher precedence ("binds more tightly") than +.
- Parentheses pairs, such as ( ) or [ ], can be used to enforce a structure different from what the precedence rules would imply, as in (a + b) * c.
- Some infix operators may be applied to more than two arguments; in this case it must be clear whether they are right-associative (in other words, a ^ b ^ c means a ^ (b ^ c), the conventional interpretation if ^ denotes the power operator) or left-associative.
It is of course possible to apply the EiffelParse library in its current state to support expressions, as illustrated by this extract from the Polynomial grammar given in full above:Variables [=] {Identifier ";" ...}
Sum [=] {Diff "+" ...}
Diff [=] {Product "-" ...}
Product [=] {Term "*" ...}
The problem then is not expressiveness but efficiency. For such expressions the recursive descent technique, however well adapted to the higher-level structures of a language, takes too much time and generates too many tree nodes. Efficient bottom-up parsing techniques are available for this case.
The solution is straightforward: write a new heir
Beyond the addition of an
Yooc
To describe the syntax of a language, it is convenient to use a textual format such as the one that has served in this chapter to illustrate the various forms of production. The correspondence between such a format and the construct classes is straightforward; for example, as explained above, the production Line [=] Variables ":" Sum
will yield the classclass
LINE
inherit
AGGREGATE
feature
production: LINKED_LIST [CONSTRUCT]
local
var: VARIABLES
sum: SUM
once
create Result.make
Result.forth
create var.make
put (var)
keyword (":")
create sum.make
put (sum)
end
...
end
This transformation of the textual description of the grammar into its equivalent Eiffel form is simple and unambiguous; but it is somewhat annoying to have to perform it manually.
A tool complementing the EiffelParse library and known as YOOC ("Yes! an Object-Oriented Compiler", a name meant as an homage to the venerable Yacc) has been planned for future releases of EiffelParse. Yooc, a translator, will take a grammar specification as input and transform it into a set of parsing classes, all descendants of CONSTRUCT and built according to the rules defined above. The input format for syntax specification, similar to the conventions used throughout this chapter, is a variant of LDL (Language Description Language), a component of the ArchiText structural document processing system.
Further reading
The following article describes some advanced uses of the EiffelParse library as well as a Yooc-like translator called PG: Per Grape and Kim Walden: Automating the Development of Syntax Tree Generators for an Evolving Language, in Proceedings of TOOLS 8 (Technology of Object-Oriented Languages and Systems), Prentice Hall, 1992, pages 185-195.