> <\body> <\doc-data||<\doc-author-data||<\author-address> LIX, CNRS École polytechnique 91128 Palaiseau Cedex France ||> \; |<\doc-note> This work has been supported by the ANR-09-JCJC-0098-01 project, as well as a Digiteo 2009-36HD grant and Région Ile-de-France. > \; |>|> <\abstract> Currently, there exists a big gap between formal computer-understandable mathematics and informal mathematics, as written by humans. When looking more closely, there are two important subproblems: making documents written by humans at least syntactically understandable for computers, and the formal verification of the actual mathematics in the documents. In this paper, we will focus on the first problem. For the time being, most authors use , , or one of its graphical front-ends in order to write documents with many mathematical formulas. In the past decade, we have developed an alternative wysiwyg system GNU , which is not based on. All these systems are only adequate for visual typesetting and do not carry much semantics. Stated in the jargon, they concentrate on presentation markup, not content markup. In recent versions of , we have started to integrate facilities for the semantic editing of formulas. In this paper, we will describe these facilities and expand on the underlying motivation and design choices. To go short, we continue to allow the user to enter formulas in a visually oriented way. In the background, we continuously run a packrat parser, which attempts to convert (potentially incomplete) formulas into content markup. As long as all formulas remain sufficiently correct, the editor can then both operate on a visual or semantic level, independently of the low-level representation being used. An important related topic, which will also be discussed at length, is the automatic correction of syntax errors in existing mathematical documents. In particular, the syntax corrector that we have implemented enables us to upgrade existing documents and test our parsing grammar on various books and papers from different sources. We will provide a detailed analysis of these experiments. One major challenge for the design of mathematical text editors is the possibility to give mathematical formulas more semantics. There are many potential applications of mathematical texts with a richer semantics: it would be easier and more robust to copy and paste formulas between a text and a computer algebra system, one might search for formulas on websites such as , various kinds of ``typos'' in formulas can be detected automatically while entering formulas, etc. Currently, most mathematicians write their documents in, , or one of its graphical front-ends . Such documents usually focus on presentation and not on mathematical correctness, not even syntactic correctness. In the past decade, we have developed GNU as an alternative structured wysiwyg text editor. does not rely on , and can be freely downloaded from . The main aims of are user-friendliness, high quality typesetting, and its use as an interface for external systems . However, until recently, mathematical formulas inside only carried presentation semantics. Giving mathematical formulas more semantics can be regarded as a gradual process. Starting with formulas which only carry presentation semantics, we ultimately would like to reach the stage where all formulas can be checked by a formal proof checker. In between, the formulas might at least be correct from the syntactic point of view, or only involve non ambiguous symbols. In this paper, we focus on techniques for making it easy for authors to create papers in which all formulas are correct from the syntactic point of view. In the jargon, this means that we can produce both and for all formulas. From now on, when we will speak about , then we will always mean syntactical semantics. For instance, in the formula we merely want to detect or enforce that the operator > applies to and ; we are not interested in the actual mathematical meaning of addition. If we are allowed to constraint the user to enter all texts in a way which is comfortable for the designer of the editor, then syntactic correctness can easily be achieved: we might constrain the user to directly enter content markup. Similarly, if the author is only allowed to write text in a specific mathematical sub-language ( all formulas which can be parsed by some computer algebra system), then it might be possible to develop tools which fulfill our requirements. Therefore, the real challenge is to develop a general purpose mathematical editor, which imposes minimal extra constraints on the user with respect to a presentation-oriented editor, yet allows for the production of syntactically correct documents. As often in the area of user interfaces, there is a psychological factor in whether a particular solution is perceived as satisfactory: certain hackers might happily enter content markup in . Instead of proposing a single best solution, we will therefore present a collection of ideas and techniques for simplifying the task of writing syntactically correct formulas; time will learn us which of them will be most useful for the typical. One of the central ideas behind our approach is to stick as much as possible to an existing user friendly and presentation-oriented editor, while using a ``syntax checker/corrector'' in the background. This syntax checker will ensure that it always possible to formally parse the formulas in the document, while these formulas are represented using presentation markup. We will see that this is possible after a few minor modifications of the editor and the development of a suitable series of tools for the manipulation of formal languages. All our implementations were done in , but our techniques might be implemented in other editors. An interesting related question is whether the production of syntactically correct documents is indeed best achieved during the editing phase: there might exist some magical algorithm for giving more semantics to most existing documents. In this paper, we will also investigate this idea and single out some of the most important patterns which cause problems. Of course, from the perspective of a software with a non trivial user base, it is desirable that the provided syntax corrector can also be used in order to upgrade existing documents. The paper is organized into two main parts: the first part (sections, and) does not depend on the packrat-based formal language tools, whereas the second part (sections, and) crucially depends on these tools. The last section contains some applications and ideas for future developments. In section, we first review the specificities of various formats for mathematical formulas. In section, we discuss elementary techniques for giving existing documents more semantics and the main obstacles encountered in this process. In section, we present afew minimal changes which were made inside in order to remedy some of these obstacles. In section, we recall the concept of a . This kind of grammars are both natural to specify, quick to parse and convenient for the specification of on-the-fly grammars, which can be locally modified inside a text. We have implemented a formal language facility inside , which is based on packrat parsing, but also contains a few additional features. In section, we will discuss the development of a standard grammar for mathematics, and different approaches for customizating this grammar. In section, we present several grammar-assisted editing tools which have been implemented inside. In order to analyze the adequacy of the proposals in this paper, we have made a selection of a few books and other documents from various sources, and tested our algorithms on this material (1303 pages with 59504 formulas in total). These experiments, which are discussed in detail in sections and, have given us a better understanding of the most common syntactical errors, how to correct them, the quality of our parsing grammar and remaining challenges. We have one our best to make our experiments as reproducible as possible. Our results were obtained using the SVN development revision 4088 of , which should be more or less equivalent to the unstable release 1.0.7.10. \ For detailed information on how to use the semantic editing features, please consult the integrated documentation by clicking on the menu entry . The present document is also intended as a source for complementary information for users who are interested in semantic mathematical editing. With the kind permission of their authors, the test documents are publicly available from <\code> \ \ \ \ \ \ \ \ As a disclaimer, we would like to add that we take responsibility for any bad usage that the authors may have made of . Except for a few genuine typos, the semantic errors which were detected in their work are mostly due to current imperfections in . Taking into account the numerous existing formula editors, it is a subtle task to compare every single idea in this paper to previous work. Indeed, many of the individual ideas and remarks can probably be traced back to earlier implementations. However, such earlier implementations are often experiments in a more restricted context. We recall that the aim of the current paper is to provide a general purpose tool with the average ``working mathematician'' (or scientist) as its prototype user. The main challenge is thus to fit the various building bricks together such that this goal is achieved. Nevertheless, we would like to mention a few people who directly influenced this paper. First of all, the decision to make the user interface of purely graphically oriented was inspired by a talk of on the system. This system was merely intended as a general purpose interface for computer algebra systems. It also integrated aparser for making the editor's behaviour depend on the semantics, although the decoupling between the editor and the parser was not as clean as in our present implementation. We are also grateful to for pointing us to the concept of a packrat grammar. Finally, some early related, but unpublished work on semantic editing was done jointly with and . In the present paper, we follow a different approachthough. We finally notice that some semantic editing features are integrated into the proprietary computer algebra systems and . However, these features are not that much intended for general purpose mathematical editing, but rather for the creation of documented worksheets. Since scientists do not receive free copies of these systems, we were unable to evaluate the features in detail. > The current standard for mathematical documents is / . There are three main features which make convenient for typing mathematical documents: <\enumerate> The use of a cryptic, yet suggestive pseudo-language for ASCII source code. The possibility for the user to extend the language with new macros. A high typographic quality. In addition, provides the user with some rudimentary support for structuring documents. One major drawback of / is that it is not really a data format, but rather aprogramming language. This language is very unconventional in the sense that it does not admit a formal syntax. Indeed, the syntax of can be ``hacked'' on the fly, and may for instance depend on the parity of the current page. This is actually one important reason for the``success'' of the system: the ill-defined syntax makes it very hard to reliably convert existing documents into other formats. In particular, the only reliable parser for is itself. If, for convenience, we want to consider / as a format, then we have to restrict ourselves to asublanguage, on which we impose severe limits to the definability of new macros and syntax extensions. Unfortunately, in practice, few existing documents conform to such more rigourous versions of , so the conversion of / to other formats necessarily involves a certain amount of heuristics. This is even true for papers which conform to a journal style with dedicated macros, since the user is not forced to use only a restricted subset of the available primitives. Even if we ignore the inconsistent syntax of , another drawback of / is its lack of formal semantics. Consider for instance the formulas and . Are and variables or functions? In the formulas and , what is the scope of the big summation? Should we consider to be incorrect, or did we intend to write a french-style interval? Some heuristics for answering these questions will be discussed below. There are more problems with /, such as inconsistencies in semantics of the many existing style packages, but these problems will be of lesser importance for what follows. The format was designed from scratch, with two main objectives: <\enumerate> Having a standard format for mathematics on the web, which assumes compatibility with existing technology. Defining a DTD with a precise semantics, which is sufficient for the representation of mathematical formulas. In particular, the format provides a remedy to most of the major drawbacks of / that we mentioned above. In fact, is divided into two main parts. The first part concerns and provides constructs for describing the presentation of mathematical formulas. The second part concerns and describes the actual syntactic meaning of a mathematical expression. For instance, returning to the formula , the typical presentation markup would be as follows <\verbatim-code> \mrow\ \ \ \mi\a\/mi\ \ \ \mo\⁢\!-- ⁢ --\\/mo\ \ \ \mrow\ \ \ \ \ \mo\(\/mo\ \ \ \ \ \mi\b\/mi\ \ \ \ \ \mo\+\/mo\ \ \ \ \ \mi\c\/mi\ \ \ \ \ \mo\)\/mo\ \ \ \/mrow\ \/mrow\ We observe two interesting things here: first of all, the invisible multiplication makes it clear that we intended to multiply with in the formula >. Secondly, the (optional) inner mrow\> and /mrow\> suggest the scope and intended meaning of the brackets. In principle, the above piece of presentation markup therefore already carries enough semantics so as to produce the corresponding content markup: <\verbatim-code> \apply\ \ \ \times/\ \ \ \ci\a\ci\ \ \ \apply\ \ \ \ \ \plus/\ \ \ \ \ \ci\b\/ci\ \ \ \ \ \ci\c\/ci\ \ \ \/apply\ \/apply\ This fragment of code can be regarded as a verbose variant of the expression <\scm-code> (* a (+ b c)) More generally, as illustrated by the above examples, tends to be very precise and fairly complete, but also very verbose. For this reason, has not yet succeeded to impose itself as the new standard for the working scientist. Browser support has also been quite poor, although this situation is slowly improving. > The three main objects of the original project were the following: <\enumerate> Provide a free and user friendly wysiwyg for mathematical formulas. Provide a typesetting quality which is as least as good as the quality of . Make it possible to use as an interface for other software, such as computer algebra systems, while keeping a good typesetting quality for large formulas. One consequence of the first point is that we require an internal format for formulas which is not read-only, but also suitable for modifications. For a wysiwyg editor, it is also more convenient to work with tree representations of documents, rather than ASCII strings. provides three ways to ``serialize'' internal trees as strings (native human readable, XML and ). Nevertheless, the important issue is not that much the syntax of the format, but rather asufficiently rich semantics which makes it possible to documents into other formats. In this respect, modern would have been perfectly suitable as a format for . In order to be as compatible with / as possible, the original internal format very much corresponded to a ``clean'' tree representation for / documents. Some drawbacks of , such as lacking scopes of brackets and big operators, were therefore also present in . On the other hand, incites users to make adistinction between multiplication and function application, which is important for the use of as an interface to computer algebra systems. For instance, > is entered by typing , and internally represented as a string leaf <\tm-fragment> a*(b+c) in all versions prior to 1.0.7.6. More generally, provides non ambiguous symbols for various mathematical constants (>, >, >, ) and operators. However, as ageneral rule, traditional documents remain presentation oriented. There are many other formats for the machine representation of mathematical texts. or expressions have traditionally been used as a precursor for content markup, but without asimilar effort towards standardization. In this paper, we will use notation for content markup, since it is less verbose than . We notice that is also used inside as an extension language (similar to - in the case of ). For some applications, such as automatic theorem proving and communication between computer algebra systems, it is important to develop mathematical data formats with even more semantics. Some initiatives in this direction are and . However, these topics are beyond the scope of this paper. The multiplication versus function application ambiguity mentioned in section is probably the most important obstacle to the automatic association of a semantics to mathematical formulas. There are several other annoying notational ambiguities, which are mostly due to the use of the same glyph for different purposes. In this section, we will list the most frequent ambiguities, which were encountered in our own documents and in a collection of third party documents to be described in section below. <\description> Besides multiplication and function application, there are several other invisible operators and symbols: <\itemize> Invisible separators, as in matrix or tensor notation j>|)>>. Invisible addition, as in >. Invisible ``wildcards'', as in the increment operator +1> (also denoted by >+1>) Invisible brackets, for forced matching if we want to omit a bracket. Invisible ellipses as in the center of the matrix 1>>|>|n>>>|>||>>|1>>|>|n>>>>>>>. Invisible zeros as in 1>>||>||>|>|||n>>>>>>>. Invisible operators have been incorporated in the standard, even though invisible wildcards, brackets, ellipses and zeros are not yet supported. We also notice that function application is implicit in formulas such as > and explicit in formulas such as or . Vertical bars are used in many circumstances: <\itemize> As brackets in absolute values > or ``ket'' notation |A|\|>>. As ``such that'' separators in sets X\|a\0|}>> or lists. As the ``divides'' predicate 1001> (both in binary and decimal notation). As separators |a|\|>a|\>>. For restricting domains or images of applications: \|>, \|>. The possibility to use bars as brackets is particularly problematic for parsers, since it is not known whether we are dealing with an opening bracket, a closing bracket, or no bracket at all. The comma may either be a separator, as in >, or a decimal comma, as in 14159\>, or a grouping symbol, as in 000\000>. The period symbol``.'' can be used instead of a comma in numbers, but also as a data access operator , or as a connector in lambda expressions x.x>. Moreover, in the formula <\equation*> a+b=c, ponctuation is used in the traditional, non-mathematical way. The semicolon ``>'' is sometimes used as an alternative for such that (example: X:x\0|}>>), but might also indicate division or the binary infix ``of type'', as in >. In what follows, will refer to two semantically different symbols which are graphically depicted by the same glyph (and apotentially different spacing). In particular, the various invisible operators mentioned above are homoglyphs. Some other common homoglyphs are as follows: <\itemize> The backslash is often used for ``subtraction'' of sets Y>. The dot >> can be used as a symbol for multiplication (example: \\>) or as a wildcard (example: the norm >|\|>>). The wedge > can be used as the logical and operator or as the wedge product. It should be noticed that support for homoglyphs is quite poor; see section for a discussion. Authors who are not aware of the ambiguities described in the previous section are inclined to produce documents with syntax errors. Following / habits, some authors consistently omit multiplications and function applications. Others systematically replace multiplications by function applications. Besides errors due to ambiguities, the following kinds of errors are also quite common: <\description> Inside , where we recall that spaces correspond to function application, users often leave superfluous spaces at the end of formulas (after changing their mind on completing a formula) or around operators (following habits for typing ASCII text). In wysiwyg editors, context changes (such as font changes, or switches between text mode and math mode) are invisible. Even though provides visual hints to indicate the current context, misplaced context changes are a common mistake. Using notation, the most common erroneous or inappropriate patterns are as follows: <\itemize> Misplaced parts: . Breaks: . Redundancies I: . Redundancies II: . Notice that some of these patterns are introduced in a natural way through certain operations, such as copy and paste, if no special routines are implemented in order to avoid this kind of nuisance. Both in ASCII-based and wysiwyg presentation-oriented editors, there is no reliable way to detect matching brackets, if no special markup is introduced to distinguish between opening and closing brackets. A reasonable heuristic is that opening and closing brackets should be of the same ``type'', such as and or and . However, this heuristic is violated for several common notations: <\itemize> Interval notation > or >. Ket notation |A|\|>>. Absolute values are also particularly difficult to handle, since the opening and closing brackets coincide. If no special markup is present to indicate the scope of a big operator, then it is difficult to determine appropriate scopes in formulas such as and . Although provided an invisible closing bracket, users (including ourselves) tended to ignore or misuse it (of course, this really means that the introduction of invisible closing brackets was not the right solution to the scoping problem). One common aspect of all these errors is that authors who only care about presentation will usually not notice them at all: the printed versions of erroneous and correct texts are generally the same, or only differ in their spacings. implements several heuristics to detect and correct syntax errors. Some of these heuristics are ``conservative'' in the sense that they will only perform corrections on incorrect texts and when we are sure or pretty sure that the corrections are right. Other heuristics are more ``agressive'' and may for instance replace spaces by multiplications or whenever this seems reasonable. However, in unlucky cases, the agressive heuristics might replace a correct symbol by an unwanted one. Notice that none of the syntactic corrections alters the presentation of the document, except for some differences in the spacing. One major technique which is used in many of the heuristics is to associate asymbol type to each symbol in a mathematical expression. For instance, we associate the type ``infix'' to >>, ``opening bracket'' to , and special types to subscripts, superscripts and primes. Using these types, we can detect immediately the incorrectness of an expression such as b)\a|)>>. Some of our algorithms also rely on the binding forces of mathematical symbols. For instance, the binding force of multiplication is larger than the binding force of addition. It is important to apply the different algorithms for syntax correction in an appropriate order, since certain heuristics may become more efficient on partially corrected text. For instance, it is recommended to replace by before launching the heuristic for determining matching brackets. We will now list, in the appropriate order, the most important heuristics for syntax correction which are implemented in . <\description> It is fairly easy to correct misplaced invisible markup, detection and correction of the various patterns that we have described in the previous section. There are several heuristics for the determination of matching brackets, starting with the most conservative ones and ending with the most agressive ones if no appropriate matches were found. The action of the routine for bracket matching can be thought of as the insertion of appropriate tags in or pairs in . For instance, might be transformed into . Each of the heuristics proposes an ``opening'', ``closing'', ``middle'' or ``unknown'' status for each of the brackets in the formula and then launches a straightforward matching algorithm. The first most conservative heuristic first checks whether there are any possibly incorrect brackets in the formula, such as the closing bracket in>>, and turns their status into ``unknown''. The status of the remaining brackets is the default one: opening for , , , and closing for , , . We next launch a series of more and more agressive heuristics for the detection of particular patters, such as french intervals >, absolute values >, ket notation|a|\|>>,etc. If, at the end of all these heuristics, some brackets still do not match, then we (optionally) match them with empty brackets. For instance, will be turned into . We notice that the heuristics may benefit from matching brackets which are detected by earlier heuristics. For instance, in the formula |)>=g+|)>>, the brackets > are matched at an early stage, after which we only need to correct the subformulas > and +> instead of the formula as a whole. The determination of scopes of big operators is intertwined with bracket matching. The main heuristic we use is based on the binding forces of infix operations inside the formula and the binding forces of the big operators themselves. For instance, in the formula <\equation> |>a>=|>b*|>c>>, the binding force of the equality is weaker than the binding force of a summation, whereas the binding force of the invisible multiplication is larger. This heuristic turns out to be extremely reliable, even though it would be incorrect on the formula <\equation*> |a>+b, where the big summation is taken to have a slightly larger binding force than the addition. The heuristic might be further improved by determining which summation variables occur in which summands. For instance, in the formula(), the occurrenceof inside > gives a second indication that we are dealing with a nested summation. Trailing or other superfluous invisible symbols are easily detected and removed from formulas. Another easy task is the detection of errors which can be repaired through the replacement of a symbol by a homoglyph. For instance, if the backslash symbol (with type ``basic symbol'') occurs between two letters (also with types ``basic symbol''), as in , then we may replace it by the ``setminus'' infix. The insertion of missing invisible symbols (or replacement of incorrect ones) is one of the hardest tasks for syntax correction. Two main situations need to be distinguished: <\itemize> The text to be corrected was written with and we may assume that most of the multiply/apply ambiguities were already resolved by the author. The text to be corrected was imported from /, so that no multiplications or function applications were present in the source document. In the first case, most multiplications and function applications are already correct, so the correction algorithm should be very conservative. In the second case, the insertions should be very agressive, at the risk of being incorrect in many cases. At a first stage, it is important to determine situations in which it is more or less clear whether to insert a multiplication or a function application. For instance, between a number and a letter, it is pretty sure that we have to insert a multiplication. Similarly, after operators such as >, it is pretty sure that we have to insert a space. At a second stage, we determine how all letter symbols occurring in the document are ``used'', whether they occur as lefthand or righthand arguments of multiplications or function applications. It is likely that they need to be used in asimilar way throughout the document, thereby providing useful hints on what todo. At the end, we insert the remaining missing symbols, while pondering our decisions as a function of the hints and whether we need to be conservative or not. In this section, we report on the performance of the syntax corrector. The results were obtained by activating the debugging tool inside as well as the option. This allows the user to display detailed status reports on the number of errors in all loaded files using . We have tested our corrector on various types of documents from various sources: <\description> This book on algorithms in real algebraic geometry was originally written in and then converted to and transformed into a book with extra interactive features. >>The -th chapter of BPR. This corresponds to a collection of six courses for students on various topics in mathematics. >The third paper in COL on information theory. This Master's Thesis was written by an early user and physicist. Our own book on transseries, which was started quite long ago with an early version of and completed during the last decade. This habilitation thesis was written more recently. >>The -th chapter of HAB. In the introduction, we have mentioned URLs where these documents can be downloaded. We also repeat that (and not the authors) should be held responsible for most of the semantic errors which were found in these documents. In table, we have compiled the main results of the syntax corrector. Concerning the total number of formulas, we notice that each non empty entry of an equation array is counted as a separate formula. For the books, we only ran the corrector on the main chapters, not on the glossaries, indexes, etc. The concept of ``error'' will be described in more detail in section below. Roughly speaking, a correct formula is a formula that can be parsed in suitable (tolerant) grammar, which will be described in section. In section, we will discuss in more detail to which extent our notion of ``error'' corresponds to the intuitive notion of a syntax error or ``typo''. The corrector tends to reduce the number of errors by a factor between and . The typical percentage of remaining errors varies between and . This error rate tends to be slightly higher for documents with many complex formulas, such as MT; in ordinary mathematical documents, most formulas are very short, whence trivially correct. Furthermore, many errors tend to be concentrated at certain portions of a document. For instance, the error rate of the individual document COL is as about twice as low as the error rate of the collection COL. Similarly, the bulk of the errors in LN are concentrated in the first chapters. Finally, errors tend to be repeated according to patterns, induced by systematic erroneous notation. ||||||>|>||>|||>|||||||||>|||||||||>|||||||||>|||||||||>>>>|Global performance of the syntax corrector on various documents.> It is interesting to study the relative correction rates of the techniques described in the previous section. More precisely, we have implemented the following algorithms: <\description> Correction of invisible structured markup. Since all documents were written with older versions of in which bracket matching was not enforced (see also section below), this algorithm implements a collection of heuristics to make all brackets match. In the case of absolute values, which cannot be parsed before this step, this typically decreases the number of errors. The algorithm also determines the scopes of big operators. This algorithm detects and moves incorrectly placed brackets with respect to invisible structured markup. For instance, the formula $y=f(x$).> would be corrected into Removal of superfluous invisible symbols. Heuristic insertion of missing invisible symbols, either function applications or multiplications. Replacing symbols by appropriate homoglyphs. Various other useful corrections. The results are shown in table. In a sense, among the original errors, one should not really count the errors which are corrected by the bracket matching algorithm. Indeed, the versions of which were used for authoring the documents contained no standard support for, say, absolute values or big operator scopes. Hence, the author had no real possibility to avoid parsing errors caused by such constructs. Fortunately, our bracket matching algorithm is highly effective at finding the correct matches. One rare example where the algorithm ``fails'' is the formula,|\|>|)>> from COL, where > was tagged to be amultiplication. \ The overwhelming source of remaining error corrections is due to missing, wrong or superfluous invisible multiplications and function applications. This fact is not surprising for documents such as BPR, which were partially imported from . Curiously, the situation seems not much better for MT, COL and IT, which were all written using . Even in our own documents LN and HAB, there are still a considerable number of ``invisible errors''. This fact illustrates that provides poor incitation to correctly enter meaningful invisible content. In COL, the author actually took the habit to systematically replace multiplications by function applications. Consequently, many of the ``correct'' formulas do not have the intended meaning. |||||>|>||>|||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>|||||||||>>>>|Numbers of corrections due to individual algorithms.> In section , we have described many common ambiguities and syntactical errors, and provided heuristics for correcting them. Of course, a more satisfactory solution would be to develop mathematical editors in such a way that these errors are ruled out right from the start, or at least, that it is harder for the user to enter obviously incorrect formulas. In the process of making mathematical editors more sophisticated, one should nevertheless keep in mind that most authors are only interested in the presentation of the final document, and not in its correctness. For most users, presentation oriented interfaces are therefore the most intuitive ones. In particular, every time that we deviate from being presentation oriented, it should be clear for the user that it buys him more than he looses. There are three basic techniques for inciting the user to write syntactically correct documents: <\itemize> Enforcing correctness the introduction of suitable markup. Providing visual hints about the semantics of a formula. Writing documentation. Of course, the main problem with the first method is that it may violate the above principle of insisting on presentation oriented editing. The problem with the second method is that hints are either irritating or easily ignored. The third option is well-intentioned, but who reads and follows the documentation? One may also wish to introduce one or more user preferences for the desired degree of syntactic correctness. In some cases, correctness can also be ensured by running appropriate syntax correction methods from section. For instance, the algorithms for the correction of invisible markup can be applied by default. Alternatively, one may redesign the most basic editing routines so as to ensure correctness all the way along. Matching brackets and well-scoped big operators are particularly important for structuring mathematical formulas. It is therefore reasonable to adapt markup so as to enforce these properties. In a formula with many brackets, this is usually helpful for the user as well, since it can be tedious to count opening and closing brackets. Highlighting matching brackets consists an alternative solution, which is used in many editors for programming languages. In 1.0.7.7, we have introduced a special ternary markup element (similar to in ), which takes two brackets and a body as its arguments. A possible future extension is to allow for additional pairs of middle brackets and extra arguments. The typical behaviour of a ``matching bracket mode'' is as follows: <\itemize> When typing an opening bracket, such as , the corresponding closing bracket is inserted automatically and the cursor is placed in between |)>>. The brackets are removed on pressing or inside a pair of matching brackets with nothing in between. The tag has ``no border'', in the sense that, in the formula +y>, there is only one accessible cursor position between and and >. In addition, one has to decide on ways to replace the opening and closing brackets, if necessary, thereby enabling the user to type intervals such as > or >, or other kinds of ``matching brackets''. In , this is currently done as follows: <\itemize> When removing an opening or closing bracket, it is actually replaced by an invisible bracket. In the special case that there was nothing between the brackets, or when both brackets become invisible, we remove the brackets. Notice that the cursor can be positioned before or after an invisible bracket. If we type an opening or closing bracket just before or after an invisible bracket, then we replace the invisible bracket by the new bracket. For instance, we may enter > by typing . In addition, we use the following convenient rule: <\itemize> When typing a closing bracket just before the closing bracket of an tag, we replace the existing closing bracket by the new one. An alternative way for entering > is therefore to type . Notice that the last rule also makes the ``matching bracket mode'' more compatible with the naive presentation oriented editing mode. Indeed, the following table shows in detail what happens when typing > in both modes: <\with|par-mode|center> ||>|>|>>|>>>|>|>>||)>>>>|>|>>||)>>>>|>|>>|>>>>>> Notice that the shortcut >> is equivalent to in this example. The above rules also apply for bar-like brackets, except that the bar is either regarded as an opening or closing bracket, depending on the context. For instance, when typing in an empty formula, we start an absolute value |\|>>. On the other hand, when the cursor is just before a closing bracket, as in |A|\>>, then typing results in the replacement of the closing bracket by a bar: A\|>. As will be explained below, other bar-like symbols can be obtained using the key. The matching bracket mode in also enforces all big operators to be scoped. As in the case of an invisible bracket, there are two cursor positions next to the invisible righthand border of a scoped big operator: inside and outside the scope of the operator. We use light cyan boxes in order to indicate the precise position of the cursor: <\with|par-mode|center> ||||||>a>>>>>>>>>=\>>|||>a>>>=\>>>||>|>>>> Currently, the implementation of subscripts and superscripts is still presentation oriented. The problem here is the following: when entering an expression such as>, there are two options for detecting that is the base for superscript: <\itemize> By parsing the complete expression. By forcing the user to specify the base by means of special markup. The first option is out of scope for ``cheap semantic editing'' and only realistic in combination with the tools described in the second part of this paper. As in the case of big operators for the right border, the second option requires the introduction of an invisible left border with two adjacent cursor positions. However, such invisible borders turn out to be less natural for left borders than for right borders, partly because we have to start entering the script entering the base. <\remark> In fact, it turns out that our heuristic for the determination of scopes of big operators works ``so well'' that it is not yet clear to is whether it is really necessary to keep the scopes of big operators inside the document. Indeed, in the case of a formula such as|a+b>>, for which our heuristic fails, the formula would actually be more readable if we put brackets around +b>. If we want to consider big operators just as prefixes, then this does require some modifications in our universal grammar, which will be specified in section below. For instance, the non-terminal symbol would admit additional rules for |>>>, |>>>, etc. provides a simple ``variant'' mechanism, based on using the key. For instance, Greek letters >, >, >, can be entered by typing , , , etc. Similarly, the variants >>, >> and >> of >> can be obtained by typing => and pressing the tab key several times. Using the variant mechanism, there are also simple shortcuts , , and for the most frequent mathematical constants >, >, > and >. Similarly, can be used for typing the operator > in |f*\ x>>. It is natural to use the variant mechanism for disambiguating homoglyphs as well. In table, we have displayed the current keyboard shortcuts for some of the most common homoglyphs. <\big-table|||>|>|>|>|>|>|>|>|>>|>|>|aj>>>|>|>|+x>>|>|>|>>|>|>|14159\>>|>|>|R:x\0|}>>>|>|>|>|>|>|>>|>|>|R\|x\0|}>>>|>|>|1001>>|>>>>> shortcuts for common homoglyphs. \; Regarding invisible operators, there are a few additional rules. First of all, letters which are not separated by invisible operators are considered as operators. For instance, typing yields and not , as in . This provides some incitation for the user to explicitly enter invisible multiplications or function applications. Also, the editor forbids entering sequences of more than one invisible multiplication or function application. In other words, typing has the same effect as typing . Besides making it easy for the user to disambiguate homoglyphs, the editor also provides visual hints. For instance, the difference between the vertical bar as a separator or adivision is made clear through the added spacing in the case of divisions. The status bar also provides information about the last symbol which has been entered. Despite the above efforts, many users don't care about syntactic correctness and disambiguating homoglyphs, or are not even aware of the possibilities offered by the editor. Unfortunately, it is hard to improve this situation: a very prominent balloon with relevant hints would quickly irritate authors. Yet, for users who do not notice spacing subtleties or discrete messages on the status bar, more agressive messages are necessary in order to draw their attention. One possible remedy would be to display more agressive help balloons only if the user's negligence leads to genuine errors. For instance, whenever the user types , the is clearly superfluous, thereby providing us a good occasion for pointing the user to the documentation. In the case of invisible operators, we might also display the corresponding visible operator in a somewhat dimmed color, while reducing its size, so as to leave the spacing invariant. Optionally, these hints might only be displayed if the cursor is inside the formula. The grammars of most traditional computer languages are LALR-1, which makes it possible to generate parsers using standard tools such as or . However, the design and customization of LALR-1 grammars is usually quite non trivial. Recently, packrat grammars were introduced as a remedy to these drawbacks. A packrat grammar consists of: <\itemize> A finite alphabet > of . A finite alphabet > of (disjoint from >). One >\\,\>> for each non-terminal symbol > in >. The set ,\>> stands for the set of . There are various possibilities for the precise definition of ,\>>. In what follows, each parsing expression is either of one of the following forms: <\itemize> A non-terminal or terminal symbol in \\>. A (possibly empty) concatenation *\*\> of parsing expressions ,\,\> in ,\>>. A (possibly empty) disjunction /\/\> of ,\,\\\,\>>. A repetition > or possibly empty repetition >> of \\,\>>. And-predicate >, with \\,\>>. Not-predicate >, with \\,\>>. The meanings of the and-predicate and not-predicate will be made clear below. The semantics of a packrat grammar and the implementation of a parser are very simple. Given an input string , an input position and a parsing expression >, a packrat parser will attempt to parse at according to > as far as possible. The result is either ``false'' or a new position . More precisely, the parser works as follows: <\itemize> If \\>, then we try to read > from the string at position . If \\>, then we parse >> at . If =\*\*\>, then we parse ,\,\> in sequence while updating with the new returned positions. If one of the > fails, then so does >. If =\/\/\>, then we try to parse ,\,\> at in sequence. As soon as one of the > succeeds at (and yields a position ), then so does > (and we return ). In case of repetitions, say =\>>, we keep parsing > at until failure, while updating with the new positions. If =&\>, then we try to parse > at . In case of success, then we return the original ( we do not consume any input). In case of failure, we return ``false''. If =!\>, then we try to parse > at . In case of success, we return ``false''. In case of failure, we return the original position . For efficiency reasons, parse results are systematically cached. Consequently, the the running time is always bounded by the length of string times the number of parsing expressions occurring in the grammar. The main difference between packrat grammars and LALR-1 grammars is due to the introduction of sequential disjunction. In the case of LALR-1 grammars, disjunctions may lead to ambiguous grammars, in which case the generation of a parser aborts with an error message. Packrat grammars are never ambiguous, since all disjunctions are always parsed in a ``first fit'' manner. Of course, this also implies that disjunction is non commutative. For instance, for terminal symbols and , the parsing expression >> parses the string, unlike the expression >/x*y>. More precisely, parsing the string at position according to the expression >/x*y> succeeds, but returns as the end position. In practice, writing a packrat grammar is usually easier than writing an LALR-1 grammar. First of all, thanks to sequential disjunctions, we do not need to spend any effort in making the grammars non ambiguous \ (of course, the disjunctions have to be ordered with more care). Secondly, there is no limitation in the size of the ``look-ahead'' for packrat grammars. In particular, there is no need for a separate ``lexer''. Another important advantage is the ability to parse a string on the fly according to a given grammar. In contrast, LALR-1 grammars first apply formal language techniques in order to transform the grammar into an automaton. Although parsers based on such automata are usually an order of magnitude faster than packrat parsers, the generation of the automaton is a rather expensive precomputation. Consequently, packrat grammars are more suitable if we want to locally modify grammars inside documents. On the negative side, packrat grammars do not support left recursion in a direct way. For instance, the grammar <\eqnarray*> |>|>|||>|||>||>|>>>> leads to an infinite recursion when parsing the string . Fortunately, this is not a very serious drawback, because most left-recursive rules are ``directly'' left-recursive in the sense that the left-recursion is internal to a single rule. Directly left-recursive grammars can easily be rewritten into right-recursive grammars. For instance, the above grammar can be rewritten as <\eqnarray*> |>|*|)>>>>|>|>|>|>|>|>|||>||>|>>>> Furthermore, it is easy to avoid infinite loops as follows. Before parsing an expression > at the position , we first put ``false'' in the cache table for the pair ,i|)>>. Any recursive attempt to parse > at the same position will therefore fail. A negative side effect of the elimination of lexers is that whitespace is no longer ignored. The treatment of whitespace can therefore be more problematic for packrat parsers. For various reasons, it is probably best to manually parse whitespace. For instance, every infix operator such as > is replaced by a non-terminal symbol Spc*+*Spc>, which automatically ``eats'' the whitespace around the operator. An alternative solution, which provides less control, is to implement a special construct for removing all whitespace. A more elegant solution is to introduce a second packrat grammar with associated productions, whose only task is to eliminate whitespace, comments, etc. In fact, documents are really trees. As we will see in section below, we will serialize these trees into strings before applying a packrat parser. This serialization step can be used as an analogue for the lexer and an occasion to remove whitespace. > An efficient generic parser for packrat grammars, which operators on strings of 32-bit integers, has been implemented in the C++ part of . Moreover, we provide a interface, which makes it easy to define grammars and use the parser. For instance, the grammar for a simple pocket calculator would be specified as follows: <\scm-code> (define-language pocket-calculator \ \ (define Sum \ \ \ \ (Sum "+" Product) \ \ \ \ (Sum "-" Product) \ \ \ \ Product) \ \ \ \ (define Product \ \ \ \ (Product "*" Number) \ \ \ \ (Product "/" Number) \ \ \ \ Number) \ \ \ \ (define Number \ \ \ \ ((+ (- "0" "9")) (or ("." (+ (- "0" "9"))) "")))) For top-level definitions of non-terminal symbols, we notice the presence of an implicit for sequential disjunctions. The notation is aconvenient abbreviation for the grammar . also allows grammars to inherit from other grammars. This makes it for instance possible to put the counterpart of a lexer in a separate grammar: <\scm-code> (define-language pocket-calculator-lexer \ \ (define Space (* " ")) \ \ (define Plus (Space "+" Space)) \ \ (define Minus (Space "-" Space)) \ \ (define Times (Space "*" Space)) \ \ (define Over (Space "/" Space))) \; (define-language pocket-calculator \ \ (inherit pocket-calculator-lexer) \ \ (define Sum \ \ \ \ (Sum Plus Product) \ \ \ \ (Sum Minus Product) \ \ \ \ Product) \ \ ...) In a similar manner, common definitions can be shared between several grammars. In traditional parser generators, such as and , it is also possible to specify productions for grammar rules. For the moment, this is not yet implemented in , but a straightforward adaptation of this idea to our context would be to specify productions for translation into expressions. For instance, a specification of with productions might be <\scm-code> (define Sum \ \ (-\ (Sum Plus Product) ("(" 2 " " 1 " " 3 ")")) \ \ (-\ (Sum Minus Product) ("(" 2 " " 1 " " 3 ")")) \ \ Product) For the last case , we understand that the default production is identity. In the case of left-recursive grammars, we also have to adapt the productions, which can be achieved the use of suitable lambda expressions. Notice that the productions actually define a second packrat grammar. In principle, we might therefore translate into other languages than . , the rules could also be reverted, which provides a way to pretty print expressions (see also section). In this context, we observe one advantage of our manual control of whitespace: we might tweak the converter so as to pretty print as a single space. Reversion may also be used for syntax correction. For instance, the direct translation could accept the notation1>> and the reverse translation might insert an invisible wildcard. Finally, we might compose ``packrat converters'' so as to define new converters or separate certain tasks which are traditionally carried out by a lexer from the main parsing task. We intend to return to these possible extensions in a forthcoming paper. Apart from the not yet implemented productions, there are various other kinds of annotations which have been implemented in . First of all, we may provide short descriptions of grammars using the keyword: <\scm-code> (define-language pocket-calculator \ \ (:synopsis "grammar demo for a simple pocket calculator") \ \ ...) Secondly, one may specify a physical or logical ``color'' for syntax highlighting: <\scm-code> (define Number \ \ (:highlight constant_number) \ \ ((+ (- "0" "9")) (or "" ("." (+ (- "0" "9")))))) We may also associate types and typesetting properties to symbols: <\scm-code> (define And-symbol \ \ (:type infix) \ \ (:penalty 10) \ \ (:spacing default default) \ \ "\wedge\" "\curlywedge\") These types were for instance used in our heuristics for syntax correction (see section). Finally, we implemented the additional properties and , which will be detailed in section on grammar assisted editing. More kinds of annotations will probably be introduced in the future. We recall that documents are internally represented by trees. Therefore, we have two options for applying packrat parsing techniques to documents: <\itemize> Generalizing packrat grammars to ``tree grammars''. Flattening trees as strings before parsing them. For the moment, we have opted for the second solution. More precisely, a document snippet is either a string leaf or a compound tree which consists of a string label and an arbitrary number of children, which are again trees. String leafs are represented in ``enriched ASCII'', using the convention that special characters can be represented by alphanumeric strings between the special symbols > and>. For instance, alpha\> represents >, whereas > and > are represented by less\> and gtr\>. Although this is currently not exactly the case, one may think of enriched ASCII strings as unicode strings. A straightforward way to serialize a compound tree with a given and children,\,t>> is as follows. Assuming that ,\,t> are recursively serialized as , >, , we serialize as \\label\u1\\|\...\\|\un\/\>. For instance, the expression <\equation*> \+ would be serialized as <\equation*> alpha\\\\rsub\k\/\+\\\frac\x\\|\y\/\> An interesting possibility is to serialize special kinds of trees in alternative ways. For instance, whitespace may be ignored and a document with several paragraphs may be serialized by inserting newline characters instead of the special ``characters'' \\document\>,\|\> and /\>. In this respect, the serialization step is somewhat analogous to the lexing step for traditional parsers. For user defined macros, it is also convenient to make the serialization customizable, as will be discussed in section below. Recall that our generic packrat parser operates on 32bit integer strings. Internally, part of the 32bit integer range is reserved for characters in our enriched alphabet, including the additional ``characters'' \\label\>, \|\> and /\>. Another part is reserved for non terminal symbols. The remaining part is reserved for constructs of parsing expressions. In order to simplify the task of writing grammars for structured documents, we introduced a few additional patterns for building parsing expressions: <\itemize> The expressions label>, and > stand for the strings \\label\>, \|\> and/\>. Furthermore, > stands for any character of the form label>. , for any well formed tree. as a shorthand for . , for any well formed leaf. , for any enriched ASCII character, such as or alpha\>. For instance, the rule in our pocket calculator might be replaced by <\scm-code> (define Product \ \ (Product Times Number) \ \ (Product Over Number) \ \ Fraction) \; (define Fraction \ \ (:\frac Sum :/ Sum :\) \ \ Number) Having implemented a good mechanism for the construction of parsers, we are confronted to the next question: which grammar should we use for general purpose mathematical formulas? In fact, we first have to decide whether we should provide a universal well-designed grammar or make it easy for users to define their own customized grammars. One important argument in favour of the first option is that a standard well-designed grammar makes communication either, since the reader or coauthors do not have to learn different notational conventions for each document. In the same way as presentation attempts to capture the presentation aspects of any (non exotic) mathematical document, we hope that a well-designed grammar could capture the syntactical semantics of mathematical formulas. Another argument in favour of a universal grammar is the fact that the design of new grammars might not be so easy for non experts. Consequently, there is a high risk that customized grammars will be ill-designed and lead to errors which are completely unintelligible for other users. Stated differently: if we choose the second option, then customizations should be made really easy. For instance, we might only allow users to introduce new operators of standard types (prefix, postfix, infix) with a given priority. The main argument in favour of customizable grammars is that we might not be able to anticipate the kind of notation an author wishes to use, so we prefer to keep a door open for customizing the default settings. External systems, such as proof assistants or computer algebra systems, may also have built-in ways to define new notations, which we may wish to reflect inside . At any rate, before adding facilities for customization, it is an interesting challenge to design a grammar which recognizes as much of the traditional mathematical notation as possible. For the moment, we have concentrated ourselves on this task, while validating the proposed grammars on a wide variety of existing documents. Besides, as we will see in section, although we do not provide easy ways for the user to customize the grammar, we do allow users to customize the tree serialization procedure. This is both a very natural extension of the existing macro system and probably sufficient for most situations when users need customized notations. Having opted for a universal grammar, a second major design issue concerns potential ambiguities. Consider the expression b\c>. In analogy with , this expression is traditionally parsed as c|)>>. However, this may be non obvious for some readers. Should we punish authors who enter such expressions, or re-lax and leave it as a task to the reader to look up the notational conventions? In other words, since language of a matter of communication, we have to decide whether the burden of disambiguating formulas should rather incomb to the author or to the reader. For an authoring tool such as , it is natural to privilege the author in this respect. We also notice that the reader might have access to the document in electronic or form. In that case, we will discuss in section how the reading tool may incorporate facilities for providing details on the notational conventions to the reader. mathematical grammar> The current grammar for mathematics is subdivided into three subgrammars: <\itemize> A lowest level grammar which mainly contains all supported mathematical symbols, grouped by category. An intermediate grammar which mainly describes the mathematical operators occurring in the grammar. The actual grammar . > The purpose of the lowest level grammar is to group all supported mathematical symbols by category and describe the purpose and typesetting properties of each category. A typical non terminal symbol defined in this grammar is the following: <\scm-code> (define And-symbol \ \ (:type infix) \ \ (:penalty 10) \ \ (:spacing default default) \ \ "\wedge\" "\curlywedge\") The category contains two actual symbols wedge\> (>>) and curlywedge\> (>>). The annotation specifies the purpose of the symbol, that it will be used as an infix operator. The and annotations specify a penalty for the symbols during line-breaking and the usual spacing around the symbols. For some symbols and operators, subscripts and superscripts are placed below above: <\scm-code> (define N-ary-operator-symbol \ \ (:type n-ary) \ \ (:penalty panic) \ \ (:spacing none default) \ \ (:limits >) \ \ "inf" "lim" "liminf" "limsup" "max" "min" "sup") The symbol grammar also describes those tags which play a special role in the higher level grammars: <\scm-code> (define Reserved-symbol \ \ :\frac :\sqrt :\wide ...) > The purpose of is to describe all mathematical operators which will be used in the top-level grammar . Roughly speaking, the operators correspond to the symbols defined in , with this difference that they may be ``decorated'', typically by primes or scripts. For instance, the counterpart of is given by <\scm-code> (define And-infix \ \ (:operator) \ \ (And-infix Post) \ \ (Pre And-infix) \ \ And-symbol) where and are given by <\scm-code> (define Pre \ \ (:operator) \ \ (:\lsub Script :\) \ \ (:\lsup Script :\) \ \ (:\lprime (* Prime-symbol) :\)) \; (define Post \ \ (:operator) \ \ (:\rsub Script :\) \ \ (:\rsup Script :\) \ \ (:\rprime (* Prime-symbol) :\)) The annotation states that should be considered as an operator during selections and other structured editing operations (see section). > Roughly speaking, the main grammar for mathematics defines the following kinds of non terminal symbols: <\itemize> The main symbol for mathematical expressions. A relaxed variant of , for which operators > and formulas such as 0>> are valid expressions. Relaxed expressions typically occur inside scripts: -a>, \0>>, etc. The symbol for the ``public interface'', which is a relaxed expression with possible trailing ponctuation symbols and whitespace. Symbols corresponding to each of the operator types, ordered by priority. Some special mathematical notations, such as quantifier notation. Prefix-expressions postfix-expressions and radicals. For instance, the ``arithmetic part'' of the grammar is given by <\scm-code> (define Sum \ \ (Sum Plus-infix Product) \ \ (Sum Minus-infix Product) \ \ Sum-prefix) \; (define Sum-prefix \ \ (Plus-prefix Sum-prefix) \ \ (Minus-prefix Sum-prefix) \ \ Product) \; (define Product \ \ (Product Times-infix Power) \ \ (Product Over-infix Power) \ \ Power) The grammatical specification of relations >, >>, >>, >>, is similar to the specification of infix operators, but relations carry the following special semantics: <\equation*> a\a\\a\a\a\a\\\a\a. Quantified expressions are defined by <\scm-code> (define Quantified \ \ (Quantifier-prefixes Ponctuation-infix Quantified)\ \ \ (Quantifier-fenced Quantified) \ \ (Quantifier-fenced Space-infix Quantified) \ \ Implication) where <\scm-code> (define Quantifier-prefix \ \ (Quantifier-prefix-symbol Relation)) \; (define Quantifier-prefixes \ \ (+ Quantifier-prefix)) \; (define Quantifier-fenced \ \ (Open Quantifier-prefixes Close) \ \ (:\around :any :/ Quantifier-prefixes :/ :any :\)) Hence, the most common forms of quantifier notation are all supported: <\eqnarray*> ||x\A,\y\B,x\y>>|||x\A\y\B:x\y>>|||x\A|)>y\B|)>\=\.>>>> It would also be natural to replace the quantifiers > and > by big operators, as in the formula <\equation*> >|\\0>>|\\0> \\\-f|\|>\\. Unfortunately these big symbols are not supported by . In the treatment of postfixed expressions: <\scm-code> (define Postfixed \ \ (Postfixed Postfix-postfix) \ \ (Postfixed Post) \ \ (Postfixed Fenced) \ \ (Postfixed Restrict) \ \ Radical) we consider function application as a postfix: <\scm-code> (define Fenced \ \ (Open Close) \ \ (Open Expressions Close) \ \ (:\around :any :/ (* Post) (* Pre) :/ :any :\) \ \ (:\around :any :/ (* Post) Expressions (* Pre) :/ :any :\)) Similarly, explicit function application using the invisible apply symbol is handled inside the rule for prefixed expressions. Radical expressions are either identifiers, numbers, special symbols, fenced expressions, or special markup: <\scm-code> (define Radical \ \ Fenced \ \ Identifier \ \ Number \ \ (:\sqrt Expression :\) \ \ (:\frac Expression :/ Expression :\) \ \ (:\wide Expression :/ :args :\) \ \ ((except :\ Reserved-symbol) :args :\) \ \ ...) macros> Although we have argued that the design or modification of packrat grammars may be difficult for the average user, it still remains desirable to provide a way to introduce new kinds of notation. One traditional approach is to allow for the introduction of new mathematical operators of a limited number of types (prefix, postfix, infix, bracket, etc.), but with arbitrary binding forces. Such operator grammars are still easy to unterstand and modify for most users. Instead of using numerical binding forces, a nice variant of this idea is to allow for arbitrary partial orders specified by rules such as Sum>. <\remark> As suggested in a discussion with , one might even allow for ambiguous grammars, in which case the user would be invited to manually disambiguate expressions, whenever appropriate. This can be achieved by introducing a non sequential ``or'' primitive into the packrat grammar building constructs. Alternatively, one could make the grammar more strict ( disallow for expressions such as b\c>) and provide hints on how to release the grammar. In practice, these complications might not be worthit, since the current focus (see section) already gives a lot of visual feedback to the author on how formulas are interpreted. Inside , a more straightforward alternative approach is to capitalize on the existing macro system. On the one hand (and before the recent introduction of semantic editing facilities), this system was already used for the introduction of new notations and abbreviations. On the other hand, the system can be extended with a user friendly ``behaves as'' construct, after which it should be powerful enough for the definition of most practically useful notations. We recall that macros behave in a quite similar way as macros, except that clean names can be used for the arguments. For instance, the macro <\tm-fragment> |3>>>> may be used for defining a macro for producing cubic roots >. Our serialization procedure from section has been designed in such a way that all user defined macros are simply expanded by default. In particular, from the semantic point of view, there is no noticeable difference between the newly introduced cubic roots and the built-in -th roots with , except that the in the cubic root is not editable. The next idea is to allow users to override the default serialization of a macro. For instance, assume that we defined a macro <\tm-fragment> *\>>> By default, the line *\*|f*\ z>> is interpreted as *\*\*|f*\ z>>. By specifying aserialization macro for <\tm-fragment> *\)>>> and after entering *\> using our macro, the formula *\*|f*\ z>> will be interpreted in the intended way *\|)>*|f*\ z>>. This very simple idea is both very user friendly (we basically specify how the macro behaves from a semantic point of view) and quite powerful. For instance, if we define a> operator with the binding force of the logical or >>, we simply invent a suitable name, such as ``logical-plus'', use a > for the rendering and a >> for the serialization. Of course, this scheme is limited to reduce to those operators which are already present in our predefined universal grammar. Nevertheless, after some years of maturing, we expect that the universal grammar will be sufficiently rich for covering most notational patterns used in practice. As a side note, it is interesting to observe once more that the serialization procedure plays the role of a ``structured'' lexer, except that we are rather destroying or transforming \ tree structure in a customizable way than detecting lexical structure inside strings. In section, we have described the performance of the syntax corrector on several large sample documents. In this section, we will analyze in more detail the causes of the remaining ``errors''. We will also investigate up to which extent ``correct'' formulas are interpreted in the way the author probably had in mind. Throughout the section, one should bear in mind that our test documents were not written using the newly incorporated semantic editing features. On the one hand, our analysis will therefore put the finger on some of the more involved difficulties if we want to make a visually authored document semantically correct. On the other hand, we will get indications on what can be further improved in the syntax corrector and our universal grammar. Of course, our study is limited in the sense that our document base is relatively small. Nevertheless, we think that the most general trends can already be detected. We have analyzed in detail the remaining errors in BPR, BPR, CH and HAB, which can be classified according to several criteria, such as severeness, eligibility for automatic correction, and the extent to which the author was to blame for the error. The results of our analysis have been summarized in table, where we have distinguished the following few main categories for sorting out the errors: Modulo additional efforts, our syntax corrector might be further improved, so as to take into account some additional common notational patterns. Some of these patterns are the following: <\description> Sometimes, large whitespaces are used in a semantically meaningful way. For instance, in BPR and BPR, the notation <\equation> \xP is frequently used as a substitute for x,P>. Although potentially correctable, it should be noticed that the (currently supported) notation x|)>P> is also usedinBPR. Hence, we detected ``at least'' an inconsistency in the notational system ofBPR. We have already mentioned the fact that the universal grammar should be sufficiently friendly so as to recognize the formula >, where the operator occurs in a script. More generally, in the formula \,\|}>> fromBPR, the operators > and > are sometimes used ``as such''. Although it does not seem a good idea to extend the grammar indefinitely so as to incorporate such notations, we might extend the corrector so as to detect these cases and automatically insert ``invisible operator'' annotations. In tables or equation arrays, large formulas are sometimes manually split across cells. One example from BPR is given by <\eqnarray*> |)>>||0|)>|}>,>>|\|)>>|||)>,\|)>\(Q,\posgcd(>>|||LTRems}.>>>> In this example, the curly opening bracket in the right-hand cell of the second line is only closed on the third line. Another (truncated) example from BPR is 0\s\0|)>>>|0\s\0\\\0|)>>>|0\s\0\\\0|)>>>|>>>>>> The problem here is that formulas of the form x> are incorrect. The first example is more severe, and obliges the checker to try concatenations of various cells. The second example could also be solved by relaxing the grammar for formulas insidecells. This was already done for formulas of the form . Since the process of syntax correction is very heuristic by nature, it sometimes fails. Failure typically occurs when one formula contains at least two difficulties, so that none of the corresponding heuristic criteria holds. The following kinds of failures occurred in our test documents: <\description> This problem is by far the most frequent one. Especially in CH the job is made hard for the corrector by the author's habit to replace multiplications by function applications. As a consequence, many symbols which would usually be detected as being variables might also be functions; this confusion pushes the corrector not to touch any of such formulas. Indeed, we consider it better to leave an incorrect formula than to make a wrong ``correction''. Although the juxtaposition of two formulas is simplified into , such simplifications can be confused by the presence of whitespace, as in . One example from BPR is the formula <\equation*> =\\*a|}>, in which the absolute value > was not recognized as such. We already mentioned another similar example from COL before: ,|\|>|)>>. There are various situations in which the user resorts to informal, purely visual notations with no obvious semantics. Sometimes, the user is just lazy or sloppy, but sometimes it is hard to ``do things right''. Here are some typical examples from our test documents: <\description> Sometimes, authors may use apparently standard notation in a non standard way, by making non conventional assumptions on operator precedence. Consider for instance the following fragment from BPR: <\quote-env> Given a formula =X|)>>,where> is> In this case, the binding force of > is higher than the binding force of >, which makes the formula incorrect. Such formulas are on the limit of being typos: it would be better to insert (possibly invisible) brackets. This kind of error is rare though, since non standard operator precedences usually do not prevent formulas from being parsed, but rather result in non intended interpretations of the formulas. Consider the formula <\equation*> |j,k\\>>|\|)>>>>>>>>>+Z*\>|)>> from BPR. The authors intended to typeset the subscript of the big product using an extra small font size, and inserted a double subscript to achieve this visual effect. In , they should have used the menu entry instead. It often occurs that one needs to insert some accompanying text inside formulas, as in the formula <\equation*> a\a\|\\,\NforN>>,-a|\|>\\|\> occurring in COL. The problem here is that the text ``s.t. for'' was entered in math mode, so it will not be ignored by the parser. Other similar examples from BPR are <\quote-env> There are indeed 4 real roots: 1,2,>\2>. It is obvious since the ->th entry of |)>> is >>. In each of these examples, we notice that the obtained visual result is actually incorrect: the errors are easily committed by users who do not pay attention to the font slant, the shapes of characters and the spacing between words. For instance, the corrected versions of the last examples read <\quote-env> There are indeed 4 real roots: 1,2>, and 2>. It is obvious since the >-th entry of |)>> is >>. Notice also that correctly entered text inside mathematical formulas is ignored by the parser. Consequently, the use of meaningful text inside formulas may raise the same problems as the before-mentioned use of meaningful whitespace. In BPR, there is a general tendency of typesetting implications between substatements in math-mode, as in the example ``)>)''. Clearly, this is an abuse of notation, although it is true that convenient ways to achieve the same result in text mode do not necessarily exist. In some less frequent cases, it is convenient to use certain kinds of suggestive notations with semantics. Three examples from BPR, COL and HAB respectively are <\enumerate> The signs of the polynomials in the Sturm sequence are No other code word is of the form \z>\\> +\+\+\\+log log log x>+log log x>+log x>> Clearly, there is no hope to conceive a ``universal grammar'' which automatically recognizes this kind of notations. The only reasonable way to incorporate such constructs is to provide a few basic annotation tags which allow the user to manually specify the semantics. All analyzed texts contained at least one syntax error for which the author has no excuse. Roughly speaking, half of them were missing brackets. Another typical typo in BPR which was detected by our grammar is <\equation*> R=\\>*P>*|)>=+*Y+\\*Y. Indeed, there is a > missing at the end of the formula. The document COL contained one other strange error, which is very specific to : the personal style package of the author contained a semantically incorrect macro, which induced an error each time this macro was used. <\big-table||||||>|>|>|>|||||>|||||>|||||>|||||>|||||>|||||>>>>> Sources of the remaining errors in our sample documents, sorted out by category. The above analysis of the remaining errors takes advantage of the possibility to highlight all syntax errors in a document. Unfortunately, there is no automatic way to decide whether a correct formula is interpreted in the way intended by the author. Therefore, it requires far more work to give a complete qualitative and quantitative analysis of the ``parse results''. Nevertheless, it is possible to manually check the interpretations of the formulas, by careful examination of the semantic focus (see below for a definition and more explanations on this concept) at the accessible cursor positions in a formula. We have carefully examined the documents BPR, COL, HAB and HAB in this way. The number of misinterpretations in BPR and HAB are of the same order of magnitude as the number of remaining errors. The number of misinterpretations in COL is very high, due to the fact that most multiplications were replaced by function applications. This paper also makes quite extensive use of semantically confusing text inside formulas. When ignoring these two major sources of misinterpretation, the universal grammar correctly parses most of the formulas. The results of our analysis have been summarized in table. Let us now describe in more details the various sources of misinterpretation: <\description> Again, the multiply/apply ambiguity constitutes amajor source of confusion. Several kinds of misinterpretation are frequent: <\enumerate> Replacing a multiplication by a function application or rarely modifies the parsability of a formula, but it completely alters its sense. For instance might either be interpreted as b\c> or |)>>. Fortunately, this kind of misinterpretations only occurs if the author explicitly entered the wrong operators. A more subtle problem occurs when the author forgets to enter certain multiplications (or if the syntax corrector failed to insert them). One example is the subformula |)>>, occurring in BPR. In this formula, is interpreted as a function applied to >. An even more subtle variant of the above misinterpretation occurs at the end of HAB in the formula > *s>. In order to avoid trailing spaces, the>> was explicitly marked to be an operator and an explicit space was inserted manually after it. Consequently, the formula is interpreted as >|)>\s> instead of the intended > \s|)>>. In recent versions of the precaution to mark >> as an operator has become superfluous. The informal, but widely accepted notation X> is used quite often in order to signify X\b\X>. However, the ``natural'' interpretation of this formula isX|)>>>. This is an interesting case where it might be useful to insert an additional rule into the grammar (the general form being ,\,a\b>>, for a relation infix >). This rule should only apply for top-level formulas and not for subformulas ( X|)>>). Furthermore, it should be tested before the default interpretation ,\,\b|)>>>>. Here we point out that this is easy to do for packrat grammars, but not for more conventional LALR-1 grammars, where such a test would introduce various kinds of ambiguities. In BPR, one may also find variant of the above notation: ``we define the signed remainder sequence of and , <\equation*> SRemS=SRemS,SRemS,\,SRemS. In this case, it would have been appropriate to insert invisible brackets around the sequence. Yet another blend of list notation occurs frequently in: <\quote-env> The subsets of affine spaces > for > that are > Here one might have used the more verbose notation |}>> instead. However, it seems preferable to investigate whether this example justifies another exceptional rule in the universal grammar. Let us finally consider <\equation*> H\|Y,X|)>, which occurs as a subformula in COL. Currently, the separators and , have the same precedence in our universal grammar. In view of the above example, we might want to give , a higher binding force. Notice that ``informal list notation'' is a special instance of a potentially more general situation in which certain precedence rules need to be overridden under ``certain circumstances''. If informal text is not being marked as such inside formulas, then it gives rise to a wide variety of errors and potential misinterpretations. Consider for instance the formula <\equation*> \|x|)>=1 if =0> from COL. Since the word ``if'' was entered in math mode, the formula is parsed as <\equation*> \|x|)>=1|)>|)>=0>. , it sometimes happens that amathematical formula (or part of it) was entered in text mode. Indeed, consider the following fragment from BPR: <\quote-env> In the above, each >(P,Q) is the negative of the remainder in the euclidean division of >(P,Q) by > for i\k+1>, > At two occasions, the subformula > was entered in text mode. This might have been noticed by the authors through the use of an upright font shape. At several places in HAB we have used the following notation: <\equation*> P=0\|)>. Contrary to (), this formula can be parsed, but its meaning <\equation*> P=\|)>|]> is not what the author intended. This example provides yet more rationale for the design of agood criterion for when a large whitespace really stands for a comma. One final misinterpreted formula from the document COL is <\equation*> max-1-p*log p-q*log q-r*log r-*log. Indeed, the subformula > is interpreted as a summand instead of an -ary operator. Despite the existence of the above misinterpretations, we have to conclude that our universal grammar is surprisingly good at interpreting formulas in the intended way. On the one hand (and contrary to the remaining errors from the previous subsection), most misinterpretations tend to fall in a small number of categories, which are relatively well understood. A large number of misinterpretations can probably be eliminated by further tweaking of the universal grammar and the syntax corrector. Most of the remaining misinterpretation can easily be eliminated by the author, if two basic rules are being followed: <\enumerate> Correct manual resolution of the multiply/apply ambiguity. Correct annotations of all exceptional non mathematical markup inside formulas, such as text and meaningful whitespace. However, can probably do a better job at inciting authors to follow these rules. We will come back to this point in section. Of course, our conclusions are drawn on the basis of an ever shrinking number of documents. Rapid investigation of a larger number of documents shows that certain areas, such as >-calculus or quantum physics, use various specialized notations. Nevertheless, instead of designing a completely new grammar, it seems sufficient to add a few rules to the universal grammar, which can be further customized using macros in style files dedicated to such areas. For instance, certain notational extensions involving > and > signs may both be useful in real algebraic geometry and quantum physics. Such notations might be activated via certain macros and keybindings defined in dedicated style packages. <\big-table||||||||>|>|>|>>|||||>|||||>|||||>|||||>|||||>|||||>>>>> Manual determination of common sources of misinterpretation. Traditionally, a formal introduction of mathematical formulas starts with the selection of afinite or countable alphabet >. The alphabet is partitioned into a subset of logical symbols ,\,\,\,\>, a subset of relation symbols ,>,\>, a subset of function symbols ,,>,\> and a subset of constant symbols. Each non-constant symbol \\> also carries a set>\\> of possible arities. An a finite >-labeled tree, such that the leafs are labeled by constant symbols and such that the number of children of each inner >-labeled node is in >>. From now on, a is understood to be a formula which to an abstract syntax tree. For instance, the abstract syntax tree <\equation*> ||> corresponds to the formula y=x>. For concrete document formats, our definition of syntactic correctness assumes a way to make this correspondance more precise. The simplest content oriented formats are those for which the correspondence is straightforward. For instance, in , a constant symbol would be represented by itself and a compound tree <\equation*> |T|\|T> by > T1 ... Tn)>, where , >, are the representations of ,\,T>. Content is another format in which the representation of abstract syntax trees is rather straightforward. Of course, content oriented formats usually do not specify how to render formulas in a human readable form. This makes them less convenient for the purpose of wysiwyg editing. Instead of selecting a format for which the correspondence between abstract syntax trees and syntactically correct formulas is straightforward, we therefore propose to focus on convenient ways to specify non trivial correspondances, using the theory developed in sections and. In principle, as soon as a reliable correspondance between content and presentation markup is established, the precise internal document format is not that important: at any stage, a reliable correspondance would mean that we can switch at will between the abstract syntax tree and a more visual representation. Unfortunately, a perfectly bijective correspondence is hard to achieve. Technically speaking, this would require an invertible grammar with productions, which cannot only parse and transform presentation markup into content markup, but also pretty print content markup as presentation markup. Obviously, this is impossible as soon as certain formulas, such as and >, are parsed in the same way. Apart from this trivial obstacle, presentation markup usually also allows for various annotations with undefined mathematical semantics. For instance: what would be the abstract syntax tree for a blinking formula, or a formula in which the positions of some subformulas were adjusted so as to ``look nice''? These examples show that presentation markup is usually richer than content markup. Consequently, we have opted for presentation markup as our native format and concentrated ourselves on the one-way correspondence from presentation markup to content markup, using a suitable packrat grammar. In fact, for the semantic editing purposes in the remainder of this section, the grammar does not even need to admit production for the generation of the actual abstract syntax tree. Of course, productions should be implemented if we want actual conversions into or . As we will detail below, it should also be noticed that we do not only need acorrespondance between presentation and content markup, but also between several editor-related derivatives, such as cursor positions and selections. We finally have to decide what kind of action to undertake in cases of error. For instance, do we accept documents to be temporarily incorrect, or do we require parsability at all stages? <\remark> It remains an interesting challenge to design grammars which are ``as invertible'' as possible. For instance, which grammar-based transformations :L\L> admit a ``Galois''-inverse :L\L> with the property that =\\\\\> and =\\\\\>? Also, how to define a mathematical semantics for annotated formulas? In particular, how would they behave during computations? For instance, we might have a data type ``annotated number'' which joins the annotation for any operation. When annotating a subset of the input data, this semantics might be used to visualize the part of the output whose computation depends on this annotated subset. In order to benefit from semantic editing facilities, we first have to ensure that the formulas in our document are correct. For similar reasons as in section, we have two main options: <\enumerate> Enforce correctness at every stage. Incite the user to type correct formulas by providing suitable visual hints. It depends very much on personal taste of the user which option should be preferred. For the moment, we have opted for mere incitation in our implementation, but we will now address the issues which arise for both points of view. If we decide to enforce correctness at every stage, then we are immediately confronted to following problem: what happens if we type in the example <\equation> x, while taking into account that the formula is incorrect? Obviously, we will have to introduce some kind of markup which will replace by a correct formula, such as <\equation*> x+>>>>. The automatically inserted dummy box >>>>>> is parsed as an ordinary variable, but in the sense that it will be removed as soon as we type something new at its position. We will call such boxes in what follows. Several existing mathematical editors use slots in order to keep formulas correct and indicate missing subformulas. Of course, we need an automated way to decide when to insert and remove slots. This can be done by writing wrappers around the event handlers. On any document modification event, the following actions should be undertaken: <\enumerate> Pass the event to the usual, non-wrapped handler. For each modified formula, apply a post-correction routine, which inserts or removes slots when appropriate. If the post-correction routine does not succeed to make the formula correct, then undo the operation. These wrappers have to be designed with care: for instance, the usual undo handling has to be done outside the wrapper, in order to keep the tab-variant system working (see section). The post-correction routine should start by trying to make local changes at the old and the new cursor positions. For instance, if a slot is still present around the old cursor position, then attempt to remove it; after that, if the formula is not correct, attempt to make the formula correct by inserting a slot at the new cursor position. Only when the local corrections failed to make the entire formula correct, we launch a more global and expensive correction routine (which corrects outwards, starting at the cursor position). Such more expensive correction routines might for instance insert slots in the numerator and denominator of a newly created fraction. <\remark> For editors which are designed from scratch, an alternative point of view is to design all editing operations to correctly handle slots at the first place. For instance, when creating a fraction, one might automatically insert the slots for the numerator and the denominator. However, this strategy puts additional burden on the programmer for the implementation of all fundamental editing operations. In order to correctness, our decoupled approach therefore seems more robust and we recommend the implementation of an independent syntax checker/corrector at least as a complement to any other approach. Assume now that we no longer insist on correctness at all stages. In that case, an interesting question is how to provide maximal correctness making modifications such as the insertion and removal of slots. Following this requirement, the idea in example() would be to consider the formula <\equation*> x+ as being correct, the cursor is positioned after the >. In other words, if the current formula is incorrect, then we check whether the current formula with a dummy variable substituted for the cursor is correct. If this check again fails, then we notify the user that he made a mistake, for instance by highlighting the formula in red. The second approach usually requires a grammar which is more tolerant towards empty subformulas in special constructs such as fractions, matrices, etc. For instance, it is reasonable to consider the empty fraction > as being correct. Again, it really comes down to personal taste whether one is willing to accept this: the fraction >>>>|>>>>>> definitely makes it clear that some information is missing, but the document also gets clobbered, at least momentaneously by dummy boxes, which also give rise to ``flickering'' when editing formulas. Moreover, in the case of a matrix <\equation*> 1>>||>||>|>|||n>>>>>>, it is customary to leave certain entries blank. Semantically speaking, such blank entries correspond to invisible zeros or ellipses. Another thing should be noticed at this point. Recall that we defined correctness in terms of our ability to maintain a correspondance between our presentation markup and the intended abstract syntax tree. Another view on the above approach is to consider it as a way to build some degree of auto-correction right into the correspondance. For instance, the formula > can be considered to be ``correct'', because we still have a way to it correct ( by inserting a slot at the cursor position). The correction might even involve the history of the formula (so that the formula remains correct when moving the cursor). Of course, the same visual behaviour could be achieved by inserting invisible slots, or, if we insist on not modifying the document, by annotating the formula by its current correction. In order to describe the first grammar assisted editing feature inside , we first have to explain the concept of . Recall that a document is represented by a tree. Cursor positions are represented by alist ,\,i|]>> of integers: the integers ,\,i>> encode a path to a subtree in the document. If this subtree is a string leaf, then > encodes a position inside the string. Otherwise > is either or , depending whether the cursor is before or behind the subtree. For instance, in the formula <\equation*> +1 which is represented by the tree <\equation*> ||sin x|cos x>|+1>, the list > encodes the cursor position right after > and > the cursor position right after the fraction and before the >. Notice that we prefer the leftmost position>> over the equivalent position > in the second example. At a certain cursor position, we may consider all subtrees of the document which contain the cursor. Each of these subtrees are visualized by a light cyan bounding box inside . For instance, at the cursor position >, the above formula would typically be displayed as <\equation*> > x|cos x>>+1> The innermost subtree (in this case the fraction) is called the . In absence of selections, several so called operate on the current focus. It is natural to generalize the above behaviour to semantic subformulas according to the current grammar. For instance, from a semantic point of view, the above formula corresponds to another tree: <\equation> |>|1> For the cursor position just after >, it would thus be natural to consider as the and also surround this subexpression by a cyan box. Let us describe how this can be done. The tree serializer for our packrat parser, which has been described in section, also maintains a correspondance between the cursor positions. For instance, the cursor position > is automatically transformed into the cursor position inside <\code> \\\frac\sin x\\|\cos x\/f\+1 Now a successful parse of a formula naturally induces a parse tree, which is not necessarily the same as the intended abstract syntax tree, but usually very close. More precisely, consider the set > of all successful parses of substrings (represented by pairs of start and end positions) involved in the complete parse. Ordering this set by inclusion defines a tree, which is called the parse tree. In our example, the parse tree coincides with the abstract syntax tree. Each of the substrings in > which contains the cursor position can be converted back into a selection (determined by a start and an end cursor position) inside the original tree. In order to find the current focus, we first determine the innermost substring in > (represented by the pair >) which contains the cursor. This substring is next converted back into the selection ,|)>>. In general, the parse tree does not necessarily coincide with the abstract syntax tree. In this case, some additional tweaking may be necessary. One source of potential discrepancies is the high granularity of packrat parsers, which are also used for the lexical analysis. For instance, floating point numbers might be defined using the following grammar: <\scm-code> (define Integer (+ (- "0" "9"))) (define Number \ \ Integer \ \ (Integer "." Integer) \ \ (Integer "." Integer "e" Integer) \ \ (Integer "." Integer "e-" Integer)) However, in the example <\equation*> 12345.67890 the substring would be regarded as the current focus. We thus have to add an annotation to the packrat grammar and (at least for the purpose of determining the current focus) either specify that expressions should not be considered as valid subexpressions, or specify that expressions should be considered as atomic ( without any subexpressions). Other expressions for which parse trees and abstract syntax trees may differ are formulas which involve an associative operator. For instance, we might want to associate the abstract syntax tree <\equation*> to the expression <\equation*> a+b+c+d. This would make the parse-subexpression ``ineligible'' for use as the current focus. In our current implementation, this behaviour is the default. From the graphical point of view, it should noticed that many mathematical formulas do not require any two-dimensional layout: <\equation*> f|)>=3. In such examples, the usual convention of surrounding subexpressions containing the current cursor may lead to an excessive number of boxes: <\equation*> *b>|)>>>|)>>=3>.> For this reason, we rather decided to surround only the current semantic focus, in addition to the subexpression which correspond to native markup. In fact, maybe the usual convention was not a good idea after all: we should at least implement a user preference to only surround the current focus. The semantic focus is a key ingredient for semantic mathematical editors, since it provides instant visual feedback on the meaning of formulas while they are being entered. For instance, when putting the cursor at the right hand side of an operator, the semantic focus contains all arguments of the operator. In particular, at the current cursor position, operator precedence is made transparent to the user the cyan bounding boxes. Onthe one hand this makes it possible to check the semantics of a formula by traversing all accessible cursor positions (as we did in section). On the other hand, in cases of doubt on the precedence of operators, the semantic focus informs us about the default interpretation. Although the default interpretation of a formula such as *\> might be wrong, the editor's interpretation is at least made transparent, and it could be overridden through the insertion of invisible brackets. In other words, we advocate an approach to the resolution of common mathematical precedence ambiguities, rather than the approach mentioned in remark. Semantic selections can be implemented along similar lines as the semantic focus, but it is worth it to look a bit closer at the details. We recall that a selection is represented by a pair with a cursor starting position and a cursor end position. The natural way to implement semantic selections is as follows. We first convert the selection into a substring (encoded by a starting and end position) of the string serialization of the tree. We next extend the substring to the smallest ``eligible'' substring in >, for a suitable customizable definition of ``eligibility''. We finally convert the larger substring back into a selection inside the original document, which is returned as the semantic selection. In the above algorithm, the definition of eligible substring for selection may differ from the one which was used for the current focus. For instance, for an associative operator >, it is natural to consider both and as eligible substrings of . Actually, we might even want to select as a substring of . In the presence of subtractions, things get even more involved, since we need an adequate somewhat notion of abstract syntax tree. For instance, <\equation*> a-b+c+d would typically be encoded as <\equation*> |c|d> after which b+c> becomes an eligible substring for selection, but not . Furthermore, it is sometimes useful to select the subexpression c+d> of . This occurs for instance if we want to replace by . However, when adopting a naive point of view on semantic selections, which might even integrate the above remarks, c+d> remains non eligible as a substring of . Nevertheless, such substrings admit operator semantics. More precisely, we may consider the selection c+d> as an operator <\equation*> \+c+d:x\x+c+d, which will be applied to a suitable subexpression at the position where we perform a ``paste''. A less trivial example would be the ``selection operator'' <\equation*> >:x\. Although it is harder to see how we would select such an operator, this generalization might be useful for rapidly typing the continued fraction <\equation*> >>>> Notice that this kind of semantics naturally complements the idea to use the paste operation for substitutions of expressions (usually variables) by other expressions. This idea was heard in a conference talk by , but we ignore to who it was originally due. It should be noticed that there are situations in which non semantic selections are useful. This typically occurs when the default semantics of the formula is non intended. For instance, the default interpretation of the formula b\c> is b|)>\c>. Imagine acontext in which the desired meaning is rather c|)>>. In that case, the user might want to select the formula c> and put (possibly invisible) brackets around it. An easy way to achieve this might be to hold a modifier key while making the selection. Alternatively, we might introduce shortcuts for modifying the binding forces of operators. <\remark> In order to complete the discussion on current focus and selections, we should mention the fact that, in presence of a selection, the current focus is rather defined as being the smallest subexpression which contains the current selection (instead of the cursor). In particular, structured editing operations which are defined for the cursor focus (, inside a fraction, swapping the numerator and denominator, or move to the next fraction) naturally extend to (semantic) selections. Moreover, repeated mouse clicks or repeatedly applying the keyboard shortcut allows for quick (starting with the current focus, select the next larger subexpression at each subsequent iteration). Mathematical grammars can also be useful at the typesetting phase in several ways. First of all, the spacing around operators may depend on the way the operator is used. Consider for instance the case of the subtraction operator >. There should be some space around it in a usual subtraction , but not in a unary negation x>. We also should omit the space when using the subtraction as a pure operation: \,,>|}>>. Usually, no parsing is necessary in order to determine the right amount of spacing: only the general symbol types of the operator itself and its neighbours should be known for this task. We have seen in section how to annotate packrat grammars (using and ) in order to provide this information. Notice that this kind of refined spacing algorithms are also present in . In a similar vein, we may provide penalties for line breaking algorithms. In fact, grammar based line breaking algorithms might go much further than that. For instance, we may decide to replace fractions > with very wide numerators or denominators by>, thereby making them easier to hyphenate. Such replacements can only be done in a reliable way when semantic information is available about when to use brackets. For instance, > should be transformed into /c>, but > into . In order to avoid expensive parsing during the typesetting phase, we may store this bracketing information in the document or suitable annotations during the editing phase. Notice that semantic hyphenation strategies should be applied with some caution: ifone decides to transform a fraction occurring in a long power series with rational coefficients, then it is usually better to transform all fractions: a mixture of transformed and untransformed fractions will look a bit messy to the human eye. For similar reasons, one should systematically transform fractions when a certain width is exceeded, rather than trying to mix the usual hyphenation algorithm with semantic transformations. Finally, in formulas with more nesting levels, one should carefully choose the level where transformations are done. For instance, if a large polynomial matrix <\equation*> *b+a*b+7*b+a*b+3*a*b-8>|-a+a-a+a+a-a+1>>|-a+a-a+a+a-a+1>|*b+a*b+7*b+a*b+3*a*b-8>>>>> does not fit on a line, then we may transform the matrix into a double list: <\equation*> *b+a*b+7*b+a*b+3*a*b-8,a-a+a-a+a+a-a+1|]>,-a+a-a+a+a-a+1,a*b+a*b+7*b+a*b+3*a*b-8|]>|]>. Nevertheless, a more elegant solution might be to break the polynomials inside the matrix: <\equation*> |||*b+a*b+7*b+>>|*b+3*a*b-8>>>>>>||-a+a-a+>>|+a-a+1>>>>>>>||-a+a-a+>>|+a-a+1>>>>>>||*b+a*b+7*b+>>|*b+3*a*b-8>>>>>>>>>> For the moment, our implementation is not yet at this level of sophistication; it would already be nice if the user could easily perform this kind of transformations by hand. In section, we have seen that homoglyphs are a major source of ambiguity and one of the bottlenecks for the design of semantic editors. This means that we should carefully design input methods which incite the user to enter the appropriate symbols, without being overly pedantic. Furthermore, there should be more visual feedback on the current interpretation of symbols, while avoiding unnecessary clobbering of the screen. We are investigating the following ideas for implementation in : <\itemize> An way to disambiguate homoglyphs, while they are being entered, is to introduce the following new keyboard shortcut convention: if we want to override the default rendering of a symbol or operator, then type the appropriate key for the intended rendering just afterwards. For instance, we might use to enter an invisible space or to enter the wedge product. This convention benefits from the fact that many homoglyphs are infix operators, so that the above kind of shortcuts have no other natural meaning. An alternative approach would be to introduce a new shortcut for cycling through ``homoglyph variants''. The usual variant key might also be used for this purpose, although the list of possible variants would become a bit too long for certain symbols. Slightly visible graphical hints might be provided to users in order to inform on the current interpretation of a symbol. For instance, multiplication might be rendered as >y>, function application as x>, and the wedge product as y>, without changing any of the spacing properties. The rendering might also be altered more locally, when the cursor is inside the formula, or next to the operator itself. Besides, we are in the process of dressing up a list with the most common homoglyphs. Unfortunately, at the time of writing, the standard does not provide a lot of support on this issue, which we regard as a major design flaw. For non mathematical scripts, homoglyphs are usually treated in a clean way. For instance, distinguishes between the latin, the cyrillic and the greek capitalO. However, this convention is dropped when it comes to mathematics: <\quote-env> Mathematical operators often have more than one meaning in different subdisciplines or different contexts. For example, the \P\Q symbol normally denotes addition in a mathematical context, but might refer to concatenation in a computer science context dealing with strings, or incrementation, or have any number of other functions in given contexts. Therefore the Unicode Standard only encodes a single character for asingle symbolic form. There are numerous other instances in which several semantic values can be attributed to the same Unicode value. For example, U+2218 > RING OPERATOR may be the equivalent of white small circle or composite function or apl jot. The Unicode Standard does not attempt to distinguish all possible semantic values that may be applied to mathematical operators or relational symbols. It is up to the application or user to distinguish such meanings according to the appropriate context. Where information is available about the usage (or usages) of particular symbols, it has been indicated in the character annotations in the code charts printed in [Unicode] and in the online code charts[Charts]. \T> It is interesting to examine the reason which is advanced here: as a binary operator, the might both be used for addition and string concatenation. In a similar way, the latinO might be pronounced differently in english, hungarian or dutch. From the point of view, these distinctions are completely irrelevant: both when used as an addition or string concatenation, the remains an associative infix operator with a fixed precedence. For genuine homoglyphs, such as the logical or and the wedge product, the syntactical status usually changes as a function of the desired interpretation. For instance, the binding force of a wedge product is the same as the binding force of a product. Furthermore, such operators would usually be placed in different categories. For instance, the logical or can be found in the category ``isotech'' of general technical symbols. It would have been natural to add the wedge product in the category ``isoamsb'' of binary operators. In fact, breaks its own rules in various subtle ways. For instance, invisible mathematical operators can also be considered as homoglyphs. For this special case, does provide an incomplete set of unambiguous operators U+2061 until U+2064. Another, more perverse example is the suggested use of double stroke letters for frequent mathematical constants. For instance, a suggested symbol for the mathematical constant =2.71828\> is the ugly-looking > (U+2147). Yet another example concerns the added code charts for sans serif and monospaced characters. For newly developed mathematical fonts, such as the proprietary STIX fonts, these characters tend to be misused as a replacement for typesetting formulas or text in special fonts. For instance, monospaced characters are suggested for typesetting identifiers in computer programs, which would thereby become invisible for standard search tools. We don't know of any classical mathematical notation which uses monospaced symbols and would have justified the introduction of such curiousities. When taking alook at the existing mathematical charts, we may also wonder whether there are really many common mathematical homoglyphs. Indeed an abundance of such homoglyphs might have justified the lack of support for ``all possible semantic values'' of symbols. However, this hypothetic situation would have been very confusing for mathematicians at a first place. Mathematicians rather tend make suggestive use of existing symbols: they will use a symbol for operations which behave like additions in many respects; why would they use it to denote, say, existential quantification? Currently, the mainstream mathematical editing tools are presentation oriented and unsuitable for the production of documents in which all formulas are at least syntactically correct. In the present paper, we have described a series of techniques which might ultimately make the production of such documents as user friendly as the existing presentation oriented programs. The general philosophy behind our approach is that the behaviour of the editor should remain as close as possible to the behaviour of a presentation oriented editor, except, of course, for those editing actions which may benefit from the added semantics. One major consequence is that the editor naturally adapts itself to the editing habits of the user, rather than the converse. Indeed, it is tempting to ``teach'' the user how to write content markup, but the extra learning efforts should be rewarded by aproportional number of immediate benefits for the user. This holds especially for those users whose primary interest is a paper which looks nice, and who do not necessarily care about the ``added value'' of content markup. Technically speaking, our approach relies on decoupling the visual editing facilities from the semantic checking and correction part. This is achieved through the development of a suitable collection of packrat grammar tools which allow us to maintain a robust correspondance between presentation and content markup. This correspondance comprises cursor positions, selections, and anything else needed for editing purposes. So far, our collection of grammar tools allowed us to naturally extend the most basic editing features of to the semantic context. Although we found it convenient to use packrat grammars, it should be noticed that other types of grammars might be used instead, modulo an additional implementation effort. One main difficulty for the development of a general purpose semantic editor is the wide variety of users. Although we should avoid the pitfall of introducing a large number of user preferences (which would quickly become unintelligible to most people), we will probably need to introduce at least a few of them, so as to address the wide range of potential usage patterns. It will take a few years of user feedback before we will be able to propose asufficiently stable selection of user preferences. In this paper, we have contented ourselves to investigate various possible solutions to problems in which personal taste plays an important role. Another aspect of customization concerns the underlying mathematical grammar. We have argued that the design of grammars should be left to experts. Indeed, designing grammars is non trivial, and non standard grammars will mostly confuse potential readers. We are thus left with two main questions: the design of a ``universal'' grammar which is suitable for most mathematical texts, and an alternative way for the user to introduce customized notation. In section, we both introduced such a grammar and described our approach of using macros for introducing new notations. Such macros can be grouped in style packages dedicated to specific areas. Of course, our proposal for a ``universal'' grammar will need to be tested and worked out further. In some cases though, it is important to allow for customizable grammars. One good example is the use of as an interface to an automatic theorem prover. Such systems often come with their own grammar tools for customizing notations. In such cases, it should not be hard to convert the underlying grammars into our packrat format. Moreover, packrat grammars are very well adapted to this situation, since they are incremental by nature, so it easy and efficient to use different grammars at different places inside the document. An alternative approach, which might actually be even more user friendly, would be to automatically convert standard notation into the notation used by the prover. In sections and, we have tested the newly implemented semantic tools on some large existing documents from various sources. These experiments made as reasonably optimistic about our ultimate goal of building a user friendly semantic editor for the working mathematician. We have identified a few main bottlenecks, which suggest several improvements and enhancements for the current editor: <\itemize> We should provide better visual feedback to the user on homoglyphs. In particular, the difference between invisible multiplication and function application should be made more apparent. The concept of current semantic focus should be thought out very carefully. We might want to suppress the traditional behaviour of to surround all ancestors of the current focus. traditionally does not use slots for displaying missing subformulas. Along the lines described in section, we plan to experiment with an editing mode which will use them in a systematic way so as to keep formulas correct at all stages. The ``universal grammar'' should be tested on a wider range of documents and carefully enhanced to cover most standard notations. We should work out a comprehensible set of annotation primitives, which users will find natural when they need to put text inside their formulas or resort to notations. Since the current implementation is very recent, many additional tweaking will be necessary in response to user feedback. We also have various other plans for future extensions: <\itemize> We plan to add productions to our packrat grammars, which will enable us to automatically generate, say, content markup for all formulas. It should also be possible to translate between distinct grammars. We intend to implement at least some rudimentary support for grammar based typesetting, starting with line breaking, as outlined in section. Systematic generalization of all currently supported editing operations (such as search and replace, tab-completion, structured navigation, etc.) to the semantic context. Using the packrat grammar tools for programming languages as well, so as to support syntax highlighting, automatic indentation, etc. After dealing with purely textual programs, we will next investigate the possibility to insert mathematical formulas directly into the source code. Exploit the extra semantics in our interfaces with external systems. <\bibliography|bib|tm-plain|all.bib> <\bib-list|35> OlivierArsac, StéphaneDalmasMarcGaëtano. The design of a customizable component to display and edit formulas. , 283--290. 1999. PhilippeAudebaudLaurenceRideau. Texmacs as authoring tool for formal developments. , 103:27--48, 2004. S.Basu, R.PollackM.-F.Roy. , 10. Springer-Verlag, 2006. 2-nd edition to appear. S.Basu, R.PollackM.F.Roy. , 10. Springer-Verlag, Berlin, Heidelberg, New York, 2003. T.Braun, H.Danielsson, M.Ludwig, J.PechtaF.Zenith. Kile: an integrated latex environment. Http://kile.sourceforge.net/, 2003. E.Chou. Collected course material. , 2010. F.DeRemer. . , Massachusetts Institute of Technology, 1969. FSF. Gnu bison. , 1998. B.Ford. Packrat parsing: a practical linear-time algorithm with backtracking. , Massachusetts Institute of Technology, September 2002. BryanFord. Packrat parsing: simple, powerful, lazy, linear time. , ICFP '02, 36--47. New York, NY, USA, 2002. ACM. K.Geddes, G.GonnetMaplesoft. Maple. , 1980. A.G.Grozin. TeXmacs interfaces to Maxima, MuPAD and Reduce. V.P.Gerdt, , JINR E5, 149. Dubna, June 2001. Arxiv cs.SC/0107036. A.G.Grozin. TeXmacs-Maxima interface. Arxiv cs.SC/0506226, June 2005. S.C.Johnson. Yacc: yet another compiler-compiler. 1979. D.E.Knuth. . Addison Wesley, 1984. M.Kohlhase. OMDoc. , 2000. L.Lamport. . Addison Wesley, 1994. L.MamaneH.Geuvers. A document-oriented Coq plugin for TeXmacs. . St Anne's Manor, Workingham, UK, 2006. T.Nipkow, L.PaulsonM.Wenzel. Isabelle/Hol., 1993. S.Owre, N.ShankarJ.Rushby. Pvs specification and verification system. , 1992. MacKichanSoftware. Scientific workplace., 1998. G.J.SussmanG.L. SteeleJr. Scheme. , 1975. A.Tejero-Cantero. Interference and entanglement of two massive particles. , Ludwig-Maximilians Universität München, September 2005. . The OpenMath Society. OpenMath. , 2003. Unicode Consortium. Unicode. , 1991. W3C. MathML. , 1999. Wolfram Research. Mathematica. , 1988. N.G.deBruijn. The mathematical language automath, its usage, and some of its extensions. , 125, 29--61. Versailles, December 1968, 1970. Springer Verlag. M. Ettrichetal. The LyX document processor. , 1995. T. Coquandetal. The Coq proof assistant. , 1984. L.van denDries. , 248. Cambridge university press, 1998. J.van derHoeven. GNU. , 1998. J.van derHoeven. GNU TeXmacs: a free, structured, wysiwyg and technical text editor. DanielFlipo, , 39--40, 39--50. Metz, 14--17 mai 2001. Actes du congrès GUTenberg. J.van derHoeven. , 1888. Springer-Verlag, 2006. J.van derHoeven. . , Univ. d'Orsay, 2007. Mémoire d'habilitation. <\initial> <\collection> <\references> <\collection> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |\>|42>> |\>|?>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |\>|?>> > > > > > > > > > > > > > > > > <\auxiliary> <\collection> <\associate|bib> Knu84 Lam94 KILE LyX SWP vdH:TeXmacs vdH:Gut Gro01 AR04 Gro05 MG06 Automath Coq PVS Isabelle MathML Ford02 Ford02b ADG99 Maple Mathematica Knu84 Lam94 MathML Scheme Automath Coq PVS Isabelle OpenMath OMDoc Unicode BPR03 BPR06 Chou:col Tejero:master vdH:ln vdH:hab DR69 Yacc Bison Ford02 Ford02b Ford02b vdD98 LyX SWP Knu84 <\associate|idx> |Help>||Manual>||Mathematical formulas>>|> |Debug>||Correct>>|> |Debug>||Mathematics>||Error status report>>|> |Format>||Index level>||Script script size>>|> <\associate|table> Global performance of the T|E>||||0.5fn>|0fn|-0.1fn>>XACS||||0.5fn>|0fn|-0.1fn>> syntax corrector on various documents.|> Numbers of corrections due to individual algorithms.|> <\tuple|normal> T|E>||||0.5fn>|0fn|-0.1fn>>XACS||||0.5fn>|0fn|-0.1fn>> shortcuts for common homoglyphs. > <\tuple|normal> Sources of the remaining errors in our sample documents, sorted out by category. > <\tuple|normal> Manual determination of common sources of misinterpretation. > <\associate|toc> |math-font-series||font-shape||1.Introduction> |.>>>>|> |math-font-series||font-shape||2.Survey of formats for mathematical formulas> |.>>>>|> |2.1.L>T|E>||||0.5fn>|0fn|-0.1fn>>X |.>>>>|> > |2.2.MathML |.>>>>|> > |2.3.T|E>||||0.5fn>|0fn|-0.1fn>>XACS||||0.5fn>|0fn|-0.1fn>> |.>>>>|> > |2.4.Other formats |.>>>>|> > |math-font-series||font-shape||3.Basic syntax correction> |.>>>>|> |3.1.Common ambiguities |.>>>>|> > |3.2.Common errors |.>>>>|> > |3.3.Syntax correction |.>>>>|> > |3.4.Experimental results |.>>>>|> > |math-font-series||font-shape||4.Basic semantic editing> |.>>>>|> |4.1.Introduction |.>>>>|> > |4.2.Enforcing matching brackets |.>>>>|> > |4.3.Homoglyphs and the ``variant'' mechanism |.>>>>|> > |math-font-series||font-shape||5.Packrat grammars and parsers> |.>>>>|> |5.1.Survey of packrat parsing |.>>>>|> > |5.2.Advantages and disadvantages of packrat grammars |.>>>>|> > |5.3.Implementation inside T|E>||||0.5fn>|0fn|-0.1fn>>XACS||||0.5fn>|0fn|-0.1fn>> |.>>>>|> > |5.4.Parsing trees |.>>>>|> > |math-font-series||font-shape||6.Towards a universal grammar for mathematics> |.>>>>|> |6.1.Major design issues |.>>>>|> > |6.2.Overview of the current T|E>||||0.5fn>|0fn|-0.1fn>>XACS||||0.5fn>|0fn|-0.1fn>> mathematical grammar |.>>>>|> > |6.2.1.The symbol grammar |language||std-symbols> |.>>>>|> > |6.2.2.The operator grammar |language||std-math-operators> |.>>>>|> > |6.2.3.The main grammar |language||std-math-grammar> |.>>>>|> > |6.3.Customization via T|E>||||0.5fn>|0fn|-0.1fn>>XACS||||0.5fn>|0fn|-0.1fn>> macros |.>>>>|> > |6.4.Experimental results |.>>>>|> > |6.4.1.Analysis of the remaining errors |.>>>>|> > |Eligible for automatic correction |.>>>>|> > |Unsuccessful automatic correction |.>>>>|> > |Informal visual notation |.>>>>|> > |Genuine typos |.>>>>|> > |Other errors |.>>>>|> > |6.4.2.Analysis of misinterpretations |.>>>>|> > |math-font-series||font-shape||7.Grammar assisted editing> |.>>>>|> |7.1.On syntactic correctness |.>>>>|> > |7.2.Maintaining correctness |.>>>>|> > |7.3.Focus |.>>>>|> > |7.4.Selections |.>>>>|> > |7.5.Grammar assisted typesetting |.>>>>|> > |7.6.Editing homoglyphs |.>>>>|> > |math-font-series||font-shape||8.Conclusion and future extensions> |.>>>>|> |math-font-series||font-shape||Bibliography> |.>>>>|>