Preserving syntactic correctnesswhile editing mathematical formulas

GNU TeXmacs is a free software for editing scientific documents with mathematical formulas, which can also be used as an interface for many computer algebra systems. We present the design of a new experimental mathematical editing mode which preserves the syntactic correctness of formulas during the editing process (i.e. all formulas can be parsed using a suitable, sufficiently rich grammar). The main constraint is to remain as closely as possible to the existing presentation-oriented formula editor, which has the advantage of being very user friendly.

Keywords: mathematical editing, syntactic correctness, packrat parsing, TeXmacs

A.M.S. subject classification: 68U15, 68U35, 68N99

1.Introduction

Most mathematical formulas in current scientific papers only carry very poor semantics. For instance, consider the two formulas

and

. People typically enter these formulas using the LaTeX pseudo-code $f(x+y)$ and $a(b+c)$ . Doing so, we do not transmit the important information that we probably meant to apply

in the first formula and to multiply

with

in the second one. The problem to automatically recover such information is very hard in general. For this reason, it would be desirable to have mathematical authoring tools in which it is easy to write formulas which systematically carry this type of information.

One important application where semantics matters is computer algebra. Popular computer algebra systems such as Mathematica and Maple contain formula editors in which it is only possible to input formulas which can at least be understood from a syntactical point of view by the system. However, these systems were not really designed for writing scientific papers: they only offer a suboptimal typesetting quality, no advanced document preparation features, and no support for more informal authoring styles which are typical for scientific papers.

The GNU TeXmacs editor was designed to be a fully fledged wysiwyg alternative for TeX/LaTeX, as well as an interface for many computer algebra systems. The software is free and can be downloaded from http://www.texmacs.org. Although formulas only carried barely more semantics than LaTeX in old versions of TeXmacs, we have recently started to integrate more and more semantic editing features. Let us briefly discuss some of the main ideas behind these developments; we refer to [7] for more details and historical references to related work.

First of all, we are only interested in what we like to call “syntactical semantics”. In the formula

, this means that we wish to capture the fact that

is an infix operator with arguments

and

, but that we are uninterested in the fact that

stands for addition on integers. Such syntactical semantics can be modeled adequately using a formal grammar. Several other mathematical formula editors are grammar-based [1–3, 8, 10, 11], and they make use of various kinds of formal grammars. In TeXmacs, we have opted for so called packrat grammars [4, 5], which are particularly easy to implement and customize.

A second question concerns the precise grammar that we should use to parse formulas in scientific documents. Instead of using different grammars for various areas with different notations, we were surprised to emperically find out that a well-designed “universal” mathematical grammar is actually sufficient for most purposes; new notations can still be introduced using a suitable macro mechanism.

The last main point concerns the interaction between the editor and the grammar. So far, we implemented a packrat parser for checking the correctness of a formula. While editing a formula, its correctness is indicated using colored boxes. It is also possible to detect and visualize the scopes of operators through the grammar. In addition to the parser, we implemented a series of tools which are able to detect and correct the most common syntactical mistakes and enhance existing documents with more semantics.

In the present paper, we wish to go one step further and enforce syntactic correctness throughout the editing process. Ideally speaking, the following requirements should be met:

The main technique that we will use for sticking as close as possible to the old, presentation oriented editing behaviour is to automatically insert “transient” markup for enforcing correctness during the editing process. For instance, when typing X+, TeXmacs will display

The transient box is used to indicate a missing symbol or subexpression and will be removed as soon as the user enters the missing part.

The use of transient boxes for missing symbols or subexpressions is common in other editors [9]. The question which interests us here is how to automatically insert such markup when needed in a way that is essentially independent from specific grammars. In this paper, we work out the following approach which was suggested in [7]: before and after each editing operation, subject the formula to suitable “correction” procedures that are only allowed to add or remove transient markup. Correcting all errors in a general formula is a very difficult problem, but the power of our approach comes from the fact that the editing process is incremental: while typing, the user only introduces small errors—mostly incomplete formulas—, which are highly localized; we may thus hope to deal with all possible problems using a small number of “kinds of corrections”.

Obviously, the simplest kinds of corrections are adding or removing a transient box at the current cursor position. This is indeed sufficient when typing simple formulas such as

, but additional mechanisms are needed in other situations. For instance, in the formula

(with the cursor between the “

” and the “

”), entering another

results in

(instead of

). Hitting backspace in the same formula

yields

; in this case, the transient “

” should be parsed as an infix addition, and not as an ordinary symbol (as was the case for a transient box).

The appropriate corrections are not always so simple. For instance, consider the quantified expression

. Just after we entered the existential quantifier “

”, the formula will read

, i.e. it was necessary to add three transient symbols in order to make the expression syntactically correct. The fact that our approach should apply to general scientific documents with mathematical formulas raises several further problems. For instance, in the formula

the trailing ponctuation “,” is incorrect from a mathematical point of view, but needed inside the surrounding English sentence. Similarly, more work remains to be done on the most convenient way to include English text inside formulas while maintaining syntactic correctness.

Yet another difficulty stems from the implementation: one needs to make sure that the necessary corrections take place after any kind of editing operation. However, for efficiency reasons, it is important to only run the correction procedures on small parts of the document. Inside an existing editor such as TeXmacs, these requirements turn out to be quite strong, so some trade-offs may be necessary.

In what follows, we report on our first implementation of these ideas inside TeXmacs. We describe and motivate the current design, discuss remaining problems, and outline directions for future improvements. Of course, more user feedback will be necessary in order to make the new mechanisms suitable for wide-spread use.

2.Survey of formula editing with TeXmacs

In this section, we briefly recall the main design philosophy behind the TeXmacs formula editor. We start with the description of the original, purely presentation-oriented mathematical editing mode. We pursue with the more recent grammar-based editing features, which are presented in more detail in [7].

2.1.Presentation oriented editing

The original goal behind TeXmacs was to provide a user friendly editor for mathematical papers with a similar typesetting quality as TeX. The challenge was to design a real time wysiwyg editor for complex, structured documents. Some early inspiration came from the idea [1] that graphically oriented math editors achieve the highest level of user friendliness. For instance, when pressing the right arrow key, the cursor should move to the right if possible (instead of moving forward in some abstract document tree, as was the case in some other existing editors). Early versions of TeXmacs used algorithms for the cursor movement which achieved this in a systematic way [6], while still making sure that all possible cursor positions in the corresponding document tree could be reached.

Another aspect of user friendliness concerned the efficiency of mathematical input methods. We designed highly efficient (and easy to memorize) keyboard shortcuts for entering common mathematical symbols, such as -> for

, <= for

, <⇥/ for

, ⇧R⇧R for

, etc. TeXmacs also implements many “structured editing operations”, so as to fully exploit the structure of documents. For instance, adding a row or column to a matrix can be done by pressing a single key or keyboard combination. Similarly, it is easy to change a matrix into a determinant or vice versa.

2.2.Grammar-based editing

The next challenge for TeXmacs is to ensure that we can only enter syntactically correct formulas, while keeping a presentation-oriented interface, which proved to be most user friendly. The first steps of this program were made in [7]. Now syntactic correctness is usually modeled as “parsability against a suitable grammar”. Before anything else, one should decide on the grammar. In particular, does a single “universal grammar” suffice, or do we need many different grammars, depending on the preferred notations of authors?

For reasons that are explained in detail in [7], we opted for the development of a universal packrat grammar [4, 5] for parsing all our mathematical formulas. In order to conserve a sufficient degree of flexibility for the introduction of new notations, we rely on a combination of two techniques: on the one hand, TeXmacs comes with a powerful macro language for introducing new markup elements. On the other hand, we introduced a special construct which allows a symbol or expression to be behave (i.e. be parsed) as an arbitrary other symbol or expression. This allows you for instance to annotate the symbol

to behave as

, which implies that

will be parsed as

instead of

One of the major difficulties of semantic editing is a clean treatment of homoglyphs, i.e. symbols with the same graphical shape, but a different syntactical meaning. The most annoying homoglyph is the multiplication/function-application ambiguity mentioned in the introduction. Another good example concerns the wedge product

and logical conjunction

, which admit different binding forces. Fortunately, there are not that many mathematical homoglyphs; for this reason, we advocate the introduction of separate symbols for them into the Unicode standard.

3.Preservation of correctness

In this section, we describe several strategies that can be used to preserve the syntactic correctness of formulas under editing operations. TeXmacs currently implements the “multiple correction schemes” strategy from Sections 3.2 and 3.3. The reader may try this implementation by downloading version

1.99.3 or SVN revision

9718. The new editing mode is still experimental and can be enabled inside math mode by clicking on the

icon and checking Semantic correctness.

3.1.The ideal strategy for preserving correctness

Ideally speaking, maintaining the syntactic correctness of mathematical formulas throughout the editing process can be done by

This ideal strategy is simple and robust; it trivially guarantees the correctness of all formulas throughout the editing process. However, it does not take into account the specific nature of certain editing operations. In particular, it does not exploit the locality of many editing actions.

Example 1. Consider the strict application of the ideal strategy to the creation of a subscript in the formula

. Since

is a valid symbol, the main editing action would create an empty subscript for it. We next launch the correction procedure, which replaces the empty subscript by a transient box, yielding

. However, the

being transient, the user would rather expect to endow the “

” operator with a subscript: this is indeed what happens in the old presentation-oriented editing mode when ignoring all transient markup. In other words, we rather expect to obtain

The above example shows that an indiscriminate global correction procedure does not provide enough control. In fact, there are usually many ways to correct a formula by adding or removing transient markup. In order to determine the “best” solution, one typically needs to take into account the precise editing operation and the current cursor position.

Another constraint is that we would like the editor to behave as closely as possible as the old presentation-oriented editing mode when ignoring all transient markup. The above example shows that a global correction procedure does not necessarily respect this constraint. One theoretic solution to this problem is to remove all transient markup before performing the editing action and then put it back in when running the correction procedure. However, this approach may lead to non local changes in the document for every editing action, which is obviously not desirable.

Remark 2. For the above reasons, we have not implemented the correction strategy from this section yet. The idea nevertheless remains interesting for future research. Indeed, on the one hand side it raises the interesting theoretical question of correcting a string so as to make it parsable by a given (packrat) grammar. From the practical point of view, the ideal strategy has the important advantage of trivially guaranteeing syntactic correctness all along. In cases where this is hard to achieve using other means, it thereby remains a good fallback strategy.

3.2.Multiple correction schemes

Instead of implementing one global correction procedure, our current TeXmacs implementation relies on multiple “correction schemes”. Each correction scheme is allowed to add or remove transient markup both before and after the actual editing operation. In other words, it really encapsulates the editing action into a semantically enhanced editing action. Furthermore, the correction scheme is allowed to fail (i.e. to produce an incorrect formula at the end). For this reason, we try multiple correction schemes in a row (the set of “eligible” schemes depends on the specific editing action), and stop as soon as we managed to obtain a correct formula.

For the actual implementation, it is clearly crucial to be able to undo editing actions whenever necessary, and in a way that is orthogonal to the usual undo/redo operations in TeXmacs.

Example 3. When inserting a mathematical symbol, the first correction scheme we try is the following: first remove all transient markup around the cursor, then insert the symbol, and finally insert a transient box at the cursor position (if needed). For instance, typing A+B in an empty mathematical formula successively yields

, and

Example 4. The basic correction scheme from the previous example sometimes fails. For instance, assume that we are in the situation

, and that we add a second “

”. When applying the basic correction scheme, we need to correct

through the insertion of a single transient box. However, the formula

is still incorrect. For this particular case, we therefore use the following correction scheme: first add a transient box (

), then perform the editing action (

), and finally correct (nothing needs to be done at this step).

In Step 3, we simply canceled the editing action if all correction schemes failed. Several other fallback strategies can be considered. If we do not aim to maintain correctness at all costs, then we may apply the editing action without any corrections, and temporarily tolerate incorrect formulas. We might also implement an unconditionally successful fallback strategy as in Remark 2; by always adding such a strategy at the end of our list of eligible correction schemes, we will never reach Step 3. Yet another idea is to introduce a correction scheme which annotates subexpressions with exotic notations in such a way that they become correct.

3.3.Quick survey of some of the implemented correction schemes

Our approach of using multiple correction schemes allows for fine-grained control, but also requires an increased amount of manual labour. Indeed, we both have to cover the complete set of editing actions, and for each editing action, we have to implement at least one correction scheme that will succeed in all possible situations.

Fortunately, the most common editing operations fall into four main categories: insertions and deletions that operate either on selections or not. Some other operations such as “search and replace” have not yet been adapted (see also the next section). Ultimately, the idea would be to provide manual support for the most common operations and to implement a suitable fallback strategy for the other ones.

3.3.1.Correction schemes for insertions

Let us briefly list how we perform the most prominent correction schemes for insertions, in absence of active selections. For each of the schemes, we show the successive states of the formula for a simple example.

The last three schemes also show that it is sometimes necessary to insert transient markup with different semantics as an ordinary symbol in order to make the formula correct.

3.3.2.Correction schemes for deletions

For completion, we continue our list of examples with the most prominent correction schemes for deletions.

These examples show that the correction schemes have to be implemented with quite a lot of care. This is due to the fact that it is convenient to design the schemes to apply with the right level of generality (e.g. not only to the deletion of symbols for the basic schemes, but also to the deletion of more complex structures, such as subscripts, fractions, etc.).

4.Problematic cases and challenges

Several problems arose during the implementation of the new semantic mathematical editing mode which preserves syntactic correctness. Some of them were more or less expected and have been solved; others require more work and further experimentation. So far, all problematic cases that we encountered fall into two categories:

In this section, we will survey the most interesting issues that came up and highlight some of the remaining challenges.

4.1.Informal content inside formulas

One difficulty with mathematical formulas in scientific papers with respect to formulas in, say, computer algebra systems, is that they may contain ponctuation, decorations, typesetting directives, or explicative text. For instance, consider the following formula:

The best approach to these problems is to introduce suitable annotation markup which describes the semantics of informal content of this kind. For instance, we might introduce a tag “punctuation” for annotating the trailing period, and which would be ignored by the parser. Alternatively, one might use a special symbol “punctuation period in math mode”. In a similar spirit, AMS-LaTeX provides special environments (split, align, gather, etc.) for typesetting large formulas while preserving some of the intended semantics. TeXmacs also contains a general purpose tag “syntax”, which may be used to parse an expression according to the rules of another specified expression. This allows us for instance to parse the word “and” in the same way as the infix operator “

”. However, we have no “postfix quantification” rule in our grammar yet. More generally, the design of a complete DTD for informal annotations is an interesting challenge.

Assuming suitable markup, the design of user friendly ways to perform the necessary annotations is another matter. Trailing periods are so common that we actually would like to enter them simply by pressing .. There are two approaches to this problem. Our current solution is to adapt the grammar for displayed formulas so as to accept trailing punctuation (which also means that we do not need any special annotation semantics). A better solution would be to “requalify” symbols whenever needed. For instance, in the formula

the trailing comma would be interpreted by default as a “punctuation symbol”. However, as soon as we add a new character

to the line, we remove the annotation markup and requalify the comma to become a separator.

Of course, for arbitrarily complex informal text (such as the “almost everywhere” example), it will be hard to completely avoid user feedback on how to insert the necessary annotations. Nevertheless, some of the most common words (“and”, “or”, “iff”, etc.) might be annotated automatically.

4.2.Special syntactic constructs

One obvious drawback of our strategy to manually design the necessary correction schemes is completeness: every additional mathematical notation potentially requires one or more new correction schemes. Fortunately, most mathematical notations are quite simple, so this disadvantage is not as bad as it might seem. General purpose scientific papers nevertheless involve far more special syntactic constructs than, say, computer algebra input. Let us illustrate some typical issues that occur on the hand of a few somewhat unorthodox constructs.

4.3.Special editing operations and markup

Let us finally investigate to which extent existing editing operations have to be adapted to the new, more semantic editing mode. We will start with a few issues that are already dealt with and then turn our attention to the remaining challenges.

Bibliography

[1]: O. Arsac, S. Dalmas, and M. Gaëtano. The design of a customizable component to display and edit formulas. In ACM Proceedings of the 1999 International Symposium on Symbolic and Algebraic Computation, July 28–31, pages 283–290. 1999.
[2]: Y. Bertot. The CtCoq system: design and architecture. Formal Aspects of Computing, 11(3):225–243, 1999.
[3]: P. Borras, D. Clement, Th. Despeyroux, J. Incerpi, G. Kahn, B. Lang, and V. Pascual. Centaur: the system. SIGSOFT Softw. Eng. Notes, 13(5):14–24, 1988.
[4]: B. Ford. Packrat parsing: a practical linear-time algorithm with backtracking. Master's thesis, Massachusetts Institute of Technology, September 2002.
[5]: B. Ford. Packrat parsing: simple, powerful, lazy, linear time. In Proceedings of the seventh ACM SIGPLAN international conference on Functional programming, ICFP '02, pages 36–47. New York, NY, USA, 2002. ACM Press.
[6]: J. van der Hoeven. GNU TeXmacs: a free, structured, wysiwyg and technical text editor. In Daniel Filipo, editor, Le document au XXI-ième siècle, volume 39–40, pages 39–50. Metz, 14–17 mai 2001. Actes du congrès GUTenberg.
[7]: J. van der Hoeven. Towards semantic mathematical editing. J. Symbolic Comput., 71:1–46, 2015.
[8]: N. Kajler. Environnement graphique distribué pour le calcul formel. PhD thesis, Université de Nice-Sophia Antipolis, 1993.
[9]: L. Padovani and R. Solmi. An investigation on the dynamics of direct-manipulation editors for mathematics. In A. Asperti, G. Bancerek, and A. Trybulec, editors, Mathematical Knowledge Management, volume 3119 of Lect. Notes Comp. Sci., pages 302–316. Springer Berlin Heidelberg, 2004.
[10]: N. M. Soiffer. The Design of a User Interface for Computer Algebra Systems. PhD thesis, University of California at Berkeley, 1991.
[11]: L. Théry, Y. Bertot, and G. Kahn. Real theorem provers deserve real user-interfaces. SIGSOFT Softw. Eng. Notes, 17(5):120–129, nov 1992.