Converters to other data formats

Currently, we have implemented imperfect converters between TeXmacs and LaTeX and from html to TeXmacs. We hope that someone else will be willing to write better converters from scratch. This chapter has been included in order to give some recommendations in that direction based on our experience from the implementation of the actual conversion programs. We also recommend to take a look at the current implementations in the directory Convert.

1.Parsing extern data formats

In order to write a converter from LaTeX html, xml, etc. to TeXmacs, a good first step is to write a parser for the extern data format. For html, xml, etc. this should be rather easy, but for LaTeX, you will probably need to be a real LaTeX guru (which I am not). We recommend the result of the parsing step to be a Scheme expression (something which is regrettable not the case for our actual converters), because this language is very well adapted for the implementation of the actual converter.

This first step should be able to process any correct file having the extern data format; possible incompatibilities should only come into play during the actual conversion. In the case of LaTeX, one should not expand the macros and keep all macro definitions, because TeXmacs will be able to take advantage out of this.

2.The actual converter

We recommend the actual converter to proceed in several steps. Often it is convenient to start with a rough, structural, conversion step, which is “polished” by a certain number of additional steps. These additional steps may take care of some very particular layout issues which can not be treated conveniently at the main step.

Actually, the main difficulties usually come from exceptional text, like verbatim, and layout issues which are handled differently in the extern data format and TeXmacs. A good example of such a difference between LaTeX and TeXmacs is the way equations or lists are handled. Consider for instance the following paragraph:

Text before.

a2 + b2 = c2.

Text after.

In LaTeX, the equation is really seen as a part of the paragraph. Indeed, there will not be any blank line between “Text before” and the equation. However, for efficiency reasons, it is better to see the paragraph as three paragraphs in TeXmacs, because the lines can be typesetted independently. Nevertheless, the equation environment will disable the indentation of “Text after”.

As a result of this anomaly, converted texts have to be postprocessed, so as to insert paragraph breaks at strategic places. It should be noticed that this step may be independent from the format which is actually being converted and that a similar reverse step may be implemented for backward conversions. We also notice that one needs an exhaustive list of all similar exceptional environments for this postprocessing step. Actually, a future version of TeXmacs might come with an additional feature, which permits the automatic detection of such environments. This is also important from a semantical point of view, because one should be able to detect that the above example logically forms only one and not three paragraphs.

3.Backward conversions

Conversions from TeXmacs to an extern data format are usually easier to implement, because the TeXmacs data format is semantically rich. However, conversions to an extern data format without a TeX-like macro facility give rise to the problem of macro expansion of non supported TeXmacs functions or environments. We plan to write a facility for this, which you will be able to use when writing a converter from TeXmacs to something else.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License".