\documentclass[fleqn]{article} \newcommand{\OMEGA}{$\Omega$} \newcommand{\mymathtt}[1]{\mbox{\texttt{#1}}} \newcommand{\mymathit}[1]{\mbox{\emph{#1}}} \begin{document} \title{Draft documentation for the \OMEGA\ system} \author{John Plaice\thanks{D\'epartement d'informatique, Universit\'e Laval, Ste-Foy (Qu\'ebec) Canada~G1K~7P4. \texttt{John.Plaice@ift.ulaval.ca}} \and Yannis Haralambous\thanks{187,~rue Nationale, F-59000~Lille, France. \texttt{Yannis.Haralambous@univ-lille1.fr}}} \date{26~February~1995} \maketitle \section{Introduction} This document is version~0.0 of the documentation for the \OMEGA~typesetting system, designed and developed by the authors. This draft document accompanies the 1.1~release of~\OMEGA, which is available at \begin{verbatim} ftp://ftp.ift.ulaval.ca/cours/omega \end{verbatim} or at \begin{verbatim} ftp://ftp.ens.fr/pub/tex/yannis/omega \end{verbatim} This documentation should be considered cursory, the bare minimum for those who wish to do $\alpha$- and $\beta$-testing. \section{Sixteen-bit fonts, registers, etc.} One of the fundamental limitations of \TeX3 is that most quantities can only range between 0~and~255. Fonts are limited to 256~characters each, only 256~fonts are allowed simultaneously, only 256~of any given kind of register can be used simultaneously, etc. \OMEGA\ loosens these restrictions, allowing up to 65~536~of each of these entities to be used. \subsection{Changes to \TeX\ to produce \OMEGA} \paragraph{Characters.} Each font can allow up to 65~536 characters, ranging between 0~and 65~535. Unless other means are provided, using \OMEGA\ Translation Processes (see section~\ref{otp}), the input and output mechanisms for characters between 256 (hex~\texttt{100}) and 65~535~(hex~\texttt{ffff}) use four circumflexes. For example, \verb|^^^^cab0| means hex value~\verb|cab0| and \verb|^^^^0020| is the space character. \paragraph{Fonts.} The number of possible fonts can be changed through the compile-time constant \texttt{NUMBERofFONTS}, which must lie between 256~and~65536. One can access fonts numbering between 0~and~$\mathtt{NUMBERofFONTS}-1$. \paragraph{Registers.} The number of posible registers of each kind can be changed through the compile-time constant \texttt{NUMBERofREGISTERS}, which must also range between 256~and~65536. For each kind of register, one can access fonts numbering between 0~and~$\mathtt{NUMBERofREGISTERS}-1$. \paragraph{Font metric files.} The \verb|.tfm| files used by \TeX3 only allow 256~characters each. Like \TeX, \OMEGA\ uses \verb|.tfm| files, but it also uses \emph{extended font metric} (\verb|.xfm|) files, which are generalizations of \verb|.tfm| files for fonts of up to 65~536~characters each. The description below focuses on the differences between \verb|.tfm| files and \verb|.xfm| files. The standard definition of \verb|.tfm| files is in the second volume of Knuth's \emph{Computers and Typesetting} series. The first 52 bytes (13 words) of an \verb|.xfm| file contain thirteen 32-bit integers that give the lengths of the various subsequent portions of the file. These thirteen integers are, in order: \begin{tabular}{ll} $0$ &empty word to designate \verb|.xfm| file;\\ \emph{lf}&length of the entire file, in words;\\ \emph{lh}&length of the header data, in words;\\ \emph{bc}&smallest character code in the font;\\ \emph{ec}&largest character code in the font;\\ \emph{nw}&number of words in the width table;\\ \emph{nh}&number of words in the height table;\\ \emph{nd}&number of words in the depth table;\\ \emph{ni}&number of words in the italic correction table;\\ \emph{nl}&number of words in the lig/kern table;\\ \emph{nk}&number of words in the kern table;\\ \emph{ne}&number of words in the extensible character table;\\ \emph{np}&number of font parameter words.\\ \end{tabular} The first word is~0 (future versions of \verb|.xfm| files could have different values; what is important is that the first two bytes be~0 to differentiate \verb|.tfm| and \verb|.xfm| files). The next twelve integers are as above, all non-negative and less than~$2^{31}$. The inequality $\mathit{bc}-1\leq\mathit{ec}\leq65535$ must hold, as must the equality \[\mathit{lf}=13+ \mathit{lh}+ 2(\mathit{ec}\!-\!\mathit{bc}\!+\!1)+ \mathit{nw}+ \mathit{nh}+ \mathit{nd}+ \mathit{ni}+ \mathit{nl}+ \mathit{nk}+ \mathit{ne}+ \mathit{np}.\] Note that an \verb|.xfm| font may contain as many as 65~536 characters (if $\mathit{bc}=0$ and $\mathit{ec}=65535$), and as few as 0~characters (if $\mathit{bc}=\mathit{ec}+1$). The rest of the \verb|.xfm| file is, like in \verb|.tfm| files, a sequence of ten data arrays. Three of the arrays are different: \emph{char\_info}, \emph{lig\_kern} and \emph{exten}. The \emph{char\_info} array contains one \emph{char\_info\_word} entry per character. Each \emph{char\_info\_word} in an \verb|.xfm| file takes 2~words (8~octets), packed as follows: \begin{description} \item[octets 0--1:] \emph{width\_index} (16~bits); \item[octet 2:] \emph{height\_index} (8~bits); \item[octet 3:] \emph{depth\_index} (8~bits); \item[octets 4--5:] \emph{italic\_index} (14 bits) times 4, plus \emph{tag} (2~bits); \item[octets 6--7:] \emph{remainder} (16 bits). \end{description} Therefore the \verb|.xfm| format imposes a limit of 256~different heights, 256~different depths, and 16~384~different italic corrections. The \emph{lig\_kern} array consists of a sequence of \emph{lig\_kern\_command} entries. Each \emph{lig\_kern\_command} in an \verb|.xfm| file takes 2~words (8~octets), packed as follows: \begin{description} \item[octets 0--1:] \emph{skip\_byte}, indicates that this is the final program step if the byte is 128 or more, otherwise the next step is obtained by skipping this number of intervening steps. \item[octets 2--3:] \emph{next\_char}, ``if \emph{next\_char} follows the current character, then perform the operation and stop, otherwise continue.'' \item[octets 4--5:] \emph{op\_byte}, indicates a ligature step if less than~128, a kern step otherwise. \item[octets 6--7:] \emph{remainder}. \end{description} For \verb|.tfm| files, if the very first instruction of a character's \emph{lig\_kern} program has $\mathit{skip\_byte}>128$, the program actually begins in location $256*\mathit{op\_byte}+\mathit{remainder}$. This feature allows access to large \emph{lig\_kern} arrays, because the first instruction must otherwise appear in a location $\leq255$. For \verb|.xfm| files, the latter value is $\leq65535$. Extensible characters are specified by an \emph{extensible\_recipe}, which consists of four 2-octet words called \emph{top}, \emph{mid}, \emph{bot}, and \emph{rep} (in this order). These bytes are the character codes of individual pieces used to build up a large symbol. If \emph{top}, \emph{mid}, or \emph{bot} are zero, they are not present in the built-up result. For example, an extensible vertical line is like an extensible bracket, except that the top and bottom pieces are missing. \paragraph{Font offsets.} When switching from one alphabet to another in Unicode, one passes from one Unicode page to another. However, the corresponding fonts will normally all be numbered from~0. To deal with this situation, a new keyword, \texttt{offset}, is introduced. In the \verb|\font| command, $\mathtt{offset}\;n$ states that character~$c$ in the font is referred to in \OMEGA\ by $n+c$. For example, \begin{verbatim} \font\ARfont=oar10 scaled 1728 offset 256 %% an X-font \end{verbatim} states that the font \texttt{oar10} is to be loaded, using a scaling factor of~1728, and that character~$c$ in the font will be referred to in \OMEGA\ as $c+256$ or, equivalently, that character~$C$ in \OMEGA\ refers to character $C-256$ in the font. \paragraph{Implementation.} The implementation of the changes presented in this section can be found in file \texttt{om16bit.ch}. The only known problem is that the current implementation creates very large formats. This can be alleviated by reducing the \texttt{NUMBERofFONTS} and \texttt{NUMBERofREGISTERS} compile-time constants. \subsection{Changes to \texttt{vptovf} to produce \texttt{xvptoxvf}} For the moment, \textsc{metafont} continues to produce 8-bit fonts. Given that most \texttt{.dvi} drivers can only handle 8-bit fonts, we took the soft approach of providing the means to develop 16-bit \emph{virtual} fonts that use 8-bit \emph{real} fonts. To do this, two new file formats, in addition to the \texttt{.xfm} files, had to be introduced: extended virtual property (\texttt{.xvp}) files and extended virtual font (\texttt{.xvf}) files. \paragraph{Extended virtual property files.} The \texttt{.xvp} files are the same as \texttt{.vpl} files, except that characters are no longer limited to 8~bits, but to 16~bits. \paragraph{Extended virtual font files.} The \texttt{.vf} file format already supports fonts with large numbers of characters. However, not all drivers that read \texttt{.vf} files properly support large fonts. Therefore, the files generated from \texttt{.xvp} files are labeled \texttt{.xvf} rather than~\texttt{.vf}. \paragraph{The \texttt{xvptoxvf} program.} The \texttt{vptovf} program reads in a virtual property (\texttt{.vpl}) file and generates a font metric (\texttt{.tfm}) file and a virtual font (\texttt{.vf}) file. In so doing, it reads in the \texttt{.tfm} files of all the fonts that it uses. The \texttt{xvptoxvf} program is the extended version of \texttt{vptovf}. It reads in an extended virtual property (\texttt{.xvp}) file and generates an extended font metric (\texttt{.xfm}) file and an extended virtual font (\texttt{.xvf}) file. In so doing, it reads in the \texttt{.tfm} and \texttt{.xfm} files of all the fonts that it uses. \paragraph{Implementation.} The changes to the \texttt{vptovf} program are all in \texttt{xvpvf.ed} in directory \texttt{fontutil}. There are currently no known problems. \subsection{Changes to \texttt{dvicopy} to produce \texttt{xdvicopy}} The \texttt{dvicopy} program is used to \emph{de-virtualize} a \texttt{.dvi} file, it reads in a \texttt{.dvi} file, and replaces all references to virtual fonts with references to the appropriate real fonts. The \texttt{xdvicopy} program does the same as \texttt{dvicopy}, except that it is also capable of reading \texttt{.xvf} and \texttt{.xfm} files. \paragraph{Implementation.} The changes to the \texttt{dvicopy} program are all in \texttt{xdvicp.ed} in directory \texttt{dviutil}. The current implementation is not optimal, in that it requires access to all \texttt{.tfm} files referred to in a virtual file, even if no characters in a \texttt{.tfm} file are needed to print a document. A demand-driven mechanism would work better and save users's disk space. \section{Bi-directional typesetting} \OMEGA\ currently includes Peter Breitenlohner's TeX--XeT, which is a modified form of Knuth and Mackay's TeX-XeT. There are two primitives (\verb|\beginR| and \verb|\endR|) to bracket right-to-left text in left-to-right text and two primitives (\verb|\beginL| and \verb|\endL|) to bracket left-to-right text in right-to-left text. See Knuth and Mackay's paper for more details. These primitives were essentially created for inserting bits of right-to-left texts into left-to-right documents. They are not really suitable for real mixed-direction typesetting. This topic is still under research, as is mixing horizontal and vertical typesetting. \paragraph{Implementation.} The implementation of the changes presented in this section can be found in file \texttt{tex--xet1415.ch}. There is one known problem, which also shows up if \TeX\ has been modified with the change file. The following file will cause the system to hang: \begin{verbatim} \documentclass{article} \begin{document} \tableofcontents \section{Ceci: ``\beginR titre\endR'' est \`a l'envers} \end{document} \end{verbatim} The problem seems to be the interaction between the \verb|``| ligature and the \verb|\beginR| primitive. \section{Character dimensions} To simplify the acrobatics necessary for diacritic placement for certain alphabets, four new primitives (\verb|\charwd|, \verb|\chardp|, \verb|\charht|, and \verb|\charit|) are provided. When followed by a integer designating a character, they respectively provide the width, the depth, the height and the italic correction of the character. For example, \begin{verbatim} \charwd120 \end{verbatim} can be considered to be an abbreviation of \begin{verbatim} \setbox250=\hbox{P}\wd250 \end{verbatim} but without the side effect of creating a box and putting something inside it. \paragraph{Implementation.} These changes are implemented in the \texttt{omchar.ch} file.\\ There are currently no known problems. \section{New infinity} To allow for inter-letter stretching in calligraphic scripts, such as Arabic, without having to rewrite macro packages, a new infinity level, \texttt{fi} has been added. It is smaller than \texttt{fil} but bigger than any finite quantity. There is therefore a new keyword, \texttt{fi} and there are two new primitives, \verb|\hfi| and \verb|vfi|. \paragraph{Implementation.} These changes are implemented in the \texttt{omfi.ch} file. There are currently no known problems. \section{\OMEGA\ Translation Processes} \label{otp} The changes described above are very useful, and allow the resolution of several problems. However, they do not radically alter the structure of \TeX. This is not the case for the \OMEGA\ Translation Processes, which allow text to be passed through any number of finite state automata, in order to impose the required effects. These processes are necessary for translating one character set to another. They are also used to choose the various forms of letters in Arabic, or to create consonental clusters in Khmer, or to rearrange letter order in Indic scripts. They could also offer alternative means of changing texts to upper or lower case or to hyphenate texts. Each translation process is placed in a file with the suffix \verb|.otp|. Its syntax is similar but not identical to a \texttt{lex} or \texttt{flex} file on Unix. Examples of translation processes can be found in the \texttt{otpexs} directory of the \OMEGA\ distribution. An \verb|.otp| file defines a finite state automaton that transforms an input character stream into an output character stream. It consists of six parts: \begin{tabular}{l} \emph{Input}\\ \emph{Output}\\ \emph{Tables}\\ \emph{States}\\ \emph{Aliases}\\ \emph{Expressions}\\ \end{tabular} \noindent where the \emph{Expressions} actually state what translations take place and in what situation. In what follows, $n$ refers to a positive integer between 0~and 65~535. It can be given in decimal form, octal form (preceded by \texttt{@'}) or hexadecimal form (preceded by \texttt{@"}). Hexadecimal numbers can use both minuscule and majuscule letters to express the digits~\emph{a--f}. Numbers can also be given in character form: a printable \textsc{ascii} character, when placed inside a pair of quotes, generates the \textsc{ascii} code for that character. For example, \verb|`a'| is equivalent to~\verb|@"61|. The \emph{Input} part states how many octets are in each input character. If the section is empty, then the default value is~2, since we hope that Unicode will become the standard means of communication in the future. If the section is not empty, it must be of the form \[ \mymathtt{input:}\;\mymathit{in}\mymathtt{;} \] where \emph{in} states how many octets are in each input character. The \emph{Output} part states how many octets are in each output character. If the section is empty, then the default value is~2, since we hope that Unicode will become the standard means of communication in the future. If the section is not empty, it must be of the form \[ \mymathtt{output:}\;\mymathit{out}\mymathtt{;} \] where \emph{out} states how many octets are in each output character. The \emph{Tables} part is used for defining tables that will be referred to later in the expressions. Often, translations from one character set to another are most efficiently presented through table lookup. This section can be empty, in which case no tables have been defined. If it is not empty, it is of the form \[ \mymathtt{tables:}\; \mymathit{table}^+ \] where each \emph{table} is of the form \[ \mymathit{id}\mymathtt{[}n\mymathtt{]=\{}n^+\mymathtt{\};} \] where the numbers in $n^+$ are comma-separated. The \emph{States} part is used to separate out the expressions. Not all expressions will necessarily be applicable in all situations. To do this, the user can name states and identify expressions with state names, in order to express what expressions apply when. This section can be empty, in which case there is only one state. If it is not empty, it is of the form \[ \mymathtt{states:}\; \mymathit{id}^+\mymathtt{;} \] where the identifiers in $\mymathit{id}^+$ are comma-separated. The \emph{Aliases} part is used to simplify the definition of the left hand sides of the expressions. Each expression consists of a left-hand side, in the form of a simplified regular expression, and of a right-hand side, which states what should be done with a recognized string. To simplify the definitions of the left-hand sides, aliases can be used. This section can be empty, in which case there are no aliases. If it is not empty, it is of the form \[ \mymathtt{aliases:}\; \mymathit{alias}^+ \] where each \emph{alias} is of the form \[ \mymathit{id}\;\mymathtt{=}\;\mymathit{left}\mymathtt{;}\] and \emph{left} is defined below. The \emph{Expressions} part is the very reason for an \verb|.otp| file. It states what translations must take place, and when. It cannot be empty, and its syntax is \[ \mymathtt{expressions:}\; \mymathit{expr}^+ \] Each \emph{expr} is of the form \[ \mymathit{leftState}\; \mymathit{totalLeft}\; \mymathit{right} \; \mymathit{pushBack} \; \mymathit{rightState} \mymathtt{;} \] where \emph{leftState} defines the state for which this expression is applicable, \emph{totalLeft} defines the left-hand-side regular expression, \emph{right} defines the characters to be output, \emph{pushBack} states what characters must be added to the input stream and \emph{rightState} gives the new state. Intuitively, if the automaton is in macro-state \emph{leftState} and the regular expression \emph{totalLeft} corresponds to a prefix of the current input stream, then (1)~the input stream is advanced to the end of the recognized prefix, (2)~the characters generated by the \emph{right} expression are put onto the output stream, (3)~the characters generated by the \emph{pushBack} stream are placed at the beginning of the input stream and (4)~the system changes to the macro-state defined by \emph{rightState}. The \emph{leftState} field can be empty. If it is not, its syntax is \[ \mymathtt{<} \mymathit{id} \mymathtt{>} \] The syntax for \emph{totalLeft} is \[ \mymathtt{beg:}? \; \mymathit{left}^+ \; \mymathtt{end:}? \] The \texttt{beg:}, if present, will only match the string if it is at the beginning of the input. The \texttt{end:}, if present, will only match the string if it is at the end of the input. The syntax for \emph{left} is given by \begin{eqnarray*} \mymathit{left} & ::= & n\\ & \mid & n\mymathtt{-}n\\ & \mid & \mymathtt{.}\\ & \mid & \mymathtt{(}\mymathit{left}^+\mymathtt{)}\\ & \mid & \mymathtt{\char94(}\mymathit{left}^+\mymathtt{)}\\ & \mid & \{\mymathit{id}\}\\ & \mid & \mymathit{left}\mymathtt{<}n\mymathtt{,}n?\mymathtt{>}\\ \end{eqnarray*} where the $\mymathit{left}^+$ means a series of \emph{left} separated by vertical bars. Therefore, $n$ means a single number, $n\mymathtt{-}n$ is a range, $\mymathtt{.}$~is a wildcard character, $\mymathtt{(}\mymathit{left}^+\mymathtt{)}$ is a choice, $\mymathtt{\char94(}\mymathit{left}^+\mymathtt{)}$ is the negation of a choice, $\mymathtt{\{}\mymathit{id}\mymathtt{\}}$ is the use of an alias and $\mymathit{left}\mymathtt{<}n\mymathtt{,}n?\mymathtt{>}$ means between $n$~and $n'$~occurrences of \emph{left}. Should there be no~$n'$, then the expression means at least $n$~occurrences. The syntax for \emph{right} is \[ \mymathtt{=>}\; \mymathit{stringExpr}^+ \] while that for \emph{pushBack}, if it is not empty, is \[ \mymathtt{<=}\; \mymathit{stringExpr}^+ \] The \emph{right} expression corresponds to the characters that are to be output. The \emph{pushBack} expression corresponds to the characters that are put back onto the input stream. A \emph{stringExpr} defines a string of characters, using the characters in the recognized input stream as arguments. It is of the form \begin{tabular}{ll} & $s$\\ $\mid$ & $n$\\ $\mid$ & \verb|\|$n$\\ $\mid$ & \verb|\$|\\ $\mid$ & \verb|\($-|$n$\verb|)|\\ $\mid$ & \verb|\*|\\ $\mid$ & \verb|\(*-|$n$\verb|)|\\ $\mid$ & \verb|\(*+|$n$\verb|)|\\ $\mid$ & \verb|\(*+|$n$\verb|-|$n'$\verb|)|\\ $\mid$ & \verb|#|\emph{arithExpr}\\ \end{tabular} \noindent where $s$~is an \textsc{ascii} character string enclosed in double quotation marks. The \verb|\|$n$ means the $n$-th character (starting from 1) in the recognized prefix; the \verb|\$| means the last character in the prefix; \verb|\($-|$n$\verb|)| the $n$-th, counting from the end. The \verb|\*| means the entire recognized prefix; \verb|\(*-|$n$\verb|)| the prefix without the last $n$~characters; \verb|\(*+|$n$\verb|)| without the first $n$~characters; \verb|\(*+|$n$\verb|-|$n'$\verb|)| removes the first~$n$ and last~$n'$ characters. For example, Indic scripts are encoded with vowels at the end of a syllable, but the vowel is actually printed first on the page. Up to six consonants can precede a vowel, yielding the following transliteration: \begin{verbatim} {consonant}<1,6> {vowel} => \$ \(*-1); \end{verbatim} The \emph{arithExpr} entry allows for calculations to actually be effected on the characters in the prefix. Their syntax is as follows: \begin{tabular}{ll} & $n$\\ $\mid$ & \verb|\|$n$\\ $\mid$ & \verb|\$|\\ $\mid$ & \verb|\($-|$n$\verb|)|\\ $\mid$ & \emph{arithExpr}\verb| + |\emph{arithExpr}\\ $\mid$ & \emph{arithExpr}\verb| - |\emph{arithExpr}\\ $\mid$ & \emph{arithExpr}\verb| * |\emph{arithExpr}\\ $\mid$ & \emph{arithExpr}\verb| div: |\emph{arithExpr}\\ $\mid$ & \emph{arithExpr}\verb| mod: |\emph{arithExpr}\\ $\mid$ & \emph{id}\verb|[|\emph{arithExpr}\verb|]|\\ $\mid$ & \verb|(|\emph{arithExpr}\verb|)|\\ \end{tabular} \noindent where \emph{id}\verb|[|\emph{arithExpr}\verb|]| means a table lookup: the \emph{id} must be a table defined in the \emph{Tables} section. The other operations should be clear. The following example shows the use of tables. \label{gb:unicode} \begin{verbatim} % File inbig5.otp % Conversion to Unicode from Chinese Big 5 (HKU) % Copyright (c) 1995 John Plaice and Yannis Haralambous % This file is part of the Omega project. % % This file was derived from data in the tcs program % ftp://plan9.att.com/plan9/unixsrc/tcs.shar.Z, 16 November 1994 % input: 1; output: 2; tables: in_big5_a1[@"9d] = { @"20, @"2c, @"2ce, @"2e, @"2219, @"2219, @"3b, @"3a, ... @"2199, @"2198, @"2225, @"2223, @"2215 }; in_big5[@"3695] = { @"3000, @"ff0c, @"3001, @"3002, @"ff0e, @"30fb, @"ff1b, @"ff1a, ... @"fffd, @"fffd, @"fffd, @"fffd, @"fffd }; expressions: @"1a => @"0a; @"00-@"a0 => \1; @"a1(@"40-@"7e) => #(in_big5_a1[\2-@"40]); @"a1(@"a1-@"fe) => #(in_big5_a1[\2-@"62]); (@"a2-@"fe)(@"40-@"7e) => #(in_big5[(\1-@"a2)*@"9d + \2-@"40]); (@"a2-@"fe)(@"a1-@"fe) => #(in_big5[(\1-@"a2)*@"9d + \2-@"62]); . . => @"fffd; \end{verbatim} In the future, more operations may well be added. Research is still under way for such things as providing means for defining functions, local variables, error handling and other functionality. The \emph{pushBack} part, which serves to put characters back onto the input stream, uses the same syntax as the \emph{right} part. When characters are placed back onto the input stream, they will be looked at upon the next iteration of the automaton. Finally, the \emph{rightState} can be empty or one of the following three forms: \begin{tabular}{ll} & \verb|<|\emph{id}\verb|>|\\ $\mid$ & \verb||\\ $\mid$ & \verb||\\ \end{tabular} \noindent If it is empty, the automaton stays in the same state. If it is of the form \verb|<|\emph{id}\verb|>|, then the automaton changes to state~\emph{id}. The \verb|| means change to state~\emph{id}, but remembering the current state. The \verb|| means return to the previously saved state. There are a number of example \texttt{.otp} files in the \texttt{otpexs} directory in the \OMEGA~distribution. Most of them serve to convert national character sets to Unicode and back. \section{Compiled Translation Processes.} \OMEGA\ does not know anything about \OMEGA\ Translation Processes. It actually reads a compiled form of these filters, known as Compiled Translation Processes (file suffix \texttt{.ctp}). Essentially, the CTPs can be considered to be portable assembler programs, and \OMEGA\ includes an interpreter for the generated instructions. The command for reading in a CTP file is similar to a font declaration. The example \begin{verbatim} \ctp\TexUni=TeXArabicToUnicode \end{verbatim} means that the file \verb|TeXArabicToUnicode.ctp| is read in by~\OMEGA\ and that internally the translation process is referred to as \verb|\TeXUni|. The CTPs consist of a sequence of 4-octet words. The first seven words have the following form: \begin{tabular}{ll} \emph{lf}&length of the entire file, in words;\\ \emph{in}&number of octets in an input character;\\ \emph{ot}&number of octets in an output character;\\ \emph{nt}&number of tables;\\ \emph{lt}&number of words allocated for tables;\\ \emph{ns}&number of states;\\ \emph{ls}&number of words allocated for states;\\ \end{tabular} \noindent The header words are followed by four arrays: \begin{eqnarray*} \mathit{table\_length} & : & \mathbf{array} \; [0..\mathit{nt}-1] \; \mathbf{of} \; \mathit{word}\\ \mathit{tables} & : & \mathbf{array} \; [0..\mathit{lt}-1] \; \mathbf{of} \; \mathit{word}\\ \mathit{state\_length} & : & \mathbf{array} \; [0..\mathit{ns}-1] \; \mathbf{of} \; \mathit{word}\\ \mathit{tables} & : & \mathbf{array} \; [0..\mathit{ls}-1] \; \mathbf{of} \; \mathit{word} \end{eqnarray*} The \emph{table\_length} array states how many words are used for each of the tables in the~CTP. For the GB~$\rightarrow$~Unicode example on page~\pageref{gb:unicode}, the \emph{table\_length} would have two entries: hex values \texttt{9d} and~\texttt{3695}. The \emph{tables} array is simply the concatenation of the tables in the OTP file. The \emph{state\_length} array states how many words are used for each of the states in the~CTP. For the GB~$\rightarrow$~Unicode example on page~\pageref{gb:unicode}, the \emph{state\_length} would have one entry. The \emph{states} array is simply the concatenation of the sequence of instructions for each state in the OTP file. Each instruction takes one or two 4-octet words. Zero- and one-argument instructions use one word. If the instruction consists of one word, then the actual instruction is in the first two octets and the argument is in the last two octets. If the instruction consists of two words, then the actual instruction is in the first two octets, the first argument is in the next two octets and the last argument is in the last two octets. The instructions are as follows: \begin{tabbing} \makebox[1cm][r]{99} \= \quad \verb|OTP_GOTO_NO_ADVANCE| \= \quad 2 arguments\kill \makebox[1cm][r]{1} \> \quad \verb|OTP_RIGHT_OUTPUT| \> \quad 0 arguments\\ \makebox[1cm][r]{2} \> \quad \verb|OTP_RIGHT_NUM| \> \quad 1 argument\\ \makebox[1cm][r]{3} \> \quad \verb|OTP_RIGHT_CHAR| \> \quad 1 argument\\ \makebox[1cm][r]{4} \> \quad \verb|OTP_RIGHT_LCHAR| \> \quad 1 argument\\ \makebox[1cm][r]{5} \> \quad \verb|OTP_RIGHT_SOME| \> \quad 2 arguments\\ \\ \makebox[1cm][r]{6} \> \quad \verb|OTP_PBACK_OUTPUT| \> \quad 0 arguments\\ \makebox[1cm][r]{7} \> \quad \verb|OTP_PBACK_NUM| \> \quad 1 argument\\ \makebox[1cm][r]{8} \> \quad \verb|OTP_PBACK_CHAR| \> \quad 1 argument\\ \makebox[1cm][r]{9} \> \quad \verb|OTP_PBACK_LCHAR| \> \quad 1 argument\\ \makebox[1cm][r]{10} \> \quad \verb|OTP_PBACK_SOME| \> \quad 2 arguments\\ \\ \makebox[1cm][r]{11} \> \quad \verb|OTP_ADD| \> \quad 0 arguments\\ \makebox[1cm][r]{12} \> \quad \verb|OTP_SUB| \> \quad 0 arguments\\ \makebox[1cm][r]{13} \> \quad \verb|OTP_MULT| \> \quad 0 arguments\\ \makebox[1cm][r]{14} \> \quad \verb|OTP_DIV| \> \quad 0 arguments\\ \makebox[1cm][r]{15} \> \quad \verb|OTP_MOD| \> \quad 0 arguments\\ \makebox[1cm][r]{16} \> \quad \verb|OTP_LOOKUP| \> \quad 0 arguments\\ \makebox[1cm][r]{17} \> \quad \verb|OTP_PUSH_NUM| \> \quad 1 argument\\ \makebox[1cm][r]{18} \> \quad \verb|OTP_PUSH_CHAR| \> \quad 1 argument\\ \makebox[1cm][r]{19} \> \quad \verb|OTP_PUSH_LCHAR| \> \quad 1 argument\\ \\ \makebox[1cm][r]{20} \> \quad \verb|OTP_STATE_CHANGE| \> \quad 1 argument\\ \makebox[1cm][r]{21} \> \quad \verb|OTP_STATE_PUSH| \> \quad 1 argument\\ \makebox[1cm][r]{22} \> \quad \verb|OTP_STATE_POP| \> \quad 1 argument\\ \\ \makebox[1cm][r]{23} \> \quad \verb|OTP_LEFT_START| \> \quad 0 arguments\\ \makebox[1cm][r]{24} \> \quad \verb|OTP_LEFT_RETURN| \> \quad 0 arguments\\ \makebox[1cm][r]{25} \> \quad \verb|OTP_LEFT_BACKUP| \> \quad 0 arguments\\ \\ \makebox[1cm][r]{26} \> \quad \verb|OTP_GOTO| \> \quad 1 argument\\ \makebox[1cm][r]{27} \> \quad \verb|OTP_GOTO_NE| \> \quad 2 arguments\\ \makebox[1cm][r]{28} \> \quad \verb|OTP_GOTO_EQ| \> \quad 2 arguments\\ \makebox[1cm][r]{29} \> \quad \verb|OTP_GOTO_LT| \> \quad 2 arguments\\ \makebox[1cm][r]{30} \> \quad \verb|OTP_GOTO_LE| \> \quad 2 arguments\\ \makebox[1cm][r]{31} \> \quad \verb|OTP_GOTO_GT| \> \quad 2 arguments\\ \makebox[1cm][r]{32} \> \quad \verb|OTP_GOTO_GE| \> \quad 2 arguments\\ \makebox[1cm][r]{33} \> \quad \verb|OTP_GOTO_NO_ADVANCE| \> \quad 1 argument\\ \makebox[1cm][r]{34} \> \quad \verb|OTP_GOTO_BEG| \> \quad 1 argument\\ \makebox[1cm][r]{35} \> \quad \verb|OTP_GOTO_END| \> \quad 1 argument\\ \makebox[1cm][r]{36} \> \quad \verb|OTP_STOP| \> \quad 0 arguments\\ \end{tabbing} The \verb|OTP_LEFT|, \verb|OTP_GOTO| and \verb|OTP_STOP| instructions are used for recognizing prefixes in an input stream. The \verb|OTP_RIGHT| instructions place characters on the output stream, while the \verb|OTP_PBACK| instructions place characters back onto the input stream. The instructions \verb|OTP_ADD| through to \verb|OTP_PUSH_LCHAR| are used for internal computations in preparation for \verb|OTP_RIGHT| or \verb|OTP_PBACK| instructions. Finally, the \verb|OTP_STATE| instructions are for changing macro-states. The system that reads from the input stream uses two pointers, which we will call \emph{first} and \emph{last}. The \emph{first} value points to the beginning of the input prefix that is currently being identified. The \emph{last} value points to the end of the input prefix that has been read. When a prefix has been recognized, then \emph{first} points to~\verb|\1| and \emph{last} points to~\verb|\$|. The \verb|OTP_LEFT_START| instruction, called at the beginning of the parsing of a prefix, advances \emph{first} to $\emph{last}+1$; \verb|OTP_LEFT_RETURN| resets the \emph{last} value to $\emph{first}-1$ (it is called when a particular \emph{left} pattern does not correspond to the prefix); \verb|OTP_LEFT_BACKUP| backs up the \emph{last} pointer by~1. Internally, a CTP program uses a program counter (PC), which is simply an index into the appropriate state array. Like for all assembler programs, this counter is normally incremented by 1 or~2, depending on the size of the instruction, but it can be abruptly changed through an \verb|OTP_GOTO| instruction. The argument in single-argument \verb|OTP_GOTO| instructions is the new~PC. For the two-argument instructions, the first is the comparand and the second is the new~PC should the test succeed. The \verb|OTP_GOTO| instruction itself is an unconditional branch; \verb|OTP_GOTO_NO_ADVANCE| advances \emph{last} by~1, and branches if has reached the end of input; \verb|OTP_GOTO_BEG| branches at the beginning of input and \verb|OTP_GOTO_END| branches at the end of input. As for \verb|OTP_GOTO_|\emph{cond}, it succeeds if the character pointed to by \emph{last} (we'll call it \verb|*|\emph{last}) satisfies the test \emph{cond}(\verb|*|\emph{last}, \emph{firstArg}). The \verb|OTP_STOP| instruction stops processing of the currently recognized prefix. Normally the automaton will be restarted with an \verb|OTP_LEFT_START| instruction. When computations are undertaken for the \verb|OTP_RIGHT| and \verb|OTP_PBACK| instructions, a computation stack is used. This stack is accessed through instructions \verb|OTP_ADD| through to \verb|OTP_PUSH_LCHAR|, as well as through the instructions \verb|OTP_RIGHT_OUTPUT| and \verb|OTP_PBACK_OUTPUT|. Since the \verb|OTP_RIGHT| and \verb|OTP_PBACK| instructions are analogous, only the former are described. The \verb|OTP_RIGHT_OUTPUT| instruction pops a value of the top of the stack and outputs it; \verb|OTP_RIGHT_NUM|$(n$) simply places $n$ on the output stream; \verb|OTP_RIGHT_CHAR|$(n)$ places the $n$-th input character on the output stream; \verb|OTP_RIGHT_LCHAR| does the same, but from the back; finally, \verb|OTP_RIGHT_SOME| places a substring onto the output stream. Three instructions are used for placing values on the stack: \verb|OTP_PUSH_NUM|$(n)$ pushes $n$ onto the stack, \verb|OTP_PUSH_CHAR|$(n)$ pushes the $n$-th character and \verb|OTP_PUSH_LCHAR|$(n)$ does the same from the end. The arithmetic operations of the form \verb|OTP_|\emph{op} apply the operation \begin{eqnarray*} \mathit{stack}[\mathit{top}-1] & := & \mathit{stack}[\mathit{top}-1] \; \mathit{op} \; \mathit{stack}[\mathit{top}] \end{eqnarray*} where \emph{top} is the stack pointer, and then decrement the stack pointer. Finally, the \verb|OTP_LOOKUP| instruction applies the operation \begin{eqnarray*} \mathit{stack}[\mathit{top}-1] & := & \mathit{stack}[\mathit{top}-1][\mathit{stack}[\mathit{top}]] \end{eqnarray*} and then decrements the pointer. Last, but not least, are the \verb|OTP_STATE| instructions, which manipulate a stack of macro-states. The initial state is always~0. The \verb|OTP_STATE_CHANGE|$(n)$ changes the current state state~$n$; \verb|OTP_STATE_PUSH|$(n)$ pushes the current state onto the state stack before changing the current state; \verb|OTP_STATE_POP| pops the state at the top of the state stack into the current state. \section{Translation process lists} Translation processes can be used for a number of different purposes. Since not all uses can be foreseen, we have decided to offer a means to dynamically reconfigure the set of translation processes that are passing over the input text. This is done using stacks of translation process lists. For any single purpose, for example to process a given language, several CTPs might be required. If one makes a context switch, such as processing a different language, then one would to be able to quickly replace \emph{all} of the CTPs that are currently being used. This is done using CTP lists. A CTP list is actually a list of pairs. Each pair consists of a positive scaled value and a doubly ended queue of CTPs. For example, \begin{verbatim} \ctplist\ArabicCTP=[(1.0 : \TexUni,\UniUniTwo,\UniTwoFont)] \end{verbatim} the output from \OMEGA\ once the CTP list \verb|\ArabicCTP| has been typed, shows that that list has one element, namely the pair with the scaled value~1.0 and the doubly ended queue with three CTPs, \verb|\TexUni|, \verb|\UniUniTwo| and \verb|\UniTwoFont|. CTP lists are built up using the five operators \verb|\nullctlist|, \verb|\addbefore|\-\verb|ctp|\-\verb|list|, \verb|\addafterctplist|, \verb|\removebeforectplist| and \verb|\removeafter|\-\verb|ctp|\-\verb|list|. For example, the above output was generated by the following sequence of \OMEGA\ statements: \begin{verbatim} \ctp\TexUni=TeXArabicToUnicode \ctp\UniUniTwo=UnicodeToContUnicode \ctp\UniTwoFont=ContUnicodeToTeXArabicOut \ctplist\ArabicCTP= \addbeforectplist 1 \TexUni \addbeforectplist 1 \UniUniTwo \addbeforectplist 1 \UniTwoFont \nullctplist \end{verbatim} The \verb|\ctplist| command is similar to the \verb|\ctp| command:\\ \verb|\ctplist|~\emph{listName}~\verb|=|~\emph{ctpListExpr}. All \emph{ctpListExpr} are built up from either the empty CTP list, \verb|\nullctplist|, or from an already existing CTP list. In the latter case, the list is completely copied, to ensure that the named list is not itself modified. Given a list~$l$, the instruction \verb|\addbeforectplist|~$n$~\emph{ctp}~$l$ states that the CTP \emph{ctp} is added at the head of the doubly ended queue for value~$n$ in list~$l$. If that queue does not exist, it is created and inserted in the list so that the scaled values are all in increasing order. The instruction \verb|\addafterctplist|~$n$~\emph{ctp}~$l$ does the same, except the addition takes place at the tail of the doubly ended queue. The instruction \verb|\removebeforectplist|~$n$~$l$ removes the CTP at the head of the doubly ended queue numbered~$n$. The instruction \verb|\removeafterctplist|~$n$~$l$ does the same at the tail of the doubly ended queue. See the next section for more examples. \section{Input Filters} Here we come to the crucial parts of \OMEGA. What happens to the input stream as it passes through translation processes? What is the interaction between \TeX's macro-expansion and \OMEGA's translation processes? When \OMEGA\ is in horizontal mode and encounters a \emph{letter}, \emph{other\_char}, \emph{char\_given} or \emph{char\_num}, that character and all the successive characters in those categories are read into a buffer. The currently active CTP is applied to the buffer, and the result is placed back onto the input, to be reread by the standard \TeX\ input routines, including macro expansion. The currently active CTP is designated by a pair $(v,i)$, where $v$~is a scaled value and $i$~is an integer. If all the enabled CTPs are in a CTP list, then the~$v$ designates the index into the CTP list and the~$i$ designates which element in the $v$-queue is currently active. Once a CTP has been used, the~$i$ is incremented; if it points to the end of the current queue, then $v$~is set to the next queue, and $i$~is reset to~1. When the last enabled CTP has been used, then the standard techniques for treating letters and other characters are used, namely generating paragraphs, etc. What this means is that it is now possible to apply a filter on the \emph{text} of a file without macro-expansion, generate a new text, possibly with macros to be expanded, macro-expand, re-apply filters, etc. All this without active characters, and without breaking macro packages. How are CTP lists enabled? CTP lists are placed on a stack, each numbered queue in a given list masking the queues with the same number for the lists below that one on the stack. There are three commands, which all respect the grouping mechanism. The \verb|\clearctplists| command disables all CTP lists. The \verb|\pushctplist|~\emph{CTPlist} command pushes \emph{CTPlist} onto the stack. The \verb|\popctplist| command pops the last list from the stack. For example, consider the following purely hypothetical situations: \begin{verbatim} \ctplist\FrenchCTP = \addbeforectplist 1 \ctpA \addbeforectplist 2 \ctpB \addbeforectplist 3 \ctpC \nullctplist \end{verbatim} \begin{verbatim} \ctplist\GermanCTP = \addbeforectplist 1 \ctpD \addbeforectplist 2 \ctpE \addbeforectplist 3 \ctpF \nullctplist \end{verbatim} \begin{verbatim} \ctplist\ArabicCTP = \addbeforectplist 1 \ctpG \addbeforectplist 2 \ctpH \addbeforectplist 2 \ctpI \addbeforectplist 3 \ctpJ \nullctplist \end{verbatim} \begin{verbatim} \ctplist\SpecialArabicCTP = \addafterctplist 3 \ctpK \ArabicCTP \end{verbatim} \begin{verbatim} \ctplist\UpperCaseCTP = \addbeforectplist 2.5 \ctpL \nullctplist \end{verbatim} There are now 5 CTP lists \emph{defined}, but none of them are \emph{enabled}. The defined lists are: \begin{verbatim} \ctplist\FrenchCTP = [(1.0:\ctpA), (2.0:\ctpB), (3.0:\ctpC)] \ctplist\GermanCTP = [(1.0:\ctpD), (2.0:\ctpE), (3.0:\ctpF)] \ctplist\ArabicCTP = [(1.0:\ctpG), (2.0:\ctpH,\ctpI), (3.0:\ctpJ)] \ctplist\SpecialArabicCTP = [(1.0:\ctpG), (2.0:\ctpH,\ctpI), (3.0:\ctpJ,\ctpK)] \ctplist\UpperCaseCTP = [(2.5:\ctpL)] \end{verbatim} Consider now the sequence of instructions \begin{verbatim} \clearctplists \pushctplist\FrenchCTP \pushctplist\UpperCaseCTP \pushctplist\GermanCTP \popctplist \popctplist \pushctplist\ArabicCTP \pushctplist\SpecialArabicCTP \pushctplist\GermanCTP \end{verbatim} The effective enabled CTP list is, in turn: \begin{verbatim} [] [(1.0:\ctpA), (2.0:\ctpB), (3.0:\ctpC)] [(1.0:\ctpA), (2.0:\ctpB), (2.5:\ctpL), (3.0:\ctpC)] [(1.0:\ctpD), (2.0:\ctpE), (2.5:\ctpL), (3.0:\ctpF)] [(1.0:\ctpA), (2.0:\ctpB), (2.5:\ctpL), (3.0:\ctpC)] [(1.0:\ctpA), (2.0:\ctpB), (3.0:\ctpC)] [(1.0:\ctpG), (2.0:\ctpH,\ctpI), (3.0:\ctpJ)] [(1.0:\ctpG), (2.0:\ctpH,\ctpI), (3.0:\ctpJ,\ctpK)] [(1.0:\ctpD), (2.0:\ctpE), (3.0:\ctpF)] \end{verbatim} The first test of the CTP lists was for Arabic. The text was typed in \textsc{ascii}, using a Latin transliteration. This text was first transformed into Unicode, the official 16-bit encoding for the world's character sets. These letters were then translated into their appropriate visual forms (isolated, initial, medial or final) and then the text was translated into the font encoding. During the second translation, inter-letter black spacing is inserted, since Arabic typesetting calls for word expansion to fill out a line. Here is the input: \begin{verbatim} \font\ARfont=oar10 scaled 1728 offset 256 %% an X-font \def\keshideh{% \begingroup\penalty10000% \clearctplists\xleaders\hbox{\char'767}\hskip0ptplus1fi% \endgroup} \ctp\TexUni=TeXArabicToUnicode \ctp\UniUniTwo=UnicodeToContUnicode \ctp\UniTwoFont=ContUnicodeToTeXArabicOut \ctplist\ArabicCTP=% \addbeforectplist 1 \TexUni \addbeforectplist 1 \UniUniTwo \addbeforectplist 1 \UniTwoFont \nullctplist \def\AR#1{\begingroup\noindent\pushctplist \ArabicCTP% \ARfont\language=255\beginR\quad #1\hfill\endR\endgroup} \end{verbatim} Notice that the \verb|\keshideh|, which is dynamically inserted between letters by the \verb|\UniUniTwo| CTP, uses the \verb|fi| infinity. It also disables all of the CTPs, within a group. \section{Automatic detection of character sets} Most character sets belong to one of three groups: \begin{enumerate} \item 8-bit character sets (including shift character sets) that include \textsc{ascii}; \item 8-bit character sets (including shift character sets) that include \textsc{ebcdic}; and \item 16-bit character sets that include \textsc{ascii} as the first 128~characters, such as Unicode. \end{enumerate} In a multilingual, heterogeneous environment, it it inevitable that different files will be written using different character sets. It is even possible that the same file might have different parts that use different character sets. How is it possible to tag these files internally so that \OMEGA\ can apply the right translations? \OMEGA\ has two basic modes of input: the old \TeX\ style, or the automatic \OMEGA\ style. The old \TeX\ style, is turned on when the \verb|\noInputMode| command is read. The default mechanism is to use the automatic \OMEGA\ style. If the \OMEGA\ style is being used, there are three modes, \texttt{ascii}, \texttt{ebcdic} and \texttt{unicode}, which correspond to the three situations above. Upon opening a file, \OMEGA\ reads the first two characters. If the first character is hex~\texttt{25} (\texttt{ascii}~\verb|%|), \OMEGA\ assumes that the input character set is \texttt{ascii}. If the first character is hex~\texttt{6c} (\texttt{ebcdic}~\verb|%|), \OMEGA\ assumes that the input character set is \texttt{ebcdic}. Finally, if the first two characters form hex~\texttt{0025} (Unicode~|%|), \OMEGA\ assumes that the input character set is Unicode. If none of these three situations occurs, then the default input mode is assumed. Here are the instructions for specifying modes. All of these instructions apply only after the carriage return terminating the current input line. The \verb|\inputMode|~\emph{mode} command, where \emph{mode} is one of \texttt{ascii}, \texttt{ebcdic} or \texttt{unicode}, states that after the carriage return, the input mode is~\emph{mode}. The \verb|\noInput|\-\verb|Mode| command states that the old \TeX\ style should be used. The \verb|\default|\-\verb|Input|\-\verb|Mode|~\emph{mode} instruction states that the default mode --- when there is no comment character at the beginning of a file --- should be \emph{mode}. As for \verb|\noDefaultInputMode|, it states that there is no default mode, and that whatever settings existed when opening the file should remain. The default mode when the system begins is \OMEGA\ style, assuming \texttt{ascii}. This is sufficient for all the \texttt{iso-8859} character sets, many national character sets, and most mixed-length character sets used in East Asia. Once the basic family of character sets has been determined, \OMEGA\ can read the files, and actually interpret control sequences. It is then possible to be more specific and to specify exactly what translation process must be applied to the entire file to convert the input to Unicode. For the moment, input translations are simply single CTPs, which differ from input filters in that they apply to \emph{all} characters in a file, not simply the letters and other characters in horizontal mode. For each kind of mode, there can be a default input translation. As for the mode instructions, each instruction only applies after the carriage return terminating the current line. The \verb|\inputTranslation|~\emph{ctp} command states that after this line, all input will be passed through translation process~\emph{ctp}. The \verb|\noInputTranslation| states that no input will be translated. The \verb|\defaultAsciiInputTranslation|~\emph{ctp}, \verb|\defaultEbcdicInputTrans|\-\verb|lation|~\emph{ctp} and \verb|\defaultUnicodeInputTranslation|~\emph{ctp} commands state what the default translations will be for each of the modes. Finally, the \verb|\noDefault|\-\verb|Ascii|\-\verb|Input|\-\verb|Translation|, \verb|\noDefault|\-\verb|Ebcdic|\-\verb|Input|\-\verb|Translation| and \verb|\noDef|\-\verb|ault|\-\verb|Unicode|\-\verb|Input|\-\verb|Translation| commands remove default translations. Upon startup, there is no default translation for \texttt{ascii} or \texttt{unicode} modes, but there is one for \texttt{ebcdic}, namely \begin{verbatim} \ctp\InputEBCDIC=inebcdic \defaultEbcdicInputTranslation\InputEBCDIC \end{verbatim} \section{Further work} Translations should be applied to output and to \verb|\special| sequences as well. This has not yet been implemented, but will be soon. Furthermore, the standard \verb|^^| and \verb|^^^^| forms used by \OMEGA\ will soon be implemented as CTPs. To do that, however, requires that input translations be CTP lists rather than CTPs. This requires more thought for implementation. We hope that this cursory documentation suffices to experiment with~\OMEGA. More detailed documentation will follow. \end{document}