Microsoft RTF Specification Nightmare

Submitted by Hannes Schmidt on Tue, 06/08/2004 - 13:55.

Have you ever seen a word processor other than Microsoft's own office suite member Word that can import an RTF (Rich Text Format) file properly? I have not. The reason for this lies in RTF's inherent complexity and its strong dependency on Microsoft's internal Word document implementation. The RTF format is basically a 7-bit-safe, serialized version of a Word document's in-memory representation plus some tweaks that ensure backward compatibility with older programs that read RTF files. For every version of Microsoft's office suite there is a corresponding RTF specification.

This indicates the tight coupling between Word internals and RTF. Why is this coupling bad? A proper file format specification should be based on an abstraction of the actual format implementation such that inevitable implementation irregularities do not leak into the file format. An abstraction provides for a clean, consistent and comprehensible model of a file format’s syntax and the corresponding semantics. The RTF spec lacks a lot of the semantic definitions necessary to read, write and modify documents. This is intentional. Microsoft is not interested in making it easy to process documents native to its office suite. RTF makes it easy for Microsoft applications to read and write documents because the in-memory representation of a document (as messed up as it is) maps directly to its RTF form. For a programmer who doesn’t have access to Word’s source code the semantics of a document remains nebulous.

Example: With Word 2000, Microsoft introduced nested tables, i.e. tables within cells of a table. In order to keep compatibility, various hacks were incorporated into version 1.6 of the RTF specification: (mind you that this is my unofficial interpretation of the specs)

  • The syntax of traditional pre-1.6 tables does not allow for unambiguous nesting of tables using the traditional group names \row and \cell. For example, “\cell \row \cell \row” can be interpreted as a simple table with two rows and one column or as a one-by-one table containing another one by one table. Consequently, new keywords were added: \nestrow and \nestcell in conjunction with \nonesttables and \nesttableprops.
  • The row definition \trowd of nested rows is escaped by a comment, such that older readers ignore it.
  • For reasons not further explained in the spec, the row definition of an outer table is repeated after every occurrence of an inner table in a cell of the outer table. The result looks totally messed up. A 1.6 compliant reader should skip these redundant \trowds and only read the \trowd that occurs after a \row. To make things worse, these redundant \trowds are optional. This must be parsing hell.
( categories: Geek )
Submitted by Anonymous on Sat, 07/23/2005 - 22:33.

I spent many months writing a RTF to XML script.

http://rtf2xml.sourceforge.net/

I thought I had everything worked out. Now it turns out that my encoding is wrong. For some characters, RTF uses \u800, where the number represents the unicode value. But for others it uses \u-8000. I tried to download the specs to figure out why, but you need a Windows operating system.

It is such a nightmare!

Submitted by Anonymous on Fri, 05/20/2005 - 05:38.

AbiWord (http://www.abisource.com) does a pretty good job on RTF import (as well as doc import), but yes, RTF is one messy hack on the top based on the Word internal data structures (if it is any consolation, that makes it quite a usefull source of info for the Word binary format which is undocumented since Word 8).