Profil von RonWelcome to the Home of R...FotosBlogGästebuchMehr ![]() | Hilfe |
|
02 August Microsoft has fun at my expense! (RTF Specification Version 1.9.1)The beauty of Rich Text filesOn the 19th of March 2008 Microsoft released the latest incarnation to the Rich Text Format specification 1.9.1 to go with their new Office 2007... Interesting because Office '08 is available on the Apple Mac but the document doesn't mention that, and only shows images from Word 2003 on the PC. Anyway... I'm working on a program to replace WordPad, not because I don't like WordPad but because I do, I just think that since it hasn't changed since Windows 95, and then not much from Write on Windows 3.x it could do with a little more... especially in the way of internet appliance... read and edit html, blog posts etc. So I'm writing my own, keeping the UI as much like WordPad (or Write) in it's default configuration as possible, keeping the code size small and the startup time fast, and ensuring that my replacement can do everything that the original can. Access to the Rich Edit component (richedt32.dll or richedt20.dll) is a really quick way of maintaining a simple Word Processor, and it's mostly what has allowed WordPad to remain as current as it is. Every time Microsoft updates Rich Edit, WordPad gets that update automatically, because it's really just a user interface to that library. Of course, as the Rich Edit component gets new features (not just improvements to existing ones) WordPad falls behind... and Rich Edit is notoriously bug ridden from an API point of view. Interfaces that are documented as fixed and working fine don't actually do anything, and ones which are documented as having bugs under specific documented circumstances don't present those bugs under those exact same circumstances, but it's more a documentation issue than a code one... and, as previously stated, the library is now very very old. So I'm using Rich Edit to get my program up and running. I'm passing Rich Text between it and my filters and modifying it at quite a low level. I'm testing my program with versions of richedt32.dll that came with Windows 95 and ones that come with Vista, and ones that are in the latest SP for NT4 and the compatibility libraries that are packaged with Wine. (Just to be safe) I'm getting Rich Text files from Macs from Next from Linux and Word Processors of every caliber. "I really ought to know about the state of play with the format it's self, and have some idea how to re-implement or replace this library should the need arise in my program." I thinks... So I asked the inventor of the standard (Microsoft) what their present documentation on this open standard for information interchange is. I received a docx file... (though it seems they have a .doc up now too) Okay... I've had docx files before, I've got translation filters for O2k-2k3 and for OOo. They work, not great, but they work. Oh no! Not on this they don't. The tables are a complete mess and, though I can read the words... making sense of the document is a nightmare. It's rather like a complex scientific journal with all the diagrams thrown away, run through a cheap 1980s OCR program and turned into UNIX ASCII text file without any formatting, only worse. There is formatting, it just doesn't resemble the original formatting of the document in any way shape or form. I don't want Office 2007, I don't use the copies of Office 97, 2000 or 2003 that I legally own. I much prefer to use Open Office, or Word Perfect Office or anything other than Microsoft Office. I've said it before, I'll say it again, it's not that MS Office is bad, I just don't like it. I know they spend a lot of time and effort on getting their UI right, but I'm happy with WordPad, I'm happy with a DOS command prompt and bash scripts, it's just who I am. I'll spend hours replacing Explorer with little third party desktops, Icon widgets, launch bars and file browsers, I like things to look good, but I need to be able to make them look and act the way I want, not the way your panel of testers say is ergonomically correct for the majority. There is no "I" in democracy, and "I" want "My PC" to work how "I" want, not how the majority vote it should, it's not their PC, they can get their own. So now I've used Microsoft's Live Writer software to write and upload my winge about their best seller onto their servers, they can shoot me for stealing their software (for about a day), I implemented the 90 day trial of Office 2007 in a virtual machine, read this file and promptly removed it again by going back 1 snapshot. About as legal as I could get away with, and way too much effort just to read a document that is supposed to enable free transfer of information between diverse systems. (it's only not really legal because it's wasn't my copy of the trail, and I because I undid the drive rather than uninstalled, so I could theoretically install it again sometime down the road) So how Open is Microsoft's Open XML file format? Not very it seems. I can read an ODT file anywhere I can read Google, which is even on a Phoney Praystation! Yet I can't read a docx even on my Microsoft Vista PC with Office 03... and if reports are to be believed, it will be hard to read them on versions of Office yet to come if MS are to implement their own ISO standard format which isn't compatible with the existing docx at all. Sigh. Office 2k7 Hate (my hate, you can love it all you like)While on the MSO chat, and yes I know I was going to bring you W2.0 Spreadsheets next, I will get there I promise, I have to add my tuppenith worth on Clippy and the Ribon. Yea, I hated Mr. Clippit (otherwise known as Clippy the paperclip Office Assistant) though I will miss Paws (the Cat) and Albert (the Genius) more importantly, I miss menus and toolbars. I only wanted to load this file and save it as something I could use... I ended up spending the day loading XP, Office 07 and boxing with a constantly changing blue ribbon... Every time I found an option I wanted to use, I'd move the cursor to where I wanted to apply that tool or effect and... hey! Where'd it go? Everything's changed! Blue Ribbons should remain fiendishly tasty and reasonably priced chocolate wafer snacks and stay the hell off my PC is all I can say. You want a revolutionary new design? Try putting the toolbars down the sides instead of up the top... have you tried using O2k7 on a widescreen display? Very popular these days... not very practical for word processing, but with OOo I can pin all my toolbars and property pages, document navigation and defined paragraph formats on the sides, maximizing the vertical document editing space, and making practical use of the extra screen width. OOo wins!!! The Office Suite of the future! Hooray! Okay... so I managed to get O2k7 running in my VM (ugly as it is, at 1024x576 I could get about 3 lines of 8 point text in at page width before the ribbon, and had 3 pixel high text at 80% document view where I could at least see a whole paragraph.), and loaded the docx. Hooray! The page numbers matched the pages, and the tables had columns below them that actually related to the column headers. So I saved my document out as a .doc, and a .odt, and an rtf, and a PDF and an XPS. "That, I should be able to do something with" I thought, and ditched 2k7 like the shallow painted tart it is. Getting Something UsefulReading any of these documents in anything else was quite a trial however. WordPad seemed to do the best job with the Rich Text file. But of course it doesn't support document links, page breaks and a myriad of other features that are actually quite useful in a 278 page document. The PDF and XPS are fine for reading, but the document was locked from editing or copy and paste. So copying the source code would be a matter of printing and re-typing. That's not very practical either. The .doc file read back about as well as the docx via translation filters, and it turns out (after much re-working of the internals of the file, trying to maintain the layout and feel) that most of what is wrong with it, is that it has been written by someone who has no idea how to use a professional document editing tool like Word. (or rather, it appeared to have been worked on by several someone's, at least one of whom had a very good idea how to manage a large document in a decent word processor, but sadly they weren't in charge of managing the consistency of the document) The tables, messed up, because they were full of 0 width columns that had been created part way down the table by splitting cells and rather than removing un-necessary columns, they were just shifted along until they met the boundary of the next and / or previous one. Fields had (at some stage) been used to create the page numbers in the contents section, but then they were converted to constants, and links were made to _toc1354375138 named bookmarks which resided at the same point as a decently named and perfectly linkable heading. I know I've taken word processing courses, and am IT literate enough to get around these things... I know that many of my collogues in programming and system maintenance haven't and or aren't, but surely Microsoft could get a secretarially trained document specialist to collate the information from the techies? Anyway. I reworked this document in OOo Writer, and in AbiWord and in a little gem known as Jarte (which reads both the .doc and .docx formats as well as .rtf, with the right filters, but sadly goes the way of 2k7 in UI design) and now I have the document in a form that is instantly useable by almost anybody. One Document to Rule them all...So, before I upset Microsoft again by republishing their hard work in an edited form, here are some interesting details about this document. Size (one of the reasons Microsoft cite for the switch to docx):-
Okay... so docx is a lot smaller than a .doc... but not all that much smaller than the zipped .doc, and .docx wont zip because it's already in a .zip file, just like an .odt. MSOffice makes smaller PDFs, but it used JPEG compression on images even against my wishes, and made a horrible PDF which compressed worse than the originally larger OOo version. By horrible, I mean the navigation is just every possible link to location in the left hand side with no levels what-so-ever. OOo made a PDF with pull out navigation tree that mimicked the contents of the document. RTF actually zips quite nicely. I'd say a PDF in a Zip is a pretty good binary distribution form. XPS files are pretty big, and not so easy to navigate as PDFs. I don't really see what Microsoft is trying to achieve here... other than that it is a plain text XML format in a Zip just like odts and docxs so it doesn't need decompiling to edit the way a PDF does, a simple unzip will do. The formats which aren't zipped (or compiled binary) already actually pack down smaller than most of the Zipped xml formats... so we're really not saving any space at all, intact, we're loosing it, you can't RAR, ACE or 7ZIP a zip it just doesn't work. (Most modern PDFs should be Flate compressed, the same as a zip, though how thoroughly is up to the creator) Also of interest, I have discovered that if I unzip a .docx .odt or .xps and pack it back up with ALZip (which isn't the best Zip program by any stretch, but it's cute, small fast and very easy to use) the files become smaller... changing the resultant zips extension back to .docx .odt or .xps makes them still perfectly readable in their new smaller size. I've tried to get this document to open legibly in as many readily available packages as possible. I've tried Atlantis Nova, Angel Writer, QJot etc etc all of which I consider in some way to be competitors for my up coming WordPad replacement. Most struggle with the tables. Some, most notably AbiWord, struggle with the sheer size of the document. Jarte reads the whole file, but stops counting the pages when they reach 59, and only saves that many pages. Angel Writer copes with all the formatting best, but doesn't implement pages or wrap to ruler so you can't really treat it like a mark up for a paper document. WordPad copes the best, but again, page breaks just don't happen as it has no idea what a "page" is till you hit print preview... but at least it knows what the ruler is. The Math functions are very new in Rich Text, and most either ignore them, or turn them into WMF objects inserted into the document. Saving from MSO to an odt file removes them all together, replacing them with the plain text of the variables and little or no math symbols. The main editing I did is in OOo, my favourite of all. This required considerable effort to take full advantage of the package and it's different (broader ISO standard) Open XML document features. Open Document Text files implement Math based largely on the older Open XML Math functions of MathML, where Microsoft's Open XML documents are based on their own proprietary markup. Apparently, Word (prior to 2007) couldn't include math layouts at all. So I'm guessing the Math Markup tool that I used to use in Word 97 was simply embedding a DDE Object. I'm sure I used to do something like this in Lotus AmiPro too back in the 90s, but I know it's something that the LaTeX people have winged and whined about for years, so I guess I'm not all that surprised. From what I can see, OOos Math injection works out in such a way as you could almost execute it, though you might have to strip a few fluffy formatie bits out here and there that will make no difference to the function of the formula at all, just make it look neater. Microsoft's is much more like laying out a User Interface or a Web Page. It would never run, as code, but the presentation description is quite exact, giving exact measurements in twips and the like. This smells of fluff to me, and doesn't make for a very transportable language at all. Nobody (that I know of) other than Microsoft use a twip as a measurement... and when you're looking at a hard copy document, surely a point or fraction of an inch would be more helpful. Anyway, what I can agree on is some of the fantastic ways to align formula elements in Microsoft's format. In OOo, the best means of doing this (according to the help) seems to be to align to some edge or other, and pad with one of two relative width white space items, or a phantom object. Microsoft use phantoms too, you can give them no width, or no height but assume their other dimension is the same size as it would be if you included the code / function which it isn't going to display. That doesn't make much sense, but if I have a word "fourtytwo" and I want to line up the word "ant" to one side of it, and the word "dog" to the other but don't want to see the word "fourtytwo" just yet, I can use a phantom of it to measure how long that word would be in the present font and style, and align "ant" and "dog" to that phantom without displaying "fourtytwo". The Win32 API has a similar function to this in it's repertoire, and when arranging user interface components that appear and disappear as they become relevant (like a ribbon) but must align up regardless of the users preferred font and screen DPI whether they are visible or not (so they don't more around like the ribbon) it is essential to know how long or tall a string will render in a given font at a given DPI without having to draw it just so you can measure it's bounding boxes. In Math it's more useful to have the brackets from one side of a formula line up with ones on the other side, even though the balance of glyphs within them may vary greatly, so it's clear that you are balancing an equal or equivalent equation, regardless of any variance in glyph ink weight. When you write a mathematical formula, your artists eye automatically does this, (even if you're a mathematician and not an artist) but for a computer, it's not instinctively clear, and since it's logically irrelevant, it can get it seriously wrong. Regardless, I couldn't find anything listed in the possibilities for 2007 Math Markup that I couldn't do in OOo Writer. Except knowing and setting exactly how many twips might be between one glyph and another. Many things that had different ways to achieve different things in MSO, used different parameters to the same method in OOo, and some needed cheaty work around's like manually shifting the size of individual symbols relative to the whole formula to get the same basic look. Some features Microsoft considers part of the Math, which OOo treats as object decoration. Boxes around formula, for example. Once a formula is composed in OOo, it is a graphical object on the page, just like a graph or a photo. So just like a graph or a photo it can have a border, and you can control it's justification and it's position relative to the anchor point and the way words and paragraphs wrap to it. Microsoft seems to take a formula as a paragraph, not an object on the page, and so you define it's distance from things, it's alignment and borders within the mathematical paragraph. So maybe all the talk about not having millions of ways to do the same basic thing any more in Office 2007 was all just smoke and mirrors after all. (Don't get upset, I know that's out of context and they were talking about UI design not file formatting and underlying code) Once I had made the alterations necessary to make the formula work in OOo correctly, and display as they did (plus or minus the odd twip) I had already put considerable effort into making a maintainable odt file. So, I went ahead and saved the source for the example RFT reader to a folder and zipped it up, applied a common font to the code (because it was irregular and all over the place from various edits) and took the liberty of applying standard schintilla syntax highlighting to it. I know most of this won't print, but it makes it easier to read on screen. I re-aligned some of the comments here and there too. Between fixing the empty half columns and broken tables, messing with formula and this that and the other the page numbers were now skewed, and as I say the TOC was no longer linked to page numbers via functions (though you can see it was at some time) so I re-built the Contents page using OOos locked contents object, and configured it to maintain the same formatting as Microsoft had used. Sharing the Fruits of my LaboursThen I moved on to creating a clean PDF from all this. I had the one Word created, and hotlinks did work, but as I say, the side index (or bookmarks) it exported where a complete mess, the XPS doesn't even seem to maintain a document navigator. The file size of my new PDF was quite a bit bigger, but I know that OOo creates complete and clean PDFs not optimized for downloading, so I ran it through a compressor, and was amazed at the difference, the document even loaded in a flash compared even to the MS export and had it's beautiful TOC at the side, so I tried the compressor on the MS document (which I will keep as a reference to the original formatting of the document). The result was less impressive, but that may be because Jpegs don't re-compress as well as lossless images. Even so, the time taken to load is a great improvement, and the size decrease is not inconsiderable. I'm not sure if you can do this with XPS files, but a PDF can have other files attached to it, just like an Email can. I wanted to use this to attach the zip I made of the example source file, so I attached the zip to the page where the source code starts with another little PDF tool I downloaded. Now, you can have the choice of reading this document in two flavors of PDF, (I recommend my OOo reconditioned version, unless you are a stickler for authenticity) as an XPS or a Rich Text file, and the PDFs will have a zip containing all the source and a make file for building your very own Rich Text Format file reader. Links to these files in my public Sky Drive can be found here, and will remain here until Microsoft take them down, or ask me to do so for them. Personally, I hope they don't take offense to my redistribution... In fact, they can give me a job. ;) PS. The Zip contains the Rich Text export from Word 2007. |
|
|