0

How to output PDF, DOC, and RTF formats: The solution

A while back, I asked your advice on how to produce PDF, DOC, and RFT output from a web app, something I needed for isabont.com, which we launched a few weeks ago.



Thanks to your help, I figured it out, and now I want to share what I learned.



First, I ended up using OpenOffice 2 to do it. We produce 3 formats natively: HTML, Text, and OpenOffice. Since the OpenOffice format is just a zip of a bunch of XML files, that’s easy to create programmatically.



To produce the XML files, I use Rails’ ERb templates, which are like ASP or PHP files, just plain HTML (or in this case XML) with inline <% ... %> tags that let you do conditionals, loops, insert variables, etc.



What I do in practice, is create the document in OpenOffice, save it, unzip it, and pull out the XML files. Some of them I can just include verbatim, others, like content.xml, I need to make dynamic.



Next, I invoke OpenOffice, calling a macro that converts the document I’ve produced. The macro is adapted from an article on xml.com:



' Save document as an Acrobat PDF file.
Sub SaveAs(inFile, outFile, FilterName)
   oDoc = StarDesktop.loadComponentFromURL(ConvertToURL(inFile), "_blank", 0, Array(MakePropertyValue("Hidden", True),))
   oDoc.storeToURL(ConvertToURL(outFile), Array(MakePropertyValue("FilterName", FilterName),)
   oDoc.close(True)
End Sub

Sub SaveAsPDF(inFile, outFile) SaveAs(inFile, Left(inFile, Len(inFile) - 4 ) + ".pdf", "writer_pdf_Export") End Sub

' Save document as a Microsoft Word file. Sub SaveAsDoc(inFile, outFile) SaveAs(inFile, Left(inFile, Len(inFile) - 4 ) + ".doc", "MS WinWord 6.0") End Sub

Function MakePropertyValue( Optional cName As String, Optional uValue ) _ As com.sun.star.beans.PropertyValue Dim oPropertyValue As New com.sun.star.beans.PropertyValue If Not IsMissing( cName ) Then oPropertyValue.Name = cName EndIf If Not IsMissing( uValue ) Then oPropertyValue.Value = uValue EndIf MakePropertyValue() = oPropertyValue End Function


This is where it starts to get annoying. You would hope that OpenOffice was actually reasonably scriptable, having come from and SUN, but no. In order to get OpenOffice to suck up this macro, you have to start it, select Tools, Macros, and then enter this macro somewhere. That’ll give it a special path, in my case macro:///Standard.Isabont.



To invoke it, I execute this shell command:



/usr/lib/openoffice/program/soffice.bin -display :10 -headless macro:///Standard.Isabont.SaveAs(/path/to/ooo/file.odt,/path/to/output/file.doc,Name of Filter)


The filters can be one of:



  • writer_pdf_Export
  • MS WinWord 6.0
  • Rich Text Format


I managed to find a macro that would spit out all the names of the filters in your installation into an OpenOffice document, but I forget where. It’s an ugly long list of filters, and it’s hard to figure out how well they’d work.



What the above command does is actually load up OpenOffice, which is big, so it takes a little while, invoke that macro, and then shut it down again. If another OpenOffice process is already running, it’ll just stop. If it crashes, it leaves a little lock file around in /home/youruser/.openoffice.org2/.lock, which you need to delete to get it to start again.



In order to make sure it only runs once, and in order to not tie up resources waiting for it, I use Ezra Zygmuntowicz’s excellent BackgrounDRb in Rails. This makes it run in a background thread, with the browser just polling every second to see if it’s done. It works.



What would be better, of course, is to just keep one OpenOffice process running and send messages to it. There’s something called an UNO bridge, but it seems to be only half working and maybe a quarter documented.



The good thing about this setup is that even though there’s the odd problem, like bullets showing up weird in the Word DOC output, it’s entirely reproducible, you can go from Word back to OpenOffice, unzip, and look inside, and you can actually get to the bottom of things.



An equally annoying thing about this setup is how poor OpenOffice works on the Mac, and how slow it is. That means that development cycles go from the usual 2-second process of editing and running unit tests, or reloading the browser, becomes an order of magnitude slower, which means you have to change your style of working. Frustrating, and less productive.



Let me take a moment for an unsolicited rant …



All in all, my expeience with OpenOffice is that I like the format, but hate the software. The format is faily clear and easy to work with, although at 706 pages, the specification is still a bit too verbose for my taste. Also, like Dave Winer is fond of saying about RSS and OPML, it works because there’s software that implements it. It would be impossible to work with it based on the spec alone, without OpenOffice.



But boy, is OpenOffice the software a bloated piece of crap! Microsoft needs have no fears about it. Instead of trying to do something new, it’s just a really poorly executed clone of MS Office.



Microsoft is widely criticized for featuritis, something they get into because of their need to sell upgrades. OpenOffice doesn’t have that problem, so they could’ve differentiated themselves by having less, but better features.



Another obvious differentiation would be scriptability. Microsoft is no fan of the command line, so they tend to take the big and clumsy approach of building a whole language and IDE into the application itself. It would be great if OpenOffice had found a way to leverage existing scripting languages, like Python, or, heck, just build in a simple webserver and offer a RESTful API and let people use any language at all.



Alas, they haven’t, and now we’ll have to put our faith in things like Writely to solve the desktop office problem going forward. That doesn’t solve the problem of outputting documents from the server, of course.



Ok, back on track.



Other options to consider are Prince XML if you’re only looking to do PDF output. It converts directly from HTML, and understands CSS reasonably well, so you don’t need to produce the OpenOffice format at all. That should save some time. Downside is, it costs real money for commercial use ($3800). Or you could produce the OpenOffice format, but not do the conversion, and rely on the latest version of Word being able to read that format. That way you can avoid polluting your servers with OpenOffice.



So there you have it. Bottom line: It’s doable, it works, but it isn’t not much fun. Good luck!

4 comments

I can only relate to this experience. We are using OO to produce PDF documents out of OpenACS and it takes around 5 seconds to spit it out. What I don't see though is the need for a background process. And to my experience, if an OpenOffice process is running already, the new execution actually takes the ressources of the running process and prints. At least converting three documents at the same time only takes 5+2+2 seconds, so the second invocation seems to be faster.
Read more
Read less
  Cancel
The reason I do a background thread is so I don't have to tie up a process on the server for the whole duration. I only have 6 concurrent connections (I could up that number, but don't need to), so holding on to such a connection for more than a fraction of a second is undesirable. By running it in the background, the browser can just check once per half second or so whether it's done, and that request takes less than 1/100th of a second to process. Also, by running it in a background process, it's trivial to assure it doesn't try to run concurrently. Remember, Rails isn't threaded like AOLserver.
Read more
Read less
  Cancel
What did you do about OO's reliance upon X?
Read more
Read less
  Cancel
Who's B?
Read more
Read less
  Cancel

Leave a comment