04 April 2006


Hansard: HTML versus PDF

Is HTML or pdf the better format for distributing text online? The short answer is that it depends on whether the articles are to be viewed and used mainly on the computer screen, or mainly as hardcopy printed on paper.

HTML is much the better choice when the file is destined to be read from a monitor screen. With HTML, the viewer can choose a font size and a window width to match his/her preference, so the text is easy to read.

When the file is destined for printing on paper, and the precise printed page layout is important, pdf is better — pdf is designed for printing, not browsing.

With pdf the user has no choice of line length for on-screen reading. At a nice font size the page often doesn't fit on the screen. Reading two-column output requires either a tiny font or lots of sideways scrolling. Page breaks get in the way, often forcing the viewer to scroll up and down several times to get the last few words on the previous page and the first few words on the following page; HTML documents never deliver that kind of needless hassle (always experienced at the receiving end and never at the sending end).

In pdf, hyperlinks work badly, and often not at all.

Searching a document for a word or phrase is much easier in HTML than in pdf.

The arguments in favour of pdf are advanced by people at the sending end, website designers, and the arguments against pdf are advanced by people at the receiving end, viewers.

The arguments in favour of HTML are advanced by people at the receiving end, viewers, and the arguments against HTML are advanced by people at the sending end, website designers.

Hansard documents are written reports of spoken words, thus are purely text with no graphics, no maps, no tables — straight text only.

Good, standards-compliant HTML is always better than pdf for use on the web, for delivering straight-text documents.

Hansard reports of Legislature debates should always be presented online in HTML. A second-choice version in pdf may be made available online, if necessary to placate sending-end control freaks.

Where Hansard is available in both forms, indexed side-by-side for equal accessibility, it would be very interesting to see the server logs to compare how often one format or the other is chosen by the citizens.

It is noteworthy that these server logs are controlled, not by citizens (who favour HTML) but by webmasters (who favour pdf). I've never seen Hansard download statistics made public in any situation where both formats are equally available; this drought of statistical information is compatible with the theory that those who oppose HTML are the same people who control access to the download statistics and would be reluctant to release statistics showing that citizens strongly prefer HTML.

A thought: why are blogs never in pdf?

This blog, and all the blogs I've ever seen, is presented in HTML, giving the viewer complete control of the font size. The viewer also can control the line length by adjusting his/her window width. Bloggers are hungry for an audience — no sane (or otherwise) blogger will drive his/her audience away by forcing them to endure the hassles of pdf. Blogs are never even offered with an alternative choice of pdf. Blogs are destined from the beginning to be read on-screen, and for a good reason are never delivered in pdf. The same reason applies with equal force to online Hansards, but Hansard decision-makers are usually people who have spent most of their working days in a printing environment, and who are uncomfortable with the freedom that HTML gives to the consumer.

Hansard available in HTML?
Yes   Ottawa House of Commons
Yes   Ottawa Senate

Yes   Yukon Legislature
Yes   British Columbia Legislature
Yes   Alberta Legislature
Yes   Manitoba Legislature
Yes   Ontario Legislature
Yes   Quebec Legislature
Yes   Nova Scotia Legislature
Yes   Newfoundland and Labrador Legislature

 No   Saskatchewan Legislature
 No   Prince Edward Island Legislature
 No   Northwest Territories Legislature
 No   Nunavut Legislature
For any particular session, New Brunswick's Hansard is partly HTML but mostly pdf only. The two formats are mixed together in no discernable system. I'm unable to find an explanation of which New Brunswick Hansard records are presented in HTML and which in pdf.

No need to limit the choice to two high level outputs.

It's much better to XML encode the data and then decide on output with an XLS transformation dynamically. The data can then be displayed, saved, or streamed based on document type.

It's always better to separate data from meta-data and talking about strict output format based on data is wrong.

XSL - http://www.w3.org/Style/XSL/
HTML - http://www.w3.org/Style/XSL/WhatIsXSL.html
PDF modules: http://www.xmlpdf.com/

Lance seems to know a lot more than I do about this. I have no idea what "XML encoding" or "XLS transformation" might mean.
I agree that "talking about strict output format based on data is wrong". I'd prefer to talk about what the important characteristics are. The characteristics associated with HTML, and not with PDF — such as giving control of text size and line length to the viewer — are the important points. Whatever means are available, that might do the job even better , are fine with me. Thank you for your comment.
I don't mean to sound techno-elite or anything, but XML is just another markup language . . . exactly like HTML. The difference is that you can define what the tags do.

In fact, HTML is pretty much dead, most modern webpages (including yours) use XHTML which is an XML document type based on W3C (World Wide Web Consortium) standards.

All XSLT (the XSL (XML Stylesheet Language ) Transformation) is just another text file that defines how those tags are output (screen, file, printer).

So from a simple XML document that contains data and minimal markup you can create HTMl, PDF, or Postscript.

Your worry about line-length and the like are not valid regarding PDF. PDF is a static data type, it generally isn't changed by the user. The _screen display_ of the PDF is different, that's client side.

I guess my point is that Internet supported technology is so far ahead of the vast majority of gov't institutions it invalidates your post about which is better.

The answer is neither, the answer is a non-specific format that has had the capability server-side to present the data in a ubiquitous way for three years or more.

IMO, that's what you get when IT staff are unionized and restricted.

Document (meaning data) management is a well known issue in IT, not so in the realms that deal specifically with information but not necessarily IT.

Thanks for the clarification. I'm not concerned about anyone sounding "techno-elite," if he/she adds something useful to the conversation. That's a new idea to me — defining your own markup tags — opens new vistas.

You wrote: "I guess my point is that Internet supported technology is so far ahead of the vast majority of gov't institutions it invalidates your post about which is better. The answer is neither..."

Here, our signals seem to be getting crossed, or maybe I need to state my position more clearly. I know that governments are a long way behind in presenting information on the Internet. The problem, the governments don't know it, or, if they do, they're keeping it very secret. No doubt there are people deep inside government who know all about this, but they have not yet made it into the ranks of the decision-makers. The decision-makers are still spending their time thinking about how to improve the design of the buggy whips.

What I'm trying to do, with this blog, is to open a conversation about the way governments are now presenting information on the Internet, and what should be improved. This is a job that has to be taken in small steps, and it has to begin with where we are now.

My chosen first small step is to talk about the online Hansards. That's a tiny, relatively simple but important part of government's information stash. Straight text, no images, no graphics, no tables, no forms, and everything cut naturally into reasonably small chunks of data, one day (a few hundred kilobytes) at a time — you can't get much simpler than that.

We have fifteen Hansards to discuss, two federal, ten provincial and three territorial. These Hansard texts are now presented online in one of two formats: HTML and/or PDF. That's why I wrote "Hansard: HTML versus PDF." That's where we are now.

Your position is: "...which is better. The answer is neither..."

My position is: Of the two formats now in use, which is better?

By all means, let's keep our eye on the horizon, but before we get there we have to step forward from our current position. Of the two formats currently in use, one is clearly superior from the point of view of the citizen. Let's try to deep-six the inferior format, or at least to get everyone on board with the superior choice now widely but not universally used.

Is there a case for presenting Hansard online in pdf? If so, can someone describe it?

You wrote: "Your worry about line-length and the like are not valid regarding PDF. PDF is a static data type, it generally isn't changed by the user. The _screen display_ of the PDF is different, that's client side.."

I think my concern about line-length and text size is valid, given the way Hansard is now presented online. Those in charge, some of them that is, have not yet grasped this point. From the point of view of the citizen, trying to find and use information in Hansard, text size and line length are often a problem now, thus are valid concerns. How do we, the citizens, get this across to those in charge?

Thank you for your comments.
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?