How to write a web book in XXI century

It all started because I wasn't feeling well and had to stay at home for a few days. But since it can be pretty boring, I decided to sort out my limited knowledge of Hebrew by putting it into a kind of a web book or something like that. Maybe, just maybe, if I really learn Hebrew someday, it will turn into a nice language course for nerds like me. And if I just get bored of it before I actually finish, then I'll at least waste enough time doing it so I won't be bored for a while. It's a win-win idea.

One would think that writing a web book is something that is easily done in XXI century, provided, of course, that you know what to actually write there. I mean, there is Unicode, there is appropriate markup for bi-directional text in HTML, and it seems to be well supported by the mainstream browsers. Just pick up the right tools and write! Or so I thought.

The first issue is to find the right format. Writing directly in HTML isn't the best idea (although it can be done too) because HTML lacks internal structure. You'll have to struggle with chapter numbering, table of contents and splitting the thing into the right number of HTML pages. HTML is the best output format for web publishing, but it's not quite right to actually write in it.

The first format I thought of was Lyx, as its WYSIWYM (what you see is what you mean) idiom gives the most content with the least effort required to format everything. I already used it for my "Endgame: Singularity" Impossible Guide and found it quite nice, especially when there is math involved. I knew there were some problems with Unicode support in TeX/LaTeX, but I thought that such a nice piece of mature software should have all those solved already… Boy, was I wrong!

There is a whole bunch of UTF-8 encodings, using XeTeX or whatever-TeX, but none of them seemed to actually work. After struggling with it for a while, I realized that using TeX in any way cripples the very idea of using Unicode to get rid of all multi-language troubles even before they appear.

OK, so I thought I needed some decent format with native support of Unicode. Something like XML. Of course, raw XML is kind of useless unless I wanted to write my own XSLT sheet, which I didn't. So what I needed was an XML-based format that is designed for writing structured documents… Looks like Docbook is the way to go! It is designed for technical documentation primarily, but nothing stops from using it for anything else, as long as nothing special is required of it, and my requirements were quite simple indeed.

Now the only thing that I needed was a sort of WYSIWYG/WYSIWYM XML editor with Docbook support because I didn't want to write raw XML with Vim or something like that. Given Docbook's popularity, there must be plenty of them, right?

The first thing that I found out is that most of these editors are pretty expensive, usually in the range from $300 to $350. I should have expected that from a software that isn't used by every housewife, but I hoped there would be at least one or two freeware, or better, open source editors. Turned out there were none. At least no WYSIWYG ones.

The next thing I figured out is that RTL support needed for Hebrew was pretty scarce too. I was already thinking about evaluating some commercial version or looking for another format or way to accomplish my goal, but then I found an old version of Serna which was open source by that time (and then turned into proprietary, which is kind of popular amongst XML editor for some reason).

Serna Free 4.4 was some 2 years old, but pretty good otherwise. I almost thought that I solved my last problem. In reality, the horror was just beginning.

After typing a few paragraphs in English, Russian and Hebrew, I tried to convert it to HTML. For some reason, that was the only format supported out of the box (no PDF or whatever), but I was happy enough with that. The problem was that any direction attributes were ignored by the converter! That is, Hebrew wasn't marked as RTL, which led to all sorts of minor, but unforgivable problems, like misplaced punctuation. I was quite surprised with that, as both Docbook and HTML have decent RTL support.

At first I thought it was a faulty editor. But then it turned out it just calls xsltproc to do the actual conversion. And xsltproc is a very robust processor which doesn't know a thing about Docbook or HTML – it just converts one XML into another according to XSLT. So I thought it must be an XSLT problem then. I was right! While Docbook itself has the necessary attributes to specify direction, XSLT just ignores them in most cases! So I had to customize the XSLT. I copied chunk.xsl to rtl.xsl, replaced docbook.xsl import with base.xsl import, where base.xsl was a new file created by me, which looked like this:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:exsl="http://exslt.org/common"
        version="1.0"
                exclude-result-prefixes="exsl">

<xsl:import href="docbook.xsl"/>

<xsl:template match="phrase">
  <span>
    <xsl:if test="@lang or @xml:lang">
      <xsl:call-template name="language.attribute"/>
    </xsl:if>
    <xsl:if test="@role and $phrase.propagates.style != 0">
      <xsl:attribute name="class">
        <xsl:value-of select="@role"/>
      </xsl:attribute>
    </xsl:if>
    <xsl:if test="@dir">
      <xsl:attribute name="dir"><xsl:value-of select="@dir" /></xsl:attribute>
    </xsl:if>
    <xsl:call-template name="anchor"/>
    <xsl:call-template name="simple.xlink">
      <xsl:with-param name="content">
        <xsl:apply-templates/>
      </xsl:with-param>
    </xsl:call-template>
  </span>
</xsl:template>

<xsl:template name="paragraph">
  <xsl:param name="class" select="''"/>
  <xsl:param name="content"/>

  <xsl:variable name="p">
    <p>
      <xsl:if test="@dir">
        <xsl:attribute name="dir"><xsl:value-of select="@dir" /></xsl:attribute>
      </xsl:if>
      <xsl:if test="$class != ''">
        <xsl:attribute name="class">
          <xsl:value-of select="$class"/>
        </xsl:attribute>
      </xsl:if>
      <xsl:copy-of select="$content"/>
    </p>
  </xsl:variable>

  <xsl:choose>
    <xsl:when test="$html.cleanup != 0">
      <xsl:call-template name="unwrap.p">
        <xsl:with-param name="p" select="$p"/>
      </xsl:call-template>
    </xsl:when>
    <xsl:otherwise>
      <xsl:copy-of select="$p"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

</xsl:stylesheet>

No, I didn't write all that stuff, thankfully! I just found it somewhere in the existing sheets, and then copied it into my base.xsl, adding those 'if test="@dir"' parts. On top of that, I had to create a startup script for xsltproc with these options:

"e:\programs\Serna Free 4.4\bin\xsltproc.exe" –stringparam chunker.output.encoding UTF-8 –stringparam html.stylesheet hebrew.css –stringparam label.from.part 1 –stringparam component.label.includes.part.label 1 %*

What it does is set encoding to UTF-8 (by default it's ISO8859-1, in 2013!), links the resulting HTML with my hebrew.css, and makes chapter numbers start with 1 for each part of the book with the part number included, so it looked like "Chapter II.3". Then I created hebrew.css:

em.foreignphrase {
    font-style: normal;
    font-size: 16pt;
}

Now that's surprisingly reasonable. None of that incomprehensible nonsense. The next part was to get spell checker working. I needed it to check both Hebrew and English (and Russian just for the hell of it). The spell checker used by Serna is hunspell, which does support multiple languages. It's only that Serna doesn't realize this, so I had to make a combined dictionary using a program named hunspell-merge. It didn't work right away for some reason, but when I compiled it from sources, it suddenly worked. Maybe if I tried a second time it would have worked anyway, though.

After all that adventure I thought I was ready to actually type the text, but then came the next challenge. The problem with modern computers is that they don't have a lot of necessary characters on keyboard, like en dash, quotation marks and apostrophe. Office software like MS Word solve this by providing autocorrection. In fact, they should have implemented it in Windows, not in Office, because people need to be able to type texts outside Word too! Or maybe that's exactly why they didn't do it. Anyway, I had just the right program to solve this problem: AutoHotkey. I was already using it for a few things, so I "only" had to add this to the script:

ru := DllCall("LoadKeyboardLayout", "Str", "00000419", "Int", 1)
en := DllCall("LoadKeyboardLayout", "Str", "00000409", "Int", 1)
he := DllCall("LoadKeyboardLayout", "Str", "0000040D", "Int", 1)

#IfWinActive, ^Syntext Serna
<::
Send “
return

#IfWinActive, ^Syntext Serna
>::
Send ”
return

#IfWinActive, ^Syntext Serna
'::
w := DllCall("GetForegroundWindow")
pid := DllCall("GetWindowThreadProcessId", "UInt", w, "Ptr", 0)
l := DllCall("GetKeyboardLayout", "UInt", pid)
if (l = ru)
{
    Send э
}
else if (l = he)
{
    Send {U+002C}
}
else
{
    Send ’
}
return

#IfWinActive, ^Syntext Serna
NumpadSub::
Send –
return

After all that XSLT stuff I didn't even think it was too hard. This way I was able to type quotation marks by typing < and >, which I hope I won't need in Serna (or I'll have to think about some other hotkeys for these).

Now the question is: was that the last challenge? I think you guessed it right: it was not. The most ridiculous challenge awaited me right ahead.

After all that AutoHotkey scripting with apostrophe, spell checker stopped working for words with apostrophes, like "aren't" and "doesn't". At first I thought it was simply about changing the ugly ASCII symbol to the real apostrophe in the dictionary. But after I did it, it stopped recognizing both of them! So now both "isn't" and "isn’t" were marked as incorrect. That wasn't so nice, so I started looking for a solution.

Turns out that hunspell actually supports such things, and I only had to add apostrophe to the list of valid characters. But that didn't work. At this moment a terrible thought occurred to me: what if instead of letting hunspell do all the job, they parsed the text into separate words and fed them to spell checker on a word-by-word basis? Even though the version was supposed to be open source, the sources seemed to have been lost! At least I wasn't able to find them anywhere.

At this moment I should have really given up. But I had a lot of time to kill, remember? So I dug up good ol' IDA Pro and started disassembling. Thankfully, Serna used my favorite – Qt, so I immediately looked for QChar::isLetter(). After a while, I found a piece of code that was essentially doing something along the lines of

if (c.isLetter() || c == '\'') …

Aha! Now I only had to replace the second part with

cmp ax, 0x2019 ; apostrophe
nop ; just to fill unused space

Of course, it didn't work right away. It turned out there were some other parts of the library where they used the same thing. Whether it was an inline function or they really duplicated their code like that is anyone's guess. After I had replaced all of those, it still didn't work, so I had to use debugger to confirm that my code is working all right, and the word was passed to the function in the main library that was called addMarkedWord(). So I had to disassemble the main library too and find some references to QChar::isLetter(), replacing apostrophe there as well.

Much to my surprise, it actually worked then! Here is a screen shot of it working fine:


To sum it up, here what one needs to do in order to comfortably write a book to publish in the web: download an editor, customize XSLT stylesheet by writing a couple of new files and re-writing some templates, download spell checker dictionaries and the program to combine them into a single one, fix the dictionaries by replacing the wrong characters with the right ones, then disassemble and fix the editor so it accepts your new dictionary. Oh, and I almost forgot: write some AutoHotkey scripts to be able to actually type the right characters. Isn't that easy?

Leave a Reply