Requirements

For small notes, Google Drive documents, web or wiki pages and blogs like this one work fine. What about bigger pieces? What do I want? And which pieces are big enough to use something other than the blog?

  • Formats publish in PDF, HTML and EPUB (which is mostly the same as HTML)
  • Styling style the piece easily, for instance, change the font of all headings, leaving the font of the paragraphs unchanged - and do it in constant, not linear time :)
  • Fonts use specific fonts and various scripts (English, Hebrew, Russian)
  • Editing edit the document easily
  • Hosting publish at places like GitHub, where installing MediaWiki is not an option, the results need to be static
  • Repeatable Builds have repeatable builds, with no pre-requisites to install and no scripts to run
  • Generated Content include content generated by code in the document
  • Equations include mathematical equations in the document
  • Graphics include images and diagrams in the document

Choices

Formats and Styling: DocBook

To support multiple output format and easy styling, I need some XML-based source format, with associated stylesheets.

There are three XML formats that can be used: TEI, DocBook and DITA.

For a project that uses TEI to publish Jewish texts, I tried (in 2007) to use TEI for the paper about the project, too. Sebastian Rahtz fixed some bugs in TEI stylesheets - the same day I reported them! But in the end, DocBook turned out to be easier (I didn’t find Maven plugin for TEI). This is understandable: TEI is targeting humanities, with their “critical apparatus”, “phrase structure” etc., while DocBook deals with publishing technical papers :) It is quite possible that TEI stylesheets drastically improved since then. After all, the TEI/DocBook unification paper “A unified model for text markup: TEI, Docbook, and beyond” by Rahtz, Walsh and Burnard dates from 2004 :)

I do not know much about DITA - but probably should :)

Actually, there is another alternative: HTML5. There is a lot of excitement about the use of HTML5 in publishing as both source and output format; see, for example, HTML5 is the Future of Book Authorship” by Sanders Kleinfeld. It was always possible to use HTML in such a way, so I am not sure why the excitement centers on HTML5. I can see why new CSS capabilities are important for publishing - for instance, “Paged Media Module” - but this is work in progress. To transform CSS-styled HTML into PDF, a formatter like the one available from Antenna House or Prince is needed. They cost a lot of money :( PDFReactor does offer free personal licenses…

For the rest of this post I use DocBook - just as I do in practice (for now?).

Repeatable Builds and Generated Content: Maven

Maven is all about repeatable builds, and has plugins for everything, including multiple DocBook plugins. The right one is docbkx-tools. The rest of this post explains how I process DocBook documents using Maven and docbkx plugin. (I do not want to install Publican - but if it can do all I need easier, I might think about it :))

The plugin integrates with DocBook XSLT stylesheets and Apache FOP (for PDF generation), brings them in as Maven dependencies, and runs the whole thing.

It also provides a nice way to reference DocBook XSLT from the customization layer: “urn:docbkx:stylesheet” in the “href” attribute of an “import” references the appropriate stylesheet. This is better than referencing Oxygen’s copy of the stylesheets - that way, the build would require Oxygen to be installed. It is better that using explicitly installed DocBook stylesheets - I don’t want that as pre-requisite for the build. Of course, use of such references makes the build not-reproducible from the command line or Oxygen.

Stylesheet parameters can be set in three places: Oxygen (the most convenient), Maven POM or DocBook customization layer. Setting parameters in Oxygen makes the build non-reproducible in Maven or the command line. Setting them in the POM makes it non-reproducible in Oxygen. I am setting them in the customization layer.

I also use other features of Maven (and Maven plugin) that make it more difficult to reproduce the build from the command line or from Oxygen:

  • entities for file references
  • filtering of the CSS files
  • file copying

PDF: FOP

DocBook XSLT generates XSL-FO. To generate PDF, some FO processor is needed.

RenderX XEP costs a lot of money, and their support didn’t answer my questions.

xmlroff handles fonts very well (it uses Pango), and I got very fast response from the maintainer - but it is written in C (eew!) and seems to be no longer developed.

docbkx plugin uses Apache FOP. It is not very actively developed, but it is slowly becoming better. For instance, since 2010 there is no need to deal with font metrics; FOP can auto-detect installed fonts (and embed them in the PDF document). It handles bi-directional documents since 2012.

Equations: MathML with MathJax and JEuclid

There is an XML format designed for equation - MathML. It has two flavors: presentation and content. Content MathML seems cleaner - but less supported :(

Browser support for MathML varies. Chrome, for instance, dropped MathML support in 2013. To make sure that MathML displays properly in the browser, JavaScript polyfill has to be used. MathJax seems to be the most actively developed one (and the one Chrome developers recommend).

For PDF, I need something that will render MathML. The only non-commercial candidate that I am aware of is JEuclid plugin for FOP. JEuclid seems to be dead (again). Fedora ships jeuclid-fop, but it is completely broken for years. Althoug the latest version of jeuclid-core (3.1.9) is available from Maven central, jeuclid-fop is not - it must be made available locally. JEuclid only supports presentation MathML :)

EPUB3 mandates support for presentation MathML (support for content MathML is optional), so for EPUB3 e-readers at least content MathML shouldn’t be a problem. Indeed, my Kobo e-reader displays MathML properly (when EPUB3 support is triggered by giving the file extension “kepub.epub”).

Graphics: SVG

Browsers and EPUB readers should support SVG, and so does FOP (via Batik), so diagrams should not be a problem.

Editing: Oxygen

It is a pity that the ease of online editing (wiki-style) will be lost. I am using Oxygen (the best XML editor :)) to edit the documents. Oxygen supports all three XML formats mentioned. (At some point, XML editing will move to the cloud. Oxygen is already moving).

Oxygen’s MathML editing capabilities are limited; for SVG, I’ll use some other application.

My Setup

Maven POM

Relevant parts of the POM:

<properties>
  <!-- Generated content -->
  <tables.directory>${project.build.directory}/tables</tables.directory>

  <!-- Input directories -->
  <xsl.directory>src/main/xsl</xsl.directory><!-- XSLT customization layer -->
  <css.directory>src/main/css</css.directory><!-- HTML stylesheets -->
  <images.directory>src/main/images</images.directory>  <!-- Static content -->

  <!-- Output directories -->
  <docbkx.output.directory>
    ${project.build.directory}/docbkx
  </docbkx.output.directory>
  <docbkx.pdf.output.directory>
    ${docbkx.output.directory}/pdf
  </docbkx.pdf.output.directory>
  <docbkx.html.output.directory>
    ${docbkx.output.directory}/html
  </docbkx.html.output.directory>
  <!-- EPUB has two directories configured, since docbkx plugin puts
   expanded EPUB where it is told, and resulting EPUB file into the
   parent directory :) -->
  <docbkx.epub.output.directory>
    ${docbkx.output.directory}/epub
  </docbkx.epub.output.directory>
  <docbkx.epub.expanded.output.directory>
    ${docbkx.epub.output.directory}/expanded
  </docbkx.epub.expanded.output.directory>

  <!-- Fonts -->
  <body.font.family>Noto Sans, Noto Sans Hebrew</body.font.family>
  <body.font.master>14</body.font.master>
  <title.font.family>Noto Sans, Noto Sans Hebrew</title.font.family>
  <!-- In addition, symbol.font.family can be configured with
     catch-all symbol fonts. -->
</properties>

<build>
  <plugins>
    <!-- Generated content that needs to be included in DocBook is produced by
      executing the generating code. This has to happen before docbkx plugin is
      executed, so this plugin configuration has to preceede that of docbkx -
      or it has to be bound to an earlier lifecycle phase. -->
    <plugin>
      <groupId>org.codehaus.mojo</groupId>
      <artifactId>exec-maven-plugin</artifactId>
      <version>1.3.2</version>

      <executions>
        <execution>
          <phase>package</phase>
          <goals>
            <goal>java</goal>
          </goals>
        </execution>
      </executions>
      <configuration>
        <mainClass>org.podval.calendar.paper.Tables</mainClass>
        <arguments>
          <argument>${tables.directory}</argument>
        </arguments>
      </configuration>
    </plugin>

    <plugin>
      <groupId>com.agilejava.docbkx</groupId>
      <artifactId>docbkx-maven-plugin</artifactId>
      <version>2.0.15</version>
    
      <dependencies>
        <dependency>
          <groupId>net.sf.docbook</groupId>
          <artifactId>docbook-xml</artifactId>
          <version>5.0-all</version>
          <classifier>resources</classifier>
          <type>zip</type>
          <scope>runtime</scope>
        </dependency>
        
        <!-- By default, docbkx plugin brings in FOP 1.0, but I want 1.1
          since it is newer and because of the Hebrew support. -->
        <dependency>
          <groupId>org.apache.xmlgraphics</groupId>
          <artifactId>fop</artifactId>
          <version>1.1</version>
          <scope>runtime</scope>
        </dependency>

        <!-- FOP does not handle MathML; it needs jEuclid plugin -->
        <dependency>
          <groupId>net.sourceforge.jeuclid</groupId>
          <artifactId>jeuclid-core</artifactId>
          <version>3.1.9</version>
        </dependency>

        <!-- jeuclid-fop is not available in Maven central repository and
          must be made available locally -->
        <dependency>
          <groupId>net.sourceforge.jeuclid</groupId>
          <artifactId>jeuclid-fop</artifactId>
          <version>3.1.9</version>
        </dependency>
      </dependencies>

      <configuration>
        <sourceDirectory>src/main/docbook</sourceDirectory>
        <includes>*.xml</includes>

        <xincludeSupported>true</xincludeSupported>
        <generatedSourceDirectory>
          ${tables.directory}
        </generatedSourceDirectory>

        <entities>
          <entity>
            <name>version</name>
            <value>${project.version}</value>
          </entity>
          <entity>
            <name>tables-directory</name>
            <value>${tables.directory}</value>
          </entity>
        </entities>
      </configuration>

      <!-- Some configuration that could be placed in the XSL files was placed
           in the POM: img.src.path: to be near the related - copying of the
           images font configuration: to centralize configuration and avoid
           duplication (filtering is used to patch font configuration into
           CSS files)
      -->

      <executions>
        <execution>
          <id>pdf</id>
          <phase>package</phase>
          <goals>
            <goal>generate-pdf</goal>
          </goals>
          <configuration>
            <targetDirectory>${docbkx.pdf.output.directory}</targetDirectory>
            <foCustomization>${xsl.directory}/pdf.xsl</foCustomization>

            <!-- Fonts -->
            <bodyFontFamily>${body.font.family}</bodyFontFamily>
            <bodyFontMaster>${body.font.master}</bodyFontMaster>
            <titleFontFamily>${title.font.family}</titleFontFamily>

            <!-- Images -->
            <imgSrcPath>${images.directory}</imgSrcPath>

            <!-- FOP -->
            <externalFOPConfiguration>
              ${basedir}/src/main/fop/fop.xconf
            </externalFOPConfiguration>
          </configuration>
        </execution>

        <execution>
          <id>html</id>
          <phase>package</phase>
          <goals>
            <goal>generate-html</goal>
          </goals>
          <configuration>
            <targetDirectory>${docbkx.html.output.directory}</targetDirectory>
            <htmlCustomization>${xsl.directory}/html.xsl</htmlCustomization>

            <!-- CSS -->
            <htmlStylesheet>css/docbook.css</htmlStylesheet>

            <!-- Images -->
            <imgSrcPath>images/</imgSrcPath>

            <preProcess>
              <copy todir="${docbkx.html.output.directory}/css"
                filtering="true">
                <fileset dir="${css.directory}"/>
                <!-- Fonts (via filtering) -->
                <filterset>
                  <filter token="body.font.family"
                    value="${body.font.family}"/>
                  <filter token="title.font.family"
                    value="${title.font.family}"/>
                </filterset>
              </copy>
              <copy todir="${docbkx.html.output.directory}/images">
                <fileset dir="${images.directory}"/>
              </copy>
            </preProcess>
          </configuration>
        </execution>

        <execution>
          <id>epub</id>
          <phase>package</phase>
          <goals>
            <goal>generate-epub</goal>
          </goals>

          <configuration>
            <targetDirectory>
              ${docbkx.epub.expanded.output.directory}
            </targetDirectory>
            <epubCustomization>${xsl.directory}/epub.xsl</epubCustomization>

            <!-- CSS -->
            <htmlStylesheet>css/docbook.css</htmlStylesheet>

            <!-- Images -->
            <imgSrcPath>images/</imgSrcPath>

            <preProcess>
              <copy todir="${docbkx.epub.expanded.output.directory}/css"
                filtering="true">
                <fileset dir="${css.directory}"/>
                <!-- Fonts (via filtering) -->
                <filterset>
                  <filter token="body.font.family"
                    value="${body.font.family}"/>
                  <filter token="title.font.family"
                    value="${title.font.family}"/>
                </filterset>
              </copy>
              <copy todir="${docbkx.epub.expanded.output.directory}/images">
                <fileset dir="${images.directory}"/>
              </copy>
            </preProcess>
          </configuration>
        </execution>
      </executions>
    </plugin>

    <!-- Results are packaged using assembly plugin.
      This way there is no need to align the directories between the "paper"
      project and "web site" project: resulting artifact can be retrieved
      and unpacked using dependency plugin. -->
    <plugin>
      <artifactId>maven-assembly-plugin</artifactId>
      <version>2.5.3</version>
      <configuration>
        <descriptors>
          <descriptor>src/main/assembly/assembly.xml</descriptor>
        </descriptors>
      </configuration>
      <executions>
        <execution>
          <id>make-assembly</id>
          <phase>package</phase>
          <goals>
            <goal>single</goal>
          </goals>
        </execution>
      </executions>
    </plugin>
  </plugins>
</build>

DocBook XSLT Customization

I need to set various DocBook XSLT parameters and do things to some of the templates. My customization layer has XSLT files for each of the three formats I need: pdf.xsl, html.exsl and epub.xsl. Each of them imports original DocBook XSLT stylesheet for the appropriate format that is the stylesheet that is customized. I use the nice way that docbkx plugin provides:

<xsl:import href="urn:docbkx:stylesheet"/>

or, for chunked HTML:

<xsl:import href="urn:docbkx:stylesheet/chunk.xsl"/>

To avoid repeating common customizations, I use two more XSLT file: common-html.xsl and common.xsl. I import common-html.xsl in both html.xsl and epub.xsl, immediately after the import of the DocBook stylesheet and before other customization. Both common-html.xsl and pdf.xsl import common.xsl.

I do not use Oxygen’s UI to set the transformation parameters, so that the build works even where Oxygen is not installed. The price: build from Oxygen is different from the “official” Maven-based build; but I can still use Oxygen to edit the documents.

This is the customization that I use in html.xsl to enable MathJax:

<xsl:template name="user.head.content">
  <script type="text/javascript">
    window.MathJax = {
      MathML: {
        extensions: [ "content-mathml.js", "mml3.js" ]
      }
    };
  </script>
  <script type="text/javascript"
    src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=MML_HTMLorMML"/>
</xsl:template>

Fonts

Current approach uses Maven filtering of the CSS files to ensure that fonts configured in the POM are used for both PDF and CSS. This will change:

  • it is unlikely that the same fonts are optimal for both
  • in CSS, setting the fonts is not enough; they have to be installed or - better - imported
  • fonts used for PDF have to be installed on the system
  • for EPUB, fonts need to be embedded?

FOP Configuration

Under src/main/fop, I have fop.xconf file with the following:

<?xml version="1.0" encoding="UTF-8"?>
<fop version="1.0">
  <renderers>
    <renderer mime="application/pdf">
      <fonts>
        <!-- FOP will look for fonts installed in the operating system. -->
        <auto-detect/>
      </fonts>
    </renderer>
  </renderers>
</fop>

Editing

HTML and (HTML-based) EPUB builds are more difficult to reproduce in Oxygen than the PDF:

Images and stylesheets need to be copied to the output; XSLT processor doesn’t do it. In Maven, I used preProcess element of the docbkx’s configuration. In Oxygen, I’ll need to write scripts…

For HTML builds, font family and size need to be inserted into the CSS stylesheet (for PDF, setting appropriate parameters is enough). In Maven, I use filtering when copying the stylesheets. In Oxygen, I do not know how to do this without duplicating the information.

Tables (generated content)

Included document has to be valid DocBook section or article, otherwise it isn’t recognized as DocBook at all and is ignored. Strictly speaking, it is when DocBook namespace is used, but Oxygen, of course, complains about the included file. To shut everybody up, I have to wrap the table into:

<section>
    <title>dummy title</title>
    <informaltable>...</informaltable>
</section>

When I include this document, only the table itself has to be included. Oxygen uses (?) Xerxes to validate DocBook. Xerxes does not support xpointer() XPointer scheme, so to select what I want I can not say xpointer="xpointer(//informaltable)" Whatever DocBook Maven plugin I use, it does not understand the xpointer schema too. Both understand element(1/2), and it works, but it is ugly and fragile. Interestingly, Oxygen (Xerxes?) understands xpointer="ID", but Maven plugin does not!

Paper and Site

I package the results using assembly plugin. This way I do not need to align the directories between the “paper” project that “web site” project: resulting artifact can be retrieved and unpacked using dependency plugin.

Assembly descriptor is in /src/main/assembly/assembly.xml and looks like this:

<assembly
  xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2
  http://maven.apache.org/xsd/assembly-1.1.2.xsd">
  <id>bin</id>
  <formats>
    <format>tar.gz</format>
  </formats>
  <includeBaseDirectory>fasle</includeBaseDirectory>
  <fileSets>
    <fileSet>
      <directory>${docbkx.pdf.output.directory}</directory>
      <outputDirectory>pdf</outputDirectory>
      <includes>
        <include>*.pdf</include>
      </includes>
    </fileSet>
    <fileSet>
      <directory>${docbkx.epub.output.directory}</directory>
      <outputDirectory>epub</outputDirectory>
      <includes>
        <include>*.epub</include>
      </includes>
    </fileSet>
    <fileSet>
      <directory>${docbkx.html.output.directory}</directory>
      <outputDirectory>html</outputDirectory>
      <includes>
        <include>**</include>
      </includes>
    </fileSet>
  </fileSets>
</assembly>

Finaly, this is how the “site” project incorporate the paper:

<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-dependency-plugin</artifactId>
  <version>2.10</version>
  <executions>
    <execution>
      <id>unpack</id>
      <phase>package</phase>
      <goals>
          <goal>unpack</goal>
      </goals>
      <configuration>
        <artifactItems>
          <artifactItem>
            <groupId>...</groupId>
            <artifactId>...</artifactId>
            <version>...</version>
            <type>tar.gz</type>
            <classifier>bin</classifier>
            <overWrite>true</overWrite>
            <outputDirectory>.</outputDirectory>
          </artifactItem>
        </artifactItems>
      </configuration>
    </execution>
  </executions>
</plugin>