Azure DevOps Wiki Export

Lately I needed to export an Azure DevOps wiki as one PDF. There is a plugin that claims, that it can do this and of course you can export each page in the browser and concat them with tools like pdftk. Unfortunately, the plugin is in a very early stage and I did not have any control over the Azure DevOps instance. The latter felt like loosing…

Hence, I searched for a “computer scientist”-solution. So I downloaded the repo and installed pandoc.

$ brew install pandoc
$ brew cask install basictex

Next I took a look at the format of the wiki repo. Every folder contains a file named .order, which contains the file names (without ending) of the articles in that folder in the order, they are shown in the sidebar. Other tools (like GitLab) have another way of doing this (i.e., alphabetical or a manual “_sidebar” site). Each article with subarticles in the Azure DevOps wiki are realized via a .md-file for the article and a folder, named like the parent article (but without the ending), where the sub articles are stored.

The file name of an article is at the same time its title, i.e., the title is missing in the article content. The format is in pseudo code:

filename.removeEnding().urldecode()
  .replace("-", " ").prepend("# ")

Additionally, the attachments (e.g., images) are referenced with a leading /, which has to be removed. So let’s prepare the markdown files, i.e., prepare the references to attachments and add the title to each article (please ensure that you have python installed):

$ alias urldecode='python3 -c "import sys, urllib.parse as ul; \
    print(ul.unquote_plus(sys.argv[1]))"'
$ find . -iname "*.md" -exec \
  sed -i 's@/.attachments@.attachments@g' {} \;
$ find . -iname "*.md" | while read file; do \
  name=`echo $file | sed 's@.*/@@' | sed 's/.md//' | \
  sed 's/-/ /g'`; name=`urldecode $name`; \
  sed -i '' -e "1s/^/\\# $name \n/" $file; done

Since one PDF document should be created, from the complete wiki, we need to recursively find the files in right order using the information from the .order-files. I created a script for that, which should be on the PATH, lateron. Since I downloaded the repo as a ZIP-file instead of cloning the repository, the line-endings from the .md files have not been adapted from Windows to Unix style. If you cloned the repo this should work without line 6 of the script and you will not need to add a line-ending to the last line of each .order-file (not shown here). The script is named find-wiki-files-ordered:

#!/bin/bash
workdir="$1"
cd "$workdir"
file=".order"
cat $file | while IFS= read -r line; do
  line="$(echo $line | tr -d '\r')"
  echo "${workdir}/${line}.md"
  if [ -d "$line" ]; then
    cd "$line"
    newworkdir=`pwd`
    find-wiki-files-ordered "$newworkdir"
  fi
  cd "$workdir"
done

This script can then be used to iterate through all the files, cat them together, add a new line between files, so that main titles of articles are recognized correctly by the markdown parser, and use pandoc to create a PDF.

$ for file in $(find-wiki-files-ordered .); do \
  cat $file; echo "\n"; \
  done | pandoc -o output.pdf

So, this was fun and the result looks like a paper 🥸. Why is that? Because pandoc uses LaTeX as intermediate format. Hence, a really good typesetting engine is used, with much better hyphenation than Microsoft Word or your browser. Unfortunately, not all links are resolved (e.g., links to issues or persons in Azure DevOps), not all features are supported (like [[TOC]]), and images are realized via floats, which sometimes is a good thing, but sometimes leads to odd positioning of figures.

A way to mitigate some of these issues is, to use pandoc to output .tex instead of .pdf, adapt the LaTeX code and use pandoc again (or add a self-made preamble and use pdflatex directly), to create a PDF from it. This has the major downside, that it introduces roundtrip engineering, because all changes, that are added to the generated .tex, have to be done after each generation again. Therefore, each change should either be done by a static LaTeX preamble, parametrizing pandoc, or automatically via a generator.

Exciting 🤓.

Leave a Reply

Your email address will not be published. Required fields are marked *