Confluence to WikiJS
Over at Refactor we have a client using on-prem Atlassian Confluence which was prohibitively expensive to move to the cloud version (Atlassian sadly recently stopped supporting their en-prem server edition). After showing them WikiJS we were tasked with the challenge of getting all of their content from the existing system into a WikiJS server. Primary objectives were to ensure better navigation, more helpful search and a generally more pleasant look and feel.
We started with an export of the existing confluence. This resulted in a folder which a whole bunch of html files, along with a few extra directories of attachments/css/etc. Our next step was to try to get them into a WikiJS installation.
We did not want to simply load the files into an existing wikijs service. If it did not work the way we wanted it we would of needed to restore from previous backups, or tried to delete files and fix up existing pages. Messy at best! To solve this we welcomed our friend Docker Compose. The WikiJS project provides a docker-compose.yml file so this made it super simple. git pull, modify the docker-compose.yml file to mount a directory we can use for transferring files into the container, docker compose up -d, navigate to http://localhost and go through the setup.
From there we dumped the confluence export directly into the mounted folder, told WikiJS to import all the files as waited. The results were… less than ideal. Navigation was non-existent. Confluence added a bunch of “This page was last modified by”, and “Most recently edited pages”, and non-content related items to each of the exported pages. We investigated any tools that were available to fix the exported files to something WIkiJS might find more suitable. We tried a few but did not find a tool that seemed to work “out of the box”, at least not with the exported site we were working with. Time for a custom solution.
This was not a project we could spend a lot of time working on. The result was a “quick-n-dirty” console program written in the go programming language, which does the following…
- Compiles a list of all the files in the export. Saves them in a data structure with a “From” and a “To” key. Both set to the same value, as it appears in the original export file structure.
- Loops through this list, and if it is a html file parses the file and tries to find if it has any “breadcrumb” section. If it does, works out what the hierarchy is for this page, and updates the “To” key to represent this new location
- Loops through the list again. This time if its not a html file, it copies the file into a target directory. However if it is a html file, it will parse the file again and remove the breadcrumb section. At this point we want to drop as many parts of the document as we can, to cleanup any extra elements that Confluence added that we are not interested with. Then we needed to do two things. If the html file has been moved to a different hierarchy, every link in the file will need to be modified to point a point relative to where this file has been moved. Also if there are any other links to other files that have been moved, they will also need to be updated. The eventual solution was a horribly inefficient series of loops over the mapped file locations and the html document structure, but even for a reasonably large set of files go seems to scream through them at a fantastic pace. “quick-n-dirty” is the active word here! Finally the modified html file is rendered into its new position in the destination folder.
Now running the import into WikiJS results in a nice hierarchy, with images and attachments that are all located correctly, the site looks nice and the WikiJS search feature seems to work much better for our client than the Confluence site. Although the code is not exactly production quality, it fits its purpose well and is available on our github page.