How to Convert a Wordpress Blog into Markdown

wordpress
markdown
dataset
Author

Christian Wittmann

Published

March 8, 2024

In this blog post, I will guide you through the steps to convert a Wordpress blog into markdown. While this my seem like a unrelated subject of this blog, it is a preparative for writing a Retrieval Augmented Generation (RAG) blog post / notebook.

Why did I turn this conversion into a blog post of its own? First of all, the conversion process was more difficult and extensive than expected, therefore I felt that this is worth sharing. Additionally, it highlights (again) that data is key in any machine learning project, and that data preparation can be a project of its own.

By now, the Wittmann-Tours blog is available my Wittmann-Tours GitHub repo under license CC-BY NC.

Dalle: The Conversion from Wordpress to Markdown - with some edits in Photoshop
Dalle: The Conversion from Wordpress to Markdown - with some edits in Photoshop

Before we Start

Please treat this blog post as the personal documentation of how I did the conversion. I was somewhat surprised that there were so few resources on the internet covering the topic of converting Wordpress to markdown. I am definitively no expert on this subject, but following the steps documented in this blog post, I got the job done.

After a bit of research I ended up working with this repo from Swizec. Thanks for putting this repo out there!

Step 1: Export the XML from Wordpress

The first step the conversion process is to export your Wordpress blog content as an XML file. Here’s how to do it:

Navigate to the export function of Wordpress blog by entering your site’s URL followed by /wp-admin/export.php, for example, https://wittmann-tours.de/wp-admin/export.php. Alternatively, you can navigate like this:

  • Log into your Wordpress Dashboard. Navigate to the admin area of your Wordpress blog by entering your site’s URL followed by /wp-admin. Use your credentials to log in.
  • Access the Tools section. Once logged in, look for the Tools option in the left-hand sidebar. Hover over it, and you will see a dropdown menu.
  • Select Export: In the dropdown menu under Tools, click on Export. This will take you to a page where you can choose what content you want to export. For a complete backup of your site, select All content.

Finally, you can download the export file: After selecting All content, click on the Download Export File button. Wordpress will generate an XML file containing all your selected data. Save this file to your computer.

Step 2: Check Software Requirements

Depending on your setup, you might need to install some software first. Here is what we need:

  1. Node.js: Node.js is a runtime environment that allows you to run JavaScript code outside of a web browser. It’s commonly used for server-side scripting and building backend services (like APIs), but it’s also used in tooling for front-end development, automation tasks, and more. In this case, Node.js is used to run the wordpress-to-markdown conversion script.

  2. npm (Node Package Manager): npm is the default package manager for Node.js. It is used to install and manage dependencies (libraries, frameworks, tools, etc.) required by Node.js applications. npm facilitates easy sharing and reuse of code. When you install Node.js, npm should be included in the installation. In this case, we need npm to install Yarn.

  3. Yarn: Yarn is an alternative package manager to npm. It performs the same basic function as npm (managing dependencies for Node.js applications) but often with some differences in performance, features, and the way dependencies are handled. In this case, Yarn was used to manage the dependencies of the wordpress-to-markdown script.

Step 3: Install Node.js and Yarn

If your system already fulfills these software requirements, feel free to skip this section.

Installing Node.js

  1. Download Node.js. Visit the official Node.js website to download the latest version of Node.js. Choose the version that is compatible with your operating system.

  2. Install Node.js. Follow the installation prompts to install Node.js on your system. The installer will guide you through the process.

  3. Verify the installation. To ensure that Node.js was installed correctly, open a terminal or command prompt and type the following commands: bash node -v npm -v These commands will display the versions of Node.js and npm installed on your system. Seeing the version numbers confirms that the installation was successful.

Installing Yarn

  1. Open your terminal or command prompt.

  2. Install Yarn globally using npm. Type the following command: bash npm install -g yarn If you encounter permission errors, it might be necessary to run the command as an administrator or with superuser rights. In such cases, use: bash sudo npm install -g yarn This will prompt you for your password to grant the necessary permissions.

  3. Verify the installation. To check if Yarn has been installed correctly, run: bash yarn -v This command will display the version of Yarn installed, indicating that the installation was successful.

Final Checks

  • Check the PATH. It’s important to ensure that the installation paths for Node.js and Yarn are correctly added to your system’s PATH environment variable. This allows you to run these tools from any directory in your terminal. To check your PATH, type: bash echo $PATH Verify that the paths to Node.js and Yarn are included in the output.

After completing these steps, your system will be equipped with Node.js and Yarn, ready for the next phase of converting your Wordpress blog into Markdown.

Step 4: Clone the Repository and Run the Conversion Script

In this step we clone the GitHub repository and run the conversion script:

  1. Open your terminal or command prompt: Ensure you’re in the directory where you want to clone the repository.

  2. Clone the repository: Execute the following command to clone the wordpress-to-markdown repository created by Swizec: bash git clone https://github.com/Swizec/wordpress-to-markdown This command downloads the repository to your local machine in a folder named wordpress-to-markdown.

  3. Navigate to the repository directory: Change into the newly cloned directory to run the conversion commands: bash cd wordpress-to-markdown

  4. Install dependencies: Before running the conversion script, you must install its dependencies. Use Yarn to install them by executing: bash yarn install This command reads the package.json file in the repository and installs all the necessary packages and dependencies required to run the conversion script.

  5. Copy XML for wordpress-to-markdown directory: Copy the XML-file you downloaded in step 1 into wordpress-to-markdown directory.

  6. Adjust script or rename XML-file: Either rename your XML-file to test-wordpress-dump.xml or change line 25 of convert.js to the file name of your XML.

  7. Run the conversion script: After installing the dependencies, you can now run the conversion script with Yarn: bash yarn convert This command initiates the conversion process, which reads your exported Wordpress XML file and converts its contents into Markdown files.

Once this step is completed, you have successfully converted your Wordpress blog content into Markdown mdx-files. The files are store in a new out-directory, containing one sub-directory per blog post.

Step 5: Convert mdx-files to md-files

So far so good, but I was not yet 100% happy, because the mdx-files did not contain a proper level-1-heading, and Obsidian ignored the files.

To convert the mdx-files to md-files, I created a quick conversion notebook which made these final adjustments.

Conclusion

Converting a Wordpress blog into Markdown turned out to be more complex than anticipated. Somehow I had anticipated there would be a simple, straightforward Wordpress plugin to get this done quickly, but no…

In the process to doing the conversion, I decided to document each step of the conversion in detail within this blog post. Not only did I want a reference for myself, knowing that revisiting the process even after a few weeks could be challenging without detailed notes, but I hope this guide is also useful for you reading this blog post.

Finally, preparing the data source for my RAG project is done, which turned out to be a project of its own.