In this blog post, I will guide you through the steps to convert a Wordpress blog into markdown. While this my seem like a unrelated subject of this blog, it is a preparative for writing a Retrieval Augmented Generation (RAG) blog post / notebook.
Why did I turn this conversion into a blog post of its own? First of all, the conversion process was more difficult and extensive than expected, therefore I felt that this is worth sharing. Additionally, it highlights (again) that data is key in any machine learning project, and that data preparation can be a project of its own.
By now, the Wittmann-Tours blog is available my Wittmann-Tours GitHub repo under license CC-BY NC.
Before we Start
Please treat this blog post as the personal documentation of how I did the conversion. I was somewhat surprised that there were so few resources on the internet covering the topic of converting Wordpress to markdown. I am definitively no expert on this subject, but following the steps documented in this blog post, I got the job done.
After a bit of research I ended up working with this repo from Swizec. Thanks for putting this repo out there!
Step 1: Export the XML from Wordpress
The first step the conversion process is to export your Wordpress blog content as an XML file. Here’s how to do it:
Navigate to the export function of Wordpress blog by entering your site’s URL followed by /wp-admin/export.php
, for example, https://wittmann-tours.de/wp-admin/export.php. Alternatively, you can navigate like this:
- Log into your Wordpress Dashboard. Navigate to the admin area of your Wordpress blog by entering your site’s URL followed by
/wp-admin
. Use your credentials to log in. - Access the Tools section. Once logged in, look for the
Tools
option in the left-hand sidebar. Hover over it, and you will see a dropdown menu. - Select Export: In the dropdown menu under
Tools
, click onExport
. This will take you to a page where you can choose what content you want to export. For a complete backup of your site, selectAll content
.
Finally, you can download the export file: After selecting All content
, click on the Download Export File
button. Wordpress will generate an XML file containing all your selected data. Save this file to your computer.
Step 2: Check Software Requirements
Depending on your setup, you might need to install some software first. Here is what we need:
Node.js: Node.js is a runtime environment that allows you to run JavaScript code outside of a web browser. It’s commonly used for server-side scripting and building backend services (like APIs), but it’s also used in tooling for front-end development, automation tasks, and more. In this case, Node.js is used to run the
wordpress-to-markdown
conversion script.npm (Node Package Manager): npm is the default package manager for Node.js. It is used to install and manage dependencies (libraries, frameworks, tools, etc.) required by Node.js applications. npm facilitates easy sharing and reuse of code. When you install Node.js, npm should be included in the installation. In this case, we need npm to install Yarn.
Yarn: Yarn is an alternative package manager to npm. It performs the same basic function as npm (managing dependencies for Node.js applications) but often with some differences in performance, features, and the way dependencies are handled. In this case, Yarn was used to manage the dependencies of the
wordpress-to-markdown
script.
Step 3: Install Node.js and Yarn
If your system already fulfills these software requirements, feel free to skip this section.
Installing Node.js
Download Node.js. Visit the official Node.js website to download the latest version of Node.js. Choose the version that is compatible with your operating system.
Install Node.js. Follow the installation prompts to install Node.js on your system. The installer will guide you through the process.
Verify the installation. To ensure that Node.js was installed correctly, open a terminal or command prompt and type the following commands:
bash node -v npm -v
These commands will display the versions of Node.js and npm installed on your system. Seeing the version numbers confirms that the installation was successful.
Installing Yarn
Open your terminal or command prompt.
Install Yarn globally using npm. Type the following command:
bash npm install -g yarn
If you encounter permission errors, it might be necessary to run the command as an administrator or with superuser rights. In such cases, use:bash sudo npm install -g yarn
This will prompt you for your password to grant the necessary permissions.Verify the installation. To check if Yarn has been installed correctly, run:
bash yarn -v
This command will display the version of Yarn installed, indicating that the installation was successful.
Final Checks
- Check the PATH. It’s important to ensure that the installation paths for Node.js and Yarn are correctly added to your system’s PATH environment variable. This allows you to run these tools from any directory in your terminal. To check your PATH, type:
bash echo $PATH
Verify that the paths to Node.js and Yarn are included in the output.
After completing these steps, your system will be equipped with Node.js and Yarn, ready for the next phase of converting your Wordpress blog into Markdown.
Step 4: Clone the Repository and Run the Conversion Script
In this step we clone the GitHub repository and run the conversion script:
Open your terminal or command prompt: Ensure you’re in the directory where you want to clone the repository.
Clone the repository: Execute the following command to clone the
wordpress-to-markdown
repository created by Swizec:bash git clone https://github.com/Swizec/wordpress-to-markdown
This command downloads the repository to your local machine in a folder namedwordpress-to-markdown
.Navigate to the repository directory: Change into the newly cloned directory to run the conversion commands:
bash cd wordpress-to-markdown
Install dependencies: Before running the conversion script, you must install its dependencies. Use Yarn to install them by executing:
bash yarn install
This command reads thepackage.json
file in the repository and installs all the necessary packages and dependencies required to run the conversion script.Copy XML for
wordpress-to-markdown
directory: Copy the XML-file you downloaded in step 1 intowordpress-to-markdown
directory.Adjust script or rename XML-file: Either rename your XML-file to
test-wordpress-dump.xml
or change line 25 ofconvert.js
to the file name of your XML.Run the conversion script: After installing the dependencies, you can now run the conversion script with Yarn:
bash yarn convert
This command initiates the conversion process, which reads your exported Wordpress XML file and converts its contents into Markdown files.
Once this step is completed, you have successfully converted your Wordpress blog content into Markdown mdx
-files. The files are store in a new out
-directory, containing one sub-directory per blog post.
Step 5: Convert mdx
-files to md
-files
So far so good, but I was not yet 100% happy, because the mdx
-files did not contain a proper level-1-heading, and Obsidian ignored the files.
To convert the mdx
-files to md
-files, I created a quick conversion notebook which made these final adjustments.
Conclusion
Converting a Wordpress blog into Markdown turned out to be more complex than anticipated. Somehow I had anticipated there would be a simple, straightforward Wordpress plugin to get this done quickly, but no…
In the process to doing the conversion, I decided to document each step of the conversion in detail within this blog post. Not only did I want a reference for myself, knowing that revisiting the process even after a few weeks could be challenging without detailed notes, but I hope this guide is also useful for you reading this blog post.
Finally, preparing the data source for my RAG project is done, which turned out to be a project of its own.