How to Convert DOCX to XML Using Python

Converting a .docx file to .xml format is a common requirement for data processing, document automation, or integration with XML-based systems. Python makes this task simple with powerful libraries like python-docx and xml.etree.ElementTree. In this guide, we'll walk through the process step by step.

Prerequisites

Before we begin, ensure you have the following installed:

Python 3.6 or later
The python-docx library (for reading DOCX files)
Basic knowledge of XML structure

Install python-docx using pip:

pip install python-docx

Step 1: Reading the DOCX File

The python-docx library allows us to extract text and formatting from a .docx file. Here's how to load a document:

from docx import Document

def read_docx(file_path):
    doc = Document(file_path)
    content = []
    for paragraph in doc.paragraphs:
        content.append(paragraph.text)
    return "\n".join(content)

docx_content = read_docx("sample.docx")
print(docx_content)

Understanding the Code

Document(file_path) loads the DOCX file.
doc.paragraphs iterates through each paragraph in the document.
The text from each paragraph is stored in a list and joined into a single string.

Step 2: Converting Text to XML

Once we have the text, we can structure it into XML using Python's built-in xml.etree.ElementTree module.

import xml.etree.ElementTree as ET

def convert_to_xml(text, output_file):
    root = ET.Element("Document")
    paragraphs = text.split("\n")
    
    for idx, para in enumerate(paragraphs, start=1):
        para_element = ET.SubElement(root, "Paragraph", id=str(idx))
        para_element.text = para
    
    tree = ET.ElementTree(root)
    tree.write(output_file, encoding="utf-8", xml_declaration=True)

convert_to_xml(docx_content, "output.xml")

Explanation

ET.Element("Document") creates the root XML element.
Each paragraph is added as a child element under <Paragraph> with a unique ID.
tree.write() saves the XML structure to a file.

Step 3: Enhancing the XML Structure

For more complex documents (e.g., tables, images, or styled text), you can extend the XML structure:

def enhanced_docx_to_xml(docx_path, output_file):
    doc = Document(docx_path)
    root = ET.Element("Document")
    
    for para in doc.paragraphs:
        para_element = ET.SubElement(root, "Paragraph")
        para_element.text = para.text
        
        # Add formatting if needed (bold, italic, etc.)
        for run in para.runs:
            if run.bold:
                bold_element = ET.SubElement(para_element, "Bold")
                bold_element.text = run.text
    
    tree = ET.ElementTree(root)
    tree.write(output_file, encoding="utf-8", xml_declaration=True)

This example includes basic text formatting like bold text in the XML output.

Final Thoughts

Converting DOCX to XML in Python is straightforward with the right tools. The python-docx library simplifies reading DOCX content, while xml.etree.ElementTree helps structure the data into XML. Customise the XML schema based on your needs—whether for data exchange, archiving, or further processing.

Incoming search terms
- How to convert DOCX to XML using Python
- Best Python library for DOCX to XML conversion
- Extract text from Word document and save as XML
- Python script to convert DOCX to structured XML
- Automate Word to XML conversion with python-docx
- How to parse DOCX files in Python for XML output
- Convert Microsoft Word documents to XML format
- Python code for DOCX to XML transformation
- Handling DOCX files with Python for XML export
- Step-by-step guide for DOCX to XML conversion

Hexa Python - Solve problems using python