How to Convert DOCX to XML Using Python

How to Convert DOCX to XML Using Python

Converting a .docx file to .xml format is a common requirement for data processing, document automation, or integration with XML-based systems. Python makes this task simple with powerful libraries like python-docx and xml.etree.ElementTree. In this guide, we'll walk through the process step by step.

Prerequisites

Before we begin, ensure you have the following installed:

  • Python 3.6 or later
  • The python-docx library (for reading DOCX files)
  • Basic knowledge of XML structure

Install python-docx using pip:

pip install python-docx

Step 1: Reading the DOCX File

The python-docx library allows us to extract text and formatting from a .docx file. Here's how to load a document:

from docx import Document

def read_docx(file_path):
    doc = Document(file_path)
    content = []
    for paragraph in doc.paragraphs:
        content.append(paragraph.text)
    return "\n".join(content)

docx_content = read_docx("sample.docx")
print(docx_content)

Understanding the Code

  • Document(file_path) loads the DOCX file.
  • doc.paragraphs iterates through each paragraph in the document.
  • The text from each paragraph is stored in a list and joined into a single string.

Step 2: Converting Text to XML

Once we have the text, we can structure it into XML using Python's built-in xml.etree.ElementTree module.

import xml.etree.ElementTree as ET

def convert_to_xml(text, output_file):
    root = ET.Element("Document")
    paragraphs = text.split("\n")
    
    for idx, para in enumerate(paragraphs, start=1):
        para_element = ET.SubElement(root, "Paragraph", id=str(idx))
        para_element.text = para
    
    tree = ET.ElementTree(root)
    tree.write(output_file, encoding="utf-8", xml_declaration=True)

convert_to_xml(docx_content, "output.xml")

Explanation

  • ET.Element("Document") creates the root XML element.
  • Each paragraph is added as a child element under <Paragraph> with a unique ID.
  • tree.write() saves the XML structure to a file.

Step 3: Enhancing the XML Structure

For more complex documents (e.g., tables, images, or styled text), you can extend the XML structure:

def enhanced_docx_to_xml(docx_path, output_file):
    doc = Document(docx_path)
    root = ET.Element("Document")
    
    for para in doc.paragraphs:
        para_element = ET.SubElement(root, "Paragraph")
        para_element.text = para.text
        
        # Add formatting if needed (bold, italic, etc.)
        for run in para.runs:
            if run.bold:
                bold_element = ET.SubElement(para_element, "Bold")
                bold_element.text = run.text
    
    tree = ET.ElementTree(root)
    tree.write(output_file, encoding="utf-8", xml_declaration=True)

This example includes basic text formatting like bold text in the XML output.


Final Thoughts

Converting DOCX to XML in Python is straightforward with the right tools. The python-docx library simplifies reading DOCX content, while xml.etree.ElementTree helps structure the data into XML. Customise the XML schema based on your needs—whether for data exchange, archiving, or further processing.

Keywords: DOCX to XML Python, convert Word to XML, python-docx library, XML parsing in Python, document automation, Python XML conversion, extract text from DOCX, structured data conversion.

Incoming search terms
- How to convert DOCX to XML using Python
- Best Python library for DOCX to XML conversion
- Extract text from Word document and save as XML
- Python script to convert DOCX to structured XML
- Automate Word to XML conversion with python-docx
- How to parse DOCX files in Python for XML output
- Convert Microsoft Word documents to XML format
- Python code for DOCX to XML transformation
- Handling DOCX files with Python for XML export
- Step-by-step guide for DOCX to XML conversion

No comments:

Post a Comment