How to Convert DOCX to XML Using Python
Converting a .docx
file to .xml
format is a common requirement for data processing, document automation, or integration with XML-based systems. Python makes this task simple with powerful libraries like python-docx
and xml.etree.ElementTree
. In this guide, we'll walk through the process step by step.
Prerequisites
Before we begin, ensure you have the following installed:
- Python 3.6 or later
- The
python-docx
library (for reading DOCX files) - Basic knowledge of XML structure
Install python-docx
using pip:
pip install python-docx
Step 1: Reading the DOCX File
The python-docx
library allows us to extract text and formatting from a .docx
file. Here's how to load a document:
from docx import Document
def read_docx(file_path):
doc = Document(file_path)
content = []
for paragraph in doc.paragraphs:
content.append(paragraph.text)
return "\n".join(content)
docx_content = read_docx("sample.docx")
print(docx_content)
Understanding the Code
Document(file_path)
loads the DOCX file.doc.paragraphs
iterates through each paragraph in the document.- The text from each paragraph is stored in a list and joined into a single string.
Step 2: Converting Text to XML
Once we have the text, we can structure it into XML using Python's built-in xml.etree.ElementTree
module.
import xml.etree.ElementTree as ET
def convert_to_xml(text, output_file):
root = ET.Element("Document")
paragraphs = text.split("\n")
for idx, para in enumerate(paragraphs, start=1):
para_element = ET.SubElement(root, "Paragraph", id=str(idx))
para_element.text = para
tree = ET.ElementTree(root)
tree.write(output_file, encoding="utf-8", xml_declaration=True)
convert_to_xml(docx_content, "output.xml")
Explanation
ET.Element("Document")
creates the root XML element.- Each paragraph is added as a child element under
<Paragraph>
with a unique ID. tree.write()
saves the XML structure to a file.
Step 3: Enhancing the XML Structure
For more complex documents (e.g., tables, images, or styled text), you can extend the XML structure:
def enhanced_docx_to_xml(docx_path, output_file):
doc = Document(docx_path)
root = ET.Element("Document")
for para in doc.paragraphs:
para_element = ET.SubElement(root, "Paragraph")
para_element.text = para.text
# Add formatting if needed (bold, italic, etc.)
for run in para.runs:
if run.bold:
bold_element = ET.SubElement(para_element, "Bold")
bold_element.text = run.text
tree = ET.ElementTree(root)
tree.write(output_file, encoding="utf-8", xml_declaration=True)
This example includes basic text formatting like bold text in the XML output.
Final Thoughts
Converting DOCX to XML in Python is straightforward with the right tools. The python-docx
library simplifies reading DOCX content, while xml.etree.ElementTree
helps structure the data into XML. Customise the XML schema based on your needs—whether for data exchange, archiving, or further processing.
- How to convert DOCX to XML using Python
- Best Python library for DOCX to XML conversion
- Extract text from Word document and save as XML
- Python script to convert DOCX to structured XML
- Automate Word to XML conversion with python-docx
- How to parse DOCX files in Python for XML output
- Convert Microsoft Word documents to XML format
- Python code for DOCX to XML transformation
- Handling DOCX files with Python for XML export
- Step-by-step guide for DOCX to XML conversion
No comments:
Post a Comment