How to Convert a DOC File to DOCX Using Python
Microsoft Word documents come in two primary formats: .doc
(older format) and .docx
(modern XML-based format). If you're working with Python and need to convert legacy DOC files to DOCX programmatically, this guide will walk you through the process using the most popular Python libraries.
Why Convert DOC to DOCX?
The DOCX format offers several advantages:
- Smaller file size due to XML compression.
- Better compatibility with modern applications.
- Easier parsing since DOCX is structured as a ZIP archive.
Prerequisites
Before proceeding, ensure you have the following installed:
- Python 3.6 or later
- The
python-docx
library (for DOCX handling) - The
pywin32
library (for Windows-based DOC conversion)
Install the required packages using pip:
pip install python-docx pywin32
Method 1: Using pywin32
(Windows Only)
This method leverages Microsoft Word's built-in conversion capabilities via COM automation.
Step-by-Step Code
import win32com.client as win32
def convert_doc_to_docx(input_path, output_path):
word = win32.gencache.EnsureDispatch('Word.Application')
doc = word.Documents.Open(input_path)
doc.SaveAs(output_path, FileFormat=16) # 16 represents DOCX format
doc.Close()
word.Quit()
# Example usage
convert_doc_to_docx("input.doc", "output.docx")
Important Notes
- This method requires Microsoft Word to be installed.
- Works only on Windows systems.
- Ensure no Word instance is running before executing the script.
Method 2: Using LibreOffice (Cross-Platform)
For non-Windows systems, you can use LibreOffice in headless mode.
Installation
First, install LibreOffice:
- Ubuntu/Debian:
sudo apt install libreoffice
- MacOS:
brew install libreoffice
Conversion Code
import subprocess
def convert_doc_to_docx(input_path, output_path):
subprocess.run([
"libreoffice",
"--headless",
"--convert-to",
"docx",
input_path,
"--outdir",
output_path
])
# Example usage
convert_doc_to_docx("input.doc", ".") # Outputs to current directory
Handling Batch Conversions
To convert multiple files at once:
import os
def batch_convert_doc_to_docx(input_folder, output_folder):
for filename in os.listdir(input_folder):
if filename.endswith(".doc"):
input_path = os.path.join(input_folder, filename)
output_path = os.path.join(output_folder, f"{os.path.splitext(filename)[0]}.docx")
convert_doc_to_docx(input_path, output_path)
Troubleshooting
- Permission Errors: Ensure the script has write access to the output directory.
- File Corruption: Verify the input DOC file isn't damaged.
- COM Errors: Restart your system if Word automation fails.
- How to convert DOC to DOCX using Python script
- Best Python library for DOCX conversion
- Convert legacy Word files to DOCX programmatically
- Python code for batch DOC to DOCX conversion
- Windows COM automation for Word file conversion
- Cross-platform DOC to DOCX conversion in Python
- How to use LibreOffice with Python for file conversion
- Automate Microsoft Word file conversion using Python
- Convert multiple DOC files to DOCX at once
- Troubleshooting DOC to DOCX conversion errors
- Python script to update old Word documents to new format
- How to handle DOCX files in Python projects
No comments:
Post a Comment