How to Convert DOCX to TXT Using Python?
Converting a .docx
file to a plain text (.txt
) format is a common task in document processing. Python makes this easy with the help of powerful libraries. In this guide, we'll use the python-docx module, one of the most popular tools for working with Word documents in Python.
Prerequisites
Before we begin, ensure you have Python installed on your system. You'll also need to install the python-docx
library. Run the following command in your terminal or command prompt:
pip install python-docx
Step-by-Step Conversion Process
Step 1: Import the Required Module
First, we need to import the Document
class from the docx
module to read the DOCX file.
from docx import Document
Step 2: Load the DOCX File
Use the Document
class to load the Word file. Replace "input.docx"
with your file path.
doc = Document("input.docx")
Step 3: Extract Text from the Document
Iterate through each paragraph in the document and collect the text.
text = []
for paragraph in doc.paragraphs:
text.append(paragraph.text)
full_text = "\n".join(text)
Step 4: Save the Text to a TXT File
Finally, write the extracted text to a .txt
file.
with open("output.txt", "w", encoding="utf-8") as file:
file.write(full_text)
Complete Code Example
Here’s the full script to convert a DOCX file to TXT:
from docx import Document
def docx_to_txt(input_file, output_file):
doc = Document(input_file)
text = [paragraph.text for paragraph in doc.paragraphs]
full_text = "\n".join(text)
with open(output_file, "w", encoding="utf-8") as file:
file.write(full_text)
print(f"Successfully converted {input_file} to {output_file}")
# Usage
docx_to_txt("input.docx", "output.txt")
Handling Errors
To make the script more robust, add error handling for cases where the file doesn’t exist or isn’t a valid DOCX file.
from docx import Document
from docx.opc.exceptions import PackageNotFoundError
def docx_to_txt(input_file, output_file):
try:
doc = Document(input_file)
text = [paragraph.text for paragraph in doc.paragraphs]
full_text = "\n".join(text)
with open(output_file, "w", encoding="utf-8") as file:
file.write(full_text)
print(f"Successfully converted {input_file} to {output_file}")
except PackageNotFoundError:
print("Error: The file is not a valid DOCX document.")
except FileNotFoundError:
print("Error: The input file was not found.")
- How to convert DOCX to TXT using Python
- Extract text from Word document in Python
- Best Python library for DOCX to TXT conversion
- Python script to read DOCX and save as TXT
- Convert Word file to plain text with Python
- How to parse DOCX files in Python
- Extract all text from DOCX using python-docx
- Save DOCX content to TXT file programmatically
- Python code for DOCX to text conversion
- Automate DOCX to TXT conversion with Python
No comments:
Post a Comment