How to Convert DOCX to TXT Using Python

How to Convert DOCX to TXT Using Python?

Converting a .docx file to a plain text (.txt) format is a common task in document processing. Python makes this easy with the help of powerful libraries. In this guide, we'll use the python-docx module, one of the most popular tools for working with Word documents in Python.

Convert docx to txt using python



Prerequisites

Before we begin, ensure you have Python installed on your system. You'll also need to install the python-docx library. Run the following command in your terminal or command prompt:

pip install python-docx

Step-by-Step Conversion Process

How to convert .docx to .txt using python


Step 1: Import the Required Module

First, we need to import the Document class from the docx module to read the DOCX file.

from docx import Document

Step 2: Load the DOCX File

Use the Document class to load the Word file. Replace "input.docx" with your file path.

doc = Document("input.docx")

Step 3: Extract Text from the Document

Iterate through each paragraph in the document and collect the text.

text = []
for paragraph in doc.paragraphs:
    text.append(paragraph.text)
full_text = "\n".join(text)

Step 4: Save the Text to a TXT File

Finally, write the extracted text to a .txt file.

with open("output.txt", "w", encoding="utf-8") as file:
    file.write(full_text)

Complete Code Example

Here’s the full script to convert a DOCX file to TXT:

from docx import Document

def docx_to_txt(input_file, output_file):
    doc = Document(input_file)
    text = [paragraph.text for paragraph in doc.paragraphs]
    full_text = "\n".join(text)
    
    with open(output_file, "w", encoding="utf-8") as file:
        file.write(full_text)
    print(f"Successfully converted {input_file} to {output_file}")

# Usage
docx_to_txt("input.docx", "output.txt")

Handling Errors

To make the script more robust, add error handling for cases where the file doesn’t exist or isn’t a valid DOCX file.

from docx import Document
from docx.opc.exceptions import PackageNotFoundError

def docx_to_txt(input_file, output_file):
    try:
        doc = Document(input_file)
        text = [paragraph.text for paragraph in doc.paragraphs]
        full_text = "\n".join(text)
        
        with open(output_file, "w", encoding="utf-8") as file:
            file.write(full_text)
        print(f"Successfully converted {input_file} to {output_file}")
    
    except PackageNotFoundError:
        print("Error: The file is not a valid DOCX document.")
    except FileNotFoundError:
        print("Error: The input file was not found.")

Summary: This guide explains how to convert DOCX files to TXT using Python and the python-docx library. The process involves reading paragraphs from the Word document and saving them into a plain text file.

Incoming search terms
- How to convert DOCX to TXT using Python
- Extract text from Word document in Python
- Best Python library for DOCX to TXT conversion
- Python script to read DOCX and save as TXT
- Convert Word file to plain text with Python
- How to parse DOCX files in Python
- Extract all text from DOCX using python-docx
- Save DOCX content to TXT file programmatically
- Python code for DOCX to text conversion
- Automate DOCX to TXT conversion with Python

No comments:

Post a Comment