# Document to Plain Text

## Document to Plain Text

### Overview & The "Smart" Mechanism

The **Document to Plain Text** node turns document files into readable text. Think of it like an assistant who opens every attachment, reads it, and gives you the important words in one clean note.

Use it when the next step needs document content, not the file itself. It works well for PDFs, Word files, spreadsheets, CSVs, JSON files, text files, web pages saved as HTML, and subtitle files.

#### Smart features

* **Works with one file or many**: send a single file or a batch from an earlier step.
* **Finds the file link automatically**: it can read common file outputs without extra cleanup.
* **Handles scanned PDFs**: if a PDF is image-based, it still tries to read the visible text.
* **Keeps multi-file output organized**: each file is labeled and separated in the final result.
* **Can reuse recent results**: turn on cache when the same files run often.

<figure><img src="/files/FWVAiSfmx50S5vECqALp" alt=""><figcaption></figcaption></figure>

### Common Use Cases

* Turn customer contracts into text before sending them to an AI summary step.
* Extract policy, invoice, or report content from mixed document uploads.
* Read spreadsheet or CSV files as text before routing, checking, or comparing data.

### How to Configure

* **Input**: Select the file field from an earlier step with **@**. You can pass one file or a list of files.
* Use supported file types such as `.pdf`, `.docx`, `.xlsx`, `.xls`, `.csv`, `.json`, `.txt`, `.html`, `.htm`, and `.srt`.
* **Enable caching**: Turn this on when the same document is used often and does not change.
* **Caching time**: Enter how long Diaflow should reuse the saved result. Start with `60` seconds.
* Run the workflow and review the extracted text before sending it to the next node.

<figure><img src="/files/caHK2wUHKTy7uTO84fVj" alt=""><figcaption></figcaption></figure>

### Before & After Example

**Before**

```
Input files:
- signed_contract.pdf
- pricing_sheet.xlsx
```

**After**

```
filename:
signed_contract.pdf
content:
This agreement starts on 1 May 2026 and renews annually.

---

filename:
pricing_sheet.xlsx
content:
Plan | Monthly fee | Seats
Pro  | 299         | 25
```

### Important Warnings & Best Practices

* Very large files can slow the workflow or fail during processing. Split large PDFs or spreadsheets first.
* If one file in a batch is not supported, the whole batch stops. Test mixed file lists early.
* Older `.xls` files may lose some layout or unusual spreadsheet features. Use `.xlsx` when possible.
* `.srt` subtitle files keep their timestamps. Add a cleanup step if you only need spoken text.
* Rare text encodings in CSV or JSON files can drop special characters. Check accents, symbols, and non-English text after import.
* If a filename contains `__`, the displayed filename may appear shortened.
* When cache is on for multiple files, any file change refreshes the whole batch result.
* Usage reporting for scanned PDFs may look simplified right now.

### Need help?

* Learn the basics in [How a node works](/getting-started/lets-start-with-the-basics/how-a-node-works.md)
* Build the full flow in [Create a workflow](/workflow-builder/create-a-workflow.md)
* Browse related nodes in [Component List](/workflow-builder/component-list.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.diaflow.io/workflow-builder/nodes/built-in-tools/document-to-plain-text.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
