# Web Scraper

## Web Scraper

### Overview & The "Smart" Mechanism

The **Web Scraper** pulls content from one or more web pages and turns it into a format your workflow can use. Think of it like a research assistant that visits a page, copies what matters, and hands it back as clean text, HTML, Markdown, audio, or a PDF link.

It is especially useful when you want fresh website content inside Diaflow without copying and pasting by hand. You can feed the result into AI, reports, alerts, summaries, or downstream actions.

#### Smart features

* **Multiple output formats** : extract content as plain text, HTML, Markdown, audio, or PDF depending on what the next step needs.
* **Automatic fallback when a page is hard to read** : the node first tries a fast page-reading method, then switches to browser-style rendering if needed.
* **Better handling for dynamic websites** : if a page relies on scripts or special rendering, the node can still try to capture the visible page content.
* **Custom headers support** : add login or access headers when a page needs them.
* **Multi-page scraping for text modes** : collect content from several URLs in one step when working with text-based output.

<figure><img src="/files/o8Az5U8rLv8SsJuT4Ljm" alt=""><figcaption></figcaption></figure>

### Common Use Cases

* Pull article content from industry websites and send it to an AI node for summarization.
* Capture product pages, help center articles, or policy pages and save the content into reports or knowledge workflows.
* Extract audio from a public video or download a public PDF for transcription, review, or document processing.

### How to Configure

* **URL(s)** : Enter one website address or several.
  * Use one URL when you need a single page.
  * Use multiple URLs when you want to combine text content from several pages.
  * You can also pull the URL from an earlier step using **@**.
* **Content output format** : Select the result type you want:
  * **Plaintext** for summaries, search, AI prompts, and simple analysis.
  * **HTML** if you need the page structure.
  * **Markdown** if you want cleaner formatting for notes, docs, or AI inputs.
  * **Audio** if you want an audio file link from a supported media page.
  * **PDF** if you want a downloadable PDF link.
* **Custom Headers** : Add headers only if the website requires access details, such as authorization or cookies.
* **Enable Cache** : Leave this on if the same page is used often and does not change frequently.
* **Caching time** : Set how long Diaflow should reuse the saved result before checking the page again.
* Run the workflow and inspect the returned content or file link before sending it to the next node.

<figure><img src="/files/ubmi4SyAqrj2QxwUmGZe" alt=""><figcaption></figcaption></figure>

### Before & After Example

**Before**

```
URL: https://example.com/blog/quarterly-market-update
Output format: Plaintext
```

**After**

```
Quarterly Market Update

Demand increased across enterprise software categories in Q2...
Key regions included North America, Europe, and Southeast Asia...
```

### Important Warnings & Best Practices

* **Audio** and **PDF** modes only use the first URL you provide. Extra URLs in the list are ignored.
* When you scrape multiple URLs in text mode, the node combines everything into one long result. If you need page-by-page separation, scrape them one at a time.
* Very large pages can slow down the workflow or return more content than you need. Start with focused pages instead of full sites.
* Some slow or unresponsive websites can delay the workflow for longer than expected.
* If you try to extract plain text from a PDF link, the result may come back empty. Use **PDF** mode for PDF files.
* Use custom headers carefully. Only add them when the site requires access details.

### Need help?

* Learn the basics in [How a node works](/getting-started/lets-start-with-the-basics/how-a-node-works.md)
* Build the full flow in [Create a workflow](/workflow-builder/create-a-workflow.md)
* Browse related nodes in [Component List](/workflow-builder/component-list.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.diaflow.io/workflow-builder/nodes/built-in-tools/web-scraper.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
