Working with URL documents

When you add a URL document to the Knowledge Hub, Louis fetches the page, extracts its content, and indexes it for retrieval. This page covers the four levers you can use to get a clean, complete, and fresh extraction.

LeverWhat it doesWhere it lives
Base scrapingPulls the page HTML — works out of the box for static pagesAutomatic
CSS selector exclusionStrips noise (headers, footers, cookie banners) before indexingAdvanced > Web Scraping (global) + per-URL override
JavaScript renderingLoads the page in a real browser so dynamic content is capturedPer-document, in Add document
RefreshKeeps URL content up to datePer-document (Synchronize) + scheduled (Advanced)

How URL scraping works (default)

By default, Louis fetches the page's raw HTML response and extracts the textual content from it. This works well for:

  • Static sites (server-rendered HTML, CMS-published pages)
  • Help-center articles, FAQ pages, blog posts
  • Public documentation

After scraping, the extracted text is split into chunks and indexed. You can see the chunks Louis stored from any document in the document Details modal — see Verify what was scraped below.

Cleaning up scraped content

Most web pages contain elements you don't want in your knowledge base: navigation menus, headers, footers, cookie banners, ad slots, "related articles" widgets. If you don't strip them out, Louis will index them and may quote them in answers.

You strip them out by listing CSS selectors that should be excluded before content is captured.

📘

Example

You scrape a help-center article. Without exclusions, Louis ingests the sidebar menu, cookie banner, and footer links alongside the article body. With .sidebar, footer, and .cookie-banner selectors, only the article content is stored.

Global defaults — Advanced > Web Scraping

Global defaults apply to every URL you add to the Knowledge Hub. Set them once, get clean scraping for free on every new URL.


To configure:

  1. Go to Advanced > Web Scraping.
  2. Add selectors in the Default CSS selectors to exclude field.
  3. Separate multiple selectors with commas or spaces.

Start with common noise elements: header, footer, .sidebar, .navigation, .cookie-banner.

Per-URL overrides

Some pages need different rules. A product page might have a useful sidebar that your help-center pages don't have. Override the global defaults at the time you add the URL.


To customize:

  1. Click Add Documents > Public URL(s).
  2. Expand the Options section.
  3. Edit the CSS selectors field.
  4. Paste your URLs.

These overrides only affect the current import. Your global defaults remain unchanged.

Common selectors to start with

Start with these patterns and refine based on your target sites:

SelectorWhat it strips
header, footerSite-wide chrome
.sidebar, #sidebarAuxiliary content panels
.nav, .navigation, .menuNavigation elements
.cookie-banner, .modal, .popupOverlays
.ad, .advertisement, .promoMarketing content
.social-share, .commentsEngagement widgets

🔅

Finding selectors

Right-click an element in your browser → Inspect → copy its class name or ID. That's your selector.

Verify what was scraped

Once a URL document is scraped, you can review exactly which selectors were applied and how the content was chunked.


To access:

  1. Open the document from the Knowledge Hub.

  2. The Details modal shows two key sections:

    • Excluded CSS Selectors — the exact selectors used during scraping. Copy them to reuse on similar documents, or use them to troubleshoot.
    • Chunk preview — how the document was split for AI search.

⚠️

Chunk quality matters

  • Chunks too large: mixed topics create noisy embeddings → reduced retrieval precision.
  • Chunks too small: context is fragmented → incomplete answers.

Use the chunk preview to validate that content boundaries make semantic sense.

JavaScript rendering — for dynamic pages

By default, Louis extracts content directly from the page's HTML response. For modern web apps where content is loaded after the initial HTML response, base scraping returns an empty or incomplete page.

Enable JavaScript rendering when:

  • The page is a single-page application (React, Vue, Angular).
  • Content is lazy-loaded as the user scrolls.
  • Content is hidden behind user interactions (click to expand, "Show more" buttons).
  • A specific element must be visible before the page is meaningful (e.g. a price loaded from an API).

⚠️

Performance impact

JavaScript rendering is slower than direct HTML extraction. Only enable it for documents that actually need it.

Configuration

In the Add document modal, toggle Enable JS rendering to reveal the options.


Rendering delay

Time the headless browser waits before extracting content, in milliseconds.

  • Range: 0 to 25,000 ms.
  • Use for: pages that need a fixed delay to load content (e.g. analytics scripts, third-party widgets).

Wait for selector

A CSS or XPath selector that must be present in the DOM before extraction starts.

  • Use for: pages that load content asynchronously and you can identify a specific element that signals readiness.
  • Example: .product-price to wait until the price element appears.

Auto scroll

When enabled, the headless browser scrolls to the bottom of the page before extracting content.

  • Use for: pages with lazy-loaded content that only renders as the user scrolls (image grids, infinite-scroll lists).

JavaScript scenario (advanced)

Define a sequence of actions to execute before extraction. Provide a JSON array of actions:

ActionDescription
waitWait for a fixed duration (ms)
executeRun arbitrary JavaScript in the page context
clickClick an element matching a selector
scrollScroll to a specific position or element
fillFill a form field

Example — click a "Show more" button repeatedly until it disappears, then wait 5 seconds:

[
  {
    "execute": {
      "script": "const clickUntilGone = setInterval(() => { const btn = document.querySelector('button[data-cy=\"more\"]'); if (btn) { btn.click(); } else { clearInterval(clickUntilGone); } }, 500);",
      "timeout": 20000
    }
  },
  {
    "wait": 5000
  }
]

Refer to the Scrapfly documentation for the complete action reference.

📘

Read-only after creation

JavaScript rendering settings are captured at the time the document is scraped. To change them, re-add the document with the new configuration.

Keeping URL content fresh

Web pages change. Refresh is how you keep Louis's knowledge in sync with the source.

One document — Synchronize

Open a URL document in the Knowledge Hub and click Synchronize in the detail drawer. Louis re-fetches the source page and re-indexes the content.

Use this when you just published a change to one specific page and want it reflected immediately.

Many documents — bulk Synchronize

Select multiple URL documents in the Knowledge Hub, then click Synchronize in the floating action bar. Up to 30 documents are re-fetched at once.

See Bulk operations for the full bulk workflow.

All URLs — Refresh All Webpages


Refresh All Webpages (in Advanced settings) manually refreshes every URL-based document in the Knowledge Hub in one click. Use it after a major site-wide change.

Automatic — Refresh Rate


Set an automatic synchronization schedule for URL-based documents in Advanced > Knowledge Base > Refresh Rate. Options:

  • Daily — Louis refreshes every URL once a day.
  • Weekly — Louis refreshes every URL once a week.
  • No Scheduled Refresh — manual only.

🗒️

Pick a rhythm that matches your content

If your site updates frequently (daily news, prices that change often), pick Daily. For mostly-stable content (help center, policies), Weekly is enough. Excessive refreshing creates unnecessary load and can trigger rate limits on your source.

Selector configurations persist across refreshes

Your CSS selector exclusions stay active across refreshes. Whether you manually sync or rely on automatic updates, your cleaning rules are reapplied — no need to redo them after a refresh.