Vulnerability-Lookup

GHSA-3R75-XC34-5F44

Vulnerability from github – Published: 2026-05-21 19:28 – Updated: 2026-06-10 18:41

Summary

Crawlee for Python: SSRF via sitemap-derived URLs

Details

Overview

Vulnerability type: Blind SSRF
Affected components: src/crawlee/_utils/sitemap.py, src/crawlee/_utils/robots.py, src/crawlee/request_loaders/_sitemap_request_loader.py, and all built-in HTTP clients.
Trigger: an attacker-controlled sitemap or robots.txt containing a URL that points to an internal host (layer 1) or uses a non-http scheme (layer 2).

Two-layer SSRF via sitemap-derived URLs:

1) Cross-host HTTP SSRF

Base case, affects every HTTP client.** Sitemap entries and robots.txt Sitemap: directives were accepted regardless of the host they pointed to. A sitemap on example.com could push http://internal.corp/admin into the crawler's queue, and the configured HTTP client would dispatch the request.

2) Non-HTTP scheme SSRF

Escalation, only CurlImpersonateHttpClient.** Nested-sitemap fetching dispatches the URL straight to the HTTP client, bypassing the Request construction step where Pydantic enforces http(s). Combined with the libcurl-backed CurlImpersonateHttpClient, this lets gopher://, file://, dict://, ftp://, etc., through.

Root cause

Crawlee already validates URL schemes through Pydantic's AnyHttpUrl (via validate_http_url in src/crawlee/_utils/urls.py) wherever a crawl target is materialised as a Request: the Request.url field is declared as Annotated[str, BeforeValidator(validate_http_url), Field(frozen=True)]. Anything that becomes a Request is therefore guaranteed to be http(s).

Two parts of the sitemap pipeline sidestepped this property in different ways:

1) Sitemap-derived URLs were enqueued without any host policy

SitemapRequestLoader took every <urlset><url><loc> entry, wrapped it in Request.from_url (which accepts any valid http(s) URL), and pushed the result into the request queue. RobotsTxtFile.get_sitemaps() returned every Sitemap: directive verbatim. Neither imposed any host check against the parent sitemap or robots.txt URL, so an attacker controlling that content could push internal-network HTTP URLs into the queue and have them crawled by whichever HTTP client was configured.

2) Nested sitemap fetching bypassed the `Request` chokepoint entirely

When _XmlSitemapParser encountered <sitemapindex><sitemap><loc>…</loc></sitemap></sitemapindex>, or when RobotsTxtFile.parse_sitemaps forwarded Sitemap: directives into the same pipeline, _fetch_and_process_sitemap dispatched the URL directly to the HTTP client:

async with http_client.stream(
    sitemap_url, 
    method='GET', 
    headers=SITEMAP_HEADERS, 
    proxy_info=proxy_info, 
    timeout=timeout,
) as response:
    ...

No Request was constructed, so the Pydantic validator never ran. Before the fix, the HTTP clients' own send_request() and stream() methods did not call validate_http_url either, so a non-http(s) scheme could pass straight through to the backend client.

The non-HTTP escalation in layer 2 is specific to CurlImpersonateHttpClient, which is backed by curl-cffi / libcurl and speaks gopher, file, dict, ftp, and other non-HTTP protocols. The other clients shipped with Crawlee (HttpxHttpClient, ImpitHttpClient, PlaywrightHttpClient) reject non-http(s) schemes at their own backend layer, regardless of what Crawlee passes in, so they were only affected by layer 1.

Vulnerable paths

Layer 1 — cross-host HTTP (all HTTP clients)

Source: an attacker-controlled sitemap that lists internal URLs under <urlset><url><loc> or <sitemapindex><sitemap><loc>, or an attacker-controlled robots.txt that lists internal URLs under Sitemap:.
Sink: the configured HTTP client issues GET requests against those URLs — either via client.request(url=request.url, …) inside crawl() for regular sitemap URLs, or via client.stream(url, …) inside the nested-sitemap fetch.

Layer 2 — non-HTTP schemes (`CurlImpersonateHttpClient` only)

Source: a nested <sitemap><loc> entry or a robots.txt Sitemap: directive pointing to a non-http(s) URL.
Sink: CurlImpersonateHttpClient.stream(...) hands the URL string verbatim to client.request(url=…, …), which dispatches via libcurl.

Hardening in 1.7.0 was added at both producer and consumer ends — see Remediation.

Exploitation preconditions

The crawler uses sitemap loading: any of SitemapRequestLoader, Sitemap.load / parse_sitemap, discover_valid_sitemaps, or RobotsTxtFile.parse_sitemaps.
The attacker controls the body of a sitemap or robots.txt that the crawler fetches — typically by being the target site, or by getting a target site to publish a malicious sitemap.
The crawler's network egress can reach the attacker-chosen destination (e.g., internal services on the same network).
The targeted endpoint accepts unauthenticated requests. Crawlee does not supply credentials to the forged destination, so authenticated services (IMDSv2 with token, password-protected Redis, protected admin panels) are not reachable through this path.

For layer 2 (non-HTTP), the configured HTTP client must additionally be CurlImpersonateHttpClient.

Impact

Layer 1 — cross-host HTTP (any client)

The crawler can be coerced into issuing GET requests against internal HTTP services on its own network: admin panels, unauthenticated internal APIs, cloud metadata endpoints, etc. Read-back is blind — Crawlee surfaces fetched content only through its local Dataset / KeyValueStore (push_data() etc.) and does not natively forward scraped bodies anywhere external — so direct impact is mostly existence/timing probing and occasional state changes via side-effecting GET endpoints. Read-side leakage of internal content is only exploitable end-to-end if the deployer's own application separately exposes scraped data (for example, a public summariser or aggregator built on top of Crawlee).

Layer 2 — non-HTTP escalation (only `CurlImpersonateHttpClient`)

Under the affected client, attackers gain the libcurl scheme set:

gopher:// is the canonical RESP-injection vector: pipeline FLUSHALL, CONFIG SET dir, CONFIG SET dbfilename, SAVE to an unauthenticated Redis on the crawler's network — enough to write attacker-controlled bytes to disk and, in the standard escalation, achieve remote code execution on the Redis host.
file:// allows the crawler to read local files (application secrets, configuration) on the crawler host.
dict:// and ftp:// permit fingerprinting and limited interaction with text-protocol services.

In both layers, the SSRF is blind in the default configuration. Write-side impact (gopher:// → Redis) and timing-based internal probing do not depend on read-back and remain viable regardless of whether the deployer surfaces scraped content.

Remediation

Both layers are fixed in crawlee==1.7.0. The fix is split across two PRs, applied at the two complementary boundaries of the affected pipeline:

Producer-side filtering — sitemap and robots.txt loaders (PR #1864). SitemapRequestLoader and RobotsTxtFile.get_sitemaps() now run every nested-sitemap entry, every regular sitemap URL, and every Sitemap: directive through crawlee._utils.urls.filter_url. This applies to an EnqueueStrategy (default 'same-hostname') against the parent sitemap / robots.txt URL — cross-host entries are dropped — and rejects non-http(s) schemes. The strategy is stamped onto the emitted Requests, so BasicCrawler._check_url_after_redirects continues policing the policy across redirects.
Consumer-side validation — HTTP-client boundary (PR #1862). validate_http_url(url) is now called at the top of send_request() and stream() in ImpitHttpClient, HttpxHttpClient, CurlImpersonateHttpClient, and PlaywrightHttpClient. Non-http(s) schemes raise pydantic.ValidationError before any backend call. crawl() was already covered, because Request.url is validated by Pydantic on construction.

After these changes, validation is enforced both where sitemap-derived HTTP requests are produced (sitemap and robots.txt loaders) and where they are consumed (HTTP clients). A regression at either layer is caught by the other.

Behaviour change for upgraders

SitemapRequestLoader and RobotsTxtFile.get_sitemaps() now default to enqueue_strategy='same-hostname'. Deployers that legitimately relied on cross-host sitemap entries (e.g., a sitemap index on sitemaps.example.com that points to content on www.example.com) must opt in explicitly with enqueue_strategy='same-domain' or enqueue_strategy='all'.

Finder credits

@r0otsu
@Yuremin (Zhengmin Yu)
@FORIMOC
@invoke1442 (Ethan Carter)
@Arturo0x90 (Arturo Melgarejo)

Severity

2.3 (Low)


                  
                    CVSS:4.0/AV:N/AC:L/AT:P/PR:N/UI:P/VC:L/VI:N/VA:N/SC:L/SI:N/SA:N

Show details on source website

JSON

To clipboard

{
  "affected": [
    {
      "package": {
        "ecosystem": "PyPI",
        "name": "crawlee"
      },
      "ranges": [
        {
          "events": [
            {
              "introduced": "1.0.0"
            },
            {
              "fixed": "1.7.0"
            }
          ],
          "type": "ECOSYSTEM"
        }
      ]
    }
  ],
  "aliases": [
    "CVE-2026-46497"
  ],
  "database_specific": {
    "cwe_ids": [
      "CWE-918"
    ],
    "github_reviewed": true,
    "github_reviewed_at": "2026-05-21T19:28:10Z",
    "nvd_published_at": "2026-06-10T16:17:08Z",
    "severity": "LOW"
  },
  "details": "## Overview\n\n- **Vulnerability type:** Blind SSRF\n- **Affected components:** `src/crawlee/_utils/sitemap.py`, `src/crawlee/_utils/robots.py`, `src/crawlee/request_loaders/_sitemap_request_loader.py`, and all built-in HTTP clients.\n- **Trigger:** an attacker-controlled sitemap or `robots.txt` containing a URL that points to an internal host (layer 1) or uses a non-http scheme (layer 2).\n\nTwo-layer SSRF via sitemap-derived URLs:\n\n### 1) Cross-host HTTP SSRF\n\nBase case, affects every HTTP client.** Sitemap entries and `robots.txt` `Sitemap:` directives were accepted regardless of the host they pointed to. A sitemap on `example.com` could push `http://internal.corp/admin` into the crawler\u0027s queue, and the configured HTTP client would dispatch the request.\n\n### 2) Non-HTTP scheme SSRF\n\nEscalation, only `CurlImpersonateHttpClient`.** Nested-sitemap fetching dispatches the URL straight to the HTTP client, bypassing the `Request` construction step where Pydantic enforces `http(s)`. Combined with the libcurl-backed `CurlImpersonateHttpClient`, this lets `gopher://`, `file://`, `dict://`, `ftp://`, etc., through.\n\n\n\n## Root cause\n\nCrawlee already validates URL schemes through Pydantic\u0027s `AnyHttpUrl` (via `validate_http_url` in `src/crawlee/_utils/urls.py`) wherever a crawl target is materialised as a `Request`: the `Request.url` field is declared as `Annotated[str, BeforeValidator(validate_http_url), Field(frozen=True)]`. Anything that becomes a `Request` is therefore guaranteed to be `http(s)`.\n\nTwo parts of the sitemap pipeline sidestepped this property in different ways:\n\n### 1) Sitemap-derived URLs were enqueued without any host policy\n\n`SitemapRequestLoader` took every `\u003curlset\u003e\u003curl\u003e\u003cloc\u003e` entry, wrapped it in `Request.from_url` (which accepts any valid `http(s)` URL), and pushed the result into the request queue. `RobotsTxtFile.get_sitemaps()` returned every `Sitemap:` directive verbatim. Neither imposed any host check against the parent sitemap or `robots.txt` URL, so an attacker controlling that content could push internal-network HTTP URLs into the queue and have them crawled by whichever HTTP client was configured.\n\n### 2) Nested sitemap fetching bypassed the `Request` chokepoint entirely\n\nWhen `_XmlSitemapParser` encountered `\u003csitemapindex\u003e\u003csitemap\u003e\u003cloc\u003e\u2026\u003c/loc\u003e\u003c/sitemap\u003e\u003c/sitemapindex\u003e`, or when `RobotsTxtFile.parse_sitemaps` forwarded `Sitemap:` directives into the same pipeline, `_fetch_and_process_sitemap` dispatched the URL directly to the HTTP client:\n\n```python\nasync with http_client.stream(\n    sitemap_url, \n    method=\u0027GET\u0027, \n    headers=SITEMAP_HEADERS, \n    proxy_info=proxy_info, \n    timeout=timeout,\n) as response:\n    ...\n```\n\nNo `Request` was constructed, so the Pydantic validator never ran. Before the fix, the HTTP clients\u0027 own `send_request()` and `stream()` methods did not call `validate_http_url` either, so a non-`http(s)` scheme could pass straight through to the backend client.\n\nThe non-HTTP escalation in layer 2 is **specific to** `CurlImpersonateHttpClient`, which is backed by `curl-cffi` / libcurl and speaks `gopher`, `file`, `dict`, `ftp`, and other non-HTTP protocols. The other clients shipped with Crawlee (`HttpxHttpClient`, `ImpitHttpClient`, `PlaywrightHttpClient`) reject non-`http(s)` schemes at their own backend layer, regardless of what Crawlee passes in, so they were only affected by layer 1.\n\n## Vulnerable paths\n\n### Layer 1 \u2014 cross-host HTTP (all HTTP clients)\n\n- *Source:* an attacker-controlled sitemap that lists internal URLs under `\u003curlset\u003e\u003curl\u003e\u003cloc\u003e` or `\u003csitemapindex\u003e\u003csitemap\u003e\u003cloc\u003e`, or an attacker-controlled `robots.txt` that lists internal URLs under `Sitemap:`.\n- *Sink:* the configured HTTP client issues `GET` requests against those URLs \u2014 either via `client.request(url=request.url, \u2026)` inside `crawl()` for regular sitemap URLs, or via `client.stream(url, \u2026)` inside the nested-sitemap fetch.\n\n### Layer 2 \u2014 non-HTTP schemes (`CurlImpersonateHttpClient` only)\n\n- *Source:* a nested `\u003csitemap\u003e\u003cloc\u003e` entry or a `robots.txt` `Sitemap:` directive pointing to a non-`http(s)` URL.\n- *Sink:* `CurlImpersonateHttpClient.stream(...)` hands the URL string verbatim to `client.request(url=\u2026, \u2026)`, which dispatches via libcurl.\n\nHardening in 1.7.0 was added at both producer and consumer ends \u2014 see *Remediation*.\n\n## Exploitation preconditions\n\n1. The crawler uses sitemap loading: any of `SitemapRequestLoader`, `Sitemap.load` / `parse_sitemap`, `discover_valid_sitemaps`, or `RobotsTxtFile.parse_sitemaps`.\n2. The attacker controls the body of a sitemap or `robots.txt` that the crawler fetches \u2014 typically by being the target site, or by getting a target site to publish a malicious sitemap.\n3. The crawler\u0027s network egress can reach the attacker-chosen destination (e.g., internal services on the same network).\n4. The targeted endpoint accepts unauthenticated requests. Crawlee does not supply credentials to the forged destination, so authenticated services (IMDSv2 with token, password-protected Redis, protected admin panels) are not reachable through this path.\n\nFor layer 2 (non-HTTP), the configured HTTP client must additionally be `CurlImpersonateHttpClient`.\n\n## Impact\n\n### Layer 1 \u2014 cross-host HTTP (any client)\n\nThe crawler can be coerced into issuing `GET` requests against internal HTTP services on its own network: admin panels, unauthenticated internal APIs, cloud metadata endpoints, etc. Read-back is blind \u2014 Crawlee surfaces fetched content only through its local `Dataset` / `KeyValueStore` (`push_data()` etc.) and does not natively forward scraped bodies anywhere external \u2014 so direct impact is mostly existence/timing probing and occasional state changes via side-effecting `GET` endpoints. Read-side leakage of internal content is only exploitable end-to-end if the deployer\u0027s own application separately exposes scraped data (for example, a public summariser or aggregator built on top of Crawlee).\n\n### Layer 2 \u2014 non-HTTP escalation (only `CurlImpersonateHttpClient`)\n\nUnder the affected client, attackers gain the libcurl scheme set:\n\n- `gopher://` is the canonical RESP-injection vector: pipeline `FLUSHALL`, `CONFIG SET dir`, `CONFIG SET dbfilename`, `SAVE` to an unauthenticated Redis on the crawler\u0027s network \u2014 enough to write attacker-controlled bytes to disk and, in the standard escalation, achieve remote code execution on the Redis host.\n- `file://` allows the crawler to read local files (application secrets, configuration) on the crawler host.\n- `dict://` and `ftp://` permit fingerprinting and limited interaction with text-protocol services.\n\nIn both layers, the SSRF is blind in the default configuration. Write-side impact (`gopher://` \u2192 Redis) and timing-based internal probing do not depend on read-back and remain viable regardless of whether the deployer surfaces scraped content.\n\n## Remediation\n\nBoth layers are fixed in `crawlee==1.7.0`. The fix is split across two PRs, applied at the two complementary boundaries of the affected pipeline:\n\n1. **Producer-side filtering \u2014 sitemap and robots.txt loaders (PR #1864).** `SitemapRequestLoader` and `RobotsTxtFile.get_sitemaps()` now run every nested-sitemap entry, every regular sitemap URL, and every `Sitemap:` directive through `crawlee._utils.urls.filter_url`. This applies to an `EnqueueStrategy` (default `\u0027same-hostname\u0027`) against the parent sitemap / `robots.txt` URL \u2014 cross-host entries are dropped \u2014 and rejects non-`http(s)` schemes. The strategy is stamped onto the emitted `Request`s, so `BasicCrawler._check_url_after_redirects` continues policing the policy across redirects.\n2. **Consumer-side validation \u2014 HTTP-client boundary (PR #1862).** `validate_http_url(url)` is now called at the top of `send_request()` and `stream()` in `ImpitHttpClient`, `HttpxHttpClient`, `CurlImpersonateHttpClient`, and `PlaywrightHttpClient`. Non-`http(s)` schemes raise `pydantic.ValidationError` before any backend call. `crawl()` was already covered, because `Request.url` is validated by Pydantic on construction.\n\nAfter these changes, validation is enforced both where sitemap-derived HTTP requests are produced (sitemap and robots.txt loaders) and where they are consumed (HTTP clients). A regression at either layer is caught by the other.\n\n### Behaviour change for upgraders\n\n`SitemapRequestLoader` and `RobotsTxtFile.get_sitemaps()` now default to `enqueue_strategy=\u0027same-hostname\u0027`. Deployers that legitimately relied on cross-host sitemap entries (e.g., a sitemap index on `sitemaps.example.com` that points to content on `www.example.com`) must opt in explicitly with `enqueue_strategy=\u0027same-domain\u0027` or `enqueue_strategy=\u0027all\u0027`.\n\n## Finder credits\n\n- [@r0otsu](https://github.com/r0otsu)\n- [@Yuremin](https://github.com/Yuremin) (Zhengmin Yu)\n- [@FORIMOC](https://github.com/FORIMOC)\n- [@invoke1442](https://github.com/invoke1442) (Ethan Carter)\n- [@Arturo0x90](https://github.com/Arturo0x90) (Arturo Melgarejo)",
  "id": "GHSA-3r75-xc34-5f44",
  "modified": "2026-06-10T18:41:20Z",
  "published": "2026-05-21T19:28:10Z",
  "references": [
    {
      "type": "WEB",
      "url": "https://github.com/apify/crawlee-python/security/advisories/GHSA-3r75-xc34-5f44"
    },
    {
      "type": "ADVISORY",
      "url": "https://nvd.nist.gov/vuln/detail/CVE-2026-46497"
    },
    {
      "type": "PACKAGE",
      "url": "https://github.com/apify/crawlee-python"
    },
    {
      "type": "WEB",
      "url": "https://github.com/apify/crawlee-python/releases/tag/v1.7.0"
    }
  ],
  "schema_version": "1.4.0",
  "severity": [
    {
      "score": "CVSS:4.0/AV:N/AC:L/AT:P/PR:N/UI:P/VC:L/VI:N/VA:N/SC:L/SI:N/SA:N",
      "type": "CVSS_V4"
    }
  ],
  "summary": "Crawlee for Python: SSRF via sitemap-derived URLs"
}

CVE-2026-46497 (GCVE-0-2026-46497)

Vulnerability from cvelistv5 – Published: 2026-06-10 15:51 – Updated: 2026-06-10 18:19

Title

SSRF via sitemap-derived URLs in Crawlee for Python

Summary

Crawlee is a web scraping and browser automation library. From version 1.0.0 to before version 1.7.0, Crawlee is vulnerable to SSRF via sitemap-derived URLs. This issue has been patched in version 1.7.0.

Severity

2.3 (Low)


                        
                          CVSS:4.0/AV:N/AC:L/AT:P/PR:N/UI:P/VC:L/VI:N/VA:N/SC:L/SI:N/SA:N

SSVC

Exploitation: none Automatable: no Technical Impact: partial

CISA Coordinator (v2.0.3)

CWE

CWE-918 - Server-Side Request Forgery (SSRF)

Assigner

GitHub_M

References

2 references

URL	Tags
https://github.com/apify/crawlee-python/security/…	x_refsource_CONFIRM
https://github.com/apify/crawlee-python/releases/…	x_refsource_MISC

Impacted products

1 product

Vendor	Product	Version
apify	crawlee-python	Affected: >= 1.0.0, < 1.7.0

Show details on NVD website

JSON

To clipboard

{
  "containers": {
    "adp": [
      {
        "metrics": [
          {
            "other": {
              "content": {
                "id": "CVE-2026-46497",
                "options": [
                  {
                    "Exploitation": "none"
                  },
                  {
                    "Automatable": "no"
                  },
                  {
                    "Technical Impact": "partial"
                  }
                ],
                "role": "CISA Coordinator",
                "timestamp": "2026-06-10T18:19:28.457818Z",
                "version": "2.0.3"
              },
              "type": "ssvc"
            }
          }
        ],
        "providerMetadata": {
          "dateUpdated": "2026-06-10T18:19:35.807Z",
          "orgId": "134c704f-9b21-4f2e-91b3-4a467353bcc0",
          "shortName": "CISA-ADP"
        },
        "title": "CISA ADP Vulnrichment"
      }
    ],
    "cna": {
      "affected": [
        {
          "product": "crawlee-python",
          "vendor": "apify",
          "versions": [
            {
              "status": "affected",
              "version": "\u003e= 1.0.0, \u003c 1.7.0"
            }
          ]
        }
      ],
      "descriptions": [
        {
          "lang": "en",
          "value": "Crawlee is a web scraping and browser automation library. From version 1.0.0 to before version 1.7.0, Crawlee is vulnerable to SSRF via sitemap-derived URLs. This issue has been patched in version 1.7.0."
        }
      ],
      "metrics": [
        {
          "cvssV4_0": {
            "attackComplexity": "LOW",
            "attackRequirements": "PRESENT",
            "attackVector": "NETWORK",
            "baseScore": 2.3,
            "baseSeverity": "LOW",
            "privilegesRequired": "NONE",
            "subAvailabilityImpact": "NONE",
            "subConfidentialityImpact": "LOW",
            "subIntegrityImpact": "NONE",
            "userInteraction": "PASSIVE",
            "vectorString": "CVSS:4.0/AV:N/AC:L/AT:P/PR:N/UI:P/VC:L/VI:N/VA:N/SC:L/SI:N/SA:N",
            "version": "4.0",
            "vulnAvailabilityImpact": "NONE",
            "vulnConfidentialityImpact": "LOW",
            "vulnIntegrityImpact": "NONE"
          }
        }
      ],
      "problemTypes": [
        {
          "descriptions": [
            {
              "cweId": "CWE-918",
              "description": "CWE-918: Server-Side Request Forgery (SSRF)",
              "lang": "en",
              "type": "CWE"
            }
          ]
        }
      ],
      "providerMetadata": {
        "dateUpdated": "2026-06-10T15:51:15.394Z",
        "orgId": "a0819718-46f1-4df5-94e2-005712e83aaa",
        "shortName": "GitHub_M"
      },
      "references": [
        {
          "name": "https://github.com/apify/crawlee-python/security/advisories/GHSA-3r75-xc34-5f44",
          "tags": [
            "x_refsource_CONFIRM"
          ],
          "url": "https://github.com/apify/crawlee-python/security/advisories/GHSA-3r75-xc34-5f44"
        },
        {
          "name": "https://github.com/apify/crawlee-python/releases/tag/v1.7.0",
          "tags": [
            "x_refsource_MISC"
          ],
          "url": "https://github.com/apify/crawlee-python/releases/tag/v1.7.0"
        }
      ],
      "source": {
        "advisory": "GHSA-3r75-xc34-5f44",
        "discovery": "UNKNOWN"
      },
      "title": "SSRF via sitemap-derived URLs in Crawlee for Python"
    }
  },
  "cveMetadata": {
    "assignerOrgId": "a0819718-46f1-4df5-94e2-005712e83aaa",
    "assignerShortName": "GitHub_M",
    "cveId": "CVE-2026-46497",
    "datePublished": "2026-06-10T15:51:15.394Z",
    "dateReserved": "2026-05-14T18:06:06.812Z",
    "dateUpdated": "2026-06-10T18:19:35.807Z",
    "state": "PUBLISHED"
  },
  "dataType": "CVE_RECORD",
  "dataVersion": "5.2"
}

PYSEC-2026-2430

Vulnerability from pysec - Published: 2026-07-13 15:19 - Updated: 2026-07-13 16:03

Details

Overview

Vulnerability type: Blind SSRF
Affected components: src/crawlee/_utils/sitemap.py, src/crawlee/_utils/robots.py, src/crawlee/request_loaders/_sitemap_request_loader.py, and all built-in HTTP clients.
Trigger: an attacker-controlled sitemap or robots.txt containing a URL that points to an internal host (layer 1) or uses a non-http scheme (layer 2).

Two-layer SSRF via sitemap-derived URLs:

1) Cross-host HTTP SSRF

2) Non-HTTP scheme SSRF

Root cause

Two parts of the sitemap pipeline sidestepped this property in different ways:

1) Sitemap-derived URLs were enqueued without any host policy

2) Nested sitemap fetching bypassed the `Request` chokepoint entirely

async with http_client.stream(
    sitemap_url, 
    method='GET', 
    headers=SITEMAP_HEADERS, 
    proxy_info=proxy_info, 
    timeout=timeout,
) as response:
    ...

Vulnerable paths

Layer 1 — cross-host HTTP (all HTTP clients)

Source: an attacker-controlled sitemap that lists internal URLs under <urlset><url><loc> or <sitemapindex><sitemap><loc>, or an attacker-controlled robots.txt that lists internal URLs under Sitemap:.
Sink: the configured HTTP client issues GET requests against those URLs — either via client.request(url=request.url, …) inside crawl() for regular sitemap URLs, or via client.stream(url, …) inside the nested-sitemap fetch.

Layer 2 — non-HTTP schemes (`CurlImpersonateHttpClient` only)

Source: a nested <sitemap><loc> entry or a robots.txt Sitemap: directive pointing to a non-http(s) URL.
Sink: CurlImpersonateHttpClient.stream(...) hands the URL string verbatim to client.request(url=…, …), which dispatches via libcurl.

Hardening in 1.7.0 was added at both producer and consumer ends — see Remediation.

Exploitation preconditions

The crawler uses sitemap loading: any of SitemapRequestLoader, Sitemap.load / parse_sitemap, discover_valid_sitemaps, or RobotsTxtFile.parse_sitemaps.
The attacker controls the body of a sitemap or robots.txt that the crawler fetches — typically by being the target site, or by getting a target site to publish a malicious sitemap.
The crawler's network egress can reach the attacker-chosen destination (e.g., internal services on the same network).
The targeted endpoint accepts unauthenticated requests. Crawlee does not supply credentials to the forged destination, so authenticated services (IMDSv2 with token, password-protected Redis, protected admin panels) are not reachable through this path.

For layer 2 (non-HTTP), the configured HTTP client must additionally be CurlImpersonateHttpClient.

Impact

Layer 1 — cross-host HTTP (any client)

Layer 2 — non-HTTP escalation (only `CurlImpersonateHttpClient`)

Under the affected client, attackers gain the libcurl scheme set:

gopher:// is the canonical RESP-injection vector: pipeline FLUSHALL, CONFIG SET dir, CONFIG SET dbfilename, SAVE to an unauthenticated Redis on the crawler's network — enough to write attacker-controlled bytes to disk and, in the standard escalation, achieve remote code execution on the Redis host.
file:// allows the crawler to read local files (application secrets, configuration) on the crawler host.
dict:// and ftp:// permit fingerprinting and limited interaction with text-protocol services.

Remediation

Both layers are fixed in crawlee==1.7.0. The fix is split across two PRs, applied at the two complementary boundaries of the affected pipeline:

Producer-side filtering — sitemap and robots.txt loaders (PR #1864). SitemapRequestLoader and RobotsTxtFile.get_sitemaps() now run every nested-sitemap entry, every regular sitemap URL, and every Sitemap: directive through crawlee._utils.urls.filter_url. This applies to an EnqueueStrategy (default 'same-hostname') against the parent sitemap / robots.txt URL — cross-host entries are dropped — and rejects non-http(s) schemes. The strategy is stamped onto the emitted Requests, so BasicCrawler._check_url_after_redirects continues policing the policy across redirects.
Consumer-side validation — HTTP-client boundary (PR #1862). validate_http_url(url) is now called at the top of send_request() and stream() in ImpitHttpClient, HttpxHttpClient, CurlImpersonateHttpClient, and PlaywrightHttpClient. Non-http(s) schemes raise pydantic.ValidationError before any backend call. crawl() was already covered, because Request.url is validated by Pydantic on construction.

Behaviour change for upgraders

Finder credits

@r0otsu
@Yuremin (Zhengmin Yu)
@FORIMOC
@invoke1442 (Ethan Carter)
@Arturo0x90 (Arturo Melgarejo)

Severity

2.3 (Low)


                  
                    CVSS:4.0/AV:N/AC:L/AT:P/PR:N/UI:P/VC:L/VI:N/VA:N/SC:L/SI:N/SA:N

Impacted products

Name	purl
crawlee	pkg:pypi/crawlee

Aliases

JSON

To clipboard

{
  "affected": [
    {
      "package": {
        "ecosystem": "PyPI",
        "name": "crawlee",
        "purl": "pkg:pypi/crawlee"
      },
      "ranges": [
        {
          "events": [
            {
              "introduced": "1.0.0"
            },
            {
              "fixed": "1.7.0"
            }
          ],
          "type": "ECOSYSTEM"
        }
      ],
      "versions": [
        "1.0.0",
        "1.0.1",
        "1.0.2",
        "1.0.3",
        "1.0.4",
        "1.1.0",
        "1.1.1",
        "1.2.0",
        "1.2.1",
        "1.3.0",
        "1.3.1",
        "1.3.2",
        "1.4.0",
        "1.5.0",
        "1.5.1b1",
        "1.5.1b2",
        "1.5.1b3",
        "1.5.1b4",
        "1.5.1b5",
        "1.6.0",
        "1.6.1",
        "1.6.1b1",
        "1.6.1b2",
        "1.6.1b3",
        "1.6.1b4",
        "1.6.1b5",
        "1.6.2",
        "1.6.2b1",
        "1.6.2b2",
        "1.6.2b3",
        "1.6.2b4",
        "1.6.3",
        "1.6.3b1",
        "1.6.3b2",
        "1.6.3b3",
        "1.6.3b4",
        "1.6.3b5",
        "1.6.3b6",
        "1.6.4b1",
        "1.6.4b2",
        "1.6.4b3",
        "1.6.4b4",
        "1.6.4b5",
        "1.6.4b6",
        "1.6.4b7",
        "1.6.4b8"
      ]
    }
  ],
  "aliases": [
    "CVE-2026-46497",
    "GHSA-3r75-xc34-5f44"
  ],
  "details": "## Overview\n\n- **Vulnerability type:** Blind SSRF\n- **Affected components:** `src/crawlee/_utils/sitemap.py`, `src/crawlee/_utils/robots.py`, `src/crawlee/request_loaders/_sitemap_request_loader.py`, and all built-in HTTP clients.\n- **Trigger:** an attacker-controlled sitemap or `robots.txt` containing a URL that points to an internal host (layer 1) or uses a non-http scheme (layer 2).\n\nTwo-layer SSRF via sitemap-derived URLs:\n\n### 1) Cross-host HTTP SSRF\n\nBase case, affects every HTTP client.** Sitemap entries and `robots.txt` `Sitemap:` directives were accepted regardless of the host they pointed to. A sitemap on `example.com` could push `http://internal.corp/admin` into the crawler\u0027s queue, and the configured HTTP client would dispatch the request.\n\n### 2) Non-HTTP scheme SSRF\n\nEscalation, only `CurlImpersonateHttpClient`.** Nested-sitemap fetching dispatches the URL straight to the HTTP client, bypassing the `Request` construction step where Pydantic enforces `http(s)`. Combined with the libcurl-backed `CurlImpersonateHttpClient`, this lets `gopher://`, `file://`, `dict://`, `ftp://`, etc., through.\n\n\n\n## Root cause\n\nCrawlee already validates URL schemes through Pydantic\u0027s `AnyHttpUrl` (via `validate_http_url` in `src/crawlee/_utils/urls.py`) wherever a crawl target is materialised as a `Request`: the `Request.url` field is declared as `Annotated[str, BeforeValidator(validate_http_url), Field(frozen=True)]`. Anything that becomes a `Request` is therefore guaranteed to be `http(s)`.\n\nTwo parts of the sitemap pipeline sidestepped this property in different ways:\n\n### 1) Sitemap-derived URLs were enqueued without any host policy\n\n`SitemapRequestLoader` took every `\u003curlset\u003e\u003curl\u003e\u003cloc\u003e` entry, wrapped it in `Request.from_url` (which accepts any valid `http(s)` URL), and pushed the result into the request queue. `RobotsTxtFile.get_sitemaps()` returned every `Sitemap:` directive verbatim. Neither imposed any host check against the parent sitemap or `robots.txt` URL, so an attacker controlling that content could push internal-network HTTP URLs into the queue and have them crawled by whichever HTTP client was configured.\n\n### 2) Nested sitemap fetching bypassed the `Request` chokepoint entirely\n\nWhen `_XmlSitemapParser` encountered `\u003csitemapindex\u003e\u003csitemap\u003e\u003cloc\u003e\u2026\u003c/loc\u003e\u003c/sitemap\u003e\u003c/sitemapindex\u003e`, or when `RobotsTxtFile.parse_sitemaps` forwarded `Sitemap:` directives into the same pipeline, `_fetch_and_process_sitemap` dispatched the URL directly to the HTTP client:\n\n```python\nasync with http_client.stream(\n    sitemap_url, \n    method=\u0027GET\u0027, \n    headers=SITEMAP_HEADERS, \n    proxy_info=proxy_info, \n    timeout=timeout,\n) as response:\n    ...\n```\n\nNo `Request` was constructed, so the Pydantic validator never ran. Before the fix, the HTTP clients\u0027 own `send_request()` and `stream()` methods did not call `validate_http_url` either, so a non-`http(s)` scheme could pass straight through to the backend client.\n\nThe non-HTTP escalation in layer 2 is **specific to** `CurlImpersonateHttpClient`, which is backed by `curl-cffi` / libcurl and speaks `gopher`, `file`, `dict`, `ftp`, and other non-HTTP protocols. The other clients shipped with Crawlee (`HttpxHttpClient`, `ImpitHttpClient`, `PlaywrightHttpClient`) reject non-`http(s)` schemes at their own backend layer, regardless of what Crawlee passes in, so they were only affected by layer 1.\n\n## Vulnerable paths\n\n### Layer 1 \u2014 cross-host HTTP (all HTTP clients)\n\n- *Source:* an attacker-controlled sitemap that lists internal URLs under `\u003curlset\u003e\u003curl\u003e\u003cloc\u003e` or `\u003csitemapindex\u003e\u003csitemap\u003e\u003cloc\u003e`, or an attacker-controlled `robots.txt` that lists internal URLs under `Sitemap:`.\n- *Sink:* the configured HTTP client issues `GET` requests against those URLs \u2014 either via `client.request(url=request.url, \u2026)` inside `crawl()` for regular sitemap URLs, or via `client.stream(url, \u2026)` inside the nested-sitemap fetch.\n\n### Layer 2 \u2014 non-HTTP schemes (`CurlImpersonateHttpClient` only)\n\n- *Source:* a nested `\u003csitemap\u003e\u003cloc\u003e` entry or a `robots.txt` `Sitemap:` directive pointing to a non-`http(s)` URL.\n- *Sink:* `CurlImpersonateHttpClient.stream(...)` hands the URL string verbatim to `client.request(url=\u2026, \u2026)`, which dispatches via libcurl.\n\nHardening in 1.7.0 was added at both producer and consumer ends \u2014 see *Remediation*.\n\n## Exploitation preconditions\n\n1. The crawler uses sitemap loading: any of `SitemapRequestLoader`, `Sitemap.load` / `parse_sitemap`, `discover_valid_sitemaps`, or `RobotsTxtFile.parse_sitemaps`.\n2. The attacker controls the body of a sitemap or `robots.txt` that the crawler fetches \u2014 typically by being the target site, or by getting a target site to publish a malicious sitemap.\n3. The crawler\u0027s network egress can reach the attacker-chosen destination (e.g., internal services on the same network).\n4. The targeted endpoint accepts unauthenticated requests. Crawlee does not supply credentials to the forged destination, so authenticated services (IMDSv2 with token, password-protected Redis, protected admin panels) are not reachable through this path.\n\nFor layer 2 (non-HTTP), the configured HTTP client must additionally be `CurlImpersonateHttpClient`.\n\n## Impact\n\n### Layer 1 \u2014 cross-host HTTP (any client)\n\nThe crawler can be coerced into issuing `GET` requests against internal HTTP services on its own network: admin panels, unauthenticated internal APIs, cloud metadata endpoints, etc. Read-back is blind \u2014 Crawlee surfaces fetched content only through its local `Dataset` / `KeyValueStore` (`push_data()` etc.) and does not natively forward scraped bodies anywhere external \u2014 so direct impact is mostly existence/timing probing and occasional state changes via side-effecting `GET` endpoints. Read-side leakage of internal content is only exploitable end-to-end if the deployer\u0027s own application separately exposes scraped data (for example, a public summariser or aggregator built on top of Crawlee).\n\n### Layer 2 \u2014 non-HTTP escalation (only `CurlImpersonateHttpClient`)\n\nUnder the affected client, attackers gain the libcurl scheme set:\n\n- `gopher://` is the canonical RESP-injection vector: pipeline `FLUSHALL`, `CONFIG SET dir`, `CONFIG SET dbfilename`, `SAVE` to an unauthenticated Redis on the crawler\u0027s network \u2014 enough to write attacker-controlled bytes to disk and, in the standard escalation, achieve remote code execution on the Redis host.\n- `file://` allows the crawler to read local files (application secrets, configuration) on the crawler host.\n- `dict://` and `ftp://` permit fingerprinting and limited interaction with text-protocol services.\n\nIn both layers, the SSRF is blind in the default configuration. Write-side impact (`gopher://` \u2192 Redis) and timing-based internal probing do not depend on read-back and remain viable regardless of whether the deployer surfaces scraped content.\n\n## Remediation\n\nBoth layers are fixed in `crawlee==1.7.0`. The fix is split across two PRs, applied at the two complementary boundaries of the affected pipeline:\n\n1. **Producer-side filtering \u2014 sitemap and robots.txt loaders (PR #1864).** `SitemapRequestLoader` and `RobotsTxtFile.get_sitemaps()` now run every nested-sitemap entry, every regular sitemap URL, and every `Sitemap:` directive through `crawlee._utils.urls.filter_url`. This applies to an `EnqueueStrategy` (default `\u0027same-hostname\u0027`) against the parent sitemap / `robots.txt` URL \u2014 cross-host entries are dropped \u2014 and rejects non-`http(s)` schemes. The strategy is stamped onto the emitted `Request`s, so `BasicCrawler._check_url_after_redirects` continues policing the policy across redirects.\n2. **Consumer-side validation \u2014 HTTP-client boundary (PR #1862).** `validate_http_url(url)` is now called at the top of `send_request()` and `stream()` in `ImpitHttpClient`, `HttpxHttpClient`, `CurlImpersonateHttpClient`, and `PlaywrightHttpClient`. Non-`http(s)` schemes raise `pydantic.ValidationError` before any backend call. `crawl()` was already covered, because `Request.url` is validated by Pydantic on construction.\n\nAfter these changes, validation is enforced both where sitemap-derived HTTP requests are produced (sitemap and robots.txt loaders) and where they are consumed (HTTP clients). A regression at either layer is caught by the other.\n\n### Behaviour change for upgraders\n\n`SitemapRequestLoader` and `RobotsTxtFile.get_sitemaps()` now default to `enqueue_strategy=\u0027same-hostname\u0027`. Deployers that legitimately relied on cross-host sitemap entries (e.g., a sitemap index on `sitemaps.example.com` that points to content on `www.example.com`) must opt in explicitly with `enqueue_strategy=\u0027same-domain\u0027` or `enqueue_strategy=\u0027all\u0027`.\n\n## Finder credits\n\n- [@r0otsu](https://github.com/r0otsu)\n- [@Yuremin](https://github.com/Yuremin) (Zhengmin Yu)\n- [@FORIMOC](https://github.com/FORIMOC)\n- [@invoke1442](https://github.com/invoke1442) (Ethan Carter)\n- [@Arturo0x90](https://github.com/Arturo0x90) (Arturo Melgarejo)",
  "id": "PYSEC-2026-2430",
  "modified": "2026-07-13T16:03:48.448162Z",
  "published": "2026-07-13T15:19:10.515048Z",
  "references": [
    {
      "type": "WEB",
      "url": "https://github.com/apify/crawlee-python/security/advisories/GHSA-3r75-xc34-5f44"
    },
    {
      "type": "ADVISORY",
      "url": "https://nvd.nist.gov/vuln/detail/CVE-2026-46497"
    },
    {
      "type": "PACKAGE",
      "url": "https://github.com/apify/crawlee-python"
    },
    {
      "type": "WEB",
      "url": "https://github.com/apify/crawlee-python/releases/tag/v1.7.0"
    },
    {
      "type": "PACKAGE",
      "url": "https://pypi.org/project/crawlee"
    },
    {
      "type": "ADVISORY",
      "url": "https://github.com/advisories/GHSA-3r75-xc34-5f44"
    }
  ],
  "severity": [
    {
      "score": "CVSS:4.0/AV:N/AC:L/AT:P/PR:N/UI:P/VC:L/VI:N/VA:N/SC:L/SI:N/SA:N",
      "type": "CVSS_V4"
    }
  ],
  "summary": "Crawlee for Python: SSRF via sitemap-derived URLs"
}

Sightings

Author	Source	Type	Date	Other

Nomenclature

Seen: The vulnerability was mentioned, discussed, or observed by the user.
Confirmed: The vulnerability has been validated from an analyst's perspective.
Published Proof of Concept: A public proof of concept is available for this vulnerability.
Exploited: The vulnerability was observed as exploited by the user who reported the sighting.
Patched: The vulnerability was observed as successfully patched by the user who reported the sighting.
Not exploited: The vulnerability was not observed as exploited by the user who reported the sighting.
Not confirmed: The user expressed doubt about the validity of the vulnerability.
Not patched: The vulnerability was not observed as successfully patched by the user who reported the sighting.

Detection rules are retrieved from Rulezet.

Action not permitted

GHSA-3R75-XC34-5F44

Overview

1) Cross-host HTTP SSRF

2) Non-HTTP scheme SSRF

Root cause

1) Sitemap-derived URLs were enqueued without any host policy

2) Nested sitemap fetching bypassed the Request chokepoint entirely

Vulnerable paths

Layer 1 — cross-host HTTP (all HTTP clients)

Layer 2 — non-HTTP schemes (CurlImpersonateHttpClient only)

Exploitation preconditions

Impact

Layer 1 — cross-host HTTP (any client)

Layer 2 — non-HTTP escalation (only CurlImpersonateHttpClient)

Remediation

Behaviour change for upgraders

Finder credits

CVE-2026-46497 (GCVE-0-2026-46497)

PYSEC-2026-2430

Overview

1) Cross-host HTTP SSRF

2) Non-HTTP scheme SSRF

Root cause

1) Sitemap-derived URLs were enqueued without any host policy

2) Nested sitemap fetching bypassed the Request chokepoint entirely

Vulnerable paths

Layer 1 — cross-host HTTP (all HTTP clients)

Layer 2 — non-HTTP schemes (CurlImpersonateHttpClient only)

Exploitation preconditions

Impact

Layer 1 — cross-host HTTP (any client)

Layer 2 — non-HTTP escalation (only CurlImpersonateHttpClient)

Remediation

Behaviour change for upgraders

Finder credits

Tags

Sightings

Nomenclature

2) Nested sitemap fetching bypassed the `Request` chokepoint entirely

Layer 2 — non-HTTP schemes (`CurlImpersonateHttpClient` only)

Layer 2 — non-HTTP escalation (only `CurlImpersonateHttpClient`)

2) Nested sitemap fetching bypassed the `Request` chokepoint entirely

Layer 2 — non-HTTP schemes (`CurlImpersonateHttpClient` only)

Layer 2 — non-HTTP escalation (only `CurlImpersonateHttpClient`)