GHSA-83VM-P52W-F9PW

Vulnerability from github – Published: 2026-05-06 21:45 – Updated: 2026-05-13 16:29
VLAI?
Summary
vLLM: extract_hidden_states speculative decoding crashes server on any request with penalty parameters
Details

Summary

The extract_hidden_states speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a RuntimeError that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (repetition_penalty, frequency_penalty, or presence_penalty).

A single request with a penalty parameter (e.g., "repetition_penalty": 1.1) is sufficient to crash the server. The crash is deterministic and immediate — no concurrency, race condition, or special workload is required.

Details

In vLLM v0.17.0, the extract_hidden_states proposer's propose() method returned sampled_token_ids.unsqueeze(-1), producing a tensor of shape (batch_size, 1).

In PR #37013 (first released in v0.18.0), the KV connector interface was refactored out of propose(). The return type changed from tuple[Tensor, KVConnectorOutput | None] to Tensor, and the .unsqueeze(-1) call was removed along with the KV connector output:

# Before (v0.17.0):
return sampled_token_ids.unsqueeze(-1), kv_connector_output  # shape (batch_size, 1)

# After (v0.18.0+):
return sampled_token_ids  # shape (batch_size, 2) after first decode step

The refactor missed that sampled_token_ids changed semantics between the first and subsequent decode steps. After the first decode step, the rejection sampler allocates its output as (batch_size, max_spec_len + 1). With num_speculative_tokens=1, this produces shape (batch_size, 2) instead of the expected (batch_size, 1), causing a broadcast shape mismatch during penalty application.

Impact

Any vLLM deployment between v0.18.0 and v0.19.1 (inclusive) configured with extract_hidden_states speculative decoding is affected. A single API request containing any penalty parameter immediately and permanently crashes the EngineCore process, resulting in complete loss of service availability.

Patches

Fixed in PR #38610, first included in vLLM v0.20.0. The fix slices the return value to sampled_token_ids[:, :1], ensuring the correct (batch_size, 1) shape regardless of the rejection sampler's output dimensions.

Workarounds

  • Upgrade to vLLM v0.20.0 or later.
  • If upgrading is not possible, avoid using extract_hidden_states as the speculative decoding method on affected versions.
  • Alternatively, reject or strip penalty parameters (repetition_penalty, frequency_penalty, presence_penalty) from incoming requests at an API gateway before they reach vLLM.
Show details on source website

{
  "affected": [
    {
      "package": {
        "ecosystem": "PyPI",
        "name": "vllm"
      },
      "ranges": [
        {
          "events": [
            {
              "introduced": "0.18.0"
            },
            {
              "fixed": "0.20.0"
            }
          ],
          "type": "ECOSYSTEM"
        }
      ]
    }
  ],
  "aliases": [
    "CVE-2026-44223"
  ],
  "database_specific": {
    "cwe_ids": [
      "CWE-131",
      "CWE-704"
    ],
    "github_reviewed": true,
    "github_reviewed_at": "2026-05-06T21:45:51Z",
    "nvd_published_at": "2026-05-12T20:16:43Z",
    "severity": "MODERATE"
  },
  "details": "### Summary\n\nThe `extract_hidden_states` speculative decoding proposer in vLLM returns a tensor with an incorrect shape after the first decode step, causing a `RuntimeError` that crashes the EngineCore process. The crash is triggered when any request in the batch uses sampling penalty parameters (`repetition_penalty`, `frequency_penalty`, or `presence_penalty`).\n\nA single request with a penalty parameter (e.g., `\"repetition_penalty\": 1.1`) is sufficient to crash the server. The crash is deterministic and immediate \u2014 no concurrency, race condition, or special workload is required.\n\n### Details\n\nIn vLLM v0.17.0, the `extract_hidden_states` proposer\u0027s `propose()` method returned `sampled_token_ids.unsqueeze(-1)`, producing a tensor of shape `(batch_size, 1)`.\n\nIn [PR #37013](https://github.com/vllm-project/vllm/pull/37013) (first released in v0.18.0), the KV connector interface was refactored out of `propose()`. The return type changed from `tuple[Tensor, KVConnectorOutput | None]` to `Tensor`, and the `.unsqueeze(-1)` call was removed along with the KV connector output:\n\n```python\n# Before (v0.17.0):\nreturn sampled_token_ids.unsqueeze(-1), kv_connector_output  # shape (batch_size, 1)\n\n# After (v0.18.0+):\nreturn sampled_token_ids  # shape (batch_size, 2) after first decode step\n```\n\nThe refactor missed that `sampled_token_ids` changed semantics between the first and subsequent decode steps. After the first decode step, the rejection sampler allocates its output as `(batch_size, max_spec_len + 1)`. With `num_speculative_tokens=1`, this produces shape `(batch_size, 2)` instead of the expected `(batch_size, 1)`, causing a broadcast shape mismatch during penalty application.\n\n### Impact\n\nAny vLLM deployment between v0.18.0 and v0.19.1 (inclusive) configured with `extract_hidden_states` speculative decoding is affected. A single API request containing any penalty parameter immediately and permanently crashes the EngineCore process, resulting in complete loss of service availability.\n\n### Patches\n\nFixed in [PR #38610](https://github.com/vllm-project/vllm/pull/38610), first included in vLLM v0.20.0. The fix slices the return value to `sampled_token_ids[:, :1]`, ensuring the correct `(batch_size, 1)` shape regardless of the rejection sampler\u0027s output dimensions.\n\n### Workarounds\n\n- Upgrade to vLLM v0.20.0 or later.\n- If upgrading is not possible, avoid using `extract_hidden_states` as the speculative decoding method on affected versions.\n- Alternatively, reject or strip penalty parameters (`repetition_penalty`, `frequency_penalty`, `presence_penalty`) from incoming requests at an API gateway before they reach vLLM.",
  "id": "GHSA-83vm-p52w-f9pw",
  "modified": "2026-05-13T16:29:07Z",
  "published": "2026-05-06T21:45:51Z",
  "references": [
    {
      "type": "WEB",
      "url": "https://github.com/vllm-project/vllm/security/advisories/GHSA-83vm-p52w-f9pw"
    },
    {
      "type": "ADVISORY",
      "url": "https://nvd.nist.gov/vuln/detail/CVE-2026-44223"
    },
    {
      "type": "WEB",
      "url": "https://github.com/vllm-project/vllm/pull/38610"
    },
    {
      "type": "PACKAGE",
      "url": "https://github.com/vllm-project/vllm"
    }
  ],
  "schema_version": "1.4.0",
  "severity": [
    {
      "score": "CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H",
      "type": "CVSS_V3"
    }
  ],
  "summary": "vLLM: extract_hidden_states speculative decoding crashes server on any request with penalty parameters"
}


Log in or create an account to share your comment.




Tags
Taxonomy of the tags.


Loading…

Loading…

Loading…
Forecast uses a logistic model when the trend is rising, or an exponential decay model when the trend is falling. Fitted via linearized least squares.

Sightings

Author Source Type Date Other

Nomenclature

  • Seen: The vulnerability was mentioned, discussed, or observed by the user.
  • Confirmed: The vulnerability has been validated from an analyst's perspective.
  • Published Proof of Concept: A public proof of concept is available for this vulnerability.
  • Exploited: The vulnerability was observed as exploited by the user who reported the sighting.
  • Patched: The vulnerability was observed as successfully patched by the user who reported the sighting.
  • Not exploited: The vulnerability was not observed as exploited by the user who reported the sighting.
  • Not confirmed: The user expressed doubt about the validity of the vulnerability.
  • Not patched: The vulnerability was not observed as successfully patched by the user who reported the sighting.


Loading…

Detection rules are retrieved from Rulezet.

Loading…

Loading…