Vulnerability-Lookup

GHSA-5JV2-G5WQ-CMR4

Vulnerability from github – Published: 2026-06-17 14:03 – Updated: 2026-06-17 14:03

Summary

vLLM: GGUF dequantize kernel int truncation exposes uninitialized GPU memory in multi-tenant serving

Details

Summary

Integer truncation of tensor dimensions in vLLM's GGUF dequantize kernels (csrc/quantization/gguf/gguf_kernel.cu) causes partial tensor processing. The output tensor is allocated at full size via torch::empty (uninitialized memory), but the dequantize CUDA kernel processes only a truncated number of elements. The unfilled portion of the output tensor retains whatever was previously in GPU memory. In multi-tenant inference deployments, this residual GPU memory may contain tensor data from other users' inference requests, constituting information disclosure.

Root Cause

The to_cuda_ggml_t function pointer type at ggml-common.h:1067 declares its element count parameter as int (32-bit):

using to_cuda_ggml_t = void (*)(const void * __restrict__ x,
                                dst_t * __restrict__ y,
                                int k,              // 32-bit
                                cudaStream_t stream);

All dequantize kernel functions (dequantize_block_cuda, dequantize_row_q2_K_cuda, etc. in dequantize.cuh) inherit this int k parameter and use it as the kernel launch grid size:

static void dequantize_block_cuda(..., const int k, cudaStream_t stream) {
    const int num_blocks = (k + 2*CUDA_DEQUANTIZE_BLOCK_SIZE - 1) / (2*CUDA_DEQUANTIZE_BLOCK_SIZE);
    dequantize_block<<<num_blocks, CUDA_DEQUANTIZE_BLOCK_SIZE, 0, stream>>>(vx, y, k);
}

In ggml_dequantize() at gguf_kernel.cu:85, the caller passes m * n (an int64_t product) to this int k parameter:

at::Tensor DW = torch::empty({m, n}, options);    // line 80: full-size, UNINITIALIZED
// ...
to_cuda((void*)W.data_ptr(), (scalar_t*)DW.data_ptr(), m * n, stream);  // line 85: m*n truncated to int

When m * n > INT_MAX, the truncated k is smaller than the actual tensor size. The kernel processes k elements. The remaining (m * n) - k elements in DW are never written and contain stale GPU memory.

This is a single root cause -- the int type on the k parameter in to_cuda_ggml_t -- with a single fix: change int k to int64_t k. All dequantize functions inherit this type through the same typedef.

Affected Functions

All in csrc/quantization/gguf/gguf_kernel.cu:

Function	Line	Allocation	Info Disclosure?
`ggml_dequantize`	74	`torch::empty({m, n})` at line 80	Yes -- `m*n` truncated to `int k` at line 85
`ggml_mul_mat_vec_a8`	91	`torch::empty({vecs, row})` at line 99	Yes -- `int col = X.sizes()[1]` at line 94
`ggml_mul_mat_a8`	207	`torch::empty({batch, row})` at line 215	Yes -- `int col = X.sizes()[1]` at line 210
`ggml_moe_a8`	279	`torch::empty({tokens*top_k, row})` at line 289	Yes -- `int col = X.sizes()[1]` at line 285

All four functions allocate output tensors with torch::empty (uninitialized) and then run CUDA kernels that use truncated dimension values as loop bounds. The unfilled portion of each output tensor retains stale GPU memory.

ggml_moe_a8_vec (line 382) uses torch::zeros instead of torch::empty, so it is not affected by the info disclosure variant.

Impact: Information Disclosure in Multi-Tenant Serving

vLLM is designed for multi-tenant inference serving. GPU memory is reused across requests from different users. When the dequantize kernel partially fills an output tensor:

The output tensor DW is allocated with torch::empty -- the buffer contains whatever was previously in that GPU memory region
The dequantize kernel fills only a truncated portion of the buffer
The unfilled portion retains residual data from prior GPU operations, which may include tensor data from other users' inference requests
The contaminated tensor proceeds through the model computation
No error or warning is generated -- the partial fill is silent

This is a confidentiality violation. In shared inference deployments (the primary vLLM use case), one user's inference data can leak into another user's model computation through residual GPU memory.

Attacker Control

The attacker crafts a GGUF model file with weight tensor dimensions whose product exceeds INT_MAX (e.g., a matrix with shape [65536, 65536] gives m * n = 4,294,967,296). The model is hosted on HuggingFace or any model hub. The victim loads the model with vLLM for inference serving. The truncation happens automatically during model weight dequantization.

Fix

A fix for this vulnerability was added here: https://github.com/vllm-project/vllm/pull/44971

Severity

5.3 (Medium)


                  
                    CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:P/VC:L/VI:L/VA:N/SC:N/SI:N/SA:N

Show details on source website

JSON

To clipboard

{
  "affected": [
    {
      "package": {
        "ecosystem": "PyPI",
        "name": "vllm"
      },
      "ranges": [
        {
          "events": [
            {
              "introduced": "0.5.5"
            },
            {
              "last_affected": "0.23.0"
            }
          ],
          "type": "ECOSYSTEM"
        }
      ]
    }
  ],
  "aliases": [
    "CVE-2026-53923"
  ],
  "database_specific": {
    "cwe_ids": [
      "CWE-200",
      "CWE-681"
    ],
    "github_reviewed": true,
    "github_reviewed_at": "2026-06-17T14:03:11Z",
    "nvd_published_at": null,
    "severity": "MODERATE"
  },
  "details": "## Summary\n\nInteger truncation of tensor dimensions in vLLM\u0027s GGUF dequantize kernels (`csrc/quantization/gguf/gguf_kernel.cu`) causes partial tensor processing. The output tensor is allocated at full size via `torch::empty` (uninitialized memory), but the dequantize CUDA kernel processes only a truncated number of elements. The unfilled portion of the output tensor retains whatever was previously in GPU memory. In multi-tenant inference deployments, this residual GPU memory may contain tensor data from other users\u0027 inference requests, constituting information disclosure.\n\n## Root Cause\n\nThe `to_cuda_ggml_t` function pointer type at `ggml-common.h:1067` declares its element count parameter as `int` (32-bit):\n\n```cpp\nusing to_cuda_ggml_t = void (*)(const void * __restrict__ x,\n                                dst_t * __restrict__ y,\n                                int k,              // 32-bit\n                                cudaStream_t stream);\n```\n\nAll dequantize kernel functions (`dequantize_block_cuda`, `dequantize_row_q2_K_cuda`, etc. in `dequantize.cuh`) inherit this `int k` parameter and use it as the kernel launch grid size:\n\n```cpp\nstatic void dequantize_block_cuda(..., const int k, cudaStream_t stream) {\n    const int num_blocks = (k + 2*CUDA_DEQUANTIZE_BLOCK_SIZE - 1) / (2*CUDA_DEQUANTIZE_BLOCK_SIZE);\n    dequantize_block\u003c\u003c\u003cnum_blocks, CUDA_DEQUANTIZE_BLOCK_SIZE, 0, stream\u003e\u003e\u003e(vx, y, k);\n}\n```\n\nIn `ggml_dequantize()` at `gguf_kernel.cu:85`, the caller passes `m * n` (an `int64_t` product) to this `int k` parameter:\n\n```cpp\nat::Tensor DW = torch::empty({m, n}, options);    // line 80: full-size, UNINITIALIZED\n// ...\nto_cuda((void*)W.data_ptr(), (scalar_t*)DW.data_ptr(), m * n, stream);  // line 85: m*n truncated to int\n```\n\nWhen `m * n \u003e INT_MAX`, the truncated `k` is smaller than the actual tensor size. The kernel processes `k` elements. The remaining `(m * n) - k` elements in `DW` are never written and contain stale GPU memory.\n\nThis is a single root cause -- the `int` type on the `k` parameter in `to_cuda_ggml_t` -- with a single fix: change `int k` to `int64_t k`. All dequantize functions inherit this type through the same typedef.\n\n## Affected Functions\n\nAll in `csrc/quantization/gguf/gguf_kernel.cu`:\n\n| Function | Line | Allocation | Info Disclosure? |\n|----------|------|-----------|-----------------|\n| `ggml_dequantize` | 74 | `torch::empty({m, n})` at line 80 | Yes -- `m*n` truncated to `int k` at line 85 |\n| `ggml_mul_mat_vec_a8` | 91 | `torch::empty({vecs, row})` at line 99 | Yes -- `int col = X.sizes()[1]` at line 94 |\n| `ggml_mul_mat_a8` | 207 | `torch::empty({batch, row})` at line 215 | Yes -- `int col = X.sizes()[1]` at line 210 |\n| `ggml_moe_a8` | 279 | `torch::empty({tokens*top_k, row})` at line 289 | Yes -- `int col = X.sizes()[1]` at line 285 |\n\nAll four functions allocate output tensors with `torch::empty` (uninitialized) and then run CUDA kernels that use truncated dimension values as loop bounds. The unfilled portion of each output tensor retains stale GPU memory.\n\n`ggml_moe_a8_vec` (line 382) uses `torch::zeros` instead of `torch::empty`, so it is not affected by the info disclosure variant.\n\n## Impact: Information Disclosure in Multi-Tenant Serving\n\nvLLM is designed for multi-tenant inference serving. GPU memory is reused across requests from different users. When the dequantize kernel partially fills an output tensor:\n\n1. The output tensor `DW` is allocated with `torch::empty` -- the buffer contains whatever was previously in that GPU memory region\n2. The dequantize kernel fills only a truncated portion of the buffer\n3. The unfilled portion retains residual data from prior GPU operations, which may include tensor data from other users\u0027 inference requests\n4. The contaminated tensor proceeds through the model computation\n5. No error or warning is generated -- the partial fill is silent\n\nThis is a confidentiality violation. In shared inference deployments (the primary vLLM use case), one user\u0027s inference data can leak into another user\u0027s model computation through residual GPU memory.\n\n## Attacker Control\n\nThe attacker crafts a GGUF model file with weight tensor dimensions whose product exceeds `INT_MAX` (e.g., a matrix with shape `[65536, 65536]` gives `m * n = 4,294,967,296`). The model is hosted on HuggingFace or any model hub. The victim loads the model with vLLM for inference serving. The truncation happens automatically during model weight dequantization.\n\n## Fix\n\nA fix for this vulnerability was added here: https://github.com/vllm-project/vllm/pull/44971",
  "id": "GHSA-5jv2-g5wq-cmr4",
  "modified": "2026-06-17T14:03:11Z",
  "published": "2026-06-17T14:03:11Z",
  "references": [
    {
      "type": "WEB",
      "url": "https://github.com/vllm-project/vllm/security/advisories/GHSA-5jv2-g5wq-cmr4"
    },
    {
      "type": "WEB",
      "url": "https://github.com/vllm-project/vllm/pull/44971"
    },
    {
      "type": "WEB",
      "url": "https://github.com/vllm-project/vllm/commit/f219788f91952827132fa4fdf916427cd20d225e"
    },
    {
      "type": "PACKAGE",
      "url": "https://github.com/vllm-project/vllm"
    }
  ],
  "schema_version": "1.4.0",
  "severity": [
    {
      "score": "CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:P/VC:L/VI:L/VA:N/SC:N/SI:N/SA:N",
      "type": "CVSS_V4"
    }
  ],
  "summary": "vLLM: GGUF dequantize kernel int truncation exposes uninitialized GPU memory in multi-tenant serving"
}

CVE-2026-53923 (GCVE-0-2026-53923)

Vulnerability from cvelistv5 – Published: 2026-06-22 21:55 – Updated: 2026-06-23 15:05

Title

vLLM GGUF Kernels: int64_t to int truncation of tensor dimensions causes GPU buffer overflow

Summary

vLLM is an inference and serving engine for large language models (LLMs). From 0.5.5 until 0.23.1rc0, integer truncation of tensor dimensions in vLLM's GGUF dequantize kernels (csrc/quantization/gguf/gguf_kernel.cu) causes partial tensor processing. The output tensor is allocated at full size via torch::empty (uninitialized memory), but the dequantize CUDA kernel processes only a truncated number of elements. The unfilled portion of the output tensor retains whatever was previously in GPU memory. In multi-tenant inference deployments, this residual GPU memory may contain tensor data from other users' inference requests, constituting information disclosure. This vulnerability is fixed in 0.23.1rc0.

Severity

5.3 (Medium)


                        
                          CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:P/VC:L/VI:L/VA:N/SC:N/SI:N/SA:N

SSVC

Exploitation: none Automatable: no Technical Impact: partial

CISA Coordinator (v2.0.3)

CWE

CWE-681 - Incorrect Conversion between Numeric Types
CWE-200 - Exposure of Sensitive Information to an Unauthorized Actor

Assigner

GitHub_M

References

3 references

URL	Tags
https://github.com/vllm-project/vllm/security/adv…	x_refsource_CONFIRM
https://github.com/vllm-project/vllm/pull/44971	x_refsource_MISC
https://github.com/vllm-project/vllm/commit/f2197…	x_refsource_MISC

Impacted products

1 product

Vendor	Product	Version
vllm-project	vllm	Affected: >= 0.5.5, < 0.23.1rc0

Show details on NVD website

JSON

To clipboard

{
  "containers": {
    "adp": [
      {
        "metrics": [
          {
            "other": {
              "content": {
                "id": "CVE-2026-53923",
                "options": [
                  {
                    "Exploitation": "none"
                  },
                  {
                    "Automatable": "no"
                  },
                  {
                    "Technical Impact": "partial"
                  }
                ],
                "role": "CISA Coordinator",
                "timestamp": "2026-06-23T15:04:15.555317Z",
                "version": "2.0.3"
              },
              "type": "ssvc"
            }
          }
        ],
        "providerMetadata": {
          "dateUpdated": "2026-06-23T15:05:21.711Z",
          "orgId": "134c704f-9b21-4f2e-91b3-4a467353bcc0",
          "shortName": "CISA-ADP"
        },
        "title": "CISA ADP Vulnrichment"
      }
    ],
    "cna": {
      "affected": [
        {
          "product": "vllm",
          "vendor": "vllm-project",
          "versions": [
            {
              "status": "affected",
              "version": "\u003e= 0.5.5, \u003c 0.23.1rc0"
            }
          ]
        }
      ],
      "descriptions": [
        {
          "lang": "en",
          "value": "vLLM is an inference and serving engine for large language models (LLMs). From 0.5.5 until 0.23.1rc0, integer truncation of tensor dimensions in vLLM\u0027s GGUF dequantize kernels (csrc/quantization/gguf/gguf_kernel.cu) causes partial tensor processing. The output tensor is allocated at full size via torch::empty (uninitialized memory), but the dequantize CUDA kernel processes only a truncated number of elements. The unfilled portion of the output tensor retains whatever was previously in GPU memory. In multi-tenant inference deployments, this residual GPU memory may contain tensor data from other users\u0027 inference requests, constituting information disclosure. This vulnerability is fixed in 0.23.1rc0."
        }
      ],
      "metrics": [
        {
          "cvssV4_0": {
            "attackComplexity": "LOW",
            "attackRequirements": "NONE",
            "attackVector": "NETWORK",
            "baseScore": 5.3,
            "baseSeverity": "MEDIUM",
            "privilegesRequired": "NONE",
            "subAvailabilityImpact": "NONE",
            "subConfidentialityImpact": "NONE",
            "subIntegrityImpact": "NONE",
            "userInteraction": "PASSIVE",
            "vectorString": "CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:P/VC:L/VI:L/VA:N/SC:N/SI:N/SA:N",
            "version": "4.0",
            "vulnAvailabilityImpact": "NONE",
            "vulnConfidentialityImpact": "LOW",
            "vulnIntegrityImpact": "LOW"
          }
        }
      ],
      "problemTypes": [
        {
          "descriptions": [
            {
              "cweId": "CWE-681",
              "description": "CWE-681: Incorrect Conversion between Numeric Types",
              "lang": "en",
              "type": "CWE"
            }
          ]
        },
        {
          "descriptions": [
            {
              "cweId": "CWE-200",
              "description": "CWE-200: Exposure of Sensitive Information to an Unauthorized Actor",
              "lang": "en",
              "type": "CWE"
            }
          ]
        }
      ],
      "providerMetadata": {
        "dateUpdated": "2026-06-22T21:55:42.001Z",
        "orgId": "a0819718-46f1-4df5-94e2-005712e83aaa",
        "shortName": "GitHub_M"
      },
      "references": [
        {
          "name": "https://github.com/vllm-project/vllm/security/advisories/GHSA-5jv2-g5wq-cmr4",
          "tags": [
            "x_refsource_CONFIRM"
          ],
          "url": "https://github.com/vllm-project/vllm/security/advisories/GHSA-5jv2-g5wq-cmr4"
        },
        {
          "name": "https://github.com/vllm-project/vllm/pull/44971",
          "tags": [
            "x_refsource_MISC"
          ],
          "url": "https://github.com/vllm-project/vllm/pull/44971"
        },
        {
          "name": "https://github.com/vllm-project/vllm/commit/f219788f91952827132fa4fdf916427cd20d225e",
          "tags": [
            "x_refsource_MISC"
          ],
          "url": "https://github.com/vllm-project/vllm/commit/f219788f91952827132fa4fdf916427cd20d225e"
        }
      ],
      "source": {
        "advisory": "GHSA-5jv2-g5wq-cmr4",
        "discovery": "UNKNOWN"
      },
      "title": "vLLM GGUF Kernels: int64_t to int truncation of tensor dimensions causes GPU buffer overflow"
    }
  },
  "cveMetadata": {
    "assignerOrgId": "a0819718-46f1-4df5-94e2-005712e83aaa",
    "assignerShortName": "GitHub_M",
    "cveId": "CVE-2026-53923",
    "datePublished": "2026-06-22T21:55:42.001Z",
    "dateReserved": "2026-06-11T15:46:12.316Z",
    "dateUpdated": "2026-06-23T15:05:21.711Z",
    "state": "PUBLISHED"
  },
  "dataType": "CVE_RECORD",
  "dataVersion": "5.2"
}

Sightings

Author	Source	Type	Date	Other

Nomenclature

Seen: The vulnerability was mentioned, discussed, or observed by the user.
Confirmed: The vulnerability has been validated from an analyst's perspective.
Published Proof of Concept: A public proof of concept is available for this vulnerability.
Exploited: The vulnerability was observed as exploited by the user who reported the sighting.
Patched: The vulnerability was observed as successfully patched by the user who reported the sighting.
Not exploited: The vulnerability was not observed as exploited by the user who reported the sighting.
Not confirmed: The user expressed doubt about the validity of the vulnerability.
Not patched: The vulnerability was not observed as successfully patched by the user who reported the sighting.

Detection rules are retrieved from Rulezet.

Action not permitted

GHSA-5JV2-G5WQ-CMR4

Summary

Root Cause

Affected Functions

Impact: Information Disclosure in Multi-Tenant Serving

Attacker Control

Fix

CVE-2026-53923 (GCVE-0-2026-53923)

Tags

Sightings

Nomenclature