Vulnerability-Lookup

GHSA-R8HV-J9G9-64G6

Vulnerability from github – Published: 2026-05-28 12:30 – Updated: 2026-06-11 18:31

Details

In the Linux kernel, the following vulnerability has been resolved:

cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated

A chain of commits going back to v7.0 reworked rmdir to satisfy the controller invariant that a subsystem's ->css_offline() must not run while tasks are still doing kernel-side work in the cgroup.

[1] d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out") [2] a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") [3] 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir") [4] 4c56a8ac6869 ("cgroup: Fix cgroup_drain_dying() testing the wrong condition") [5] 13e786b64bd3 ("cgroup: Increment nr_dying_subsys_* from rmdir context")

[1] moved task cset unlink from do_exit() to finish_task_switch() so a task's cset link drops only after the task has fully stopped scheduling. That made tasks past exit_signals() linger on cset->tasks until their final context switch, which led to a series of problems as what userspace expected to see after rmdir diverged from what the kernel needs to wait for. [2]-[5] tried to bridge that divergence: [2] filtered the exiting tasks from cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4] fixed the wait's condition; [5] made nr_dying_subsys_* visible synchronously.

The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g. host PID 1 systemd reaping orphan pids that were re-parented to it during the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those pids to free, the pids can't free because PID 1 is the reaper and it's stuck in rmdir, and the system A-A deadlocks. No internal lock ordering breaks this; the wait itself is the bug.

The css killing side that drove the original reorder, however, can be made cleanly asynchronous: ->css_offline() is already async, run from css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to make that chain start only after all tasks have left the cgroup. rmdir's user-visible side then returns as soon as cgroup.procs and friends are empty, while ->css_offline() still runs only after the cgroup is fully drained.

Verified by the original reproducer (pidns teardown + zombie reaper, runs under vng) which hangs vanilla and succeeds here, and by per-commit deterministic repros for [2], [3], [4], [5] with a boot parameter that widens the post-exit_signals() window so each state is reliably reachable. Some stress tests on top of that.

cgroup_apply_control_disable() has the same shape of pre-existing race: when a controller is disabled via subtree_control, kill_css() ran synchronously while tasks past exit_signals() could still be linked to the cgroup's csets, and ->css_offline() could fire before they drained. This patch preserves the existing synchronous behavior at that call site (kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch will defer kill_css_finish() there using a per-css trigger.

This seems like the right approach and I don't see problems with it. The changes are somewhat invasive but not excessively so, so backporting to -stable should be okay. If something does turn out to be wrong, the fallback is to revert the entire chain ([1]-[5]) and rework in the development branch instead.

v2: Pin cgrp across the deferred destroy work with explicit cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1 wasn't actually broken (ordered cgroup_offline_wq + queue_work order in cgroup_task_dead() saved it) but the explicit ref removes the dependency on those non-obvious invariants. Also note the pre-existing cgroup_apply_control_disable() race in the description; a follow-up will defer kill_css_finish() there.

Severity

5.5 (Medium)


                  
                    CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H

Show details on source website

JSON

To clipboard

{
  "affected": [],
  "aliases": [
    "CVE-2026-46223"
  ],
  "database_specific": {
    "cwe_ids": [
      "CWE-667"
    ],
    "github_reviewed": false,
    "github_reviewed_at": null,
    "nvd_published_at": "2026-05-28T10:16:37Z",
    "severity": "MODERATE"
  },
  "details": "In the Linux kernel, the following vulnerability has been resolved:\n\ncgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated\n\nA chain of commits going back to v7.0 reworked rmdir to satisfy the\ncontroller invariant that a subsystem\u0027s -\u003ecss_offline() must not run while\ntasks are still doing kernel-side work in the cgroup.\n\n[1] d245698d727a (\"cgroup: Defer task cgroup unlink until after the task is done switching out\")\n[2] a72f73c4dd9b (\"cgroup: Don\u0027t expose dead tasks in cgroup\")\n[3] 1b164b876c36 (\"cgroup: Wait for dying tasks to leave on rmdir\")\n[4] 4c56a8ac6869 (\"cgroup: Fix cgroup_drain_dying() testing the wrong condition\")\n[5] 13e786b64bd3 (\"cgroup: Increment nr_dying_subsys_* from rmdir context\")\n\n[1] moved task cset unlink from do_exit() to finish_task_switch() so a\ntask\u0027s cset link drops only after the task has fully stopped scheduling.\nThat made tasks past exit_signals() linger on cset-\u003etasks until their final\ncontext switch, which led to a series of problems as what userspace expected\nto see after rmdir diverged from what the kernel needs to wait for. [2]-[5]\ntried to bridge that divergence: [2] filtered the exiting tasks from\ncgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4]\nfixed the wait\u0027s condition; [5] made nr_dying_subsys_* visible\nsynchronously.\n\nThe cgroup_drain_dying() wait in [3] turned out to be a dead end. When the\nrmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g.\nhost PID 1 systemd reaping orphan pids that were re-parented to it during\nthe same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those\npids to free, the pids can\u0027t free because PID 1 is the reaper and it\u0027s stuck\nin rmdir, and the system A-A deadlocks. No internal lock ordering breaks\nthis; the wait itself is the bug.\n\nThe css killing side that drove the original reorder, however, can be made\ncleanly asynchronous: -\u003ecss_offline() is already async, run from\ncss_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to\nmake that chain start only after all tasks have left the cgroup. rmdir\u0027s\nuser-visible side then returns as soon as cgroup.procs and friends are\nempty, while -\u003ecss_offline() still runs only after the cgroup is fully\ndrained.\n\nVerified by the original reproducer (pidns teardown + zombie reaper, runs\nunder vng) which hangs vanilla and succeeds here, and by per-commit\ndeterministic repros for [2], [3], [4], [5] with a boot parameter that\nwidens the post-exit_signals() window so each state is reliably reachable.\nSome stress tests on top of that.\n\ncgroup_apply_control_disable() has the same shape of pre-existing race:\nwhen a controller is disabled via subtree_control, kill_css() ran\nsynchronously while tasks past exit_signals() could still be linked to\nthe cgroup\u0027s csets, and -\u003ecss_offline() could fire before they drained.\nThis patch preserves the existing synchronous behavior at that call site\n(kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch\nwill defer kill_css_finish() there using a per-css trigger.\n\nThis seems like the right approach and I don\u0027t see problems with it. The\nchanges are somewhat invasive but not excessively so, so backporting to\n-stable should be okay. If something does turn out to be wrong, the fallback\nis to revert the entire chain ([1]-[5]) and rework in the development branch\ninstead.\n\nv2: Pin cgrp across the deferred destroy work with explicit\n    cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1\n    wasn\u0027t actually broken (ordered cgroup_offline_wq + queue_work order\n    in cgroup_task_dead() saved it) but the explicit ref removes the\n    dependency on those non-obvious invariants. Also note the\n    pre-existing cgroup_apply_control_disable() race in the description;\n    a follow-up will defer kill_css_finish() there.",
  "id": "GHSA-r8hv-j9g9-64g6",
  "modified": "2026-06-11T18:31:31Z",
  "published": "2026-05-28T12:30:33Z",
  "references": [
    {
      "type": "ADVISORY",
      "url": "https://nvd.nist.gov/vuln/detail/CVE-2026-46223"
    },
    {
      "type": "WEB",
      "url": "https://git.kernel.org/stable/c/33fa2e6b1507a0a377a151a8826438bedad1d0b0"
    },
    {
      "type": "WEB",
      "url": "https://git.kernel.org/stable/c/93618edf753838a727dbff63c7c291dee22d656b"
    }
  ],
  "schema_version": "1.4.0",
  "severity": [
    {
      "score": "CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:N/I:N/A:H",
      "type": "CVSS_V3"
    }
  ]
}

CVE-2026-46223 (GCVE-0-2026-46223)

Vulnerability from cvelistv5 – Published: 2026-05-28 09:40 – Updated: 2026-06-14 18:03

Title

cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated

Summary

In the Linux kernel, the following vulnerability has been resolved: cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated A chain of commits going back to v7.0 reworked rmdir to satisfy the controller invariant that a subsystem's ->css_offline() must not run while tasks are still doing kernel-side work in the cgroup. [1] d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out") [2] a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") [3] 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir") [4] 4c56a8ac6869 ("cgroup: Fix cgroup_drain_dying() testing the wrong condition") [5] 13e786b64bd3 ("cgroup: Increment nr_dying_subsys_* from rmdir context") [1] moved task cset unlink from do_exit() to finish_task_switch() so a task's cset link drops only after the task has fully stopped scheduling. That made tasks past exit_signals() linger on cset->tasks until their final context switch, which led to a series of problems as what userspace expected to see after rmdir diverged from what the kernel needs to wait for. [2]-[5] tried to bridge that divergence: [2] filtered the exiting tasks from cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4] fixed the wait's condition; [5] made nr_dying_subsys_* visible synchronously. The cgroup_drain_dying() wait in [3] turned out to be a dead end. When the rmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g. host PID 1 systemd reaping orphan pids that were re-parented to it during the same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those pids to free, the pids can't free because PID 1 is the reaper and it's stuck in rmdir, and the system A-A deadlocks. No internal lock ordering breaks this; the wait itself is the bug. The css killing side that drove the original reorder, however, can be made cleanly asynchronous: ->css_offline() is already async, run from css_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to make that chain start only after all tasks have left the cgroup. rmdir's user-visible side then returns as soon as cgroup.procs and friends are empty, while ->css_offline() still runs only after the cgroup is fully drained. Verified by the original reproducer (pidns teardown + zombie reaper, runs under vng) which hangs vanilla and succeeds here, and by per-commit deterministic repros for [2], [3], [4], [5] with a boot parameter that widens the post-exit_signals() window so each state is reliably reachable. Some stress tests on top of that. cgroup_apply_control_disable() has the same shape of pre-existing race: when a controller is disabled via subtree_control, kill_css() ran synchronously while tasks past exit_signals() could still be linked to the cgroup's csets, and ->css_offline() could fire before they drained. This patch preserves the existing synchronous behavior at that call site (kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch will defer kill_css_finish() there using a per-css trigger. This seems like the right approach and I don't see problems with it. The changes are somewhat invasive but not excessively so, so backporting to -stable should be okay. If something does turn out to be wrong, the fallback is to revert the entire chain ([1]-[5]) and rework in the development branch instead. v2: Pin cgrp across the deferred destroy work with explicit cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1 wasn't actually broken (ordered cgroup_offline_wq + queue_work order in cgroup_task_dead() saved it) but the explicit ref removes the dependency on those non-obvious invariants. Also note the pre-existing cgroup_apply_control_disable() race in the description; a follow-up will defer kill_css_finish() there.

Severity

No CVSS data available.

Assigner

Linux

References

2 references

URL	Tags
https://git.kernel.org/stable/c/33fa2e6b1507a0a37…
https://git.kernel.org/stable/c/93618edf753838a72…

Impacted products

2 products

Vendor	Product	Version
Linux	Linux	Affected: 1b164b876c36c3eb5561dd9b37702b04401b0166 , < 33fa2e6b1507a0a377a151a8826438bedad1d0b0 (git) Affected: 1b164b876c36c3eb5561dd9b37702b04401b0166 , < 93618edf753838a727dbff63c7c291dee22d656b (git) Affected: 78c72bce4a87819126211c0d24e18350010604fb (git) Affected: 6.19.12 , < 6.20 (semver)
Linux	Linux	Affected: 7.0 Unaffected: 0 , < 7.0 (semver) Unaffected: 7.0.9 , ≤ 7.0.* (semver) Unaffected: 7.1 , ≤ * (original_commit_for_fix)

Show details on NVD website

JSON

To clipboard

{
  "containers": {
    "cna": {
      "affected": [
        {
          "defaultStatus": "unaffected",
          "product": "Linux",
          "programFiles": [
            "include/linux/cgroup-defs.h",
            "kernel/cgroup/cgroup.c"
          ],
          "repo": "https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git",
          "vendor": "Linux",
          "versions": [
            {
              "lessThan": "33fa2e6b1507a0a377a151a8826438bedad1d0b0",
              "status": "affected",
              "version": "1b164b876c36c3eb5561dd9b37702b04401b0166",
              "versionType": "git"
            },
            {
              "lessThan": "93618edf753838a727dbff63c7c291dee22d656b",
              "status": "affected",
              "version": "1b164b876c36c3eb5561dd9b37702b04401b0166",
              "versionType": "git"
            },
            {
              "status": "affected",
              "version": "78c72bce4a87819126211c0d24e18350010604fb",
              "versionType": "git"
            },
            {
              "lessThan": "6.20",
              "status": "affected",
              "version": "6.19.12",
              "versionType": "semver"
            }
          ]
        },
        {
          "defaultStatus": "affected",
          "product": "Linux",
          "programFiles": [
            "include/linux/cgroup-defs.h",
            "kernel/cgroup/cgroup.c"
          ],
          "repo": "https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git",
          "vendor": "Linux",
          "versions": [
            {
              "status": "affected",
              "version": "7.0"
            },
            {
              "lessThan": "7.0",
              "status": "unaffected",
              "version": "0",
              "versionType": "semver"
            },
            {
              "lessThanOrEqual": "7.0.*",
              "status": "unaffected",
              "version": "7.0.9",
              "versionType": "semver"
            },
            {
              "lessThanOrEqual": "*",
              "status": "unaffected",
              "version": "7.1",
              "versionType": "original_commit_for_fix"
            }
          ]
        }
      ],
      "cpeApplicability": [
        {
          "nodes": [
            {
              "cpeMatch": [
                {
                  "criteria": "cpe:2.3:o:linux:linux_kernel:*:*:*:*:*:*:*:*",
                  "versionEndExcluding": "7.0.9",
                  "versionStartIncluding": "7.0",
                  "vulnerable": true
                },
                {
                  "criteria": "cpe:2.3:o:linux:linux_kernel:*:*:*:*:*:*:*:*",
                  "versionEndExcluding": "7.1",
                  "versionStartIncluding": "7.0",
                  "vulnerable": true
                },
                {
                  "criteria": "cpe:2.3:o:linux:linux_kernel:*:*:*:*:*:*:*:*",
                  "versionStartIncluding": "6.19.12",
                  "vulnerable": true
                }
              ],
              "negate": false,
              "operator": "OR"
            }
          ]
        }
      ],
      "descriptions": [
        {
          "lang": "en",
          "value": "In the Linux kernel, the following vulnerability has been resolved:\n\ncgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated\n\nA chain of commits going back to v7.0 reworked rmdir to satisfy the\ncontroller invariant that a subsystem\u0027s -\u003ecss_offline() must not run while\ntasks are still doing kernel-side work in the cgroup.\n\n[1] d245698d727a (\"cgroup: Defer task cgroup unlink until after the task is done switching out\")\n[2] a72f73c4dd9b (\"cgroup: Don\u0027t expose dead tasks in cgroup\")\n[3] 1b164b876c36 (\"cgroup: Wait for dying tasks to leave on rmdir\")\n[4] 4c56a8ac6869 (\"cgroup: Fix cgroup_drain_dying() testing the wrong condition\")\n[5] 13e786b64bd3 (\"cgroup: Increment nr_dying_subsys_* from rmdir context\")\n\n[1] moved task cset unlink from do_exit() to finish_task_switch() so a\ntask\u0027s cset link drops only after the task has fully stopped scheduling.\nThat made tasks past exit_signals() linger on cset-\u003etasks until their final\ncontext switch, which led to a series of problems as what userspace expected\nto see after rmdir diverged from what the kernel needs to wait for. [2]-[5]\ntried to bridge that divergence: [2] filtered the exiting tasks from\ncgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4]\nfixed the wait\u0027s condition; [5] made nr_dying_subsys_* visible\nsynchronously.\n\nThe cgroup_drain_dying() wait in [3] turned out to be a dead end. When the\nrmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g.\nhost PID 1 systemd reaping orphan pids that were re-parented to it during\nthe same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those\npids to free, the pids can\u0027t free because PID 1 is the reaper and it\u0027s stuck\nin rmdir, and the system A-A deadlocks. No internal lock ordering breaks\nthis; the wait itself is the bug.\n\nThe css killing side that drove the original reorder, however, can be made\ncleanly asynchronous: -\u003ecss_offline() is already async, run from\ncss_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to\nmake that chain start only after all tasks have left the cgroup. rmdir\u0027s\nuser-visible side then returns as soon as cgroup.procs and friends are\nempty, while -\u003ecss_offline() still runs only after the cgroup is fully\ndrained.\n\nVerified by the original reproducer (pidns teardown + zombie reaper, runs\nunder vng) which hangs vanilla and succeeds here, and by per-commit\ndeterministic repros for [2], [3], [4], [5] with a boot parameter that\nwidens the post-exit_signals() window so each state is reliably reachable.\nSome stress tests on top of that.\n\ncgroup_apply_control_disable() has the same shape of pre-existing race:\nwhen a controller is disabled via subtree_control, kill_css() ran\nsynchronously while tasks past exit_signals() could still be linked to\nthe cgroup\u0027s csets, and -\u003ecss_offline() could fire before they drained.\nThis patch preserves the existing synchronous behavior at that call site\n(kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch\nwill defer kill_css_finish() there using a per-css trigger.\n\nThis seems like the right approach and I don\u0027t see problems with it. The\nchanges are somewhat invasive but not excessively so, so backporting to\n-stable should be okay. If something does turn out to be wrong, the fallback\nis to revert the entire chain ([1]-[5]) and rework in the development branch\ninstead.\n\nv2: Pin cgrp across the deferred destroy work with explicit\n    cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1\n    wasn\u0027t actually broken (ordered cgroup_offline_wq + queue_work order\n    in cgroup_task_dead() saved it) but the explicit ref removes the\n    dependency on those non-obvious invariants. Also note the\n    pre-existing cgroup_apply_control_disable() race in the description;\n    a follow-up will defer kill_css_finish() there."
        }
      ],
      "providerMetadata": {
        "dateUpdated": "2026-06-14T18:03:52.610Z",
        "orgId": "416baaa9-dc9f-4396-8d5f-8c081fb06d67",
        "shortName": "Linux"
      },
      "references": [
        {
          "url": "https://git.kernel.org/stable/c/33fa2e6b1507a0a377a151a8826438bedad1d0b0"
        },
        {
          "url": "https://git.kernel.org/stable/c/93618edf753838a727dbff63c7c291dee22d656b"
        }
      ],
      "title": "cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated",
      "x_generator": {
        "engine": "bippy-1.2.0"
      }
    }
  },
  "cveMetadata": {
    "assignerOrgId": "416baaa9-dc9f-4396-8d5f-8c081fb06d67",
    "assignerShortName": "Linux",
    "cveId": "CVE-2026-46223",
    "datePublished": "2026-05-28T09:40:40.791Z",
    "dateReserved": "2026-05-13T15:03:33.106Z",
    "dateUpdated": "2026-06-14T18:03:52.610Z",
    "state": "PUBLISHED"
  },
  "dataType": "CVE_RECORD",
  "dataVersion": "5.2"
}

Sightings

Author	Source	Type	Date	Other

Nomenclature

Seen: The vulnerability was mentioned, discussed, or observed by the user.
Confirmed: The vulnerability has been validated from an analyst's perspective.
Published Proof of Concept: A public proof of concept is available for this vulnerability.
Exploited: The vulnerability was observed as exploited by the user who reported the sighting.
Patched: The vulnerability was observed as successfully patched by the user who reported the sighting.
Not exploited: The vulnerability was not observed as exploited by the user who reported the sighting.
Not confirmed: The user expressed doubt about the validity of the vulnerability.
Not patched: The vulnerability was not observed as successfully patched by the user who reported the sighting.

Detection rules are retrieved from Rulezet.

Action not permitted

GHSA-R8HV-J9G9-64G6

CVE-2026-46223 (GCVE-0-2026-46223)

Tags

Sightings

Nomenclature