Feature CNV-51201: Integration between VMs and primary user-defined networks

View the Description

Provide a simple way to get a VM-friendly networking setup, without having to configure the underlying physical network.

Epic CNV-46603: UI for OVN Kubernetes: Primary user-defined networks

View the Description View the linked PRs

Goal

Primary used-defined networks can be managed from the UI and the user flow is seamless.

User Stories

As a cluster admin,
I want to use the UI to define a ClusterUserDefinedNetwork, assigned with a namespace selector.
As a project admin,
I want to use the UI to define a UserDefinedNetwork in my namespace.
As a project admin,
I want to be queried to create a UserDefinedNetwork before I create any Pods/VMs in my new project.
As a project admin running VMs in a namespace with UDN defined,
I expect the "pod network" to be called "user-defined primary network",
and I expect that when using it, the proper network binding is used.
As a project admin,
I want to use the UI to request a specific IP for my VM connected to UDN.

UX doc

https://docs.google.com/document/d/1WqkTPvpWMNEGlUIETiqPIt6ZEXnfWKRElBsmAs9OVE0/edit?tab=t.0#heading=h.yn2cvj2pci1l

Non-Requirements

<List of things not included in this epic, to alleviate any doubt raised during the grooming process.>

Notes

The user-defined networks design, including the API, is available here: https://github.com/openshift/enhancements/blob/master/enhancements/network/user-defined-network-segmentation.md

https://github.com/openshift/networking-console-plugin/pull/150

Feature OBSDA-1046: CCX OCP core maintenance 2025

View the Description

Placeholder feature for ccx-ocp-core maintenance tasks.

Epic CCXDEV-14640: IO maintenance OCP 4.19

View the Description

This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.

Bug OCPBUGS-45926: Component Readiness: [Insights Operator] [Other] test regressed

View the Description View the linked PRs

It looks like the insights-operator doesn't work with IPv6, there are log errors like this:

E1209 12:20:27.648684   37952 run.go:72] "command failed" err="failed to run groups: failed to listen on secure address: listen tcp: address fd01:0:0:5::6:8000: too many colons in address"

It's showing up in metal techpreview jobs.

The URL isn't being constructed correctly, use NetJoinHostPort over Sprintf. Some more details here https://github.com/stbenjam/no-sprintf-host-port. There's a non-default linter in golangci-lint for this.

Component Readiness has found a potential regression in the following test:

[sig-architecture] platform pods in ns/openshift-insights should not exit an excessive amount of times

Test has a 56.36% pass rate, but 95.00% is required.

Sample (being evaluated) Release: 4.18
Start Time: 2024-12-02T00:00:00Z
End Time: 2024-12-09T16:00:00Z
Success Rate: 56.36%
Successes: 31
Failures: 24
Flakes: 0

View the test details report for additional context.

https://github.com/openshift/insights-operator/pull/1050

Feature OBSDA-372: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic OU-224: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Task OU-561: Dev console: Use alerts rules detail page from monitoring-plugin

View the Description View the linked PRs

Background

The admin console's alert rule details page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.

Outcomes

That dev console page is loaded from monitoring-plugin and the code for the page is removed from the console codebase.

Ensure removal of deprecated patternfly components from kebab-dropdown.tsx and alerting.tsx once this story and ~~OU-257~~ are completed.

https://github.com/openshift/monitoring-plugin/pull/286

Story CONSOLE-4380: Use alerts rules detail page from monitoring-plugin

View the linked PRs

Feature OBSDA-390: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic OU-427: DP: Perses Dashboards in OpenShift - Internal teams

View the Description

Description

“In order to allow internal teams to define their dashboards in Perses, we as the Observability UI Team need to Add support on the console to display Perses dashboards”

Goals & Outcomes

Product Requirements:

The console dashboards plugin is able to render dashboards coming from Perses

Task OU-432: Merge Perses Dashboards with the current dashboard list

View the Description View the linked PRs

Background

In order to allow customers and internal teams to see dashboards created using Perses, we must add them as new elements on the current dashboard list

Outcomes

When navigating to the Monitoring / Dashboards. Perses dashboards are listed with the current console dashboards. The extension point is backported to 4.14

Steps

COO (monitoring-console-plugin)
- Add the Perses dashboards feature called "perses-dashboards" in the monitoring plugin.
- Create a function to fetch dashboards from the Perses API
CMO (monitoring-plugin)
- An extension point is added to inject the function to fetch dashboards from Perses API and merge the results with the current console dashboards

https://github.com/openshift/monitoring-plugin/pull/266

Feature OBSDA-390: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Feature OCPPLAN-7878: NetEdge - Maintainability and Debugability & Tech Backlog

View the Description

tldr: three basic claims, the rest is explanation and one example

We cannot improve long term maintainability solely by fixing bugs.
Teams should be asked to produce designs for improving maintainability/debugability.
Specific maintenance items (or investigation of maintenance items), should be placed into planning as peer to PM requests and explicitly prioritized against them.

While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.

One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.

I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.

We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.

Relevant links:

Documentation:
- Edge Diagnostics Scratchpad, our team's internal diagnostic guide.
- Troubleshooting OCP networking issues - The complete guide, the SDN team's diagnostic guide.
- Linux Performance, Brendan Gregg's guide to analyzing Linux performance issues.
- RFC: A proper feedback loop on Alerts.
- OpenShift Router Reload Technical Overview on Access.
- Performance Scaling HAProxy with OpenShift on Access.
- How to collect worker metrics to troubleshoot CPU load, memory pressure and interrupt issues and networking on worker nodes in OCP 4 on Access.
- OpenShift Performance and Scale Knowledge Base on Mojo, results from OpenShift scalability testing.
- Scalability and performance, OCP 4.5 documentation about the router's currently known scalability limits.
- Scaling OpenShift Container Platform HAProxy Router, OCP 3.11 documentation about the manual performance configuration that was possible in OCP 3.
- Timing web requests with cURL and Chrome from the Cloudflare blog.
- tcpdump advanced filters, some useful tcpdump commands.
- OpenShift SDN - Networking, OCP 3.11 documentation on the SDN (useful background reading).
- Ingress Operator and Controller Status Conditions, design document for improved status condition reporting.
- Observability tips for HAProxy, a slide deck by Willy Tarreau.
- Interesting Traces - Out of Order versus Retransmissions, analysis using tshark.
- The PCP Book: A Complete Documentation of Performance Co-Pilot, by Yogesh Babar.
- Debugging kernel networking bug, brief guide to using SystemTap on RHCOS.
- Troubleshooting throughput issues from the OCP 4.5 documentation.
- Troubleshooting OpenShift Clusters and Workloads.
- Red Hat Enterprise Linux Network Performance Tuning Guide (PDF).
- openshift/enhancements#289 stability: point to point network check, a diagnostic built into the kube-apiserver operator.
Diagnostic tools:
- dropwatch to watch for packet drops.
- ethtool to check NIC configuration.
- iovisor/bcc: BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more to trace and diagnose various issues in the networking stack.
- r-curler to gather timing information about HTTP/HTTPS connections.
- route-monitor, to monitor routes for reachability.
- hping(3), a programmable packet generator.
- OpenTracing / Jaeger in OpenShift.
- node-problem-detector, a possible integration point for new diagnostics.
- Using SystemTap by Brendan Gregg.
- DTrace SystemTap cheatsheet (PDF).
Visualization and more sophisticated diagnostic tools:
- eldadru/ksniff, kubectl plugin for tcpdump & Wireshark.
- ironcladlou/ditm, Dan's "Dan in the Middle" tool.
- Skydive, network diagnostic and visualization tool.
- ali, a "load testing tool capable of performing real-time analysis" with visualization.
Testing tools:
- stress-ng, a general stress-loading tool (CPU, filesystem, network, ...).
- mb, the networking benchmarking tool written and used by Jiri Mencak from our Perf+Scale team.
Case studies:
- BZ1763206 is an example of diagnosing DNS latency/timeouts.
- BZ1829779 Investigation details the diagnosis of route latency.
- BZ1845545 is an example of diagnosing misconfigured DNS for an external LB.
- Debugging network stalls on Kubernetes, from the GitHub Blog, about diagnosing Kubernetes performance issues related to ksoftirqd.

Epic NE-1865: [Tech Debt] Fix OWNERS files in openshift/origin

View the Description

Add a NID alias to OWNERS_ALIASES and update the OWNERS file in test/extended/router and add OWNERS file to test/extended/dns

Story NE-1870: Submit PR to fix OWNERS files in openshift/origin

View the linked PRs

https://github.com/openshift/origin/pull/29247

Feature OCPSTRAT-1064: Improve upgrades - phase 3 - Control plane & worker node independence

View the Description

Feature Overview

As a cluster-admin, I want to run update in discrete steps. Update control plane and worker nodes independently.
I also want to back-up and restore incase of a problematic upgrade.

Background:

This Feature is a continuation of https://issues.redhat.com/browse/OCPSTRAT-180.
Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.Below is the list of done tasks.

~~OTA-700~~ Reduce False Positives (such as Degraded)
~~OTA-922~~ - Better able to show the progress made in each discrete step
[Cover by status command]Better visibility into any errors during the upgrades and documentation of what they error means and how to recover.

Goals

Have an option to do upgrades in more discrete steps under admin control. Specifically, these steps are:
- Control plane upgrade
- Worker nodes upgrade
- Workload enabling upgrade (i..e. Router, other components) or infra nodes
An user experience around an end-2-end back-up and restore after a failed upgrade
MCO-530 - Support in Telemetry for the discrete steps of upgrades

References

Epic TRT-1578: Ensure all HA components are not degraded by design during upgrades

View the Description

Epic Goal

Eliminate the gap between measured availability and Available=true

Why is this important?

Today it's not uncommon, even for CI jobs, to have multiple operators which blip through either Degraded=True or Available=False conditions
We should assume that if our CI jobs do this then when operating in customer environments with higher levels of chaos things will be even worse
We have had multiple customers express that they've pursued rolling back upgrades because the cluster is telling them that portions of the cluster are Degraded or Unavailable when they're actually not
Since our product is self-hosted, we can reasonably expect that the instability that we experience on our platform workloads (kube-apiserver, console, authentication, service availability), will also impact customer workloads that run exactly the same way: we're just better at detecting it.

Scenarios

In all of the following, assume standard 3 master 0 worker or 3 master 2+ worker topologies
Add/update CI jobs which ensure 100% Degraded=False and Available=True for the duration of upgrade
Add/update CI jobs which measure availability of all components which are not explicitly defined as non-HA (ex: metal's DHCP server is singleton)
Address all identified issues

Acceptance Criteria

openshift/enhancements CONVENTIONS outlines these requirements
CI - Release blocking jobs include these new/updated tests
Release Technical Enablement - N/A if we do this we should need no docs
No outstanding identified issues

Dependencies (internal and external)

Previous Work (Optional):

Clayton, David, and Trevor identified many issues early in 4.8 development but were unable to ensure all teams addressed them that list is in this query, teams will be asked to address everything on this list as a 4.9 blocker+ bug and we will re-evaluate status closer to 4.9 code freeze to see which may be deferred to 4.10
https://bugzilla.redhat.com/buglist.cgi?columnlist=product%2Ccomponent%2Cassigned_to%2Cbug_severity%2Ctarget_release%2Cbug_status%2Cresolution%2Cshort_desc%2Cchangeddate&f1=longdesc&f2=cf_environment&j_top=OR&list_id=12012976&o1=casesubstring&o2=casesubstring&query_based_on=ClusterOperator%20conditions&query_format=advanced&v1=should%20not%20change%20condition%2F&v2=should%20not%20change%20condition%2F

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
DEV - Tests in place
DEV - No outstanding failing tests
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story TRT-1575: CI: fail update suite if any ClusterOperator go Degraded=True

View the Description View the linked PRs

These are alarming conditions which may frighten customers, and we don't want to see them in our own, controlled, repeatable update CI. This example job had logs like:

Feb 18 21:11:25.799 E clusteroperator/openshift-apiserver changed Degraded to True: APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()

And the job failed, but none of the failures were "something made openshift-apiserver mad enough to go Degraded".

Definition of done:

Same as ~~OTA-362~~
File bugs or the existing issues
If bug exists then add the tests to the exception list.
Unless tests are in exception list , they should fail if we see degraded != false.

https://github.com/openshift/origin/pull/29354

Feature OCPSTRAT-108: [TP] Support Kube KMS Integration in OCP (User-Provided)

View the Description

Feature Overview

This feature aims to enable customers of OCP to integrate 3rd party KMS solutions for encrypting etcd values at rest in accordance with:

https://kubernetes.io/docs/tasks/administer-cluster/kms-provider/

Goals

Bring KMS v2 API to beta|stable level
Create/expose mechanisms for customers to plug in containers/operators which can serve the API server's needs (can it be an operator, something provided via CoreOS layering, vanilla container spec provided to API server operator?)
Provide similar UX experience for all of self-hosted, hypershift, SNO scenarios
Provide example container/operator for the mechanism

General Prioritization for the Feature

Approved design for detection & actuation for stand-alone OCP clusters.
1. How to detect a problem like an expired/lost key and no contact with the KMS provider?
2. How to inform/notify the situation, even at node level, of the situation
Tech Preview (Feature gated) enabling Kube-KMS v2 for partners to start working on KMS plugin provider integrations:
1. Cloud: (priority Azure > AWS > Google)
  1. Azure KMS
  2. Azure Dedicated HSM
  3. AWS KMS
  4. AWS CloudHSM
  5. Google Cloud HSM
2. On-premise:
  1. HashiCorp Vault
  2. EU FSI & EU Telco KMS/HSM top-2 providers
GA after at least one stable KMS plugin provider

Epic API-1684: [TP] Support KMS on self-managed OCP

View the Description

Scenario:

For an OCP cluster with external KMS enabled:

The customer loses the key to the external KMS
The external KMS service is degraded or unavailable

How doe the above scenario(s) impact the cluster? The API may be unavailable

Goal:

Detection: The ability to detect these failure condition(s) and make it visible to the cluster admin.
Actuation: To what extent can we restore the cluster? ( API availability, Control Plane operators). Recovering customer data is outside of the scope

Investigation Steps:

Detection:

How do we detect issues with the external KMS?
How do we detect issues with the KMS plugins?
How do we surface the information that an issue happened with KMS?
- Metrics / Alerts? Will not work with SNO
- ClusterOperatorStatus?

Actuation:

Is the control-plane self-recovering?
What actions are required for the user to recover the cluster partially/completely?
Complete: kube-apiserver? KMS plugin?
Partial: kube-apiserver? Etcd? KMS plugin?

User stories that might result in KCS:

KMS / KMS plugin unavailable
- Is there any degradation? (most likely not with kms v2)
KMS unavailable and DEK not in cache anymore
- Degradation will most likely occur, but what happens when the KMS becomes available again? Is the cluster self-recovering?
Key has been deleted and later recovered
- Is the cluster self-recovering?
KMS / KMS plugin misconfigured
- Is the apiserver rolled-back to the previous healthy revision?
- Is the misconfiguration properly surfaced?

Plugins research:

What are the pros and cons of managing the plugins ourselves vs leaving that responsibility to the customer?
What is the list of KMS we need to support?
Do all the KMS plugins we need to use support KMS v2? If not reach out to the provider
HSMs?

POCs:

Have a running POC of KMS on OCP to iterate over the user stories and start testing things out
Have a hacked version of o/k with https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/3926-handling-undecryptable-resources to be able to easily take actions to fix the clusters as it will be for the customers in 4.17.

Acceptance Criteria:

Document the detection and actuation process in a KEP.
Generate new Jira work items based on the new findings.

Task API-1843: Add TP-gated KMS type to the APIServer API

View the Description View the linked PRs

We did something similar for the aesgcm encryption type in https://github.com/openshift/api/pull/1413/.

https://github.com/openshift/api/pull/2071

Feature OCPSTRAT-1093: Support for VolumeGroup Snapshots (GA)

View the Description

Feature Overview (aka. Goal Summary)

Volume Group Snapshots is a key new Kubernetes storage feature that allows multiple PVs to be grouped together and snapshotted at the same time. This enables customers to takes consistent snapshots of applications that span across multiple PVs.

This is also a key requirement for backup and DR solutions.

https://kubernetes.io/blog/2023/05/08/kubernetes-1-27-volume-group-snapshot-alpha/

https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3476-volume-group-snapshot

Goals (aka. expected user outcomes)

Productise the volume group snapshots feature as tech preview have docs, testing as well as a feature gate to enable it in order for customers and partners to test it in advance.

Requirements (aka. Acceptance Criteria):

The feature should be graduated beta upstream to become TP in OCP. Tests and CI must pass and a feature gate should allow customers and partners to easily enable it. We should identify all OCP shipped CSI drivers that support this feature and configure them accordingly.

Use Cases (Optional):

As a storage vendor I want to have early access to the VolumeGroupSnapshot feature to test and validate my driver support.
As a backup vendor I want to have early access to the VolumeGroupSnapshot feature to test and validate my backup solution.
As a customer I want early access to test the VolumeGroupSnapshot feature in order to take consistent snapshots of my workloads that are relying on multiple PVs.

Out of Scope

CSI drivers development/support of this feature.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Drivers must support this feature and enable it. Partners may need to change their operator and/or doc to support it.

Documentation Considerations

Document how to enable the feature, what this feature does and how to use it. Update the OCP driver's table to include this capability.

Interoperability Considerations

Can be leveraged by ODF and OCP virt, especially around backup and DR scenarios.

Epic STOR-2120: Support for VolumeGroup Snapshots (GA)

View the Description

Epic Goal*

What is our purpose in implementing this? What new capability will be available to customers?

Add Volume Group Snapshots as Tech Preview. This is a key new Kubernetes storage feature that allows multiple PVs to be grouped together and snapshotted at the same time. This enables customers to takes consistent snapshots of applications that span across multiple PVs.

We will rely on the newly beta promoted feature. This feature is driver dependent.

This will need a new external-snapshotter rebase + removal of the feature gate check in csi-snapshot-controller-operator. Freshly installed or upgraded from older release, will have group snapshot v1beta1 API enabled + enabled support for it in the snapshot-controller (+ ship corresponding external-snapshotter sidecar).

No opt-in, no opt-out.

OCP itself will not ship any CSI driver that supports it.

Why is this important? (mandatory)

What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?

This is also a key requirement for backup and DR solutions specially for OCP virt.

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

As a storage vendor I want to have early access to the VolumeGroupSnapshot feature to test and validate my driver support.
As a backup vendor I want to have early access to the VolumeGroupSnapshot feature to test and validate my backup solution.
As a customer I want early access to test the VolumeGroupSnapshot feature in order to take consistent snapshots of my workloads that are relying on multiple PVs

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

External snapshotter rebase to the upstream version that include the beta API.

Contributing Teams(and contacts) (mandatory)

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

Development - STOR / ODF
Documentation - STOR
QE - STOR / ODF
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Since we don't ship any driver with OCP that support the feature we need to have testing with ODF

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

We're looking at enabling it by default which could introduce risk. Since the feature has recently landed upstream, we will need to rebase on a newer external snapshotter that we initially targeted.

When moving to v1 there may be non backward compatible changes.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

Story STOR-2135: Rebase against a version of snapshotter which has latest changes

View the Description View the linked PRs

We need to rebase against a version of snapshotter which has all the latest changes.

https://github.com/openshift/csi-external-snapshotter/pull/166

Story STOR-2136: Move snapshot featuregate to GA

View the linked PRs

Feature OCPSTRAT-1267: Making Crun default in 4.18

View the Description

Feature Overview (aka. Goal Summary)

Crun is GA as non default since OCP 4.14 . We want to make it as default in 4.18 while still supporting runc as non-default

Benefits of Crun is covered here https://github.com/containers/crun

FAQ.: https://docs.google.com/document/d/1N7tik4HXTKsXS-tMhvnmagvw6TE44iNccQGfbL_-eXw/edit

***Note -> making Crun default does not means we will remove the support for runc nor we have any plans in foreseeable future to do that

Epic OCPNODE-2217: Make crun the default runtime for OpenShift

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Bug OCPBUGS-45928: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-config-operator/pull/4765

Feature OCPSTRAT-1389: On Cluster Layering: Phase 3 (GA)

View the Description

Feature Overview

This is Image mode on OpenShift. It uses the rpm-ostree native containers interface and not bootc but that is an implementation detail.

In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.

The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.

On-cluster, automated RHCOS Layering builds are important for multiple reasons:

One-click/one-command upgrades of OCP are very popular. Many customers may want to make one or just a few customizations but also want to keep that simplified upgrade experience.
Customers who only need to customize RHCOS temporarily (hotfix, driver test package, etc) will find off-cluster builds to be too much friction for one driver.
One of OCP's virtues is that the platform and OS are developed, tested, and versioned together. Off-cluster building breaks that connection and leaves it up to the user to keep the OS up-to-date with the platform containers. We must make it easy for customers to add what they need and keep the OS image matched to the platform containers.

Goals & Requirements

The goal of this feature is primarily to bring the 4.14 progress (~~OCPSTRAT-35~~) to a Tech Preview or GA level of support.
Customers should be able to specify a Containerfile with their customizations and "forget it" as long as the automated builds succeed. If they fail, the admin should be alerted and pointed to the logs from the failed build.
- The admin should then be able to correct the build and resume the upgrade.
Intersect with the Custom Boot Images such that a required custom software component can be present on every boot of every node throughout the installation process including the bootstrap node sequence (example: out-of-box storage driver needed for root disk).
Users can return a pool to an unmodified image easily.
RHEL entitlements should be wired in or at least simple to set up (once).
Parity with current features – including the current drain/reboot suppression list, CoreOS Extensions, and config drift monitoring.

Epic MCO-828: On-Cluster Layering GA

View the Description

This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.

As a cluster admin of user provided infrastructure,
when I apply the machine config that opts a pool into On Cluster Layering,
I want to also be able to remove that config and have the pool revert back to its non-layered state with the previously applied config.

As a cluster admin using on cluster layering,
when an image build has failed,
I want it to retry 3 times automatically without my intervention and show me where to find the log of the failure.

As a cluster admin,
when I enable On Cluster Layering,
I want to know that the builder image I am building with is stable and will not change unless I change it
so that I keep the same API promises as we do elsewhere in the platform.

To test:

As a cluster admin using on cluster layering,
when I try to upgrade my cluster and the Cluster Version Operator is not available,
I want the upgrade operation to be blocked.

As a cluster admin,
when I use a disconnected environment,
I want to still be able to use On Cluster Layering.

As a cluster admin using On Cluster layering,
When there has been config drift of any sort that degrades a node and I have resolved the issue,
I want to it to resync without forcing a reboot.

As a cluster admin using on cluster layering,
when a pool is using on cluster layering and references an internal registry
I want that registry available on the host network so that the pool can successfully scale up
(~~MCO-770~~, ~~MCO-578~~, ~~MCO-574~~ )

As a cluster admin using on cluster layering,
when a pool is using on cluster layering and I want to scale up nodes,
the nodes should have the same config as the other nodes in the pool.

Maybe:

Entitlements: ~~MCO-1097~~, ~~MCO-1099~~

Not Likely:

As a cluster admin using on cluster layering,
when I try to upgrade my cluster,
I want the upgrade operation to succeed at the same rate as non-OCL upgrades do.

Story MCO-1416: Make OCL e2e tests blocking

View the Description View the linked PRs

With OCL GA'ing soon, we'll need a blocking path within our e2e test suite that must pass before a PR can be merged. Since e2e-gcp-op-techpreview is a non-blocking job, we should do both of the following:

Migrate the tests from e2e-gcp-op-techpreview into e2e-gcp-op. This can be done by moving the tests in the MCO repo from the test/e2e-techpreview folder to the test/e2e folder. There might be some minor cleanups such as fixing duplicate function names, etc. but it should be fairly straightforward to do.
Make e2e-gcp-op-techpreview a blocking job. A PR to the openshift/release repo to set optional: false for that job in both the 4.18 and 4.19 configs will be needed. This should be a pretty straightforward config change.

Feature OCPSTRAT-1418: Allow Custom machine names when using the CPMS feature

View the Description

Feature Overview

As a cluster admin for standalone OpenShift, I want to customize the prefix of the machine names created by CPMS due to company policies related to nomenclature. Implement the Control Plane Machine Set (CPMS) feature in OpenShift to support machine names where user can set custom names prefixes. Note the prefix will always be suffixed by "<5-chars>-<index>" as this is part of the CPMS internal design.

Acceptance Criteria

A new field called machineNamePrefix has been added to CPMS CR.
This field would allow the customer to specify a custom prefix for the machine names.
The machine names would then be generated using the format: <machineNamePrefix>~~<5-chars>~~<index>
Where:
<machineNamePrefix> is the custom prefix provided by the customer
<5-chars> is a random 5 character string (this is required and cannot be changed)
<index> represents the index of the machine (0, 1, 2, etc.)
Ensure that if the machineNamePrefix is changed, the operator reconciles and succeeds in rolling out the changes.

Epic OAPE-16: Ability to assign custom name formats to Control Plane Machines via CPMS

View the Description

Epic Goal

Provide a new field to the CPMS that allows to define a Machine name prefix
This prefix will supersede the current usage of the control plane label and role combination we use today
The names must still continue to be suffixed with <chars>-<idx> as this is important to the operation of CPMS

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.

Done Checklist

CI - CI is running, tests are automated and merged.
DEV - Downstream code and tests merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story OAPE-18: As a developer, I want to vendor openshift/api changes into cpms-operator

View the Description View the linked PRs

Bump openshift/api to vendor machineNamePrefix field and CPMSMachineNamePrefix feature-gate into cpms-operator

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/333

Feature OCPSTRAT-1500: Support 2+1 node Openshift cluster with Local Arbiter (OLA) - Tech Preview

View the Description View Demos

Feature Overview (aka. Goal Summary)

Edge customers requiring computing on-site to serve business applications (e.g., point of sale, security & control applications, AI inference) are asking for a 2-node HA solution for their environments. Only two nodes at the edge, because the 3d node induces too much cost, but still they need HA for critical workload. To address this need, a 2+1 topology is introduced. It supports a small cheap arbiter node that can optionally be remote/virtual to reduce onsite HW cost.

Goals (aka. expected user outcomes)

Support OpenShift on 2+1 topology, meaning two primary nodes with large capacity to run workload and control plan, and a third small “arbiter” node which ensure quorum. See requirements for more details.

Requirements (aka. Acceptance Criteria):

Co-located arbiter node - 3d node in same network/location with low latency network access, but the arbiter node is much smaller compared to the two main nodes. Target resource requirements for the arbiter node: 4 cores / 8 vcpu, 16G RAM, 120G disk (non-spinning), 1x1 GbE network ports, no BMC
OCP Virt fully functionally, incl. Live migration of VMs (assuming RWX CSI Driver is available)
Single Node outage is handled seamlessly
In case the arbiter node is down , a reboot/restart of the two remaining nodes has to work, i.e. the two remaining nodes re-gain quorum and spin-up the workload.
Scale out of the cluster by adding additional worker nodes should be possible
Transition the cluster into a regular 3 node compact cluster, e.g. by adding a new node as control plane node, then removing the arbiter node, should be possible
Regular workload should not be scheduled to the arbiter node (e.g by making it un-schedulabe, or introduce a new node role “arbiter”). Only essential control plane workload (etcd components) should run on the arbiter node. Non-essential control plan workload (i.e. router, registry, console, monitoring etc) should also not be scheduled to the arbiter nodded.
It must be possible to explicitly schedule additional workload to the arbiter node. That is important for 3d party solutions (e.g. storage provider) which also have quorum based mechanisms.
must seamlessly integrate into existing installation/update mechanisms, esp. zero touch provisioning etc.
Added: ability to track OLA usage in the fleet of connected clusters via OCP telemetry data

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	self-managed
Classic (standalone cluster)	yes
Hosted control planes	no
Multi node, Compact (three node), or Single node (SNO), or all	Multi node and Compact (three node)
Connected / Restricted Network	both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	x86_86 and ARM
Operator compatibility	full
Backport needed (list applicable versions)	no
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	no
Other (please specify)	n/a

Questions to Answer (Optional):

How to implement the scheduling restrictions to the arbiter node? New node role “arbiter”?
Can this be delivered in one release, or do we need to split, e.g. TechPreview + GA?

Out of Scope

Storage driver providing RWX shared storage
…

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Two node support is in high demand by telco, industrial and retail customers.
VMWare supports a two node VSan solution: https://core.vmware.com/resource/vsan-2-node-cluster-guide
Example edge hardware frequently used for edge deployments with a co-located small arbiter node: Dell PowerEdge XR4000z Server is an edge computing device that allows restaurants, retailers, and other small to medium businesses to set up local computing for data-intensive workloads.

Customer Considerations

See requirements - there are two main groups of customers: co-located arbiter node, and remote arbiter node.

Documentation Considerations

Topology needs to be documented, esp. The requirements of the arbiter node.

Interoperability Considerations

OCP Virt needs to be explicitly tested on this scenario to support VM HA (live migration, restart on other node)

https://spaces.redhat.com/display/PLUG/Edge+Sprint+Demos#EdgeSprintDemos-Sprint260

Epic OCPEDGE-1191: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story OCPEDGE-1308: Update Dependencies for cluster-config-operator

View the Description View the linked PRs

Once the HighlyAvailableArbiter has been added to the ocp/api, we need to update the cluster-config-operator dependencies to reference the new change, so that it propagates to cluster installs in our payloads.

https://github.com/openshift/cluster-config-operator/pull/428

Story OCPEDGE-1435: Update CEO Dependencies for Arbiter Node

View the Description View the linked PRs

Update the dependencies for CEO for library-go and ocp/api to support the Arbiter additions, doing this in a separate PR to keep things clean and easier to test.

https://github.com/openshift/cluster-etcd-operator/pull/1378

Story OCPEDGE-1195: Update CEO to Support Arbiter Node

View the Description View the linked PRs

We need to update CEO (cluster etcd operator) to understand what an arbiter/witness node is so it can properly assign an etcd member on our less powerful node.

https://github.com/openshift/cluster-etcd-operator/pull/1366

Story OCPEDGE-1307: Add HighlyAvailableArbiter topology to Infrastructure API for TechPreview

View the Description View the linked PRs

We need to add the `HighlyAvailableArbiter` value to the controlPlaneTopology in ocp/api package as a TechPreview

https://github.com/openshift/api/blob/master/config/v1/types_infrastructure.go#L95-L103

https://github.com/openshift/api/pull/2082

Feature OCPSTRAT-1532: Support multiple NICs in Nutanix

View the Description

Feature Overview

Ability to install OpenShift on Nutanix with nodes having multiple NICs (multiple subnets) from IPI and for autoscaling with MachineSets.

Epic CORS-3741: Support multiple NICs in Nutanix

View the Description View the linked PRs

Feature Overview

Ability to install OpenShift on Nutanix with nodes having multiple NICs (multiple subnets) from IPI and for autoscaling with MachineSets.

https://github.com/openshift/machine-api-operator/pull/1301

Feature OCPSTRAT-1577: [Tech Preview] OpenShift Zones support for vSphere Host Groups

View the Description

Feature Overview

Support mapping OpenShift zones to vSphere host groups, in addition to vSphere clusters.

When defining zones for vSphere administrators can map regions to vSphere datacenters and zones to vSphere clusters.

There are use cases where vSphere clusters have only one cluster construct with all their ESXi hosts but the administrators want to divide the ESXi hosts in host groups. A common example is vSphere stretched clusters, where there is only one logical vSphere cluster but the ESXi nodes are distributed across to physical sites, and grouped by site in vSphere host groups.

In order for OpenShift to be able to distribute its nodes on vSphere matching the physical grouping of hosts, OpenShift zones have to be able to map to vSphere host groups too.

Requirements{}

Users can define OpenShift zones mapping them to host groups at installation time (day 1)
Users can use host groups as OpenShift zones post-installation (day 2)

Epic SPLAT-1728: [Tech Preview] OpenShift Zones support for vSphere Host Groups

View the Description

Epic Goal

Support mapping OpenShift zones to vSphere host groups, in addition to vSphere clusters.

When defining zones for vSphere administrators can map regions to vSphere datacenters and zones to vSphere clusters.

In order for OpenShift to be able to distribute its nodes on vSphere matching the physical grouping of hosts, OpenShift zones have to be able to map to vSphere host groups too.

Requirements{}

Users can define OpenShift zones mapping them to host groups at installation time (day 1)
Users can use host groups as OpenShift zones post-installation (day 2)

Story SPLAT-1800: Enable host vm group zonal for mao

View the Description View the linked PRs

As an openshift engineer enable host vm group zonal in mao so that compute nodes properly are deployed

Acceptance Criteria:

Modify workspace to include vmgroup
properly configure vsphere cluster to add vm into vmgroup

https://github.com/openshift/machine-api-operator/pull/1285

Feature OCPSTRAT-1613: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic SDN-4930: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2357

Story SDN-5031: [monitoring][L2/L3] Open default network ports on UDN pods via users's request through pod annotations

View the linked PRs

https://github.com/openshift/origin/pull/29346

Bug OCPBUGS-43354: UDN: L2: OVN's lb_force_snat_ip=routerip doesn't work when multiple networks are set on LRP

View the Description View the linked PRs

Description of problem:

When we set multiple networks on LRP:

port rtoe-GR_227br_tenant.red_ovn-control-plane
        mac: "02:42:ac:12:00:07"
        ipv6-lla: "fe80::42:acff:fe12:7"
        networks: ["169.254.0.15/17", "172.18.0.7/16", "fc00:f853:ccd:e793::7/64", "fd69::f/112"]

and also use lb_force_snat_ip=routerip it picks the lexicographically sorted first item from the set of networks - there is no ordering for this

This breaks Services implementation on L2 UDNs

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Do presume that Engineering will access attachments through supportshell.
Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

When showing the results from commands, include the entire command in the output.
For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
For guidance on using this template please see
OCPBUGS Template Training for Networking components

https://github.com/openshift/ovn-kubernetes/pull/2357

Story SDN-5017: [L2] [NetworkPolicy] Support NetworkPolicies on Primary UDNs

View the Description View the linked PRs

We want to do Network Policies not MultiNetwork POlicies

https://github.com/openshift/origin/pull/29195

Feature OCPSTRAT-1682: OCP Console - Upgrade to PatternFly 6 (PF6)

View the Description

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Epic CONSOLE-4325: Adopt PatternFly 6 and remove PatternFly 4

View the Description

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story CONSOLE-4378: Remove PopupKebabMenu and related code

View the Description View the linked PRs

PopupKebabMenu is orphaned and contains a reference to `@patternfly/react-core/deprecated`. It and related code should be removed so we can drop PF4 and adopt PF6.

https://github.com/openshift/console/pull/14594

Story CONSOLE-4377: Update ActionItemMenu.tsx to use new DropdownItemProps from PatternFly 5

View the Description View the linked PRs

https://github.com/openshift/console/blob/master/frontend/packages/console-shared/src/components/actions/menu/ActionMenuItem.tsx#L3 contains a reference to `@patternfly/react-core/deprecated`. In order to drop PF4 and adopt PF6, this reference needs to be removed.

https://github.com/openshift/console/pull/14593

Story CONSOLE-4376: Remove orphaned ClusterConfigurationDropdownField.tsx and related code

View the Description View the linked PRs

This component was never finished and should be removed as it includes a reference to `@patternfly/react-core/deprecated`, which blocks the removal of PF4 and the adoption of PF6.

https://github.com/openshift/console/pull/14592

Feature OCPSTRAT-1684: Add all Dev only UI pages to the Admin Perspective

View the Description

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic ODC-7716: Merge Admin and Dev Perspectives

View the Description

Epic Goal

Base on user analytics many of customers switch back and fourth between perspectives, and average15 times per session.
The following steps will be need:
- Surface all Dev specific Nav items in the Admin Console
- Disable the Dev perspective by default but allow admins to enable via console setting

Why is this important?

We need to alleviate this pain point and improve the overall user experience for our users.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Story ODC-7710: Remove RHOAS plugin from the console

View the Description View the linked PRs

Description

As a developer, I do not want to maintain the code for a project already dead.

Acceptance Criteria

Remove RHOAS plugin https://github.com/openshift/console/tree/master/frontend/packages/rhoas-plugin
Remove RHOAS-catalog-source https://github.com/openshift/console/blob/master/frontend/packages/dev-console/integration-tests/testData/yamls/operator-installtion/RHOAS-catalog-source.yaml
Check if there is dependencies in other package and fix it

Additional Details:

https://github.com/openshift/console/pull/14577

Feature OCPSTRAT-1697: VolumeAttributesClass (TP)

View the Description

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

K8s 1.31 introduces VolumeAttributesClass as beta (code in external provisioner). We should make it available to customers as tech preview.

VolumeAttributesClass allows PVC to be modified after their creation and while attached. There is as vast number of parameters that can be updated but the most popular is to change the QoS values. Parameters that can be changed depend on the driver used.

Goals (aka. expected user outcomes)

Productise VolumeAttributesClass as TP in anticipation for GA. Customer can start testing VolumeAttributesClass.

Requirements (aka. Acceptance Criteria):

Disabled by default
put it under TechPreviewNoUpgrade
make sure VolumeAttributeClass object is available in beta APIs
enable the feature in external-provisioner and external-resizer at least in AWS EBS CSI driver, check the other drivers.
- Add RBAC rules for these objects
make sure we run its tests in one of TechPreviewNoUpgrade CI jobs (with hostpath CSI driver)
reuse / add a job with AWS EBS CSI driver + tech preview.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	both
Classic (standalone cluster)	yes
Hosted control planes	yes
Multi node, Compact (three node), or Single node (SNO), or all	all
Connected / Restricted Network	both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	all
Operator compatibility	N/A core storage
Backport needed (list applicable versions)	None
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	TBD for TP
Other (please specify)	n/A

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

As an OCP user, I want to change parameters of my existing PVC such as the QoS attributes.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

UI for TP

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

There's been some limitations and complains on the fact that PVC attributes are sealed after their creation avoiding customers to update them. This is particularly impacting for a specific QoS is set and the volume requirements are changing.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Customer should not use it in production atm.

Documentation Considerations

Document VolumeAttributesClass creation and how to update PVC. Mention any limitation. Mention it's tech preview no upgrade. Add drivers support if needed.

Interoperability Considerations

Check which drivers support it for which parameters.

Epic STOR-2078: Upstream Beta Tracking: VolumeAttributesClass (TP)

View the Description View the linked PRs

Epic Goal

Support upstream feature "VolumeAttributesClass" in OCP as Beta, i.e. test it and have docs for it.

Why is this important?

We get this upstream feature through Kubernetes rebase. We should ensure it works well in OCP and we have docs for it.

Upstream links

Enhancement issue: https://github.com/kubernetes/enhancements/issues/3751
KEP: https://github.com/kubernetes/enhancements/pull/3780

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

https://github.com/openshift/cluster-storage-operator/pull/549

Feature OCPSTRAT-1708: Control Plane fleet wide fix delivery mechanism

View the Description

Feature Overview (aka. Goal Summary)

A common concern with dealing with escalations/incidents in Managed OpenShift Hosted Control Planes is the resolution time incurred when the fix needs to be delivered in a component of the solution that ships within the OpenShift release payload. This is because OpenShift's release payloads:

Have a hotfix process that is customer/support-exception targeted rather than fleet targeted
Can take weeks to be available for Managed OpenShift

This feature seeks to provide mechanisms that put the upper time boundary in delivering such fixes to match the current HyperShift Operator <24h expectation

Goals (aka. expected user outcomes)

Hosted Control Plane fixes are delivered through Konflux builds
No additional upgrade edges
Release specific
Adequate, fleet representative, automated testing coverage
Reduced human interaction

Requirements (aka. Acceptance Criteria):

Overriding Hosted Control Plane components can be done automatically once the PRs are ready and the affected versions have been properly identified
Managed OpenShift Hosted Clusters have their Control Planes fix applied without requiring customer intervention and without workload disruption beyond what might already be incurred because of the incident it is solving
Fix can be promoted through integration, stage and production canary with a good degree of observability

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	managed (ROSA and ARO)
Classic (standalone cluster)	No
Hosted control planes	Yes
Multi node, Compact (three node), or Single node (SNO), or all	All supported ROSA/HCP topologies
Connected / Restricted Network	All supported ROSA/HCP topologies
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	All supported ROSA/HCP topologies
Operator compatibility	CPO and Operators depending on it
Backport needed (list applicable versions)	TBD
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	No
Other (please specify)	No

Use Cases (Optional):

Incident response when the engineering solution is partially or completely in the Hosted Control Plane side rather than in the HyperShift Operator

Out of Scope

HyperShift Operator binary bundling

Background

Discussed previously during incident calls. Design discussion document

Customer Considerations

Because the Managed Control Plane version does not change but it is overridden, customer visibility and impact should be limited as much as possible.

Documentation Considerations

SOP needs to be defined for:

Requesting and approving the fleet wide fixes described above
Building and delivering them
Identifying clusters with deployed fleet wide fixes

Epic CNTRLPLANE-16: Control Plane Operator Konflux pipeline

View the Description

Goal

Have a Konflux build for every supported branch on every pull request / merge that modifies the Control Plane Operator

Why is this important?

In order to build the Control Plane Operator images to be used for management cluster wide overrides.
To be able to deliver managed Hosted Control Plane fixes to managed OpenShift with a similar SLO as the fixes for the HyperShift Operator.

Scenarios

A PR that modifies the control plane in a supported branch is posted for a fix affecting managed OpenShift

Acceptance Criteria

Dev - Konflux application and component per supported release
Dev - SOPs for managing/troubleshooting the Konflux Application
Dev - Release Plan that delivers to the appropriate AppSre production registry
QE - HyperShift Operator versions that encode an override must be tested with the CPO Konflux builds that they make

Dependencies (internal and external)

Konflux

Previous Work (Optional):

HOSTEDCP-2027

Open questions:

Antoni Segura Puimedon How long or how many times should the CPO override be tested?

Done Checklist

CI - CI is running, tests are automated and merged.
DEV - Konflux App link: <link to Konflux App for CPO>
DEV - SOP: <link to meaningful PR or GitHub Issue>
QE - Test plan in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task CNTRLPLANE-10: Limit CPO Konflux builds to PRs that actually should have it built

View the Description View the linked PRs

The default PR posting and pushing tekton file that Konflux generates builds always. We can be more efficient with resources.

https://github.com/openshift/hypershift/pull/5287

Task CNTRLPLANE-9: Control Plane Operator Konflux builds for the main branch

View the Description View the linked PRs

It is necessary to get the builds off of main for CPO overrides.

https://github.com/openshift/hypershift/pull/5282

Feature OCPSTRAT-1712: IPsec Design Modernization

View the Description

Feature Overview

The OpenShift IPsec implementation will be enhanced for a growing set of enterprise use cases, and for larger scale deployments.

Goals

The OpenShift IPsec implementation was originally built for purpose-driven use cases from telco NEPs, but was also be useful for a specific set of other customer use cases outside of that context. As customer adoption grew and it was adopted by some of the largest (by number of cluster nodes) deployments in the field, it became obvious that some redesign is necessary in order to continue to deliver enterprise-grade IPsec, for both East-West and North-South traffic, and for some of our most-demanding customer deployments.

Key enhancements include observability and blocked traffic across paths if IPsec encryption is not functioning properly.

Requirements

Requirement	Notes	isMvp?
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

Questions to answer…

Out of Scope

Configuration of external-to-cluster IPsec endpoints for N-S IPsec.

Background, and strategic fit

The OpenShift IPsec feature is fundamental to customer deployments for ensuring that all traffic between cluster nodes (East-West) and between cluster nodes and external-to-the-cluster entities that also are configured for IPsec (North-South) is encrypted by default. This encryption must scale to the largest of deployments.

Assumptions

Customer Considerations

Customers require the option to use their own certificates or CA for IPsec.
Customers require observability of configuration (e.g. is the IPsec tunnel up and passing traffic)
If the IPsec tunnel is not up or otherwise functioning, traffic across the intended-to-be-encrypted network path should be blocked.

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
Does this feature have doc impact?
New Content, Updates to existing content, Release Note, or No Doc Impact
If unsure and no Technical Writer is available, please contact Content Strategy.
What concepts do customers need to understand to be successful in [action]?
How do we expect customers will use the feature? For what purpose(s)?
What reference material might a customer want/need to complete [action]?
Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic SDN-5334: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Story SDN-4168: Improve ipsec tests

View the linked PRs

https://github.com/openshift/origin/pull/28797

Bug OCPBUGS-40906: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/api/pull/1472

Bug OCPBUGS-33656: IPsec state not cleaned up on the cluster

View the Description View the linked PRs

While running IPsec e2e tests in the CI, the data plane traffic is not flowing with desired traffic type esp or udp. For example, ipsec mode external, the traffic type seems to seen as esp for EW traffic, but it's supposed to be geneve (udp) taffic.

Example CI run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/50687/rehearse-50687-pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-ipsec-serial/1789527351734833152

This issue was reproducible on a local cluster after many attempts and noticed ipsec states are not cleanup on the node which is a residue from previous test run with ipsec full mode.

[peri@sdn-09 origin]$ kubectl get networks.operator.openshift.io cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
creationTimestamp: "2024-05-13T18:55:57Z"
generation: 1362
name: cluster
resourceVersion: "593827"
uid: 10f804c9-da46-41ee-91d5-37aff920bee4
spec:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
defaultNetwork:
ovnKubernetesConfig:
egressIPConfig: {}
gatewayConfig:
ipv4: {}
ipv6: {}
routingViaHost: false
genevePort: 6081
ipsecConfig:
mode: External
mtu: 1400
policyAuditConfig:
destination: "null"
maxFileSize: 50
maxLogFiles: 5
rateLimit: 20
syslogFacility: local0
type: OVNKubernetes
deployKubeProxy: false
disableMultiNetwork: false
disableNetworkDiagnostics: false
logLevel: Normal
managementState: Managed
observedConfig: null
operatorLogLevel: Normal
serviceNetwork:
- 172.30.0.0/16
unsupportedConfigOverrides: null
useMultiNetworkPolicy: false
status:
conditions:
- lastTransitionTime: "2024-05-13T18:55:57Z"
status: "False"
type: ManagementStateDegraded
- lastTransitionTime: "2024-05-14T10:13:12Z"
status: "False"
type: Degraded
- lastTransitionTime: "2024-05-13T18:55:57Z"
status: "True"
type: Upgradeable
- lastTransitionTime: "2024-05-14T11:50:26Z"
status: "False"
type: Progressing
- lastTransitionTime: "2024-05-13T18:57:13Z"
status: "True"
type: Available
readyReplicas: 0
version: 4.16.0-0.nightly-2024-05-08-222442
[peri@sdn-09 origin]$ oc debug node/worker-0
Starting pod/worker-0-debug-k6nlm ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.111.23
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# toolbox
Checking if there is a newer version of registry.redhat.io/rhel9/support-tools available...
Container 'toolbox-root' already exists. Trying to start...
(To remove the container and start with a fresh toolbox, run: sudo podman rm 'toolbox-root')
toolbox-root
Container started successfully. To exit, type 'exit'.
[root@worker-0 /]# tcpdump -i enp2s0 -c 1 -v --direction=out esp and src 192.168.111.23 and dst 192.168.111.24
dropped privs to tcpdump
tcpdump: listening on enp2s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
16:07:01.854214 IP (tos 0x0, ttl 64, id 20451, offset 0, flags [DF], proto ESP (50), length 152)
worker-0 > worker-1: ESP(spi=0x52cc9c8d,seq=0xe1c5c), length 132
1 packet captured
6 packets received by filter
0 packets dropped by kernel
[root@worker-0 /]# exit
exit

sh-5.1# ipsec whack --trafficstatus
006 #20: "ovn-1184d9-0-in-1", type=ESP, add_time=1715687134, inBytes=206148172, outBytes=0, maxBytes=2^63B, id='@1184d960-3211-45c4-a482-d7b6fe995446'
006 #19: "ovn-1184d9-0-out-1", type=ESP, add_time=1715687112, inBytes=0, outBytes=40269835, maxBytes=2^63B, id='@1184d960-3211-45c4-a482-d7b6fe995446'
006 #27: "ovn-185198-0-in-1", type=ESP, add_time=1715687419, inBytes=71406656, outBytes=0, maxBytes=2^63B, id='@185198f6-7dde-4e9b-b2aa-52439d2beef5'
006 #26: "ovn-185198-0-out-1", type=ESP, add_time=1715687401, inBytes=0, outBytes=17201159, maxBytes=2^63B, id='@185198f6-7dde-4e9b-b2aa-52439d2beef5'
006 #14: "ovn-922aca-0-in-1", type=ESP, add_time=1715687004, inBytes=116384250, outBytes=0, maxBytes=2^63B, id='@922aca42-b893-496e-bb9b-0310884f4cc1'
006 #13: "ovn-922aca-0-out-1", type=ESP, add_time=1715686986, inBytes=0, outBytes=986900228, maxBytes=2^63B, id='@922aca42-b893-496e-bb9b-0310884f4cc1'
006 #6: "ovn-f72f26-0-in-1", type=ESP, add_time=1715686855, inBytes=115781441, outBytes=98, maxBytes=2^63B, id='@f72f2622-e7dc-414e-8369-6013752ea15b'
006 #5: "ovn-f72f26-0-out-1", type=ESP, add_time=1715686833, inBytes=9320, outBytes=29002449, maxBytes=2^63B, id='@f72f2622-e7dc-414e-8369-6013752ea15b'
sh-5.1# ip xfrm state; echo ' '; ip xfrm policy
src 192.168.111.21 dst 192.168.111.23
proto esp spi 0x7f7ddcf5 reqid 16413 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x6158d9a0f4a28598500e15f81a40ef715502b37ecf979feb11bbc488479c8804598011ee 128
lastused 2024-05-14 16:07:11
anti-replay esn context:
seq-hi 0x0, seq 0x18564, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff
sel src 192.168.111.21/32 dst 192.168.111.23/32 proto udp dport 6081
src 192.168.111.23 dst 192.168.111.21
proto esp spi 0xda57e42e reqid 16413 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x810bebecef77951ae8bb9a46cf53a348a24266df8b57bf2c88d4f23244eb3875e88cc796 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.21/32 proto udp sport 6081
src 192.168.111.21 dst 192.168.111.23
proto esp spi 0xf84f2fcf reqid 16417 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x0f242efb072699a0f061d4c941d1bb9d4eb7357b136db85a0165c3b3979e27b00ff20ac7 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.21/32 dst 192.168.111.23/32 proto udp sport 6081
src 192.168.111.23 dst 192.168.111.21
proto esp spi 0x9523c6ca reqid 16417 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xe075d39b6e53c033f5225f8be48efe537c3ba605cee2f5f5f3bb1cf16b6c53182ecf35f7 128
lastused 2024-05-14 16:07:11
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x10fb2
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.21/32 proto udp dport 6081
src 192.168.111.20 dst 192.168.111.23
proto esp spi 0x459d8516 reqid 16397 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xee778e6db2ce83fa24da3b18e028451bbfcf4259513bca21db832c3023e238a6b55fdacc 128
lastused 2024-05-14 16:07:13
anti-replay esn context:
seq-hi 0x0, seq 0x3ec45, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff
sel src 192.168.111.20/32 dst 192.168.111.23/32 proto udp dport 6081
src 192.168.111.23 dst 192.168.111.20
proto esp spi 0x3142f53a reqid 16397 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x6238fea6dffdd36cbb909f6aab48425ba6e38f9d32edfa0c1e0fc6af8d4e3a5c11b5dfd1 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.20/32 proto udp sport 6081
src 192.168.111.20 dst 192.168.111.23
proto esp spi 0xeda1ccb9 reqid 16401 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xef84a90993bd71df9c97db940803ad31c6f7d2e72a367a1ec55b4798879818a6341c38b6 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.20/32 dst 192.168.111.23/32 proto udp sport 6081
src 192.168.111.23 dst 192.168.111.20
proto esp spi 0x02c3c0dd reqid 16401 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x858ab7326e54b6d888825118724de5f0c0ad772be2b39133c272920c2cceb2f716d02754 128
lastused 2024-05-14 16:07:13
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x26f8e
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.20/32 proto udp dport 6081
src 192.168.111.24 dst 192.168.111.23
proto esp spi 0xc9535b47 reqid 16405 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xd7a83ff4bd6e7704562c597810d509c3cdd4e208daabf2ec074d109748fd1647ab2eff9d 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x53d4c, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff
sel src 192.168.111.24/32 dst 192.168.111.23/32 proto udp dport 6081
src 192.168.111.23 dst 192.168.111.24
proto esp spi 0xb66203c8 reqid 16405 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xc207001a7f1ed7f114b3e327308ddbddc36de5272a11fe0661d03eaecc84b6761c7ec9c4 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.24/32 proto udp sport 6081
src 192.168.111.24 dst 192.168.111.23
proto esp spi 0x2e4d4deb reqid 16409 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x91e399d83aa1c2626424b502d4b8dae07d4a170f7ef39f8d1baca8e92b8a1dee210e2502 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.24/32 dst 192.168.111.23/32 proto udp sport 6081
src 192.168.111.23 dst 192.168.111.24
proto esp spi 0x52cc9c8d reqid 16409 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xb605451f32f5dd7a113cae16e6f1509270c286d67265da2ad14634abccf6c90f907e5c00 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0xe2735
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.24/32 proto udp dport 6081
src 192.168.111.22 dst 192.168.111.23
proto esp spi 0x973119c3 reqid 16389 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x87d13e67b948454671fb8463ec0cd4d9c38e5e2dd7f97cbb8f88b50d4965fb1f21b36199 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x2af9a, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff
sel src 192.168.111.22/32 dst 192.168.111.23/32 proto udp dport 6081
src 192.168.111.23 dst 192.168.111.22
proto esp spi 0x4c3580ff reqid 16389 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x2c09750f51e86d60647a60e15606f8b312036639f8de2d7e49e733cda105b920baade029 128
lastused 2024-05-14 14:36:43
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x1
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.22/32 proto udp sport 6081
src 192.168.111.22 dst 192.168.111.23
proto esp spi 0xa3e469dc reqid 16393 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x1d5c5c232e6fd4b72f3dad68e8a4d523cbd297f463c53602fad429d12c0211d97ae26f47 128
lastused 2024-05-14 14:18:42
anti-replay esn context:
seq-hi 0x0, seq 0xb, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 000007ff
sel src 192.168.111.22/32 dst 192.168.111.23/32 proto udp sport 6081
src 192.168.111.23 dst 192.168.111.22
proto esp spi 0xdee8476f reqid 16393 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x5895025ce5b192a7854091841c73c8e29e7e302f61becfa3feb44d071ac5c64ce54f5083 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x1f1a3
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.22/32 proto udp dport 6081

src 192.168.111.23/32 dst 192.168.111.21/32 proto udp sport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16413 mode transport
src 192.168.111.21/32 dst 192.168.111.23/32 proto udp dport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16413 mode transport
src 192.168.111.23/32 dst 192.168.111.21/32 proto udp dport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16417 mode transport
src 192.168.111.21/32 dst 192.168.111.23/32 proto udp sport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16417 mode transport
src 192.168.111.23/32 dst 192.168.111.20/32 proto udp sport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16397 mode transport
src 192.168.111.20/32 dst 192.168.111.23/32 proto udp dport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16397 mode transport
src 192.168.111.23/32 dst 192.168.111.20/32 proto udp dport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16401 mode transport
src 192.168.111.20/32 dst 192.168.111.23/32 proto udp sport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16401 mode transport
src 192.168.111.23/32 dst 192.168.111.24/32 proto udp sport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16405 mode transport
src 192.168.111.24/32 dst 192.168.111.23/32 proto udp dport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16405 mode transport
src 192.168.111.23/32 dst 192.168.111.24/32 proto udp dport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16409 mode transport
src 192.168.111.24/32 dst 192.168.111.23/32 proto udp sport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16409 mode transport
src 192.168.111.23/32 dst 192.168.111.22/32 proto udp sport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16389 mode transport
src 192.168.111.22/32 dst 192.168.111.23/32 proto udp dport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16389 mode transport
src 192.168.111.23/32 dst 192.168.111.22/32 proto udp dport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16393 mode transport
src 192.168.111.22/32 dst 192.168.111.23/32 proto udp sport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16393 mode transport
src ::/0 dst ::/0
socket out priority 0 ptype main
src ::/0 dst ::/0
socket in priority 0 ptype main
src ::/0 dst ::/0
socket out priority 0 ptype main
src ::/0 dst ::/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 135
dir out priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 135
dir fwd priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 135
dir in priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 136
dir out priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 136
dir fwd priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 136
dir in priority 1 ptype main
sh-5.1# cat /etc/ipsec.conf
# /etc/ipsec.conf - Libreswan 4.0 configuration file
#
# see 'man ipsec.conf' and 'man pluto' for more information
#
# For example configurations and documentation, see https://libreswan.org/wiki/

config setup
# If logfile= is unset, syslog is used to send log messages too.
# Note that on busy VPN servers, the amount of logging can trigger
# syslogd (or journald) to rate limit messages.
#logfile=/var/log/pluto.log
#
# Debugging should only be used to find bugs, not configuration issues!
# "base" regular debug, "tmi" is excessive and "private" will log
# sensitive key material (not available in FIPS mode). The "cpu-usage"
# value logs timing information and should not be used with other
# debug options as it will defeat getting accurate timing information.
# Default is "none"
# plutodebug="base"
# plutodebug="tmi"
#plutodebug="none"
#
# Some machines use a DNS resolver on localhost with broken DNSSEC
# support. This can be tested using the command:
# dig +dnssec DNSnameOfRemoteServer
# If that fails but omitting '+dnssec' works, the system's resolver is
# broken and you might need to disable DNSSEC.
# dnssec-enable=no
#
# To enable IKE and IPsec over TCP for VPN server. Requires at least
# Linux 5.7 kernel or a kernel with TCP backport (like RHEL8 4.18.0-291)
# listen-tcp=yes
# To enable IKE and IPsec over TCP for VPN client, also specify
# tcp-remote-port=4500 in the client's conn section.

# if it exists, include system wide crypto-policy defaults
include /etc/crypto-policies/back-ends/libreswan.config

# It is best to add your IPsec connections as separate files
# in /etc/ipsec.d/
include /etc/ipsec.d/*.conf
sh-5.1# cat /etc/ipsec.d/openshift.conf
# Generated by ovs-monitor-ipsec...do not modify by hand!

config setup
uniqueids=yes

conn %default
keyingtries=%forever
type=transport
auto=route
ike=aes_gcm256-sha2_256
esp=aes_gcm256
ikev2=insist

conn ovn-f72f26-0-in-1
left=192.168.111.23
right=192.168.111.22
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@f72f2622-e7dc-414e-8369-6013752ea15b
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp/6081
rightprotoport=udp

conn ovn-f72f26-0-out-1
left=192.168.111.23
right=192.168.111.22
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@f72f2622-e7dc-414e-8369-6013752ea15b
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp
rightprotoport=udp/6081

conn ovn-1184d9-0-in-1
left=192.168.111.23
right=192.168.111.20
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@1184d960-3211-45c4-a482-d7b6fe995446
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp/6081
rightprotoport=udp

conn ovn-1184d9-0-out-1
left=192.168.111.23
right=192.168.111.20
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@1184d960-3211-45c4-a482-d7b6fe995446
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp
rightprotoport=udp/6081

conn ovn-922aca-0-in-1
left=192.168.111.23
right=192.168.111.24
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@922aca42-b893-496e-bb9b-0310884f4cc1
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp/6081
rightprotoport=udp

conn ovn-922aca-0-out-1
left=192.168.111.23
right=192.168.111.24
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@922aca42-b893-496e-bb9b-0310884f4cc1
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp
rightprotoport=udp/6081

conn ovn-185198-0-in-1
left=192.168.111.23
right=192.168.111.21
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@185198f6-7dde-4e9b-b2aa-52439d2beef5
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp/6081
rightprotoport=udp

conn ovn-185198-0-out-1
left=192.168.111.23
right=192.168.111.21
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@185198f6-7dde-4e9b-b2aa-52439d2beef5
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp
rightprotoport=udp/6081

sh-5.1#

https://github.com/openshift/cluster-network-operator/pull/2372

Feature OCPSTRAT-1733: MultiOperatorManager Phase 1

View the Description

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic API-1835: Produce MultiOperatorManager POC

View the Description View the linked PRs

link back to OCPSTRAT-1644 somehow

Epic Goal*

What is our purpose in implementing this? What new capability will be available to customers?

Why is this important? (mandatory)

What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

Feature OCPSTRAT-1787: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic CNTRLPLANE-1: Upgrade to Kubernetes 1.32

View the Description View the linked PRs

Epic Goal*

Drive the technical part of the Kubernetes 1.32 upgrade, including rebasing openshift/kubernetes repository and coordination across OpenShift organization to get e2e tests green for the OCP release.

Why is this important? (mandatory)

OpenShift 4.19 cannot be released without Kubernetes 1.32

Scenarios (mandatory)

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

Slack Discussion Channel - https://redhat.enterprise.slack.com/archives/C07V32J0YKF

Feature OCPSTRAT-1823: [GA] 'oc adm upgrade status' command and status API

View the Description

As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.

Feature Overview (aka. Goal Summary)

Here are common update improvements from customer interactions on Update experience

Show nodes where pod draining is taking more time.
Customers have to dig deeper often to find the nodes for further debugging.
The ask has been to bubble up this on the update progress window.
oc update status ?
From the UI we can see the progress of the update. From oc cli we can see this from "oc get cvo"
But the ask is to show more details in a human-readable format.
Know where the update has stopped. Consider adding at what run level it has stopped.
```
oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS

version   4.12.0    True        True          16s     Working towards 4.12.4: 9 of 829 done (1% complete)
```

Documentation Considerations

Update docs for UX and CLI changes

Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22

Epic OTA-1260: Status API for oc adm upgrade status command

View the Description

Epic Goal*

Add a new command `oc adm upgrade status` command which is backed by an API. Please find the mock output of the command output attached in this card.

Why is this important? (mandatory)

From the UI we can see the progress of the update. Using OC CLI we can see some of the information using "oc get clusterversion" but the output is not readable and it is a lot of extra information to process.
Customer as asking us to show more details in a human-readable format as well provide an API which they can use for automation.

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Development -
Documentation -
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Done - Checklist (mandatory)

CI Testing - Tests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Other

Story OTA-1393: status: recognize the process of migration to multi-arch

View the Description View the linked PRs

After ~~OTA-960~~ is fixed, ClusterVersion/version and oc adm upgrade can be used to monitor the process of migrating a cluster to multi-arch.

$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.18.0-ec.3: 761 of 890 done (85% complete), waiting on machine-config

Upgradeable=False

  Reason: PoolUpdating
  Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details

Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.18 (available channels: candidate-4.18)
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.

But oc adm upgrade status reports COMPLETION 100% while the migration/upgrade is still ongoing.

$ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Completed
Target Version:  4.18.0-ec.3 (from 4.18.0-ec.3)
Completion:      100% (33 operators updated, 0 updating, 0 waiting)
Duration:        15m
Operator Status: 33 Healthy

Control Plane Nodes
NAME                                        ASSESSMENT    PHASE     VERSION       EST   MESSAGE
ip-10-0-95-224.us-east-2.compute.internal   Unavailable   Updated   4.18.0-ec.3   -     Node is unavailable
ip-10-0-33-81.us-east-2.compute.internal    Completed     Updated   4.18.0-ec.3   -
ip-10-0-45-170.us-east-2.compute.internal   Completed     Updated   4.18.0-ec.3   -

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Completed    100%         3 Total, 2 Available, 0 Progressing, 0 Outdated, 0 Draining, 0 Excluded, 0 Degraded

Worker Pool Nodes: worker
NAME                                        ASSESSMENT    PHASE     VERSION       EST   MESSAGE
ip-10-0-72-40.us-east-2.compute.internal    Unavailable   Updated   4.18.0-ec.3   -     Node is unavailable
ip-10-0-17-117.us-east-2.compute.internal   Completed     Updated   4.18.0-ec.3   -
ip-10-0-22-179.us-east-2.compute.internal   Completed     Updated   4.18.0-ec.3   -

= Update Health =
SINCE   LEVEL     IMPACT         MESSAGE
-       Warning   Update Speed   Node ip-10-0-95-224.us-east-2.compute.internal is unavailable
-       Warning   Update Speed   Node ip-10-0-72-40.us-east-2.compute.internal is unavailable

Run with --details=health for additional description and links to related online documentation

$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-ec.3   True        True          14m     Working towards 4.18.0-ec.3: 761 of 890 done (85% complete), waiting on machine-config

$ oc get co machine-config
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.18.0-ec.3   True        True          False      63m     Working towards 4.18.0-ec.3

The reason is that PROGRESSING=True is not detected for co/machine-config as the status command checks only operator.Status.Versions[name=="operator"] and it needs to check ClusterOperator.Status.Versions[name=="operator-image"] as well.

For grooming:

It will be challenging for the status command to check the operator image's pull spec because it does not know the expected value. CVO knows it because CVO holds the manifests (containing the expected value) from the multi-arch payload.

One "hacky" workaround is that the status command gets the pull spec from the MCO deployment:

oc get deployment -n openshift-machine-config-operator machine-config-operator -o json | jq -r '.spec.template.spec.containers[]|select(.name=="machine-config-operator")|.image'
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:787a505ca594b0a727549353c503dec9233a9d3c2dcd6b64e3de5f998892a1d5

Note this co/machine-config -> deployment/machine-config-operator trick may not be feasible if we want to extend it to all cluster operators. But it should work as a hacky workaround to check only MCO.

We may claim that the status command is not designed for monitoring the multi-arch migration and suggest to use oc adm upgrade instead. In that case, we can close this card as Obsolete/Won'tDo.

manifests.ziphas the mockData/manifests for the status cmd that are taken during the migration.

oc#1920 started the work for the status command to recognize the migration and we need to extend the work to cover (the comments from Petr's review):

"Target Version: 4.18.0-ec.3 (from 4.18.0-ec.3)": confusing. We should tell "multi-arch" migration somehow. Or even better: from the current arch to multi-arch, for example "Target Version: 4.18.0-ec.3 multi (from x86_64)" if we could get the origin arch from CV or somewhere else.
- We have spec.desiredUpdate.architecture since forever, and can use that being Multi as a partial hint. ~~MULTIARCH-4559~~ is adding tech-preview status properties around architecture in 4.18, but tech-preview, so may not be worth bothering with in oc code. Two history entries with the same version string but different digests is probably a reliable-enough heuristic, coupled with the spec-side hint.

"Duration: 6m55s (Est. Time Remaining: 1h4m)": We will see if we could find a simple way to hand this special case. I do not understand "the 97% completion will be reached so fast." as I am not familiar with the algorithm. But it seems acceptable to Petr that we show N/A for the migration.
- I think I get "the 97% completion will be reached so fast." now as only MCO has the operator-image pull spec. Other COs claim the completeness immaturely. With that said, "N/A" sounds like the most possible way for now.

Node status like "All control plane nodes successfully updated to 4.18.0-ec.3" for control planes and "ip-10-0-17-117.us-east-2.compute.internal Completed". It is technically hard to detect the transaction during migration as MCO annotates only the version. This may become a separate card if it is too big to finish with the current one.

"targetImagePullSpec := getMCOImagePullSpec(mcoDeployment)" should be computed just once. Now it is in the each iteration of the for loop. We should also comment about why we do it with this hacky way.

https://github.com/openshift/oc/pull/1933

Feature OCPSTRAT-1844: TechDebt - OCP Console - Dependency Cleanup

View the Description

Feature Overview (aka. Goal Summary)

We need to maintain our dependencies across all the libraries we use in order to stay in compliance.

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Interoperability Considerations

Epic CONSOLE-4350: OCP 4.19 - Console Dependencies & Tech Debt

View the Description

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

Story CONSOLE-4400: Update to TypeScript 5

View the Description View the linked PRs

Currently console is using TypeScript 4, which is preventing us from upgrading to NodeJS-22. Due to that we need to update TypeScript 5 (not necessarily latest version).

AC:

Update TypeScript to version 5
Update ES build target to ES-2021

Note: In case of higher complexity we should be splitting the story into multiple stories, per console package.

https://github.com/openshift/console/pull/14620

Bug OCPBUGS-45848: Some references in static plugins are missing file extensions

View the linked PRs

https://github.com/openshift/console/pull/14599

Story CONSOLE-3905: Update Webpack package to version 5

View the Description View the linked PRs

As a developer I want to make sure we are running the latest version of webpack in order to take advantage of the latest benefits and also keep current so that future updating is a painless as possible.

We are currently on v4.47.0.

Changelog: https://webpack.js.org/blog/2020-10-10-webpack-5-release/

By updating to version 5 we will need to update following pkgs as well:

html-webpack-plugin
webpack-bundle-analyzer
copy-webpack-plugin
fork-ts-checker-webpack-plugin

AC: Update webpack to version 5 and determine what should be the ideal minor version.

https://github.com/openshift/console/pull/14378

Feature OCPSTRAT-306: Support for bring your own external OIDC based Auth provider for direct API Server access [Standalone OCP][TechPreview]

View the Description

Feature Overview (aka. Goal Summary)

The ability in OpenShift to create trust and directly consume access tokens issued by external OIDC Authentication Providers using an authentication approach similar to upstream Kubernetes.

BYO Identity will help facilitate CLI only workflows and capabilities of the Authentication Provider (such as Keycloak, Dex, Azure AD) similar to upstream Kubernetes.

Goals (aka. expected user outcomes)

Ability in OpenShift to provide a direct, pluggable Authentication workflow such that the OpenShift/K8s API server can consume access tokens issued by external OIDC identity providers. Kubernetes provides this integration as described here. Customer/Users can then configure their IDPs to support the OIDC protocols and workflows they desire such as Client credential flow.

OpenShift OAuth server is still available as default option, with the ability to tune in the external OIDC provider as a Day-2 configuration.

Requirements (aka. Acceptance Criteria):

The customer should be able to tie into RBAC functionality, similar to how it is closely aligned with OpenShift OAuth

Use Cases (Optional):

As a customer, I would like to integrate my OIDC Identity Provider directly with the OpenShift API server.
As a customer in multi-cluster cloud environment, I have both K8s and non-K8s clusters using my IDP and hence I need seamless authentication directly to the OpenShift/K8sAPI using my Identity Provider

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

Epic AUTH-528: Direct External OIDC Provider for Standalone OCP

View the Description

Epic Goal

The ability to provide a direct authentication workflow such that OpenShift can consume bearer tokens issued by external OIDC identity providers, replacing the built-in OAuth stack by deactivating/removing its components as necessary.

Why is this important? (mandatory)

OpenShift has its own built-in OAuth server which can be used to obtain OAuth access tokens for authentication to the API. The server can be configured with an external identity provider (including support for OIDC), however it is still the built-in server that issues tokens, and thus authentication is limited to the capabilities of the oauth-server.

Scenarios (mandatory)

As a customer, I want to integrate my OIDC Identity Provider directly with OpenShift so that I can fully use its capabilities in machine-to-machine workflows.
*As a customer in a hybrid cloud environment, I want to seamlessly use my OIDC Identity Provider across all of my fleet.

Dependencies (internal and external) (mandatory)

Support in the console/console-operator (already completed)
Support in the OpenShift CLI `oc` (already completed)

Contributing Teams(and contacts) (mandatory)

Development - OCP Auth
Documentation - OCP Auth
QE - OCP Auth
PX -
Others -

Acceptance Criteria (optional)

external OIDC provider can be configured to be used directly via the kube-apiserver to issue tokens
built-in oauth stack no longer operational in the cluster; respective APIs, resources and components deactivated
changing back to the built-in oauth stack possible

Drawbacks or Risk (optional)

Enabling an external OIDC provider to an OCP cluster will result in the oauth-apiserver being removed from the system; this inherently means that the two API Services it is serving (v1.oauth.openshift.io, v1.user.openshift.io) will be gone from the cluster, and therefore any related data will be lost. It is the user's responsibility to create backups of any required data.
Configuring an external OIDC identity provider for authentication by definition means that any security updates or patches must be managed independently from the cluster itself, i.e. cluster updates will not resolve security issues relevant to the provider itself; the provider will have to be updated separately. Additionally, new functionality or features on the provider's side might need integration work in OpenShift (depending on their nature).

Done - Checklist (mandatory)

CI Testing - Basic e2e automationTests are merged and completing successfully
Documentation - Content development is complete.
QE - Test scenarios are written and executed successfully.
Technical Enablement - Slides are complete (if requested by PLM)
Engineering Stories Merged
All associated work items with the Epic are closed
Epic status should be “Release Pending”

Bug OCPBUGS-44953: [Premerge test] Removing OCP BYO external oidc to revert to OAuth IDP caused co/console degraded with AuthStatusHandlerDegraded

View the Description View the linked PRs

Description of problem:
This is a bug found during pre-merge test of 4.18 epic AUTH-528 PRs and filed for better tracking per existing "OpenShift - Testing Before PR Merges - Left-Shift Testing" google doc workflow.

co/console degraded with AuthStatusHandlerDegraded after OCP BYO external oidc is configured and then removed (i.e. reverted back to OAuth IDP).

Version-Release number of selected component (if applicable):

Cluster-bot build which is built at 2024-11-25 09:39 CST (UTC+800)
build 4.18,openshift/cluster-authentication-operator#713,openshift/cluster-authentication-operator#740,openshift/cluster-kube-apiserver-operator#1760,openshift/console-operator#940

How reproducible:

Always (tried twice, both hit it)

Steps to Reproduce:

1. Launch a TechPreviewNoUpgrade standalone OCP cluster with above build. Configure htpasswd IDP. Test users can login successfully.

2. Configure BYO external OIDC in this OCP cluster using Microsoft Entra ID. KAS and console pods can roll out successfully. oc login and console login to Microsoft Entra ID can succeed.

3. Remove BYO external OIDC configuration, i.e. go back to original htpasswd OAuth IDP:
[xxia@2024-11-25 21:10:17 CST my]$ oc patch authentication.config/cluster --type=merge -p='
spec: 
  type: ""
  oidcProviders: null
'
authentication.config.openshift.io/cluster patched

[xxia@2024-11-25 21:15:24 CST my]$ oc get authentication.config  cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Authentication
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    release.openshift.io/create-only: "true"
  creationTimestamp: "2024-11-25T04:11:59Z"
  generation: 5
  name: cluster
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: e814f1dc-0b51-4b87-8f04-6bd99594bf47
  resourceVersion: "284724"
  uid: 2de77b67-7de4-4883-8ceb-f1020b277210
spec:
  oauthMetadata:
    name: ""
  serviceAccountIssuer: ""
  type: ""
  webhookTokenAuthenticator:
    kubeConfig:
      name: webhook-authentication-integrated-oauth
status:
  integratedOAuthMetadata:
    name: oauth-openshift
  oidcClients:
  - componentName: cli
    componentNamespace: openshift-console
  - componentName: console
    componentNamespace: openshift-console
    conditions:
    - lastTransitionTime: "2024-11-25T13:10:23Z"
      message: ""
      reason: OIDCConfigAvailable
      status: "False"
      type: Degraded
    - lastTransitionTime: "2024-11-25T13:10:23Z"
      message: ""
      reason: OIDCConfigAvailable
      status: "False"
      type: Progressing
    - lastTransitionTime: "2024-11-25T13:10:23Z"
      message: ""
      reason: OIDCConfigAvailable
      status: "True"
      type: Available
    currentOIDCClients:
    - clientID: 95fbae1d-69a7-4206-86bd-00ea9e0bb778
      issuerURL: https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/v2.0
      oidcProviderName: microsoft-entra-id


KAS and console pods indeed can roll out successfully; and now oc login and console login indeed can succeed using the htpasswd user and password:
[xxia@2024-11-25 21:49:32 CST my]$ oc login -u testuser-1 -p xxxxxx
Login successful.
...

But co/console degraded, which is weird:
[xxia@2024-11-25 21:56:07 CST my]$ oc get co | grep -v 'True *False *False'
NAME                                       VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
console                                    4.18.0-0.test-2024-11-25-020414-ci-ln-71cvsj2-latest   True        False         True       9h      AuthStatusHandlerDegraded: Authentication.config.openshift.io "cluster" is invalid: [status.oidcClients[1].currentOIDCClients[0].issuerURL: Invalid value: "": oidcClients[1].currentOIDCClients[0].issuerURL in body should match '^https:\/\/[^\s]', status.oidcClients[1].currentOIDCClients[0].oidcProviderName: Invalid value: "": oidcClients[1].currentOIDCClients[0].oidcProviderName in body should be at least 1 chars long]

Actual results:

co/console degraded, as above.

Expected results:

co/console is normal.

Additional info:

https://github.com/openshift/console-operator/pull/945

Feature OCPSTRAT-487: Pod Security Admission Integration - Restricted Enforcement

View the Description

Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission.

With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.

With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn".

Epic AUTH-262: Pod Security Admission Integration - Restricted Enforcement

View the Description

Epic Goal

Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.

Story AUTH-482: SCC pinning for all workloads in platform namespaces

View the Description View the linked PRs

When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.

To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).

Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).

The following tables track progress.

Progress summary

# namespaces	4.19	4.18	4.17	4.16	4.15	4.14
monitored	82	82	82	82	82	82
fix needed	68	68	68	68	68	68
fixed	39	39	35	32	39	1
remaining	29	29	33	36	29	67
~ remaining non-runlevel	8	8	12	15	8	46
~ remaining runlevel (low-prio)	21	21	21	21	21	21
~ untested	5	2	2	2	82	82

Progress breakdown

#	namespace	4.19	4.18	4.17	4.16	4.15	4.14
1	oc debug node pods			#1763	#1816	#1818
2	openshift-apiserver-operator				#573	#581
3	openshift-authentication				#656	#675
4	openshift-authentication-operator				#656	#675
5	openshift-catalogd				#50	#58
6	openshift-cloud-credential-operator				#681	#736
7	openshift-cloud-network-config-controller		#2282	#2490	#2496
8	openshift-cluster-csi-drivers	#118 #5310 #135	#524 #131 #306 #265 #75		#170 #459	#484
9	openshift-cluster-node-tuning-operator				#968	#1117
10	openshift-cluster-olm-operator				#54	n/a	n/a
11	openshift-cluster-samples-operator				#535	#548
12	openshift-cluster-storage-operator		#516		#459 #196	#484 #211
13	openshift-cluster-version				#1038	#1068
14	openshift-config-operator				#410	#420
15	openshift-console			#871	#908	#924
16	openshift-console-operator			#871	#908	#924
17	openshift-controller-manager				#336	#361
18	openshift-controller-manager-operator				#336	#361
19	openshift-e2e-loki		#56579	#56579	#56579	#56579
20	openshift-image-registry				#1008	#1067
21	openshift-ingress		#1032
22	openshift-ingress-canary		#1031
23	openshift-ingress-operator		#1031
24	openshift-insights	#1033	#1041	#1049	#915	#967
25	openshift-kni-infra		#4504	#4542	#4539	#4540
26	openshift-kube-storage-version-migrator				#107	#112
27	openshift-kube-storage-version-migrator-operator				#107	#112
28	openshift-machine-api	#1308 #1317	#1311	#407	#315 #282 #1220 #73 #50 #433	#332 #326 #1288 #81 #57 #443
29	openshift-machine-config-operator		#4636	#4219	#4384	#4393
30	openshift-manila-csi-driver			#234	#235	#236
31	openshift-marketplace		#578		#561	#570
32	openshift-metallb-system		#238	#240	#241
33	openshift-monitoring	#2298 #366	#2498		#2335	#2420
34	openshift-network-console		#2545
35	openshift-network-diagnostics		#2282	#2490	#2496
36	openshift-network-node-identity		#2282	#2490	#2496
37	openshift-nutanix-infra		#4504		#4539	#4540
38	openshift-oauth-apiserver				#656	#675
39	openshift-openstack-infra		#4504		#4539	#4540
40	openshift-operator-controller				#100	#120
41	openshift-operator-lifecycle-manager				#703	#828
42	openshift-route-controller-manager				#336	#361
43	openshift-service-ca				#235	#243
44	openshift-service-ca-operator				#235	#243
45	openshift-sriov-network-operator			#995	#999	#1003
46	openshift-user-workload-monitoring				#2335	#2420
47	openshift-vsphere-infra		#4504	#4542	#4539	#4540
48	(runlevel) kube-system
49	(runlevel) openshift-cloud-controller-manager
50	(runlevel) openshift-cloud-controller-manager-operator
51	(runlevel) openshift-cluster-api
52	(runlevel) openshift-cluster-machine-approver
53	(runlevel) openshift-dns
54	(runlevel) openshift-dns-operator
55	(runlevel) openshift-etcd
56	(runlevel) openshift-etcd-operator
57	(runlevel) openshift-kube-apiserver
58	(runlevel) openshift-kube-apiserver-operator
59	(runlevel) openshift-kube-controller-manager
60	(runlevel) openshift-kube-controller-manager-operator
61	(runlevel) openshift-kube-proxy
62	(runlevel) openshift-kube-scheduler
63	(runlevel) openshift-kube-scheduler-operator
64	(runlevel) openshift-multus
65	(runlevel) openshift-network-operator
66	(runlevel) openshift-ovn-kubernetes
67	(runlevel) openshift-sdn
68	(runlevel) openshift-storage

Feature OCPSTRAT-683: Migrate MAPI to Cluster API for AWS -Phase 1

View the Description

Feature Overview (aka. Goal Summary)

Implement Migration core for MAPI to CAPI for AWS

This feature covers the design and implementation of converting from using the Machine API (MAPI) to Cluster API (CAPI) for AWS
This Design investigates possible solutions for AWS
Once AWS shim/sync layer is implemented use the architecture for other clouds in phase-2 & phase 3

Acceptance Criteria

When customers use CAPI, There must be no negative effect to switching over to using CAPI . Seamless migration of Machine resources. the fields in MAPI/CAPI should reconcile from both CRDs.

Epic OCPCLOUD-2120: Implement Migration core for MAPI to CAPI

View the Description

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Create the core/common tooling needed to enable the migration designed in ~~OCPCLOUD-1578~~
To allow providers to individually migrate from MAPI to CAPI
Implementation plan in https://docs.google.com/document/d/1IZPmcJujKPdoBZKt66i3eGcJWb1RDIgk1ywt13h6T-w/edit

Why is this important?

We need to build out the core so that development of the migration for individual providers can then happen in parallel

Scenarios

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Task OCPCLOUD-2787: Get Migration controller working on an actual cluster (not tests)

View the Description View the linked PRs

As QE have tried to test upstream CAPI pausing, we've hit a few issues with running the migration controller. & cluster capi operator on a real cluster vs envtest.

This card captures the work required to iron out these kinks, and get things running (i.e not crashing).

I also think we want an e2e or some sort of automated testing to ensure we don't break things again.

Goal: Stop the CAPI operator crashing on startup in a real cluster.

Non goals: get the entire conversion flow running from CAPI -> MAPI and MAPI -> CAPI. We still need significant feature work before we're here.

https://github.com/openshift/cluster-machine-approver/pull/262

Feature OCPSTRAT-943: [Tech-Preview]Native Karpenter with ROSA+HCP

View the Description

Feature Overview (aka. Goal Summary)

As a cluster administrator, I want to use Karpenter on an OpenShift cluster running in AWS to scale nodes instead of Cluster Autoscalar(CAS). I want to automatically manage heterogeneous compute resources in my OpenShift cluster without the additional manual task of managing node pools. Additional features I want are:

Reducing cloud costs through instance selection and scaling/descaling
Support GPUs, spot instances, mixed compute types and other compute types.
Automatic node lifecycle management and upgrades

This feature covers the work done to integrate upstream Karpenter 1.x with ROSA HCP. This eliminates the need for manual node pool management while ensuring cost-effective compute selection for workloads. Red Hat manages the node lifecycle and upgrades.

The feature will be rolled out with ROSA (AWS) since it has more mature Karpenter ecosystem, followed by ARO (Azure) implementation(check OCPSTRAT-1498)

Goals (aka. expected user outcomes)

Run Karpenter in management cluster and disable CAS
Automate node provisioning in workload cluster
automate lifecycle management in workload cluster
Reduce cost in heterogenous compute workloads

https://docs.google.com/document/d/1ID_IhXPpYY4K3G_wa1MYJxOb3yz5FYoOj3ONSkEDsZs/edit?tab=t.0#heading=h.yvv1wy2g0utk

Requirements (aka. Acceptance Criteria):

As a cluster-admin or SRE I should be able to configure Karpenter with OCP on AWS. Both cli and UI should enable users to configure Karpenter and disable CAS.

Run Karpenter in management cluster and disable CAS
OCM API
- Enable/Disable Cluster autoscaler
- Enable/disable AutoNode feature
- New ARN role configuration for Karpenter
- Optional: New managed policy or integration with existing nodepool role permissions
Expose NodeClass/Nodepool resources to users.
secure node provisioning and management, machine approval system for Karpenter instances
HCP Karpenter cleanup/deletion support
ROSA CAPI fields to enable/disable/configure Karpenter
Write end-to-end tests for karpenter running on ROSA HCP

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	managed ROSA HCP
Classic (standalone cluster)
Hosted control planes	yes
Multi node, Compact (three node), or Single node (SNO), or all	MNO
Connected / Restricted Network	Connected
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	x86_x64, ARM (aarch64)
Operator compatibility
Backport needed (list applicable versions)	No
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	yes - console
Other (please specify)	rosa-cli

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Supporting this feature in Standalone OCP/self-hosted HCP/ROSA classic
Creating a multi-provider cost/pricing operator compatible with CAPI is beyond the scope of this Feature. That may take more time.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Karpenter.sh is an open-source node provisioning project built for Kubernetes. It is designed to simplify Kubernetes infrastructure by automatically launching and terminating nodes based on the needs of your workloads. Karpenter can help you to reduce costs, improve performance, and simplify operations.
Karpenter works by observing the unscheduled pods in your cluster and launching new nodes to accommodate them. Karpenter can also terminate nodes that are no longer needed, which can help you save money on infrastructure costs.
Karpenter architecture has a Karpenter-core and Karpenter-provider as components.
The core has AWS code which does the resource calculation to reduce the cost by re-provisioning new nodes.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Migration guides from using CAS to Karpenter
Performance testing to compare CAS vs Karpenter on ROSA HCP
API documentation for NodePool and EC2NodeClass configuration

Interoperability Considerations

Epic HOSTEDCP-2220: Build and merge a HCP + Karpenter feature gated prototype

View the Description View the linked PRs

Goal

Codify and enable usage of a prototype for HCP working with karpetner management side.

Why is this important?

A first usable version is critical to democratize knowledge and develop internal feedback.

Acceptance Criteria

Deploying a cluster with --auto-node results in karpenter running management side, the CRDs and a default ec2NodeClass installed within the guest cluster
...

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

CI - CI is running, tests are automated and merged.
Release Technical Enablement <link to Feature Enablement Presentation>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

https://github.com/openshift/hypershift/pull/5279

Feature OCPSTRAT-979: Integrate Azure Workload Identities and Managed Service Identity (MSI) for Operators (control plane/data plane) - Part I

View the Description

Goal Summary

This feature aims to make sure that the HyperShift operator and the control-plane it deploys uses Managed Service Identities (MSI) and have access to scoped credentials (also via access to AKS's image gallery potentially). Additionally, for operators deployed in customers account (system components), they would be scoped with Azure workload identities.

Epic SDN-5372: Azure Service Principal Support with Mounted Credentials

View the Description

Epic Goal

The Cluster Network Operator can authenticate with Service Principal backed by a certificate stored in an Azure Key Vault. The Secrets CSI driver will be used to mount the certificate as a volume on the image registry deployment in a hosted control plane.

Why is this important?

This is needed to enable authentication with Service Principal with backing certificates for ARO HCP.

Acceptance Criteria

Cluster Network Operator is able to authenticate with Azure in ARO HCP using Service Principal with a backing certificate.
Updated documentation
ARO HCP CI coverage

Dependencies (internal and external)

Azure SDK

Previous Work (Optional):

~~SDN-4450~~

Open questions:

Which degree of coverage should run on AKS e2e vs on existing e2es

Done Checklist

CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Bug OCPBUGS-44967: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/5216

Epic HOSTEDCP-432: Lifecycle Hosted Clusters in HyperShift via Managed Identities

View the Description

Problem

Today Azure installation requires manually created service principal which involves relations, permission granting, credential setting, credential storage, credentials rotation, credentials clean up, and service principal deletion. This is not only mundane and time-consuming but also less secure and risks access to resources by adversaries due to lack of credential rotation.

Goal

Employ Azure managed credentials which drastically reduce the steps required to just managed identity creation, permission granting, and resource deletion.

Ideally, this should be a HyperShift-native functionality. I.e., HyperShift should use managed identities for the control plane, the kubelet, and any add-on that needs access to Azure resources.

Bug OCPBUGS-42434: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/5160

Feature RHDP-1022: YML editing AI within OpenShift Developer Console

View the Description

Feature Overview

An assistant to help developers in ODC edit configuration YML files

Goals

Perform an architectural spike to better assess feasibility and value of pursuing this further

Requirements

Refinement DOC

Requirement	Notes	isMvp?
CI - MUST be running successfully with test automation	This is a requirement for ALL features.	YES
Release Technical Enablement	Provide necessary release enablement details and documents.	YES

(Optional) Use Cases

This Section:

Main success scenarios - high-level user stories
Alternate flow/scenarios - high-level user stories
...

Questions to answer…

Is there overlap with what other teams at RH are already planning?

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

Customer Considerations

More details in the outcome parent RHDP-985

Documentation Considerations

Questions to be addressed:

What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
Does this feature have doc impact?
New Content, Updates to existing content, Release Note, or No Doc Impact
If unsure and no Technical Writer is available, please contact Content Strategy.
What concepts do customers need to understand to be successful in [action]?
How do we expect customers will use the feature? For what purpose(s)?
What reference material might a customer want/need to complete [action]?
Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
What is the doc impact (New Content, Updates to existing content, or Release Note)?

Epic ODC-7677: Add Import in YAML Editor action in OpenShift Lightspeed

View the Description

Problem: As a user of OpenShift Lightspeed, I would like to import a YAML generated in the Lightspeed window into the Op

Why is it important?

Use cases:

<case>

Acceptance criteria:

Add an Import in YAML Editor action inside the OLS chat popup.
Along with the copy button we can add another button that imports the generated YAML data inside the YAML editor.
This action should also be able to redirect users to the YAML editor and then paste the generated YAML inside the editor.
We will need to create a new extension point that can help trigger the action from the OLS chat popup and a way to listen to any such triggers inside the YAML editor.
We also need to consider certain edge cases like -
What happens if the user has already added something in the editor and the trigger import action from OLS?
What happens when the user imported a YAML from OLS and then regenerated it again to modify something?

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

Bug OCPBUGS-45296: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14562

Feature RHIN-1262: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Epic RHINENG-10537: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug RHINENG-14585: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/280

Bug RHINENG-14523: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/283

Bug RHINENG-14677: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/279

Bug RHINENG-14555: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/285

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

Bug OCPBUGS-45005: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-credential-operator/pull/792

Bug OCPBUGS-45363: Failing test: oc adm must-gather runs successfully for audit logs [apigroup:config.openshift.io][apigroup:oauth.openshift.io] [Suite:openshift/conformance/parallel]

View the Description View the linked PRs

Test is failing due to oddness with oc adm logs.

We think it is related to PodLogsQuery feature that went into 1.32.

Bug OCPBUGS-45411: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/azure-disk-csi-driver/pull/89

Bug OCPBUGS-44523: Unused variable "identityName" defined in ASH arm template

View the Description View the linked PRs

Description of problem:

In ASH arm template 06_workers.json[1], there is an unused variable "identityName" defined, this is harmless, but little weird to be present in official upi installation doc[2], which might confuse user when installing UPI cluster on ASH.

[1] https://github.com/openshift/installer/blob/master/upi/azurestack/06_workers.json#L52
[2]  https://docs.openshift.com/container-platform/4.17/installing/installing_azure_stack_hub/upi/installing-azure-stack-hub-user-infra.html#installation-arm-worker_installing-azure-stack-hub-user-infra

suggest to remove it from arm template.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/9204

Bug OCPBUGS-44236: Hypershift OAuth failing to connect to IdentityProvider when using a proxy with additionalTrustBundle and IdentityProvider URL can be publicly verified

View the Description View the linked PRs

Description of problem:

Initially, the clusters at version 4.16.9 were having issues with reconciling the IDP. The error which was found in Dynatrace was

  "error": "failed to update control plane: failed to reconcile openshift oauth apiserver: failed to reconcile oauth server config: failed to generate oauth config: failed to apply IDP AAD config: Service Unavailable",

Initially it was assumed that the IDP service was unavialble but the CU confirmed that they also had the GroupSync operator running inside all clusters, which can successfully connect to the customer IDP and sync User + Group information from the IDP into the cluster.

The CU was advised to upgrade to 4.16.18 keeping in mind few of the other OCPBUGS which were related to proxy and would be resolved by upgrading to 4.16.15+

However, after upgrade the IDP is still failing to apply it seems. It looks like IDP reconciler isn't considering the Additional Trust Bundle for the customer proxy

Checking DT Logs, it seems to fail to verify the certificate

"error": "failed to update control plane: failed to reconcile openshift oauth apiserver: failed to reconcile oauth server config: failed to generate oauth config: failed to apply IDP AAD config: tls: failed to verify certificate: x509: certificate signed by unknown authority",

  "error": "failed to update control plane: [failed to reconcile openshift oauth apiserver: failed to reconcile oauth server config: failed to generate oauth config: failed to apply IDP AAD config: tls: failed to verify certificate: x509: certificate signed by unknown authority, failed to update status: Operation cannot be fulfilled on hostedcontrolplanes.hypershift.openshift.io \"rosa-staging\": the object has been modified; please apply your changes to the latest version and try again]",

Version-Release number of selected component (if applicable):

4.16.18

How reproducible:

Customer has a few clusters deployed and each of them has the same issue.

Steps to Reproduce:

    1. Create a HostedCluster with a proxy configuration that specifies an additionalTrustBundle, and an OpenID idp that can be publicly verified (ie. EntraID or Keycloak with LetsEncrypt certs)
    2. Wait for the cluster to come up and try to use the IDP
    3.

Actual results:

IDP is failing to work for HCP

Expected results:

IDP should be working for the clusters

Additional info:

    The issue will happen only if the IDP does not require a custom trust bundle to be verified.

https://github.com/openshift/hypershift/pull/5241

Bug OCPBUGS-47504: Power VS: dnssvcs default private endpoint needs to specify API version

View the Description View the linked PRs

Description of problem:

   The initial set of default endpoint overrides we specified in the installer are missing a v1 at the end of the DNS services override.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/9335

Bug OCPBUGS-45668: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/baremetal-operator/pull/387

Bug OCPBUGS-45050: UI crash accessing a Service in pending state

View the Description View the linked PRs

Description of problem:

Created a service for DNS server for secondary networks in Openshift-Virtualizaion, using MetalLB, but the IP is still pending, when accessing the service from the UI, it crash.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

Steps to Reproduce:

    1. Create an IP pool (for example 1 IP) for Metal LB and fully utilize the IP range (which other service)
    2. Allocate a new IP using the oc expose command like below
    3. Check the service status on the UI

Actual results:

UI crash

Expected results:

Should show the service status

Additional info:

oc expose -n openshift-cnv deployment/secondary-dns --name=dns-lb --type=LoadBalancer --port=53 --target-port=5353 --protocol='UDP'

https://github.com/openshift/networking-console-plugin/pull/151

Bug OCPBUGS-45756: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/origin/pull/29349

Vulnerability OCPBUGS-46211: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/azure-disk-csi-driver/pull/90

Bug OCPBUGS-46461: Improving helm CI tests

View the Description View the linked PRs

Description of problem:

Improving tests to remove the issue in the following helm test case
Perform the helm chart upgrade for already upgraded helm chart : HR-08-TC02: Helm Release Perform the helm chart upgrade for already upgraded helm chart : HR-08-TC02 expand_less	37s
{The following error originated from your application code, not from Cypress. It was caused by an unhandled promise rejection.

  > Cannot read properties of undefined (reading 'repoName')

When Cypress detects uncaught errors originating from your application it will automatically fail the current test.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14617

Vulnerability OCPBUGS-43671: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14409

Bug OCPBUGS-46471: Power VS: MAPI ignores endpoint override

View the Description View the linked PRs

Description of problem:

    The Power VS Machine API provider ignores the authentication endpoint override.

https://github.com/openshift/machine-api-provider-powervs/pull/95

Bug OCPBUGS-45400: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/gcp-pd-csi-driver-operator/pull/136

Bug OCPBUGS-45414: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/image-registry/pull/417

Bug OCPBUGS-27477: Pausing Master MCP results in Alerts

View the Description View the linked PRs

Description of problem:

When the master MCP is paused below alert are triggered
Failed to resync 4.12.35 because: Required MachineConfigPool 'master' is paused

The node have been rebooted to make sure there is no pending MC rollout

Affects version

  4.12

How reproducible:

Steps to Reproduce:

    1. Create a MC and apply it to master
    2. use below mc
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 99-master-cgroupsv2
spec:
  kernelArguments:
    - systemd.unified_cgroup_hierarchy=1
    3.Wait until the nodes are rebooted and running
    4. pause the mcp

    Actual results:{code:none}
MCP pausing causing the alert

Expected results:


Alerts should not be fired

    Additional info:{code:none}

https://github.com/openshift/machine-config-operator/pull/4707

Bug OCPBUGS-45640: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/122

Bug OCPBUGS-46452: i18n upload/download routine task - sprint 263

View the Description View the linked PRs

Description of problem:

The story is to track i18n upload/download routine tasks which are perform every sprint.

A.C.

- Upload strings to Memosource at the start of the sprint and reach out to localization team

- Download translated strings from Memsource when it is ready

- Review the translated strings and open a pull request

- Open a followup story for next sprint

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14629

Bug OCPBUGS-45465: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ibm-vpc-block-csi-driver/pull/78

Bug OCPBUGS-45855: If there are no esxi host in vcenter cluster fail with sane error

View the Description View the linked PRs

If the vCenter cluster has no esxi hosts importing the ova fails. Add a more sane error message

https://github.com/openshift/installer/pull/9291

Bug OCPBUGS-45286: PowerVS: Listen to machineNetwork

View the Description View the linked PRs

Description of problem:

In CAPI, we use a random machineNetwork instead of using the one passed in by the user.

https://github.com/openshift/installer/pull/9254

Bug OCPBUGS-45753: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/azure-file-csi-driver/pull/83

Bug OCPBUGS-46035: Fix skew version support for oc adm node-image commands (4.17 -> 4.18)

View the Description View the linked PRs

Description of problem:

    Due the recent changes, using oc 4.17 adm node-image commands on a 4.18 ocp cluster doesn't work

Version-Release number of selected component (if applicable):

How reproducible:

    always

Steps to Reproduce:

    1. oc adm node-image create / monitor
    2.
    3.

Actual results:

    The commands fail

Expected results:

    The commands should work as expected

Additional info:

https://github.com/openshift/installer/pull/9307

Bug OCPBUGS-45322: Consolidate updatingConfig/Version conditions control from CAPI controller with nodepool controller

View the Description View the linked PRs

Description of problem:

    Currently both the nodepool controller and capi controller set the updatingConfig condition on nodepool upgrades. We should only use one to set the condition to avoid constant switching between conditions and to ensure the logic used for setting this condition is the same.

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    CAPI and Nodepool controller set a different status because their logic is not consistent.

Expected results:

    CAPI and Nodepool controller set  the same status because their logic is not cosolidated.

Additional info:

https://github.com/openshift/hypershift/pull/5222

Bug OCPBUGS-45562: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/aws-ebs-csi-driver/pull/280

Bug OCPBUGS-46544: @types/node and nodejs version mismatch

View the Description View the linked PRs

Description of problem:

    We are currently using node 18, but our types are for node 10

Version-Release number of selected component (if applicable):

    4.19.0

How reproducible:

    Always

Steps to Reproduce:

    1. Open frontend/package.json
    2. Observe @types/node and engine version
    3.

Actual results:

    They are different

Expected results:

    They are the same

Additional info:

https://github.com/openshift/console/pull/14634

Bug OCPBUGS-45406: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-vsphere/pull/81

Bug OCPBUGS-45801: Unable to edit "until" in silences (of alerts) from the Admin/Developer perspective

View the Description View the linked PRs

Description of problem:

checked in 4.18.0-0.nightly-2024-12-05-103644/4.19.0-0.nightly-2024-12-04-031229, ~~OCPBUGS-34533~~ is reproduced on 4.18+, no such issue with 4.17 and below.

steps: login admin console or developer console(admin console go to "Observe -> Alerting -> Silences" tab, developer console go to "Observe -> Silences" tab), to create silence, edit the "Until" option, even with a valid timestamp or invalid stamp, will get error "[object Object]" in the "Until" field. see screen recording: https://drive.google.com/file/d/14JYcNyslSVYP10jFmsTaOvPFZSky1eg_/view?usp=drive_link

checked 4.17 fix for ~~OCPBUGS-34533~~ is already in 4.18+ code

Version-Release number of selected component (if applicable):

4.18+

How reproducible:

always

Steps to Reproduce:

1. see the descriptions

Actual results:

Unable to edit "until" filed in silences

Expected results:

able to edit "until" filed in silences

https://github.com/openshift/monitoring-plugin/pull/292

Bug OCPBUGS-45550: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/80

Bug OCPBUGS-44362: Remove the v1alpha1 schema for ConsolePlugin CRD

View the Description View the linked PRs

Description of problem:

v1alpha1 schema is still present in the v1 ConsolePlugin CRD and should be removed manually since the generator is re-adding it automatically.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console-operator/pull/942

Bug OCPBUGS-45471: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/220

Bug OCPBUGS-45583: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kubernetes-metrics-server/pull/39

Task HOSTEDCP-1958: Integrate codespell into Make Verify

View the Description View the linked PRs

Integrate codespell into Make Verify so that things are spell correctly in our upstream docs and codebase.

https://github.com/codespell-project/codespell

https://github.com/openshift/hypershift/pull/4700

Task HOSTEDCP-2181: prep for 4.19 branching

View the Description View the linked PRs

To do

Change Over Dockerfile base images
Double Check image versions in new e2e configs eg inital-4.17,n1minor,n2minor etc.....
Do we still need hypershift-aws-e2e-4.17 on newer branches (Seth)
MCE config file in release repo
Add n-1 e2e test on e2e test file change

https://github.com/openshift/hypershift/pull/5195

Bug OCPBUGS-45425: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-olm-operator/pull/94

Bug OCPBUGS-44655: HO issue determining the cluster payload arch does not checks the ICSP/IDMS

View the Description View the linked PRs

Description of problem:

This function https://github.com/openshift/hypershift/blame/c34a1f6cef0cb41c8a1f83acd4ddf10a4b9e8532/support/util/util.go#L391 does not checks the IDMS/ICSP overrides during the reconciliation, so it breaks the disconnected deployments.

https://github.com/openshift/hypershift/pull/5168

Bug OCPBUGS-45162: [GCP] "destroy cluster" doesn't delete the PVC disks which have the label "kubernetes-io-cluster-: owned"

View the Description View the linked PRs

Description of problem:

    "destroy cluster" doesn't delete the PVC disks which have the label "kubernetes-io-cluster-<infra-id>: owned"

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-multi-2024-11-27-162629

How reproducible:

    Always

Steps to Reproduce:

1. include the step which sets the cluster default storageclass to the hyperdisk one before ipi-install (see my debug PR https://github.com/openshift/release/pull/59306)
2. "create cluster", and make sure it succeeds
3. "destroy cluster"

Note: although we confirmed with issue with disk type "hyperdisk-balanced", we believe other disk types have the same issue.

Actual results:

    The 2 PVC disks of hyperdisk-balanced type are not deleted during "destroy cluster", although the disks have the label "kubernetes-io-cluster-<infra-id>: owned".

Expected results:

    The 2 PVC disks should be deleted during "destroy cluster", because they have the correct/expected labels according to which the uninstaller should be able to detect them.

Additional info:

    FYI the PROW CI debug job: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/59306/rehearse-59306-periodic-ci-openshift-verification-tests-master-installer-rehearse-4.18-installer-rehearse-debug/1861958752689721344

https://github.com/openshift/installer/pull/9274

Bug OCPBUGS-45508: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-operator/pull/340

Bug OCPBUGS-46557: [TP] Display name and Description are not taking effect when creating project

View the Description View the linked PRs

Description of problem:

when TechPreviewNoUpgrade feature gate is enabled, console will show a customized 'Create Project' modal to all users.
In the customized modal, 'Display name' and 'Description' values user typed are not taking effect

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-12-16-065305

How reproducible:

Always when TechPreviewNoUpgrade feature gate is enabled

Steps to Reproduce:

1. Enable TechPreviewNoUpgrade feature gate
$ oc patch  featuregate cluster -p '{"spec": {"featureSet": "TechPreviewNoUpgrade"}}' --type merge
2. any user login to console and create a project from web, set 'Display name' and 'Description' then click on 'Create' 
3. Check created project YAML
$ oc get project ku-5 -o json | jq .metadata.annotations
{
  "openshift.io/description": "",
  "openshift.io/display-name": "",
  "openshift.io/requester": "kube:admin",
  "openshift.io/sa.scc.mcs": "s0:c28,c17",
  "openshift.io/sa.scc.supplemental-groups": "1000790000/10000",
  "openshift.io/sa.scc.uid-range": "1000790000/10000"
}

Actual results:

display-name and description are all empty

Expected results:

display-name and description should be set to the values user had configured

Additional info:

once TP is enabled, customized create project modal is looking like https://drive.google.com/file/d/1HmIlm0u_Ia_TPsa0ZAGyTloRmpfD0WYk/view?usp=drive_link

https://github.com/openshift/networking-console-plugin/pull/154

Bug OCPBUGS-43519: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2357

Bug OCPBUGS-45478: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-vsphere/pull/52

Bug OCPBUGS-45787: Installing operator with a + in the version name doesn't work

View the Description View the linked PRs

Description of problem:

When attempting to install a specific version of an operator from the web console, the install plan of the latest version of that operator is created if the operator version had a + in it.

Version-Release number of selected component (if applicable):

4.17.6 (Tested version)

How reproducible:

Easily reproducible

Steps to Reproduce:

1. Under Operators > Operator Hub, install an operator with a + character in the version.
2. On the next screen, note that the + in the version text box is missing.
3. Make no changes to the default options and proceed to install the operator.
4. An install plan is created to install the operator with the latest version from the channel.

Actual results:

The install plan is created for the latest version from the channel.

Expected results:

The install plan is created for the requested version.

Additional info:

Notes on the reproducer:
- For step 1: the selected version shouldn't be the latest version from the channel for the purposes of this bug. 
- For step 1: The version will need to be selected from the version dropdown to reproduce the bug. If the default version that appears in the dropdown is used, then the bug won't reproduce. 
 
Other Notes: 
- This might also happen with other special characters in the version string other than +, but this is not something that I tested.

https://github.com/openshift/console/pull/14602

Bug OCPBUGS-46380: Static pod operator API accepts invalid node statuses and node status transitions

View the Description View the linked PRs

Description of problem:

The StaticPodOperatorStatus API validations permit:
- nodeStatuses[].currentRevision can be cleared and can decrease
- more than one entry in nodeStatuses can have a targetRevision > 0
But both of these signal a bug in one or more of the static pod controllers that write to them.

Version-Release number of selected component (if applicable):

This has been the case ~forever but we are aware of bugs in 4.18+ that are resulting in controllers trying to make these invalid writes. We also have more expressive validation mechanisms today that make it possible to plug the holes.

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/api/pull/2123

Bug OCPBUGS-45407: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-gcp/pull/70

Bug OCPBUGS-45567: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/gcp-pd-csi-driver/pull/73

Bug OCPBUGS-44723: aws-efs-csi-driver-controller-metrics/aws-efs-csi-driver-controller-metrics target down with OpenShift Container Platform 4.17

View the Description View the linked PRs

Description of problem:

After the upgrade to OpenShift Container Platform 4.17, it's being observed that aws-efs-csi-driver-controller-metrics/aws-efs-csi-driver-controller-metrics is reporting target down state. When checking the newly created Container one can find the below logs, that may explain the effect seen/reported.

$ oc logs aws-efs-csi-driver-controller-5b8d5cfdf4-zwh67 -c kube-rbac-proxy-8211
W1119 07:53:10.249934       1 deprecated.go:66] 
==== Removed Flag Warning ======================

logtostderr is removed in the k8s upstream and has no effect any more.

===============================================
		
I1119 07:53:10.250382       1 kube-rbac-proxy.go:233] Valid token audiences: 
I1119 07:53:10.250431       1 kube-rbac-proxy.go:347] Reading certificate files
I1119 07:53:10.250645       1 kube-rbac-proxy.go:395] Starting TCP socket on 0.0.0.0:9211
I1119 07:53:10.250944       1 kube-rbac-proxy.go:402] Listening securely on 0.0.0.0:9211
I1119 07:54:01.440714       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:54:19.860038       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:54:31.432943       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:54:49.852801       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:55:01.433635       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:55:19.853259       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:55:31.432722       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:55:49.852606       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:56:01.432707       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:56:19.853137       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:56:31.440223       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:56:49.856349       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:57:01.432528       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:57:19.853132       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:57:31.433104       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:57:49.852859       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:58:01.433321       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:58:19.853612       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.17

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift Container Platform 4.17
2. Install aws-efs-csi-driver-operator
3. Create efs.csi.aws.com CSIDriver object and wait for the aws-efs-csi-driver-controller to roll out.

Actual results:

The below Target Down Alert is being raised

50% of the aws-efs-csi-driver-controller-metrics/aws-efs-csi-driver-controller-metrics targets in Namespace openshift-cluster-csi-drivers namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.

Expected results:

The ServiceMonitor endpoint should be reachable and properly responding with the desired information to monitor the health of the component.

Additional info:

https://github.com/openshift/csi-operator/pull/330

Bug OCPBUGS-47476: Power VS: MAPI ignores RC endpoint override

View the Description View the linked PRs

When deploying with an endpoint override for the resourceController, the Power VS machine API provider will ignore the override.

https://github.com/openshift/machine-api-provider-powervs/pull/97

Task HOSTEDCP-2203: Enable creation of AWS HostedClusters with public subnets only

View the Description View the linked PRs

Creating clusters in which machines are created in a public subnet and use a public IP makes it possible to avoid creating NAT gateways (or proxies) for AWS clusters. While not applicable for every test, this configuration will save us money and cloud resources.

Bug OCPBUGS-45218: [aws] using default instance type for edge pools often fails

View the Description View the linked PRs

Description of problem:

    If the install is performed with an AWS user missing the `ec2:DescribeInstanceTypeOfferings`, the installer will use a hardcoded instance type from the set of non-edge machine pools. This can potentially cause the edge node to fail during provisioning, since the instance type doesn't take into account edge/wavelength zones support.

Because edge nodes are not needed for the installation to complete, the issue is not noticed by the installer, only by inspecting the status of the edge nodes.

Version-Release number of selected component (if applicable):

    4.16+ (since edge nodes support was added)

How reproducible:

    always

Steps to Reproduce:

    1. Specify an edge machine pool in the install-config without an instance type
    2. Run the install with an user without `ec2:DescribeInstanceTypeOfferings`
    3.

Actual results:

    In CI the `node-readiness` test step will fail and the edge nodes will show

                    errorMessage: 'error launching instance: The requested configuration is currently not supported. Please check the documentation for supported configurations.'         
                    errorReason: InvalidConfiguration

Expected results:

    Either
1. the permission is always required when instance type is not set for an edge pool; or
2.  a better instance type default is used

Additional info:

    Example CI job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/9230/pull-ci-openshift-installer-master-e2e-aws-ovn-edge-zones/1862140149505200128

https://github.com/openshift/installer/pull/9256

Bug OCPBUGS-45737: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-storage-operator/pull/544

Bug OCPBUGS-46577: Power VS: Block CSI driver does not honor endpoint overrides.

View the Description View the linked PRs

Description of problem:

    When deploying with endpoint overrides, the block CSI driver will try to use the default endpoints rather than the ones specified.

https://github.com/openshift/ibm-powervs-block-csi-driver-operator/pull/79

Bug OCPBUGS-45496: In OCL. Rarely, a new MC is rendered but no MOSB is created

View the Description View the linked PRs

Description of problem:


In order to test OCL we run e2e automated test cases in a cluster that has OCL enabled in master and worker pools.

We have seen that rarely a new machineconfig is rendered but no MOSB resource is created.

Version-Release number of selected component (if applicable):

4.18

How reproducible:

Rare

Steps to Reproduce:

We don't have any steps to reproduce it. It happens eventually when we run a regression in a cluster with OCL enabled in master and worker pools.

Actual results:

We see that in some scenarios a new MC is created, then a new rendered MC is created too, but now MOSB is created and the pool is stuck forever.

Expected results:

Whenever a new rendered MC is created, a new MOSB sould be created too to build the new image.

Additional info:

In the comments section we will add all the must-gather files that are related to this issue.


In some scenarios we can see this error reported by the os-builder pod:


2024-12-03T16:44:14.874310241Z I1203 16:44:14.874268       1 request.go:632] Waited for 596.269343ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/secrets?labelSelector=machineconfiguration.openshift.io%2Fephemeral-build-object%2Cmachineconfiguration.openshift.io%2Fmachine-os-build%3Dmosc-worker-5fc70e666518756a629ac4823fc35690%2Cmachineconfiguration.openshift.io%2Fon-cluster-layering%2Cmachineconfiguration.openshift.io%2Frendered-machine-config%3Drendered-worker-7c0a57dfe9cd7674b26bc5c030732b35%2Cmachineconfiguration.openshift.io%2Ftarget-machine-config-pool%3Dworker


Nevertheless, we only see this error in some of them, not in all of them.

https://github.com/openshift/machine-config-operator/pull/4739

Bug OCPBUGS-45896: "No datapoints found." on alert details graph

View the Description View the linked PRs

Description of problem:

checked on 4.18.0-0.nightly-2024-12-07-130635/4.19.0-0.nightly-2024-12-07-115816, admin console, go to alert details page, "No datapoints found." on alert details graph. see picture for CannotRetrieveUpdates alert: https://drive.google.com/file/d/1RJCxUZg7Z8uQaekt39ux1jQH_kW9KYXd/view?usp=drive_link

issue exists in 4.18+, no such issue with 4.17

Version-Release number of selected component (if applicable):

4.18+

How reproducible:

always on 4.18+

Steps to Reproduce:

1. see the description

Actual results:

"No datapoints found." on alert details graph

Expected results:

show correct graph

https://github.com/openshift/monitoring-plugin/pull/288

Bug OCPBUGS-18656: AlertmanagerConfig with missing options causes Alertmanager to crash

View the Description View the linked PRs

Description of problem:

AlertmanagerConfig with missing options causes Alertmanager to crash

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

A cluster administrator has enabled monitoring for user-defined projects.
CMO 

~~~
 config.yaml: |
    enableUserWorkload: true
    prometheusK8s:
      retention: 7d
~~~

A cluster administrator has enabled alert routing for user-defined projects. 

UWM cm / CMO cm 

~~~
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    alertmanager:
      enabled: true 
      enableAlertmanagerConfig: true
~~~

verify existing config: 

~~~
$ oc exec -n openshift-user-workload-monitoring alertmanager-user-workload-0 -- amtool config show --alertmanager.url http://localhost:9093  
global:
  resolve_timeout: 5m
  http_config:
    follow_redirects: true
  smtp_hello: localhost
  smtp_require_tls: true
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
  telegram_api_url: https://api.telegram.org
route:
  receiver: Default
  group_by:
  - namespace
  continue: false
receivers:
- name: Default
templates: []
~~~

create alertmanager config without options "smtp_from:" and "smtp_smarthost"

~~~
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: example
  namespace: example-namespace
spec:
  receivers:
    - emailConfigs:
        - to: some.username@example.com
      name: custom-rules1
  route:
    matchers:
      - name: alertname
    receiver: custom-rules1
    repeatInterval: 1m
~~~

check logs for alertmanager: the following error is seen. 

~~~
ts=2023-09-05T12:07:33.449Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="no global SMTP smarthost set"
~~~

Actual results:

Alertmamnager fails to restart.

Expected results:

CRD should be pre validated.

Additional info:

Reproducible with and without user workload Alertmanager.

https://github.com/openshift/prometheus-operator/pull/321

Bug OCPBUGS-18961: oc adm release extract --included should include ImageRegistry in 4.13-to-4.14 extractions

View the Description View the linked PRs

Description of problem

When updating a 4.13 cluster to 4.14, the new-in-4.14 ImageRegistry capability will always be enabled, because capabilities cannot be uninstalled.

Version-Release number of selected component (if applicable)

4.14 oc should learn about this, so they will appropriately extract registry CredentialsRequests when connecting to 4.13 clusters for 4.14 manifests. 4.15 oc will get OTA-1010 to handle this kind of issue automatically, but there's no problem with getting an ImageRegistry hack into 4.15 engineering candidates in the meantime.

How reproducible

100%

Steps to Reproduce

1. Connect your oc to a 4.13 cluster.
2. Extract manifests for a 4.14 release.
3. Check for ImageRegistry CredentialsRequests.

Actual results

$ oc adm upgrade | head -n1
Cluster version is 4.13.12
$ oc adm release extract --included --credentials-requests --to credentials-requests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.0-x86_64
$ grep -r ImageRegistry credentials-requests
...no hits...

Expected results

$ oc adm upgrade | head -n1
Cluster version is 4.13.12
$ oc adm release extract --included --credentials-requests --to credentials-requests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.0-x86_64
$ grep -r ImageRegistry credentials-requests
credentials-requests/0000_50_cluster-image-registry-operator_01-registry-credentials-request.yaml:    capability.openshift.io/name: ImageRegistry

Additional info

We already do this for MachineAPI. The ImageRegistry capability landed later, and this is us catching the oc-extract hack up with that change.

https://github.com/openshift/oc/pull/1539

Bug OCPBUGS-36357: CBO Watches don't watch what we expect

View the Description View the linked PRs

The cluster-baremetal-operator sets up a number of watches for resources using Owns() that have no effect because the Provisioning CR does not (and should not) own any resources of the given type or using EnqueueRequestForObject{}, which similarly has no effect because the resource name and namespace are different from that of the Provisioning CR.

The commit https://github.com/openshift/cluster-baremetal-operator/pull/351/commits/d4e709bbfbae6d316f2e76bec18b0e10a45ac93e should be reverted as it adds considerable complexity to no effect whatsoever.

The correct way to trigger a reconcile of the provisioning CR is using EnqueueRequestsFromMapFunc(watchOCPConfigPullSecret) (note that the map function watchOCPConfigPullSecret() is poorly named - it always returns the name/namespace of the Provisioning CR singleton, regardless of the input, which is what we want). We should replace the ClusterOperator, Proxy, and Machine watches with ones of this form.

See https://github.com/openshift/cluster-baremetal-operator/pull/423/files#r1628777876 and https://github.com/openshift/cluster-baremetal-operator/pull/351/commits/d4e709bbfbae6d316f2e76bec18b0e10a45ac93e#r1628776168 for commentary.

https://github.com/openshift/cluster-baremetal-operator/pull/452

Bug OCPBUGS-44808: Inconsistent properties location of maxOpenShiftVersion

View the Description View the linked PRs

Description of problem:

    Some bundles in the Catalog have been given the property in the FBC (and not in the bundle's CSV) which does not get propagated through to the helm chart annotations.

Version-Release number of selected component (if applicable):

How reproducible:

    Install elasticsearch 5.8.13

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    cluster is upgradeable

Expected results:

    cluster is not upgradeable

Additional info:

https://github.com/openshift/origin/pull/29328

Bug OCPBUGS-45381: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/operator-framework-olm/pull/906

Bug OCPBUGS-45490: Evicted Pods owned by Catalogsource are not rescheduled

View the Description View the linked PRs

Description of problem:

For various reasons, Pods may get evicted. Once they are evicted, the owner of the Pod should recreate the Pod so it is scheduled again.

With OLM, we can see that evicted Pods owned by Catalogsources are not rescheduled. The outcome is that all subscriptions have a "ResolutionFailed=True" condition, which hinders an upgrade of the operator. Specifically the customer is seeing an affected CatalogSource is "multicluster-engine-CENSORED_NAME-redhat-operator-index "in openshift-marketplace namespace, pod name: "multicluster-engine-CENSORED_NAME-redhat-operator-index-5ng9j"

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.16.21

How reproducible:

Sometimes, when Pods are evicted on the cluster

Steps to Reproduce:

1. Set up an OpenShift Container Platform 4.16 cluster, install various Operators
2. Create a condition that a Node will evict Pods (for example by creating DiskPressure on the Node)
3. Observe if any Pods owned by CatalogSources are being evicted

Actual results:

If Pods owned by CatalogSources are being evicted, they are not recreated / rescheduled.

Expected results:

When Pods owned by CatalogSources are being evicted, they are being recreacted / rescheduled.

Additional info:

Discussion: https://redhat-internal.slack.com/archives/C3VS0LV41/p1726170881413389?thread_ts=1726126461.479019&cid=C3VS0LV41
Support Case with "must-gather": 04003784

https://github.com/openshift/operator-framework-olm/pull/907

Task MGMT-19506: Bump golang.org/x/crypto from 0.25.0 to 0.31.0

View the Description View the linked PRs

In order to fix security issue https://github.com/openshift/assisted-service/security/dependabot/94

Bug OCPBUGS-44641: OpenShift installation on GCP via IPI on existing Subnet is failing because of a private dns zone using the same dns name but not binding to the cluster's VPC

View the Description View the linked PRs

Description of problem:

    A similar testing scenario to OCPBUGS-38719, but with the pre-existing dns private zone is not a peering zone, instead it is a normal dns zone which binds to another VPC network. And the installation will fail finally, because the dns record-set "*.apps.<cluster name>.<base domain>" is added to the above dns private zone, rather than the cluster's dns private zone.

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-multi-2024-10-24-093933

How reproducible:

    Always

Steps to Reproduce:

    Please refer to the steps told in https://issues.redhat.com/browse/OCPBUGS-38719?focusedId=25944076&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25944076

Actual results:

    The installation failed, due to the cluster operator "ingress" degraded

Expected results:

    The installation should succeed.

Additional info:

https://github.com/openshift/installer/pull/9216

Bug OCPBUGS-44860: Guard against accidental 4.y.z -> 4.(y+2).z'

View the Description View the linked PRs

Description of problem

From our docs:

Due to fundamental Kubernetes design, all OpenShift Container Platform updates between minor versions must be serialized. You must update from OpenShift Container Platform <4.y> to <4.y+1>, and then to <4.y+2>. You cannot update from OpenShift Container Platform <4.y> to <4.y+2> directly. However, administrators who want to update between two even-numbered minor versions can do so incurring only a single reboot of non-control plane hosts.

We should add a new precondition that enforces that policy, so cluster admins who run --to-image ... don't hop straight from 4.y.z to 4.(y+2).z' or similar without realizing that they were outpacing testing and policy.

Version-Release number of selected component

The policy and current lack-of guard both date back to all OCP 4 releases, and since they're Kube-side constraints, they may date back to the start of Kube.

How reproducible

Every time.

Steps to Reproduce

1. Install a 4.y.z cluster.
2. Use --to-image to request an update to a 4.(y+2).z release.
3. Wait a few minutes for the cluster-version operator to consider the request.
4. Check with oc adm upgrade.

Actual results

Update accepted.

Expected results

Update rejected (unless it was forced), complaining about the excessively long hop.

https://github.com/openshift/cluster-version-operator/pull/1112

Bug OCPBUGS-45222: When the webhook token authenticator is enabled, the console is in crashloopback

View the Description View the linked PRs

Description of problem:


When setting up the "webhookTokenAuthenticator" the oauth configure "type" is set to "None". 
Then controller sets the console configmap with "authType=disabled". Which will cause that the console pod goes in the crash loop back due to the not allowed type:

Error:
validate.go:76] invalid flag: user-auth, error: value must be one of [oidc openshift], not disabled.

This worked before on 4.14, stopped working on 4.15.

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.15

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

The console can't start, seems like it is not allowed to change the console.

Expected results:

Additional info:

https://github.com/openshift/console-operator/pull/944

Bug OCPBUGS-38749: clusteroperator/machine-config blips Degraded=True during non-upgrade job run

View the Description View the linked PRs

Description of problem:

    In an effort to ensure all HA components are not degraded by design during normal e2e test or upgrades, we are collecting all operators that are blipping Degraded=True during any payload job run.

This card captures machine-config operator that blips Degraded=True during some ci job runs.


Example Job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview-serial/1843561357304139776
  
Reasons associated with the blip: MachineConfigDaemonFailed or MachineConfigurationFailed

For now, we put an exception in the test. But it is expected that teams take action to fix those and remove the exceptions after the fix go in.

Exception is defined here: https://github.com/openshift/origin/blob/e5e76d7ca739b5699639dd4c500f6c076c697da6/pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go#L109


See linked issue for more explanation on the effort.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/machine-config-operator/pull/4764

Bug OCPBUGS-45439: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-45491: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/9273

Bug OCPBUGS-42809: quorum loss during bootstrapping

View the Description View the linked PRs

Description of problem:

During bootstrapping we're running into the following scenario:

4 members: master 0, 1 and 2 (are full voting) and bootstrap (torn down/dead member) revision rollout causes 0 to restart and leaves you with 2/4 healthy, which means quorum-loss.

This causes apiserver unavailability during the installation and should be avoided.

Version-Release number of selected component (if applicable):

4.17, 4.18 but is likely a longer standing issue

How reproducible:

rarely

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

apiserver should not return any errors

Additional info:

https://github.com/openshift/cluster-etcd-operator/pull/1372

Bug OCPBUGS-45860: VSphere MCO Panic

View the Description View the linked PRs

The following test is failing more than expected:

Undiagnosed panic detected in pod

See the sippy test details for additional context.

Observed in 4.18-e2e-vsphere-ovn-upi-serial/1861922894817267712

Undiagnosed panic detected in pod
{  pods/openshift-machine-config-operator_machine-config-daemon-4mzxf_machine-config-daemon_previous.log.gz:E1128 00:28:30.700325    4480 panic.go:261] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=<}

https://github.com/openshift/machine-config-operator/pull/4740

Bug OCPBUGS-44786: support the LLC alignment cpumanager policy option

View the Description View the linked PRs

Description of problem:

    Pull support from upstream kubernetes (see KEP 4800: https://github.com/kubernetes/enhancements/issues/4800) for LLC alignment support in cpumanager

Version-Release number of selected component (if applicable):

    4.19

How reproducible:

    100%

Steps to Reproduce:

    1. try to schedule a pod which requires exclusive CPU allocation and whose CPUs should be affine to the same LLC block
    2. observe random and likely wrong (not LLC-aligned) allocation
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/kubernetes/pull/2136

Bug OCPBUGS-45280: GCP fails to assign permissions to installer created SA

View the Description View the linked PRs

Description of problem:

DEBUG Creating ServiceAccount for control plane nodes 
DEBUG Service account created for XXXXX-gcp-r4ncs-m 
DEBUG Getting policy for openshift-dev-installer   
DEBUG adding roles/compute.instanceAdmin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member 
DEBUG adding roles/compute.networkAdmin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member 
DEBUG adding roles/compute.securityAdmin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member 
DEBUG adding roles/storage.admin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member 
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during pre-provisioning: failed to add master roles: failed to set IAM policy, unexpected error: googleapi: Error 400: Service account XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com does not exist., badRequest

It appears that the Service account was created correctly. The roles are assigned to the service account. It is possible that there needs to be a "wait for action to complete" on the server side to ensure that this will all be ok.

Version-Release number of selected component (if applicable):

How reproducible:

Random. Appears to be a sync issue

Steps to Reproduce:

    1. Run the installer for a normal GCP basic install
    2.
    3.

Actual results:

    Installer fails saying that the Service Account that the installer created does not have the permissions to perform an action. Sometimes it takes numerous tries for this to happen (very intermittent).

Expected results:

    Successful install

Additional info:

https://github.com/openshift/installer/pull/9299

Bug OCPBUGS-45454: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/oc/pull/1938

Bug OCPBUGS-45588: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-livenessprobe/pull/70

Bug OCPBUGS-45735: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-external-provisioner/pull/107

Bug OCPBUGS-45777: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-version-operator/pull/1123

Bug OCPBUGS-45727: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/9284

Bug OCPBUGS-46529: kubevirt hosted cluster with apiserver noderport using hostname ends without network policies

View the Description View the linked PRs

Description of problem:

    When creating a kubevirt hosted cluster with the following apiserver publishing configuration

- service: APIServer
    servicePublishingStrategy:
      type: NodePort
      nodePort:
        address: my.hostna.me
        port: 305030

Shows following error:

"failed to reconcile virt launcher policy: could not determine if amy.hostna.me is an IPv4 or IPv6 address"

And network policies and not propertly deployed at the virtual machine namespaces.

Version-Release number of selected component (if applicable):

 4.17

How reproducible:

    Always

Steps to Reproduce:

    1.Create a kubevirt hosted cluster with apiserver nodeport publish with a hostname
    2. Wait for hosted cluster creation.

Actual results:

Following error pops up and network policies are not created

"failed to reconcile virt launcher policy: could not determine if amy.hostna.me is an IPv4 or IPv6 address"

Expected results:

    No error pops ups and network policies are created.

Additional info:

    This is where the error get originated -> https://github.com/openshift/hypershift/blob/ef8596d4d69a53eb60838ae45ffce2bca0bfa3b2/hypershift-operator/controllers/hostedcluster/network_policies.go#L644

    That error should prevent network policies creation.

https://github.com/openshift/hypershift/pull/5313

Bug OCPBUGS-45441: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/route-controller-manager/pull/52

Bug OCPBUGS-45453: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/prometheus-operator/pull/317

Bug OCPBUGS-45712: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-node-driver-registrar/pull/77

Bug OCPBUGS-36404: Too many pending CSRs lead to scaleup failures when scaling to 500 nodes

View the Description View the linked PRs

Description of problem:
machine-approver logs

E0221 20:29:52.377443       1 controller.go:182] csr-dm7zr: Pending CSRs: 1871; Max pending allowed: 604. Difference between pending CSRs and machines > 100. Ignoring all CSRs as too many recent pending CSRs seen

oc get csr |wc -l
3818
oc get csr |grep "node-bootstrapper" |wc -l
2152

By approving the pending CSR manually I can get the cluster to scaleup.

We can increase the maxPending to a higher number https://github.com/openshift/cluster-machine-approver/blob/2d68698410d7e6239dafa6749cc454272508db19/pkg/controller/controller.go#L330

https://github.com/openshift/cluster-machine-approver/pull/243

Bug OCPBUGS-41156: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2357

Bug OCPBUGS-45389: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-monitoring-operator/pull/2533

Bug OCPBUGS-45402: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/vmware-vsphere-csi-driver/pull/135

Bug OCPBUGS-28206: ERROR in search tool: Cannot read properties of undefined (reading 'state')

View the Description View the linked PRs

Description of problem:

"Cannot read properties of undefined (reading 'state')" Error in search tool when filtering Subscriptions while adding new Subscriptions

Version-Release number of selected component (if applicable):

How reproducible:

    Always

Steps to Reproduce:

    1. As an Administrator, go to Home -> Search and filter by Subscription component
    2. Start creating subscriptions (bulk)

Actual results:

    The filtered results will turn in "Oh no! Something went wrong" view

Expected results:

    Get updated results every few seconds

Additional info:

If the view is reloaded -> Fix

Stack Trace:

TypeError: Cannot read properties of undefined (reading 'state')
    at L (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/subscriptions-chunk-89fe3c19814d1f6cdc84.min.js:1:3915)
    at na (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:58879)
    at Hs (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:111315)
    at Sc (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:98327)
    at Cc (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:98255)
    at _c (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:98118)
    at pc (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:95105)
    at https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:44774
    at t.unstable_runWithPriority (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:289:3768)
    at Uo (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:44551)

https://github.com/openshift/console/pull/14600

Bug OCPBUGS-45482: Installer deletes bootstrap machine before etcd bootstrap member removed from cluster

View the linked PRs

https://github.com/openshift/installer/pull/9261

Bug OCPBUGS-45519: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kube-state-metrics/pull/118

Bug OCPBUGS-45622: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-operator/pull/343

Bug OCPBUGS-46354: 4.18 HyperShift operator fails to serialize NodePool ConfigMaps with ImageDigestMirrorSet

View the Description View the linked PRs

Description of problem:

    4.18 HyperShift operator's NodePool controller fails to serialize NodePool ConfigMaps that contain ImageDigestMirrorSet. Inspecting the code, it fails on NTO reconciliation logic, where only machineconfiguration API schemas are loaded into the YAML serializer: https://github.com/openshift/hypershift/blob/f7ba5a14e5d0cf658cf83a13a10917bee1168011/hypershift-operator/controllers/nodepool/nto.go#L415-L421

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    100%

Steps to Reproduce:

    1. Install 4.18 HyperShift operator
    2. Create NodePool with configuration ConfigMap that includes ImageDigestMirrorSet
    3. HyperShift operator fails to reconcile NodePool

Actual results:

    HyperShift operator fails to reconcile NodePool

Expected results:

    HyperShift operator to successfully reconcile NodePool

Additional info:

    Regression introduced by https://github.com/openshift/hypershift/pull/4717

https://github.com/openshift/hypershift/pull/5280

Bug OCPBUGS-44059: Parallelize ./check-patternfly-modules.sh

View the Description View the linked PRs

Description of problem:

    Currently check-patternfly-modules.sh checks them serially, which could be improved by checking them in parallel. 

Since yarn why does not write to anything, this should be easily parallelizable as there is no race condition with writing back to the yarn.lock

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14453

Bug OCPBUGS-45412: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-openstack/pull/316

Bug OCPBUGS-45452: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/images/pull/202

Bug OCPBUGS-45757: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/network-tools/pull/135

Bug OCPBUGS-25852: Missing metric - example: cluster_autoscaler_failed_scale_ups_total

View the Description View the linked PRs

Description of problem:

Missing metrics - example: cluster_autoscaler_failed_scale_ups_total

Version-Release number of selected component (if applicable):

How reproducible:

Always

Steps to Reproduce:

#curl the autoscalers metrics endpoint: 

$ oc exec deployment/cluster-autoscaler-default -- curl -s http://localhost:8085/metrics | grep cluster_autoscaler_failed_scale_ups_total

Actual results:

the metrics does not return a value until an event has happened

Expected results:

The metric counter should be initialized at start up providing a zero value

Additional info:

I have been through the file: 

https://raw.githubusercontent.com/openshift/kubernetes-autoscaler/master/cluster-autoscaler/metrics/metrics.go 

and checked off the metrics that do not appear when scraping the metrics endpoint straight after deployment. 

the following metrics are in metrics.go but are missing from the scrape

~~~
node_group_min_count
node_group_max_count
pending_node_deletions
errors_total
scaled_up_gpu_nodes_total
failed_scale_ups_total
failed_gpu_scale_ups_total
scaled_down_nodes_total
scaled_down_gpu_nodes_total
unremovable_nodes_count 
skipped_scale_events_count
~~~

https://github.com/openshift/kubernetes-autoscaler/pull/332

Bug OCPBUGS-44438: HCP applies featureset-guarded manifests when bootstrapping CVO

View the Description View the linked PRs

Description of problem:

CVO manifests contain some feature-gated ones:

since at least 4.16, there are feature-gated ClusterVersion CRDs
UpdateStatus API feature is delivered through DevPreview (now) and TechPreview (later) feature set

We observed HyperShift CI jobs to fail when adding DevPreview-gated deployment manifests to CVO, which was unexpected. Investigating further, we discovered that HyperShift applies them:

error: error parsing /var/payload/manifests/0000_00_update-status-controller_03_deployment-DevPreviewNoUpgrade.yaml: error converting YAML to JSON: yaml: invalid map key: map[interface {}]interface {}{".ReleaseImage":interface {}(nil)}

But even without these added manifests, this happens for existing ClusterVersion CRD manifests present in the payload:

$ ls -1 manifests/*clusterversions*crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-CustomNoUpgrade.crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-Default.crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-DevPreviewNoUpgrade.crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-TechPreviewNoUpgrade.crd.yaml

In a passing HyperShift CI job, the same log shows that all four manifests are applied instead of just one:

customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured

Version-Release number of selected component (if applicable):

4.18

How reproducible:

Always

Steps to Reproduce:

1. inspect the cluster-version-operator-*-bootstrap.log of a HyperShift CI job

Actual results:

customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured

= all four ClusterVersion CRD manifests are applied

Expected results:

customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created

= ClusterVersion CRD manifest is applied just once

Additional info

I'm filing this card so that I can link it to the "easy" fix https://github.com/openshift/hypershift/pull/5093 which is not the perfect fix, but allows us to add featureset-gated manifests to CVO without breaking HyperShift. It is desirable to improve this even further and actually correctly select the manifests to be applied for CVO bootstrap, but that involves non-trivial logic similar to one used by CVO and it seems to be better approached as a feature to be properly assessed and implemented, rather than a bugfix, so I'll file a separate HOSTEDCP card for that.

https://github.com/openshift/hypershift/pull/5093

Bug OCPBUGS-44834: [aws] permissions missing for edge zones

View the Description View the linked PRs

Description of problem:

    Some permissions are missing when edge zones are specified in the install-config.yaml, probably those related to Carrier Gateways (but maybe more)

Version-Release number of selected component (if applicable):

    4.16+

How reproducible:

    always with minimal permissions

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    time="2024-11-20T22:40:58Z" level=debug msg="\tfailed to describe carrier gateways in vpc \"vpc-0bdb2ab5d111dfe52\": UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-girt7h2j-4515a-minimal-perm is not authorized to perform: ec2:DescribeCarrierGateways because no identity-based policy allows the ec2:DescribeCarrierGateways action"

Expected results:

    All required permissions are listed in pkg/asset/installconfig/aws/permissions.go

Additional info:

    See https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/9222/pull-ci-openshift-installer-master-e2e-aws-ovn-edge-zones/1859351015715770368 for a failed min-perms install

https://github.com/openshift/installer/pull/9230

Bug OCPBUGS-45428: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api/pull/228

Bug OCPBUGS-44925: [aws] missing ec2:AllocateAddress permission when Ipv4Pool is enabled

View the Description View the linked PRs

Description of problem:

    When using PublicIPv4Pool, CAPA will try to allocate IP address in the supplied pool which requires the `ec2:AllocateAddress` permission

Version-Release number of selected component (if applicable):

    4.16+

How reproducible:

    always

Steps to Reproduce:

    1. Minimal permissions and publicIpv4Pool set
    2.
    3.

Actual results:

    time="2024-11-21T05:39:49Z" level=debug msg="E1121 05:39:49.352606     327 awscluster_controller.go:279] \"failed to reconcile load balancer\" err=<"
time="2024-11-21T05:39:49Z" level=debug msg="\tfailed to allocate addresses to load balancer: failed to allocate address from Public IPv4 Pool \"ipv4pool-ec2-0768267342e327ea9\" to role lb-apiserver: failed to allocate Elastic IP for \"lb-apiserver\": UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-2cr41ill-663fd-minimal-perm is not authorized to perform: ec2:AllocateAddress on resource: arn:aws:ec2:us-east-1:460538899914:ipv4pool-ec2/ipv4pool-ec2-0768267342e327ea9 because no identity-based policy allows the ec2:AllocateAddress action. Encoded authorization failure message: Iy1gCtvfPxZ2uqo-SHei1yJQvNwaOBl5F_8BnfeEYCLMczeDJDdS4fZ_AesPLdEQgK7ahuOffqIr--PWphjOUbL2BXKZSBFhn3iN9tZrDCnQQPKZxf9WaQmSkoGNWKNUGn6rvEZS5KvlHV5vf5mCz5Bk2lk3w-O6bfHK0q_dphLpJjU-sTGvB6bWAinukxSYZ3xbirOzxfkRfCFdr7nDfX8G4uD4ncA7_D-XriDvaIyvevWSnus5AI5RIlrCuFGsr1_3yEvrC_AsLENZHyE13fA83F5-Abpm6-jwKQ5vvK1WuD3sqpT5gfTxccEqkqqZycQl6nsxSDP2vDqFyFGKLAmPne8RBRbEV-TOdDJphaJtesf6mMPtyMquBKI769GW9zTYE7nQzSYUoiBOafxz6K1FiYFoc1y6v6YoosxT8bcSFT3gWZWNh2upRJtagRI_9IRyj7MpyiXJfcqQXZzXkAfqV4nsJP8wRXS2vWvtjOm0i7C82P0ys3RVkQVcSByTW6yFyxh8Scoy0HA4hTYKFrCAWA1N0SROJsS1sbfctpykdCntmp9M_gd7YkSN882Fy5FanA"
time="2024-11-21T05:39:49Z" level=debug msg="\t\tstatus code: 403, request id: 27752e3c-596e-43f7-8044-72246dbca486"

Expected results:

Additional info:

Seems to happen consistently with shared-vpc-edge-zones CI job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/9230/pull-ci-openshift-installer-master-e2e-aws-ovn-shared-vpc-edge-zones/1860015198224519168

https://github.com/openshift/installer/pull/9234

Bug OCPBUGS-45289: Incorrect ELB name was recognized by installer on ap-southeast-5

View the Description View the linked PRs

Description of problem:

The LB name should be yunjiang-ap55-sk6jl-ext-a6aae262b13b0580, rather than ending with ELB service endpoint (elb.ap-southeast-5.amazonaws.com):

	failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed provisioning resources after infrastructure ready: failed to find HostedZone ID for NLB: failed to list load balancers: ValidationError: The load balancer name 'yunjiang-ap55-sk6jl-ext-a6aae262b13b0580.elb.ap-southeast-5.amazonaws.com' cannot be longer than '32' characters\n\tstatus code: 400, request id: f8adce67-d844-4088-9289-4950ce4d0c83

Checking the tag value, the value of Name key is correct: yunjiang-ap55-sk6jl-ext

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-30-141716

How reproducible:

always

Steps to Reproduce:

    1. Deploy a cluster on ap-southeast-5
    2.
    3.

Actual results:

The LB can not be created

Expected results:

Create a cluster successfully.

Additional info:

No such issues on other AWS regions.

https://github.com/openshift/installer/pull/9263

Bug OCPBUGS-45413: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/vmware-vsphere-csi-driver/pull/136

Bug OCPBUGS-45638: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ironic-image/pull/613

Bug OCPBUGS-45827: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/bond-cni/pull/66

Bug OCPBUGS-45311: oc adm node-image create --pxe does not generate the correct artifacts

View the Description View the linked PRs

Description of problem:

oc adm node-image create --pxe does not generate only pxe artifacts, but copies everything from the node-joiner pod. Also, the name of the pxe artifacts are not corrected (prefixed with agent, instead of node)

Version-Release number of selected component (if applicable):

How reproducible:

always

Steps to Reproduce:

    1. oc adm node-image create --pxe

Actual results:

    All the node-joiner pods are copied. PXE artifacts name are wrong.

Expected results:

    In the target folder, there should be only the following artifacts:

* node.x86_64-initrd.img
* node.x86_64-rootfs.img
* node.x86_64-vmlinuz

Additional info:

https://github.com/openshift/oc/pull/1931

Bug OCPBUGS-45685: Power VS: Available SysTypes should be decided by zone rather than region

View the Description View the linked PRs

Description of problem:

    As more systems have been added to Power VS, the assumption that every zone in a region has the same set of systypes has been broken. To properly represent what system types are available, the powervs_regions struct needed to be altered and parts of the installer referencing it needed to be updated.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. Try to deploy with s1022 in dal10
    2. SysType not available, even though it is a valid option in Power VS.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/9245

Bug OCPBUGS-45722: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console-operator/pull/946

Bug OCPBUGS-43825: Delete feature does not delete the graph image after mirrorToMirror

View the Description View the linked PRs

Description of problem:

When running the delete command on oc-mirror after a mirrorToMirror, the graph-image is not being deleted.

Version-Release number of selected component (if applicable):

How reproducible:
With the following ImageSetConfiguration (use the same for the DeleteImageSetConfiguration only changing the kind and the mirror to delete)

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    channels:
    - name: stable-4.13
      minVersion: 4.13.10
      maxVersion: 4.13.10
    graph: true

Steps to Reproduce:

    1. Run mirror to mirror
./bin/oc-mirror -c ./alex-tests/alex-isc/isc.yaml --workspace file:///home/aguidi/go/src/github.com/aguidirh/oc-mirror/alex-tests/clid-230 docker://localhost:6000 --v2 --dest-tls-verify=false

    2. Run the delete --generate
./bin/oc-mirror delete -c ./alex-tests/alex-isc/isc-delete.yaml --generate --workspace file:///home/aguidi/go/src/github.com/aguidirh/oc-mirror/alex-tests/clid-230 --delete-id clid-230-delete-test docker://localhost:6000 --v2 --dest-tls-verify=false

    3. Run the delete
./bin/oc-mirror delete --delete-yaml-file /home/aguidi/go/src/github.com/aguidirh/oc-mirror/alex-tests/clid-230/working-dir/delete/delete-images-clid-230-delete-test.yaml docker://localhost:6000 --v2 --dest-tls-verify=false

Actual results:

During the delete --generate the graph-image is not being included in the delete file 

2024/10/25 09:44:21  [WARN]   : unable to find graph image in local cache: SKIPPING. %!v(MISSING)
2024/10/25 09:44:21  [WARN]   : reading manifest latest in localhost:55000/openshift/graph-image: manifest unknown

Because of that the graph-image is not being deleted from the target registry

[aguidi@fedora oc-mirror]$ curl http://localhost:6000/v2/openshift/graph-image/tags/list | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    51  100    51    0     0  15577      0 --:--:-- --:--:-- --:--:-- 17000
{
  "name": "openshift/graph-image",
  "tags": [
    "latest"
  ]
}

Expected results:

graph-image should be deleted even after mirrorToMirror

Additional info:

https://github.com/openshift/oc-mirror/pull/982

Bug OCPBUGS-45189: Missing app in auditedAppList for Manila

View the Description View the linked PRs

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/5200/pull-ci-openshift-hypershift-main-e2e-openstack/1862228917390151680

{Failed  === RUN   TestAutoscaling/EnsureHostedCluster/EnsurePodsWithEmptyDirPVsHaveSafeToEvictAnnotations
    util.go:1232: 
        the pod  openstack-manila-csi-controllerplugin-676cc65ffc-tnnkb is not in the audited list for safe-eviction and should not contain the safe-to-evict-local-volume annotation
        Expected
            <string>: socket-dir
        to be empty
        --- FAIL: TestAutoscaling/EnsureHostedCluster/EnsurePodsWithEmptyDirPVsHaveSafeToEvictAnnotations (0.02s)
}

https://github.com/openshift/hypershift/pull/5202

Bug OCPBUGS-45395: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-operator/pull/339

Bug OCPBUGS-46083: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/300

Vulnerability OCPBUGS-46218: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/azure-file-csi-driver/pull/84

Bug OCPBUGS-13181: metric for ingresswithoutclassname does not decrease when classless ingresses cease to exist

View the Description View the linked PRs

Description of problem:

We have an OKD 4.12 cluster which has persistent and 
increasing ingresswithoutclassname alerts with no ingresses normally 
present in the cluster. I believe the ingresswithoutclassname being 
counted is created as part of the ACME validation process managed by the
 cert-manager operator with it's openshift route addon which are torn down once the ACME validation is complete.

Version-Release number of selected component (if applicable):

 4.12.0-0.okd-2023-04-16-041331

How reproducible:

seems very consistent. went away during an update but came back shortly after and continues to increase.

Steps to Reproduce:

1. create ingress w/o classname
2. see counter increase
3. delete classless ingress
4. counter does not decrease.

Additional info:

https://github.com/openshift/cluster-ingress-operator/issues/912

https://github.com/openshift/route-controller-manager/pull/49

Bug OCPBUGS-45474: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-os-images/pull/48

Bug OCPBUGS-45597: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-samples-operator/pull/591

Bug OCPBUGS-45713: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/prom-label-proxy/pull/377

Bug OCPBUGS-46363: Bootstrapping times out prematurely while waiting for etcd bootstrap member removal

View the Description View the linked PRs

Description of problem:

Observed in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn/1866088107347021824/artifacts/e2e-gcp-ovn/ipi-install-install/artifacts/.openshift_install-1733747884.log

Distinct issues occurring in this job caused the "etcd bootstrap member to be removed from cluster" gate to take longer than its 5 minute timeout, but there was plenty of time left to complete bootstrapping successfully. It doesn't make sense to have a narrow timeout here because progress toward removal of the etcd bootstrap member begins the moment the etcd cluster starts for the first time, not when the installer starts waiting to observe it.

Version-Release number of selected component (if applicable):

4.19.0

How reproducible:

Sometimes

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/installer/pull/9295

Bug OCPBUGS-47541: Incorrect capitalization for `Lightspeed` to capitalized `LightSpeed` in ja and zh langs

View the Description View the linked PRs

Description of problem:

Incorrect capitalization for `Lightspeed` to capitalized `LightSpeed` in ja and zh langs

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14646

Bug OCPBUGS-45696: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-provider-ibmcloud/pull/59

Bug OCPBUGS-45314: [4.19] pin libreswan package to 4.6-3.el9_0.3 in ovnk

View the Description View the linked PRs

Description of problem:

This is part of the plan to improve stability of ipsec in ocp releases.

There are several regressions identified in libreswan-4.9 (default in 4.14.z and 4.15.z) which needs to be addressed in an incremental approach. The first step is to introduce libreswan-4.6-3.el9_0.3 which is the oldest major version(4.6) that can still be released in rhel9. It includes a libreswan crash fix and some CVE backports that are present in libreswan-4.9 but not in libreswan-4.5 (so that it can pass the internal CVE scanner check).

This pinning of libreswan-4.6-3.el9_0.3 is only needed for 4.14.z since containerized ipsec is used in 4.14. Starting 4.15, ipsec is moved to host and this CNO PR (about to merge as of writing) will allow ovnk to use host ipsec execs which only requires libreswan pkg update in rhcos extension.

https://github.com/openshift/ovn-kubernetes/pull/2375

Bug OCPBUGS-45433: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-openstack/pull/317

Bug OCPBUGS-45710: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/image-customization-controller/pull/133

Bug OCPBUGS-45951: Bump OVS to 3.4.0-18.el9fdp for OCP 4.19

View the Description View the linked PRs

Description of problem:

bump ovs version to openvswitch3.4-3.4.0-18.el9fdp for ocp 4.19 to include the ovs-monitor-ipsec improvement https://issues.redhat.com/browse/FDP-846

https://github.com/openshift/ovn-kubernetes/pull/2387

Bug OCPBUGS-45382: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-metal3/pull/29

Bug OCPBUGS-45434: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-machine-approver/pull/247

Bug OCPBUGS-44372: PPC: false negative reporting while comparing the topologies of affected compute nodes

View the Description View the linked PRs

Description of problem:

   This bug is filed a result of https://access.redhat.com/support/cases/#/case/03977446
ALthough both nodes topologies are equavilent, PPC reported a false negative:

  Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU corres differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66]

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

    always

Steps to Reproduce:

    1.TBD
    2.
    3.

Actual results:

    Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU corres differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66]

Expected results:

    topologies matches, the PPC should work fine

Additional info:

https://github.com/openshift/cluster-node-tuning-operator/pull/1236

Bug OCPBUGS-45807: [AWS mini-perm] sts:AssumeRole permission is missing from installer generated policy

View the Description View the linked PRs

Description of problem:

`sts:AssumeRole` is required by creating Shared-VPC [1], otherwise which will cause the error:

 level=fatal msg=failed to fetch Cluster Infrastructure Variables: failed to fetch dependency of "Cluster Infrastructure Variables": failed to generate asset "Platform Provisioning Check": aws.hostedZone: Invalid value: "Z01991651G3UXC4ZFDNDU": unable to retrieve hosted zone: could not get hosted zone: Z01991651G3UXC4ZFDNDU: AccessDenied: User: arn:aws:iam::301721915996:user/ci-op-1c2w7jv2-ef4fe-minimal-perm-installer is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::641733028092:role/ci-op-1c2w7jv2-ef4fe-shared-role
level=fatal msg=	status code: 403, request id: ab7160fa-ade9-4afe-aacd-782495dc9978
Installer exit with code 1

[1]https://docs.openshift.com/container-platform/4.17/installing/installing_aws/installing-aws-account.html

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-12-03-174639

How reproducible:

Always

Steps to Reproduce:

1. Create install-config for Shared-VPC cluster
2. Run openshift-install create permissions-policy
3. Create cluster by using the above installer-required policy.

Actual results:

See description

Expected results:

sts:AssumeRole is included in the policy file when Shared VPC is configured.

Additional info:

The configuration of Shared-VPC is like:
platform:
  aws:
	hostedZone:
	hostedZoneRole:

https://github.com/openshift/installer/pull/9287

Bug OCPBUGS-45626: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/insights-runtime-extractor/pull/34

Bug OCPBUGS-45686: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kubernetes-autoscaler/pull/331

Bug OCPBUGS-39199: clusteroperator/machine-config blips Degraded=True during upgrade test

View the Description View the linked PRs

Description of problem:

    In an effort to ensure all HA components are not degraded by design during normal e2e test or upgrades, we are collecting all operators that are blipping Degraded=True during any payload job run.

This card captures machine-config operator that blips Degraded=True during upgrade runs.


Example Job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-azure-ovn-upgrade/1843023092004163584   

Reasons associated with the blip: RenderConfigFailed   

For now, we put an exception in the test. But it is expected that teams take action to fix those and remove the exceptions after the fix go in.

Exceptions are defined here: 


See linked issue for more explanation on the effort.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/machine-config-operator/pull/4706

Bug OCPBUGS-45512: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/insights-runtime-extractor/pull/33

Bug OCPBUGS-45179: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-monitoring-operator/pull/2529

Bug OCPBUGS-45994: allow KubeDaemonSetMisScheduled alert on ROSA

View the Description View the linked PRs

OSD-26887: managed services taints several nodes as infrastructure. This taint appears to be applied after some of the platform DS are scheduled there, causing this alert to fire. Managed services rebalances the DS after the taint is added, and the alert clears, but origin fails this test. Allowing this alert to fire while we investigate why the taint is not added at node birth.

https://github.com/openshift/origin/pull/29357

Bug OCPBUGS-37058: i18n: Missing translations for "PodDisruptionBudget violated" string

View the Description View the linked PRs

Description of problem:

Missing translations for "PodDisruptionBudget violated" string

Code:

"`count` PodDisruptionBudget violated_one": "`count` PodDisruptionBudget violated_one", "`count` PodDisruptionBudget violated_other": "`count` PodDisruptionBudget violated_other",

Code:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14586

Bug OCPBUGS-45374: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-operator/pull/337

Vulnerability OCPBUGS-46125: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-external-provisioner/pull/108

Bug OCPBUGS-45565: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/kubernetes-autoscaler/pull/330

Bug OCPBUGS-45380: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-external-snapshotter/pull/167

Bug OCPBUGS-45570: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-azure/pull/328

Bug OCPBUGS-45661: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/monitoring-plugin/pull/284

Bug OCPBUGS-46342: HyperShift CEL validation blocks ARM64 NodePool creation for non-AWS/Azure platforms

View the Description View the linked PRs

Description of problem:

HyperShift CEL validation blocks ARM64 NodePool creation for non-AWS/Azure platforms
Can't add a Bare Metal worker node to the hosted cluster. 
This was discussed on #project-hypershift Slack channel.

Version-Release number of selected component (if applicable):

MultiClusterEngine v2.7.2 
HyperShift Operator image: 
registry.redhat.io/multicluster-engine/hypershift-rhel9-operator@sha256:56bd0210fa2a6b9494697dc7e2322952cd3d1500abc9f1f0bbf49964005a7c3a

How reproducible:

Always

Steps to Reproduce:

1. Create a HyperShift HostedCluster on a non-AWS/non-Azure platform
2. Try to create a NodePool with ARM64 architecture specification

Actual results:

- CEL validation blocks creating NodePool with arch: arm64 on non-AWS/Azure platforms
- Receive error: "The NodePool is invalid: spec: Invalid value: "object": Setting Arch to arm64 is only supported for AWS and Azure"
- Additional validation in NodePool spec also blocks arm64 architecture

Expected results:

- Allow ARM64 architecture specification for NodePools on BareMetal platform 
- Remove or update the CEL validation to support this use case

Additional info:

NodePool YAML:
apiVersion: hypershift.openshift.io/v1beta1
kind: NodePool
metadata:
  name: nodepool-doca5-1
  namespace: doca5
spec:
  arch: arm64
  clusterName: doca5
  management:
    autoRepair: false
    replace:
      rollingUpdate:
        maxSurge: 1
        maxUnavailable: 0
      strategy: RollingUpdate
    upgradeType: InPlace
  platform:
    agent:
      agentLabelSelector: {}
    type: Agent
  release:
    image: quay.io/openshift-release-dev/ocp-release:4.16.21-multi
  replicas: 1

https://github.com/openshift/hypershift/pull/5276

Bug OCPBUGS-29354: Provide error report from ingress-to-route controller

View the Description View the linked PRs

Description of problem:

    ingress-to-route controller does not provide any information about failed conversions from ingress to route. This is a big issue in environments heavily dependent on the ingress objects. The only way to find why a route is not created is guess and try as the only information one can get is that the route is not created.

Version-Release number of selected component (if applicable):

    OCP 4.14

How reproducible:

    100%

Steps to Reproduce:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    route.openshift.io/termination: passthrough
  name: hello-openshift-class
  namespace: test
spec:
  ingressClassName: openshift-default
  rules:
  - host: ingress01-rhodain-test01.apps.rhodain03.sbr-virt.gsslab.brq2.redhat.com
    http:
      paths:
      - backend:
          service:
            name: myapp02
            port:
              number: 8080
        path: /
        pathType: Prefix
  tls:
  - {}

Actual results:

    Route is not created and no error is logged

Expected results:

    En error is provided in the events or at least in the controllers logs. The events are prefered as the ingress objects are mainly created by uses without cluster admin privileges.

Additional info:

https://github.com/openshift/route-controller-manager/pull/48

Bug OCPBUGS-45098: OLMv1 doesn't work in proxied environment

View the Description View the linked PRs

It looks like OLMv1 doesn't handle proxies correctly, aws-ovn-proxy job is permafailing https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-proxy/1861444783696777216

I suspect it's on the OLM operator side, are you looking at the cluster-wide proxy object and wiring it into your containers if set?

https://github.com/openshift/cluster-olm-operator/pull/93

Bug OCPBUGS-45558: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-dns-operator/pull/425

Bug OCPBUGS-44694: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/baremetal-runtimecfg/pull/334

Bug OCPBUGS-45319: HorizontalNav component of dynamic plugin sdk don't have all the necessary props

View the Description View the linked PRs

Description of problem:

HorizontalNav component of @openshift-console/dynamic-plugin-sdk doest not have customData prop which is available in console repo. 

This prop is needed to pass any value between tabs in details page

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/console/pull/14575

Bug OCPBUGS-45739: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-openstack/pull/318

Bug OCPBUGS-46498: updating the list of the monitored accelerators

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

Bug OCPBUGS-38543: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14313

Bug OCPBUGS-45418: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-baremetal-operator/pull/457

Bug OCPBUGS-45494: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-ibm/pull/77

Bug OCPBUGS-45715: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ironic-static-ip-manager/pull/46

Bug OCPBUGS-45346: Invalid PerformanceProfile cpus panics on validation webhook

View the Description View the linked PRs

Description of problem:

Application of PerformanceProfile with invalid cpuset in one of the reserved/isolated/shared/offlined cpu fields causing webhook validation to panic instead of returning an informant error.

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-22-231049

How reproducible:

Apply a PerformanceProfile with invalid cpu values

Steps to Reproduce:

Apply the following PerformanceProfile with invalid cpu values:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: pp
spec:
  cpu:
    isolated: 'garbage'
    reserved: 0-3
  machineConfigPoolSelector:
    pools.operator.machineconfiguration.openshift.io/worker-cnf: ""
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""

Actual results:

On OCP >= 4.18 the error is:

Error from server: error when creating "pp.yaml": admission webhook "vwb.performance.openshift.io" denied the request: panic: runtime error: invalid memory address or nil pointer dereference [recovered]

On OCP <= 4.17 the error is:

Validation webhook passes without any errors. Invalid configuration propogates to the cluster and breaks it.

Expected results:

We expect to pushback an informant error when invalid cpuset has been entered, without panicking or accepting it!

https://github.com/openshift/cluster-node-tuning-operator/pull/1231

Bug OCPBUGS-45401: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-external-attacher/pull/81

Bug OCPBUGS-45534: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-baremetal/pull/222

Bug OCPBUGS-45644: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/prometheus-operator/pull/319

Bug OCPBUGS-45678: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/9282

Bug OCPBUGS-46361: deprecate "oc adm pod-network"

View the Description View the linked PRs

The "oc adm pod-network" command for working with openshift-sdn multitenant mode is now totally useless in OCP 4.17 and newer clusters (since it's only useful with openshift-sdn, and openshift-sdn no longer exists as of OCP 4.17). Of course, people might use a new oc binary to talk to an older cluster, but probably the built-in documentation should make it clearer that this is not a command that will be useful to 99% of users.

If it's possible to make "pod-network" not show up as a subcommand in "oc adm -h" that would probably be good. If not, it should probably have a description like "Manage OpenShift-SDN Multitenant mode networking [DEPRECATED]", and likewise, the longer descriptions of the pod-network subcommands should talk about "OpenShift-SDN Multitenant mode" rather than "the redhat/openshift-ovs-multitenant network plugin" (which is OCP 3 terminology), and maybe should explicitly say something like "this has no effect when using the default OpenShift Networking plugin (OVN-Kubernetes)".

https://github.com/openshift/oc/pull/1955

Bug OCPBUGS-45617: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-storage-operator/pull/546

Bug OCPBUGS-42849: The release signature configmap file is invalid with no name

View the Description View the linked PRs

Description of problem:

The release signature configmap file is invalid with no name defined

Version-Release number of selected component (if applicable):

oc-mirror version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202410011141.p0.g227a9c4.assembly.stream.el9-227a9c4", GitCommit:"227a9c499b6fd94e189a71776c83057149ee06c2", GitTreeState:"clean", BuildDate:"2024-10-01T20:07:43Z", GoVersion:"go1.22.5 (Red Hat 1.22.5-1.module+el8.10.0+22070+9237f38b) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

100%

Steps to Reproduce:

1) with isc :
cat /test/yinzhou/config.yaml 
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    channels:
    - name: stable-4.16
2) do mirror2disk + disk2mirror 
3) use the signature configmap  to create resource

Actual results:

3) failed to create resource with error: 
oc create -f signature-configmap.json 
The ConfigMap "" is invalid: metadata.name: Required value: name or generateName is required

oc create -f signature-configmap.yaml 
The ConfigMap "" is invalid: metadata.name: Required value: name or generateName is required

Expected results:

No error

https://github.com/openshift/oc-mirror/pull/979

Bug OCPBUGS-32406: Test Serverless function gives no response when function is not running

View the Description View the linked PRs

Description of problem:

    If the serverless function is not running and on click of Test Serverless button, nothing is happening.

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    Always

Steps to Reproduce:

    1.Install serverless operator
    2.Create serverless function and make sure the status is false
    3.Click on Test Serverless function

Actual results:

    No response

Expected results:

    May be an alert or may be we can hide that option if function is not ready?

Additional info:

https://github.com/openshift/console/pull/14610

Bug OCPBUGS-42320: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/5220

Bug OCPBUGS-45684: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-provider-openstack/pull/127

Bug OCPBUGS-45443: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/aws-pod-identity-webhook/pull/199

Bug OCPBUGS-45724: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-scheduler-operator/pull/554

Bug OCPBUGS-33144: machine-config-daemon pod not picking up on label and mcp change to push out new rendered- config

View the Description View the linked PRs

Description of problem:

Node was created today with worker label. It was labeled as a loadbalancer to match mcp selector. MCP saw the selector and moved to Updating but the machine-config-daemon pod isn't responding. We tried deleting the pod and it still didn't pick up that it needed to get a new config. Manually editing the desired config appears to workaround the issue but shouldn't be necessary.

Node created today:

[dasmall@supportshell-1 03803880]$ oc get nodes worker-048.kub3.sttlwazu.vzwops.com -o yaml | yq .metadata.creationTimestamp
'2024-04-30T17:17:56Z'

Node has worker and loadbalancer roles:

[dasmall@supportshell-1 03803880]$ oc get node worker-048.kub3.sttlwazu.vzwops.com
NAME                                  STATUS   ROLES                 AGE   VERSION
worker-048.kub3.sttlwazu.vzwops.com   Ready    loadbalancer,worker   1h    v1.25.14+a52e8df


MCP shows a loadbalancer needing Update and 0 nodes in worker pool:

[dasmall@supportshell-1 03803880]$ oc get mcp
NAME           CONFIG                                                   UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
loadbalancer   rendered-loadbalancer-1486d925cac5a9366d6345552af26c89   False     True       False      4              3                   3                     0                      87d
master         rendered-master-47f6fa5afe8ce8f156d80a104f8bacae         True      False      False      3              3                   3                     0                      87d
worker         rendered-worker-a6be9fb3f667b76a611ce51811434cf9         True      False      False      0              0                   0                     0                      87d
workerperf     rendered-workerperf-477d3621fe19f1f980d1557a02276b4e     True      False      False      38             38                  38                    0                      87d


Status shows mcp updating:

[dasmall@supportshell-1 03803880]$ oc get mcp loadbalancer -o yaml | yq .status.conditions[4]
lastTransitionTime: '2024-04-30T17:33:21Z'
message: All nodes are updating to rendered-loadbalancer-1486d925cac5a9366d6345552af26c89
reason: ''
status: 'True'
type: Updating


Node still appears happy with worker MC:

[dasmall@supportshell-1 03803880]$ oc get node worker-048.kub3.sttlwazu.vzwops.com -o yaml | grep rendered-
    machineconfiguration.openshift.io/currentConfig: rendered-worker-a6be9fb3f667b76a611ce51811434cf9
    machineconfiguration.openshift.io/desiredConfig: rendered-worker-a6be9fb3f667b76a611ce51811434cf9
    machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-a6be9fb3f667b76a611ce51811434cf9
    machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-a6be9fb3f667b76a611ce51811434cf9


machine-config-daemon pod appears idle:

[dasmall@supportshell-1 03803880]$ oc logs -n openshift-machine-config-operator machine-config-daemon-wx2b8 -c machine-config-daemon
2024-04-30T17:48:29.868191425Z I0430 17:48:29.868156   19112 start.go:112] Version: v4.12.0-202311220908.p0.gef25c81.assembly.stream-dirty (ef25c81205a65d5361cfc464e16fd5d47c0c6f17)
2024-04-30T17:48:29.871340319Z I0430 17:48:29.871328   19112 start.go:125] Calling chroot("/rootfs")
2024-04-30T17:48:29.871602466Z I0430 17:48:29.871593   19112 update.go:2110] Running: systemctl daemon-reload
2024-04-30T17:48:30.066554346Z I0430 17:48:30.066006   19112 rpm-ostree.go:85] Enabled workaround for bug 2111817
2024-04-30T17:48:30.297743470Z I0430 17:48:30.297706   19112 daemon.go:241] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858 (412.86.202311271639-0) 828584d351fcb58e4d799cebf271094d5d9b5c1a515d491ee5607b1dcf6ebf6b
2024-04-30T17:48:30.324852197Z I0430 17:48:30.324543   19112 start.go:101] Copied self to /run/bin/machine-config-daemon on host
2024-04-30T17:48:30.325677959Z I0430 17:48:30.325666   19112 start.go:188] overriding kubernetes api to https://api-int.kub3.sttlwazu.vzwops.com:6443
2024-04-30T17:48:30.326381479Z I0430 17:48:30.326368   19112 metrics.go:106] Registering Prometheus metrics
2024-04-30T17:48:30.326447815Z I0430 17:48:30.326440   19112 metrics.go:111] Starting metrics listener on 127.0.0.1:8797
2024-04-30T17:48:30.327835814Z I0430 17:48:30.327811   19112 writer.go:93] NodeWriter initialized with credentials from /var/lib/kubelet/kubeconfig
2024-04-30T17:48:30.327932144Z I0430 17:48:30.327923   19112 update.go:2125] Starting to manage node: worker-048.kub3.sttlwazu.vzwops.com
2024-04-30T17:48:30.332123862Z I0430 17:48:30.332097   19112 rpm-ostree.go:394] Running captured: rpm-ostree status
2024-04-30T17:48:30.332928272Z I0430 17:48:30.332909   19112 daemon.go:1049] Detected a new login session: New session 1 of user core.
2024-04-30T17:48:30.332935796Z I0430 17:48:30.332926   19112 daemon.go:1050] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh
2024-04-30T17:48:30.368619942Z I0430 17:48:30.368598   19112 daemon.go:1298] State: idle
2024-04-30T17:48:30.368619942Z Deployments:
2024-04-30T17:48:30.368619942Z * ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858
2024-04-30T17:48:30.368619942Z                    Digest: sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858
2024-04-30T17:48:30.368619942Z                   Version: 412.86.202311271639-0 (2024-04-30T17:05:27Z)
2024-04-30T17:48:30.368619942Z           LayeredPackages: kernel-devel kernel-headers
2024-04-30T17:48:30.368619942Z
2024-04-30T17:48:30.368619942Z   ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858
2024-04-30T17:48:30.368619942Z                    Digest: sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858
2024-04-30T17:48:30.368619942Z                   Version: 412.86.202311271639-0 (2024-04-30T17:05:27Z)
2024-04-30T17:48:30.368619942Z           LayeredPackages: kernel-devel kernel-headers
2024-04-30T17:48:30.368907860Z I0430 17:48:30.368884   19112 coreos.go:54] CoreOS aleph version: mtime=2023-08-08 11:20:41.285 +0000 UTC build=412.86.202308081039-0 imgid=rhcos-412.86.202308081039-0-metal.x86_64.raw
2024-04-30T17:48:30.368932886Z I0430 17:48:30.368926   19112 coreos.go:71] Ignition provisioning: time=2024-04-30T17:03:44Z
2024-04-30T17:48:30.368938120Z I0430 17:48:30.368931   19112 rpm-ostree.go:394] Running captured: journalctl --list-boots
2024-04-30T17:48:30.372893750Z I0430 17:48:30.372884   19112 daemon.go:1307] journalctl --list-boots:
2024-04-30T17:48:30.372893750Z -2 847e119666d9498da2ae1bd89aa4c4d0 Tue 2024-04-30 17:03:13 UTC—Tue 2024-04-30 17:06:32 UTC
2024-04-30T17:48:30.372893750Z -1 9617b204b8b8412fb31438787f56a62f Tue 2024-04-30 17:09:06 UTC—Tue 2024-04-30 17:36:39 UTC
2024-04-30T17:48:30.372893750Z  0 3cbf6edcacde408b8979692c16e3d01b Tue 2024-04-30 17:39:20 UTC—Tue 2024-04-30 17:48:30 UTC
2024-04-30T17:48:30.372912686Z I0430 17:48:30.372891   19112 rpm-ostree.go:394] Running captured: systemctl list-units --state=failed --no-legend
2024-04-30T17:48:30.378069332Z I0430 17:48:30.378059   19112 daemon.go:1322] systemd service state: OK
2024-04-30T17:48:30.378069332Z I0430 17:48:30.378066   19112 daemon.go:987] Starting MachineConfigDaemon
2024-04-30T17:48:30.378121340Z I0430 17:48:30.378106   19112 daemon.go:994] Enabling Kubelet Healthz Monitor
2024-04-30T17:48:31.486786667Z I0430 17:48:31.486747   19112 daemon.go:457] Node worker-048.kub3.sttlwazu.vzwops.com is not labeled node-role.kubernetes.io/master
2024-04-30T17:48:31.491674986Z I0430 17:48:31.491594   19112 daemon.go:1243] Current+desired config: rendered-worker-a6be9fb3f667b76a611ce51811434cf9
2024-04-30T17:48:31.491674986Z I0430 17:48:31.491603   19112 daemon.go:1253] state: Done
2024-04-30T17:48:31.495704843Z I0430 17:48:31.495617   19112 daemon.go:617] Detected a login session before the daemon took over on first boot
2024-04-30T17:48:31.495704843Z I0430 17:48:31.495624   19112 daemon.go:618] Applying annotation: machineconfiguration.openshift.io/ssh
2024-04-30T17:48:31.503165515Z I0430 17:48:31.503052   19112 update.go:2110] Running: rpm-ostree cleanup -r
2024-04-30T17:48:32.232728843Z Bootloader updated; bootconfig swap: yes; bootversion: boot.1.1, deployment count change: -1
2024-04-30T17:48:35.755815139Z Freed: 92.3 MB (pkgcache branches: 0)
2024-04-30T17:48:35.764568364Z I0430 17:48:35.764548   19112 daemon.go:1563] Validating against current config rendered-worker-a6be9fb3f667b76a611ce51811434cf9
2024-04-30T17:48:36.120148982Z I0430 17:48:36.120119   19112 rpm-ostree.go:394] Running captured: rpm-ostree kargs
2024-04-30T17:48:36.179660790Z I0430 17:48:36.179631   19112 update.go:2125] Validated on-disk state
2024-04-30T17:48:36.182434142Z I0430 17:48:36.182406   19112 daemon.go:1646] In desired config rendered-worker-a6be9fb3f667b76a611ce51811434cf9
2024-04-30T17:48:36.196911084Z I0430 17:48:36.196879   19112 config_drift_monitor.go:246] Config Drift Monitor started

Version-Release number of selected component (if applicable):

    4.12.45

How reproducible:

    They can reproduce in multiple clusters

Actual results:

    Node stays with rendered-worker config

Expected results:

    machineconfigpool updating should prompt a change to the desired config which the machine-config-daemon pod then updates node to

Additional info:

    here is the latest must-gather where this issue is occuring:
https://attachments.access.redhat.com/hydra/rest/cases/03803880/attachments/3fd0cf52-a770-4525-aecd-3a437ea70c9b?usePresignedUrl=true

https://github.com/openshift/machine-config-operator/pull/4757

Bug OCPBUGS-45705: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/configmap-reload/pull/66

Bug OCPBUGS-43779: [GCP] destroying a private cluster doesn't delete the forwarding-rule/backend-service/health-check/firewall-rules created by ingress operator

View the Description View the linked PRs

Description of problem:

    Destroying a private cluster doesn't delete the forwarding-rule/backend-service/health-check/firewall-rules created by ingress operator.

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-multi-2024-10-23-202329

How reproducible:

    Always

Steps to Reproduce:

1. pre-create vpc network/subnets/router and a bastion host
2. "create install-config", and then insert the network settings under platform.gcp, along with "publish: Internal" (see [1])
3. "create cluster" (use the above bastion host as http proxy)
4. "destroy cluster" (see [2])

Actual results:

    Although "destroy cluster" completes successfully, the forwarding-rule/backend-service/health-check/firewall-rules created by ingress operator are not deleted (see [3]), which leads to deleting the vpc network/subnets failure.

Expected results:

    The forwarding-rule/backend-service/health-check/firewall-rules created by ingress operator should also be deleted during "destroy cluster".

Additional info:

FYI one history bug https://issues.redhat.com/browse/OCPBUGS-37683

https://github.com/openshift/installer/pull/9270

Bug OCPBUGS-45480: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-operator/pull/1310

Bug OCPBUGS-45341: iptables-alerter daemonset should run everywhere

View the Description View the linked PRs

Managed services marks a couple of nodes as "infra" so user workloads don't get scheduled on them. However, platform daemonsets like iptables-alerter should run there – and the typical toleration for that purpose should be:

 tolerations:
- operator: Exists

instead the toleration is

tolerations:
- key: "node-role.kubernetes.io/master"
  operator: "Exists"
  effect: "NoSchedule"

Examples from other platform DS:

$ for ns in openshift-cluster-csi-drivers openshift-cluster-node-tuning-operator openshift-dns openshift-image-registry openshift-machine-config-operator openshift-monitoring openshift-multus openshift-multus openshift-multus openshift-network-diagnostics openshift-network-operator openshift-ovn-kubernetes openshift-security; do echo "NS: $ns"; oc get ds -o json -n $ns|jq '.items.[0].spec.template.spec.tolerations'; done
NS: openshift-cluster-csi-drivers
[
  {
    "operator": "Exists"
  }
]
NS: openshift-cluster-node-tuning-operator
[
  {
    "operator": "Exists"
  }
]
NS: openshift-dns
[
  {
    "key": "node-role.kubernetes.io/master",
    "operator": "Exists"
  }
]
NS: openshift-image-registry
[
  {
    "operator": "Exists"
  }
]
NS: openshift-machine-config-operator
[
  {
    "operator": "Exists"
  }
]
NS: openshift-monitoring
[
  {
    "operator": "Exists"
  }
]
NS: openshift-multus
[
  {
    "operator": "Exists"
  }
]
NS: openshift-multus
[
  {
    "operator": "Exists"
  }
]
NS: openshift-multus
[
  {
    "operator": "Exists"
  }
]
NS: openshift-network-diagnostics
[
  {
    "operator": "Exists"
  }
]
NS: openshift-network-operator
[
  {
    "effect": "NoSchedule",
    "key": "node-role.kubernetes.io/master",
    "operator": "Exists"
  }
]
NS: openshift-ovn-kubernetes
[
  {
    "operator": "Exists"
  }
]
NS: openshift-security
[
  {
    "operator": "Exists"
  }
]

https://github.com/openshift/cluster-network-operator/pull/2581

Bug OCPBUGS-45614: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-gcp/pull/234

Bug OCPBUGS-45838: managed services namespace list is missing some

View the Description View the linked PRs

The helper doesn't have all the namespaces in it, and we're getting some flakes in CI like this:

{{batch/v1/Job/openshift-backplane-managed-scripts/<batch_job>/container/osd-delete-backplane-script-resources
does not have a cpu request (rule: "batch/v1/Job/openshift-backplane-managed-scripts/<batch_job>/container/osd-delete-backplane-script-resources/request[cpu]")}}

Bug OCPBUGS-46052: machine-os-builder deployment missing openshift.io/required-scc annotation

View the Description View the linked PRs

Description of problem:

The machine-os-builder deployment manifest does not set the openshift.io/required-scc annotation, which appears to be required for the upgrade conformance suite to pass. The rest of the MCO components currently set this annotation and we can probably use the same setting for the Machine Config Controller (which is restricted-v2). What I'm unsure of is whether this also needs to be set on the builder pods as well and what the appropriate setting would be for that case.

Version-Release number of selected component (if applicable):

How reproducible:

This always occurs in the new CI jobs, e2e-aws-ovn-upgrade-ocb-techpreview and e2e-aws-ovn-upgrade-ocb-conformance-suite-techpreview. Here's two examples from rehearsal failures:

Steps to Reproduce:

Run either of the aforementioned CI jobs.

Actual results:

Test [sig-auth] all workloads in ns/openshift-machine-config-operator must set the 'openshift.io/required-scc' annotation fails.

Expected results:

Test{{ [sig-auth] all workloads in ns/openshift-machine-config-operator must set the 'openshift.io/required-scc' annotation}} should pass.

Additional info:

https://github.com/openshift/machine-config-operator/pull/4752

Bug OCPBUGS-32754: Stale routes to the join switch subnet cause intermittent drops during egress

View the Description View the linked PRs

Description of problem:

Under some circumstances (not clear exactly which ones), the OVN databases of 2 nodes ended up having 2 src-ip static routes in ovn_cluster_router instead of one: one of them points to the correct IP of the rtoj-GR_${NODE_NAME} LRP and one points to a wrong IP on the join subnet (that IP is not used in any other LRP or LSP).

Both static routes are taken into consideration while routing traffic out from the cluster, so packets that use the right route are able to egress while the packets that use the wrong route are dropped.

Version-Release number of selected component (if applicable):

Reproduced in 4.14.20

How reproducible:

At least once. Only 2 nodes of the cluster.

Steps to Reproduce:

(Not sure, it was just found after investigation of strange packet drop)

Actual results:

Wrong static route to some non-existent IP in the join subnet. Intermittent packet drop.

Expected results:

No wrong static routes. No packet drop.

Additional info:

This can be workarounded by wiping the OVN databases of the impacted node.

https://github.com/openshift/ovn-kubernetes/pull/2357

Bug OCPBUGS-45560: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-provider-nutanix/pull/87

Bug OCPBUGS-45754: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/oc/pull/1946

Task HOSTEDCP-2206: Improve Unit Test Runtime

View the Description View the linked PRs

Our unit test runtime is slow. It seems to run anywhere from ~16-20 minutes locally. On CI it can take at least 30 minutes to run. Investigate whether or not any changes can be made to improve the unit test runtime.

https://github.com/openshift/hypershift/pull/5257

Task MULTIARCH-5186: Bump k8s and openshift dependencies.

View the Description View the linked PRs

This issue tracks the updation of k8s and related openshift APIs to a recent version, to keep in-line with other MAPI providers.

https://github.com/openshift/machine-api-provider-powervs/pull/94

Bug OCPBUGS-45242: ConsolePluginBackendDetail is throwing an error on some specific ConsolePlugin manifest

View the Description View the linked PRs

Description of problem:

Console plugin details page is throwing error on some specific YAML

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-30-141716

How reproducible:

Always

Steps to Reproduce:

1. Create a ConsolePlugin with minimum required fields  apiVersion: console.openshift.io/v1
kind: ConsolePlugin
metadata:
  name: console-demo-plugin-two
spec:
  backend:
    type: Service
  displayName: OpenShift Console Demo Plugin

2. Visit consoleplugin details page at /k8s/cluster/console.openshift.io~v1~ConsolePlugin/console-demo-plugin

Actual results:

2. We will see an error page

Expected results:

2. we should not show an error page since ConsolePlugin YAML has every required fields although they are not complete

Additional info:

https://github.com/openshift/console/pull/14582

Bug OCPBUGS-45736: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/prometheus-operator/pull/320

Bug OCPBUGS-45972: Azure Cloud Controller Manager Panic

View the Description View the linked PRs

This is a clone of issue OCPBUGS-45859. The following is the description of the original issue:
—
The following test is failing more than expected:

Undiagnosed panic detected in pod

See the sippy test details for additional context.

Observed in 4.18-e2e-azure-ovn/1864410356567248896 as well as pull-ci-openshift-installer-master-e2e-azure-ovn/1864312373058211840

: Undiagnosed panic detected in pod
{  pods/openshift-cloud-controller-manager_azure-cloud-controller-manager-5788c6f7f9-n2mnh_cloud-controller-manager_previous.log.gz:E1204 22:27:54.558549       1 iface.go:262] "Observed a panic" panic="interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.EndpointSlice" panicGoValue="&runtime.TypeAssertionError{_interface:(*abi.Type)(0x291daa0), concrete:(*abi.Type)(0x2b73880), asserted:(*abi.Type)(0x2f5cc20), missingMethod:\"\"}" stacktrace=<}

https://github.com/openshift/cloud-provider-azure/pull/131

Bug OCPBUGS-23924: No results rendered for failed task run

View the Description View the linked PRs

Description of problem:

Previously, failed task rus did not emit results, now they do but the UI still shows "No TaskRun results available due to failure" even though task run's status contains a result.

Version-Release number of selected component (if applicable):

4.14.3

How reproducible:

Always with a task run producing a result but failing afterwards

Steps to Reproduce:

    1. Create the pipelinerun below
    2. have a look on its task run

apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: hello-pipeline
spec:
  tasks:
  - name: hello
    taskSpec:
      results:
      - name: greeting1
      steps:
      - name: greet
        image: registry.access.redhat.com/ubi8/ubi-minimal
        script: |
          #!/usr/bin/env bash
          set -e
          echo -n "Hello World!" | tee $(results.greeting1.path)
          exit 1
  results:
  - name: greeting2
    value: $(tasks.hello.results.greeting1)

Actual results:

No results in UI

Expected results:

One result should be displayed even though task run failed

Additional info:

Pipelines 1.13.0

https://github.com/openshift/console/pull/14414

Bug OCPBUGS-42815: IBM Fusion operator upgrade is blocked with the error: "error validating existing CRs against new CRD's schema"

View the Description View the linked PRs

Description of problem:

    While upgrading the Fusion operator,  IBM team is facing the following error in the operator's subscription:
error validating existing CRs against new CRD's schema for "fusionserviceinstances.service.isf.ibm.com": error validating service.isf.ibm.com/v1, Kind=FusionServiceInstance "ibm-spectrum-fusion-ns/odfmanager": updated validation is too restrictive: [].status.triggerCatSrcCreateStartTime: Invalid value: "number": status.triggerCatSrcCreateStartTime in body must be of type integer: "number"


question here, "triggerCatSrcCreateStartTime" has been present in the operator for the past few releases and it's datatype (integer) hasn't changed in the latest release as well. There was  one "FusionServiceInstance" CR present in the cluster when this issue was hit and the value of "triggerCatSrcCreateStartTime" field being "1726856593000774400".

Version-Release number of selected component (if applicable):

    Its impacting between OCP 4.16.7 and OCP 4.16.14 versions

How reproducible:

    Always

Steps to Reproduce:

    1.Upgrade the fusion operator ocp version 4.16.7 to ocp 4.16.14
    2.
    3.

Actual results:

    Upgrade fails with error in description

Expected results:

    Upgrade should not be failed

Additional info:

https://github.com/openshift/operator-framework-olm/pull/910

Task CNTRLPLANE-25: Investigate aks-e2e CreateClusterV2 Issue

View the Description View the linked PRs

The aks-e2e test keeps failing on the CreateClusterV2 test because the `ValidReleaseInfo` condition is not set. The patch that sets this status keeps failing. Investigate why & provide a fix.

https://github.com/openshift/hypershift/pull/5316

Bug OCPBUGS-44794: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ovn-kubernetes/pull/2357

Bug OCPBUGS-45174: Bar Chart: wrong bar size if the first record is not the largest one

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

    every time

Steps to Reproduce:

    1. Create the dashboard with a bar chart and sort query result asc.
    2. 
    3.

Actual results:

 bar goes outside of the border

Expected results:

bar should not goes outside of the border

Additional info:

screenshot: https://drive.google.com/file/d/1xPRgenpyCxvUuWcGiWzmw5kz51qKLHyI/view?usp=drive_link

https://github.com/openshift/monitoring-plugin/pull/295

Bug OCPBUGS-44314: Cannot access external network via https from the HCP openshift-apiserver component

View the Description View the linked PRs

Description of problem:

Trying to setup a disconnected HCP cluster with self-managed image registry.

After the cluster installed, all the imagestream failed to import images.
With error:
```
Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client
```

The imagestream will talk to openshift-apiserver and get the image target there.

After login to the hcp namespace, figured out that I cannot access any external network with https protocol.

Version-Release number of selected component (if applicable):

4.14.35

How reproducible:

    always

Steps to Reproduce:

    1. Install the hypershift hosted cluster with above setup
    2. The cluster can be created successfully and all the pods on the cluster can be running with the expected images pulled
    3. Check the internal image-registry
    4. Check the openshift-apiserver pod from management cluster

Actual results:

All the imagestreams failed to sync from the remote registry.
$ oc describe is cli -n openshift
Name:            cli
Namespace:        openshift
Created:        6 days ago
Labels:            <none>
Annotations:        include.release.openshift.io/ibm-cloud-managed=true
            include.release.openshift.io/self-managed-high-availability=true
            openshift.io/image.dockerRepositoryCheck=2024-11-06T22:12:32Z
Image Repository:    image-registry.openshift-image-registry.svc:5000/openshift/cli
Image Lookup:        local=false
Unique Images:        0
Tags:            1latest
  updates automatically from registry quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d  ! error: Import failed (InternalError): Internal error occurred: [122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-1@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-2@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-3@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-4@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-5@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://quay.io/v2/": http: server gave HTTP response to HTTPS client]


Access the external network from the openshift-apiserver pod:
sh-5.1$ curl --connect-timeout 5 https://quay.io/v2
curl: (28) Operation timed out after 5001 milliseconds with 0 out of 0 bytes received
sh-5.1$ curl --connect-timeout 5 https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/
curl: (28) Operation timed out after 5001 milliseconds with 0 out of 0 bytes received

sh-5.1$ env | grep -i http.*proxy
HTTPS_PROXY=http://127.0.0.1:8090
HTTP_PROXY=http://127.0.0.1:8090

Expected results:

The openshift-apiserver should be able to talk to the remote https services.

Additional info:

It is working after set the registry to no_proxy

sh-5.1$ NO_PROXY=122610517469.dkr.ecr.us-west-2.amazonaws.com curl --connect-timeout 5 https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/
Not Authorized

https://github.com/openshift/hypershift/pull/5281

Bug OCPBUGS-45420: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/operator-framework-olm/pull/909

Bug OCPBUGS-45563: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Bug OCPBUGS-45691: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-ibmcloud/pull/96

Bug OCPBUGS-36454: [coredns] Rename Dockerfile

View the Description View the linked PRs

Refactor name to Dockerfile.ocp as a better, version independent alternative

https://github.com/openshift/coredns/pull/128

Bug OCPBUGS-44560: additional network ignored on SingleStackIPv6 IPI installation

View the Description View the linked PRs

Description of problem:

Additional network is not correctly configured on the secondary interface inside the masters and the workers.

With install-config.yaml with this section:

# This file is autogenerated by infrared openshift plugin                                                                                                                                                                                                                                                                    
apiVersion: v1                                                                                                                                                                                                                                                                                                               
baseDomain: "shiftstack.local"
compute:
- name: worker
  platform:
    openstack:
      zones: []
      additionalNetworkIDs: ['26a751c3-c316-483c-91ed-615702bcbba9']
      type: "worker"
  replicas: 3
controlPlane:
  name: master
  platform:
    openstack:
      zones: []
      additionalNetworkIDs: ['26a751c3-c316-483c-91ed-615702bcbba9']
      type: "master"
  replicas: 3
metadata:
  name: "ostest"
networking:
  clusterNetworks:
  - cidr: fd01::/48
    hostPrefix: 64
  serviceNetwork:
    - fd02::/112
  machineNetwork:
    - cidr: "fd2e:6f44:5dd8:c956::/64"
  networkType: "OVNKubernetes"
platform:
  openstack:
    cloud:            "shiftstack"
    region:           "regionOne"
    defaultMachinePlatform:
      type: "master"
    apiVIPs: ["fd2e:6f44:5dd8:c956::5"]
    ingressVIPs: ["fd2e:6f44:5dd8:c956::7"]
    controlPlanePort:
      fixedIPs:
        - subnet:
            name: "subnet-ssipv6"
pullSecret: |
  {"auths": {"installer-host.example.com:8443": {"auth": "ZHVtbXkxMjM6ZHVtbXkxMjM="}}}
sshKey: <hidden>
additionalTrustBundle: <hidden>
imageContentSources:
- mirrors:
  - installer-host.example.com:8443/registry
  source: quay.io/openshift-release-dev/ocp-v4.0-art-dev
- mirrors:
  - installer-host.example.com:8443/registry
  source: registry.ci.openshift.org/ocp/release

The installation works. However, the additional network is not configured on the masters or the workers, which leads in our case to faulty manila integration.

In journal of all OCP nodes, it's observed logs repeteadly like below one from the master-0:

Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info>  [1731590504.9667] device (enp4s0): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')
Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <warn>  [1731590504.9672] device (enp4s0): Activation: failed for connection 'Wired connection 1'
Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info>  [1731590504.9674] device (enp4s0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info>  [1731590504.9768] dhcp4 (enp4s0): canceled DHCP transaction
Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info>  [1731590504.9768] dhcp4 (enp4s0): activation: beginning transaction (timeout in 45 seconds)
Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info>  [1731590504.9768] dhcp4 (enp4s0): state changed no lease

Where that server has specifically an interface connected to the subnet "StorageNFSSubnet":

$ openstack server list | grep master-0
| da23da4a-4af8-4e54-ac60-88d6db2627b6 | ostest-kmmtt-master-0       | ACTIVE | StorageNFS=fd00:fd00:fd00:5000::fb:d8; network-ssipv6=fd2e:6f44:5dd8:c956::2e4            | ostest-kmmtt-rhcos                            | master    |

That subnet is defined in openstack as dhcpv6-stateful:

$ openstack subnet show StorageNFSSubnet
+----------------------+-------------------------------------------------------+
| Field                | Value                                                 |
+----------------------+-------------------------------------------------------+
| allocation_pools     | fd00:fd00:fd00:5000::fb:10-fd00:fd00:fd00:5000::fb:fe |
| cidr                 | fd00:fd00:fd00:5000::/64                              |
| created_at           | 2024-11-13T12:34:41Z                                  |
| description          |                                                       |
| dns_nameservers      |                                                       |
| dns_publish_fixed_ip | None                                                  |
| enable_dhcp          | True                                                  |
| gateway_ip           | None                                                  |
| host_routes          |                                                       |
| id                   | 480d7b2a-915f-4f0c-9717-90c55b48f912                  |
| ip_version           | 6                                                     |
| ipv6_address_mode    | dhcpv6-stateful                                       |
| ipv6_ra_mode         | dhcpv6-stateful                                       |
| name                 | StorageNFSSubnet                                      |
| network_id           | 26a751c3-c316-483c-91ed-615702bcbba9                  |
| prefix_length        | None                                                  |
| project_id           | 4566c393806c43b9b4e9455ebae1cbb6                      |
| revision_number      | 0                                                     |
| segment_id           | None                                                  |
| service_types        | None                                                  |
| subnetpool_id        | None                                                  |
| tags                 |                                                       |
| updated_at           | 2024-11-13T12:34:41Z                                  |
+----------------------+-------------------------------------------------------+

I also compared with ipv4 installation, and the storageNFSsubnet IP is successfully configured on enp4s0.

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-12-201730,
~~RHOS-17~~.1-RHEL-9-20240701.n.1

How reproducible: Always
Additional info: must-gather and journal of the OCP nodes provided in private comment.

https://github.com/openshift/installer/pull/9323

Bug OCPBUGS-45802: Layout incorrect on ‘Edit Pod count’ pops up windows

View the Description View the linked PRs

Description of problem:

The 'Plus' button in the 'Edit Pod Count' popup window overlaps the input field, which is incorrect.

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-12-05-103644

How reproducible:

    Always

Steps to Reproduce:

    1.Navigate to Workloads -> ReplicaSets page， choose one resource, and click the Keban list buton, choose ‘Edit Pod count’
    2.
    3.

Actual results:

    The Layout is incorrect

Expected results:

    The 'Plus' button in the 'Edit Pod Count' popup window should not overlaps the input field

Additional info:

 Snapshot: https://drive.google.com/file/d/1mL7xeT7FzkdsM1TZlqGdgCP5BG6XA8uh/view?usp=drive_link
https://drive.google.com/file/d/1qmcal_4hypEPjmG6PTG11AJPwdgt65py/view?usp=drive_link

https://github.com/openshift/console/pull/14597

Bug OCPBUGS-45104: View release notes link is not correct

View the Description View the linked PRs

Description of problem:

console is showing view release notes on several places, but the current link only point to Y release main release note

Version-Release number of selected component (if applicable):

4.17.2

How reproducible:

Always

Steps to Reproduce:

1. set up 4.17.2 cluster
2. navigate to Cluster Settings page, check 'View release note' link in 'Update history' table

Actual results:

the link only point user to Y release main release note

Expected results:

the link should point to release note of a specific version
the correct link should be 
https://access.redhat.com/documentation/en-us/openshift_container_platform/${major}.${minor}/html/release_notes/ocp-${major}-${minor}-release-notes#ocp-${major}-${minor}-${patch}_release_notes

Additional info:

https://github.com/openshift/console/pull/14543

Bug OCPBUGS-45732: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-provider-powervs/pull/93

Bug OCPBUGS-42610: unnecessary daemonset / deployment rollouts on vsphere

View the Description View the linked PRs

Description of problem:

Sippy complains about pathological events in ns/openshift-cluster-csi-drivers in vsphere-ovn-serial jobs. See this job as one example.

Jan noticed that the DaemonSet generation is 10-12, while in 4.17 is 2. Why is our operator updating the DaemonSet so often?

I wrote a quick "one-liner" to generate json diffs from the vmware-vsphere-csi-driver-operator logs:

prev=''; grep 'DaemonSet "openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-node" changes' openshift-cluster-csi-drivers_vmware-vsphere-csi-driver-operator-5b79c58f6f-hpr6g_vmware-vsphere-csi-driver-operator.log | sed 's/^.*changes: //' | while read -r line; do diff <(echo $prev | jq .) <(echo $line | jq .); prev=$line; echo "####"; done

It really seems to be only operator.openshift.io/spec-hash and operator.openshift.io/dep-* fields changing in the json diffs:

####
4,5c4,5
<       "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==",
<       "operator.openshift.io/spec-hash": "fb274874404ad6706171c6774a369876ca54e037fcccc200c0ebf3019a600c36"
---
>       "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==",
>       "operator.openshift.io/spec-hash": "27a1bab0c00ace8ac21d95a5fe9a089282e7b2b3ec042045951bd5e26ae01a09"
13c13
<           "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q=="
---
>           "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A=="
####
4,5c4,5
<       "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==",
<       "operator.openshift.io/spec-hash": "27a1bab0c00ace8ac21d95a5fe9a089282e7b2b3ec042045951bd5e26ae01a09"
---
>       "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==",
>       "operator.openshift.io/spec-hash": "fb274874404ad6706171c6774a369876ca54e037fcccc200c0ebf3019a600c36"
13c13
<           "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A=="
---
>           "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q=="
####

The deployment is also changing in the same way. We need to find what is causing the spec-hash and dep-* fields to change and avoid the unnecessary churn that causes new daemonset / deployment rollouts.

Version-Release number of selected component (if applicable):

4.18.0

How reproducible:

~20% failure rate in 4.18 vsphere-ovn-serial jobs

Steps to Reproduce:

Actual results:

operator rolls out unnecessary daemonset / deployment changes

Expected results:

don't roll out changes unless there is a spec change

Additional info:

https://github.com/openshift/origin/pull/29311

Bug OCPBUGS-39559: ART requests updates to 4.18 image coredns-container

View the Description View the linked PRs

Please review the following PR: https://github.com/openshift/coredns/pull/130

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

https://github.com/openshift/coredns/pull/130

Bug OCPBUGS-45582: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-external-snapshotter/pull/168

Bug OCPBUGS-45586: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/prometheus/pull/238

Bug OCPBUGS-45714: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-driver-nfs/pull/150

Bug OCPBUGS-45306: Due to trailing dot(.) in domain name openshift installation getting failed.

View the Description View the linked PRs

Description of problem:

Customer is trying to install Self managed OCP cluster in aws. This customer use AWS VPC DHCPOptionSet. where it has a trailing dot (.) at the end of domain name in dhcpoptionset. due to this setting Master nodes hostname also has trailing dot & this cause failure in OpenShift installation.

Version-Release number of selected component (if applicable):

How reproducible:

    always

Steps to Reproduce:

1.Please create a aws vpc with DHCPOptionSet, where DHCPoptionSet has trailing dot at the domain name.
2.Try installation of cluster with IPI.

Actual results:

    Openshift Installer should allowed to create AWS Master nodes, where domain has trailing dot(.).

Expected results:

Additional info:

Bug OCPBUGS-45623: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/prometheus-alertmanager/pull/97

Bug OCPBUGS-45321: builder Unit Test Permanently Failing

View the Description View the linked PRs

Description of problem:

Unit tests for openshift/builder permanently failing for v4.18

Version-Release number of selected component (if applicable):

4.18

How reproducible:

Always

Steps to Reproduce:

    1. Run PR against openshift/builder

Actual results:

Test fails: 
--- FAIL: TestUnqualifiedClone (0.20s)
    source_test.go:171: unable to add submodule: "Cloning into '/tmp/test-unqualified335202210/sub'...\nfatal: transport 'file' not allowed\nfatal: clone of 'file:///tmp/test-submodule643317239' into submodule path '/tmp/test-unqualified335202210/sub' failed\n"
    source_test.go:195: unable to find submodule dir
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference

Expected results:

Tests pass

Additional info:

Example: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_builder/401/pull-ci-openshift-builder-master-unit/1853816128913018880

https://github.com/openshift/builder/pull/412

Bug OCPBUGS-45387: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-openstack/pull/345

Bug OCPBUGS-46470: Power VS: Add endpoint override for ResourceManager

View the Description View the linked PRs

Description of problem:

    We have recently enabled a few endpoint overrides, but ResourceManager was accidentally excluded.

https://github.com/openshift/installer/pull/9317

Bug OCPBUGS-43083: Pods cannot connect to apiserver in IPv6 disconnected hosted cluster

View the Description View the linked PRs

Description of problem:

Installing 4.17 agent-based hosted cluster on bare-metal with IPv6 stack in disconnected environment. We cannot install MetalLB operator on the hosted cluster to expose openshift router and handle ingress because the openshift-marketplace pods that extract the operator bundle and the relative pods are in Error state. They try to execute the following command but cannot reach the cluster apiserver:

opm alpha bundle extract -m /bundle/ -n openshift-marketplace -c b5a818607a7a162d7f9a13695046d44e47d8127a45cad69c0d8271b2da945b1 -z

INFO[0000] Using in-cluster kube client config          
Error: error loading manifests from directory: Get "https://[fd02::1]:443/api/v1/namespaces/openshift-marketplace/configmaps/b5a818607a7a162d7f9a13695046d44e47d8127a45cad69c0d8271b2da945b1": dial tcp [fd02::1]:443: connect: connection refused



In our hosted cluster fd02::1 is the clusterIP of the kubernetes service and the endpoint associated to the service is [fd00::1]:6443. By debugging the pods we see that connection to clusterIP is refused but if we try to connect to its endpoint the connection is established and we get 403 Forbidden:

sh-5.1$ curl -k https://[fd02::1]:443
curl: (7) Failed to connect to fd02::1 port 443: Connection refused


sh-5.1$ curl -k https://[fd00::1]:6443
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403

This issue is happening also in other pods in the hosted cluster which are in Error or in CrashLoopBackOff, we have similar error in their logs, e.g.:

F1011 09:11:54.129077       1 cmd.go:162] failed checking apiserver connectivity: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-service-ca-operator/leases/service-ca-operator-lock": dial tcp [fd02::1]:443: connect: connection refused


IPv6 disconnected 4.16 hosted cluster with same configuration was installed successfully and didn't show this issue, and neither IPv4 disconnected 4.17. So the issue is with IPv6 stack only.

Version-Release number of selected component (if applicable):

Hub cluster: 4.17.0-0.nightly-2024-10-10-004834

MCE 2.7.0-DOWNANDBACK-2024-09-27-14-52-56

Hosted cluster: version 4.17.1
image: registry.ci.openshift.org/ocp/release@sha256:e16ac60ac6971e5b6f89c1d818f5ae711c0d63ad6a6a26ffe795c738e8cc4dde

How reproducible:

100%

Steps to Reproduce:

    1. Install MCE 2.7 on 4.17 IPv6 disconnected BM hub cluster
    2. Install 4.17 agent-based hosted cluster and scale up the nodepool 
    3. After worker nodes are installed, attempt to install MetalLB operator to hanlde ingress

Actual results:

MetalLB operator cannot be installed because pods cannot connect to the cluster apiserver.

Expected results:

Pods in the cluster can connect to apiserver.

Additional info:

https://github.com/openshift/hypershift/pull/5168

Bug OCPBUGS-42636: Multiple reboots during EUS upgrade on Control Plane nodes

View the Description View the linked PRs

Description of problem:

    During the EUS to EUS upgrade of a MNO cluster from 4.14.16 to 4.16.11 on baremetal, we have seen that depending on the custom configuration, like performance profile or container runtime config, one or more control plane nodes are rebooted multiple times. 

Seems that this is a race condition. When the first MachineConfig rendered is generated, the first Control Plane node start the reboot(the maxUnavailable is set to 1 on the master MCP), and at this moment a new MachineConfig render is generated, what means a second reboot. Once this first node is rebooted the second time, the rest of the Control Plane nodes are rebooted just once, because no more new MachineConfig renders are generated.

Version-Release number of selected component (if applicable):

    OCP 4.14.16 > 4.15.31  > 4.16.11

How reproducible:

    Perform the upgrade of a Multi Node OCP with a custom configuration like a performance profile or container runtime configuration (like force cgroups v1, or update runc to crun)

Steps to Reproduce:

    1. Deploy on baremetal a MNO OCP 4.14 with a custom manifest, like the below:

---
apiVersion: config.openshift.io/v1
kind: Node
metadata:
  name: cluster
spec:
  cgroupMode: v1

    2. Upgrade the cluster to the next minor version available, for instance 4.15.31, make a partial upgrade pausing the worker Machine Config Pool.

    3. Monitoring the upgrade process (cluster operators, Machine Configs, Machine Config Pools and nodes)

Actual results:

    You will see that once almost all the Cluster Operators are in the 4.15.31 version, except the Machine Config Operator, at this moment review the MachineConfig reders that are generated for the master Machine Config Pool, and also monitor the nodes, to see that new MachineConfig render is generated once the first Control Plane node has been rebooted.

Expected results:

  What is expected is that in a upgrade only one Machine Config Render is generated per Machine Config Pool, and only one reboot per node to finish the upgrade.

Additional info:

https://github.com/openshift/machine-config-operator/pull/4727

Bug OCPBUGS-45424: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/node_exporter/pull/157

Bug OCPBUGS-45728: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-powervs/pull/81

Bug OCPBUGS-45943: Empty status.ServiceNetwork field causes x509: cannot validate certificate for xxx.xxx.xx.xx which doesn't contain any IP SANs

View the Description View the linked PRs

Description of problem:

    1 Client can not connect to the kube-apiserver via kubernetes svc, as the kubernetes svc is not in the cert SANs
    2 The kube-apiserver-operator generate apiserver certs, and insert the kubernetes svc ip from the network cr status.ServiceNetwork
    3 When the temporary control plane is down, and the network cr is not ready yet, Client will not connect to apiserver again

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. I have just met this for very rare conditions, especially when the machine performance is poor     
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/cluster-kube-apiserver-operator/pull/1776

Bug OCPBUGS-45352: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-operator/pull/336

Bug OCPBUGS-45698: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/372

Bug OCPBUGS-44938: agent create image failing with secured local registry

View the Description View the linked PRs

Description of problem:

When deploying a disconnected cluster, creating the iso by "openshifit-install agent create image" is failing (authentication required), when the release image resides in a secured local-registry.
Actually the issue is this:
openshift-install generates registry-config out of the install-config.yaml, and it's only the local regustry credentials (disconnected deploy), but it's not creating an icsp-file to get the image from local registry.

Version-Release number of selected component (if applicable):

How reproducible:

    Run an agent-based iso image creation of a disconnected clutser. choose a version (nightly), where the image is in secured registry (such as registry.ci).  it will fail on authentication required.

Steps to Reproduce:

    1.openshift-install agant create image
    2.
    3.

Actual results:

failing on authentication required

Expected results:

    iso to be created

Additional info:

https://github.com/openshift/installer/pull/9266

Bug OCPBUGS-45000: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-credential-operator/pull/786

Bug OCPBUGS-45198: Pending plugins will block loading of Console plugins tab

View the Description View the linked PRs

Description of problem:

dynamic plugin in Pending status will block console plugins tab page loading

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-27-162407

How reproducible:

Always

Steps to Reproduce:

1. Create a dynamic plugin which will be in Pending status, we can create from file https://github.com/openshift/openshift-tests-private/blob/master/frontend/fixtures/plugin/pending-console-demo-plugin-1.yaml 

2. Enable the 'console-demo-plugin-1' plugin and navigate to Console plugins tab at /k8s/cluster/operator.openshift.io~v1~Console/cluster/console-plugins

Actual results:

2. page will always be loading

Expected results:

2. console plugins list table should be displayed

Additional info:

https://github.com/openshift/console/pull/14583

Bug OCPBUGS-45763: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/telemeter/pull/552

Vulnerability OCPBUGS-46197: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/aws-ebs-csi-driver/pull/281

Bug OCPBUGS-45467: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/coredns/pull/131

Bug OCPBUGS-45606: 'Channel' and 'Version' dropdowns do not collapse if the user does not select an option

View the Description View the linked PRs

Description of problem:

'Channel' and 'Version' dropdowns do not collapse if the user does not select an option

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-12-04-113014

How reproducible:

    Always

Steps to Reproduce:

    1. Naviage to Operator Insatallation page OR Operator Install details page
       eg: /operatorhub/ns/openshift-console?source=["Red+Hat"]&details-item=datagrid-redhat-operators-openshift-marketplace&channel=stable&version=8.5.4
       /operatorhub/subscribe?pkg=datagrid&catalog=redhat-operators&catalogNamespace=openshift-marketplace&targetNamespace=openshift-console&channel=stable&version=8.5.4&tokenizedAuth=     
    2. Click the Channel/Update channel OR 'Version' dropdown list
    3. Click the dropdow again

Actual results:

The dropdown list cannot collapse, only if user selected an option OR click other area

Expected results:

 the dropdown can collapse after click

Additional info:

https://github.com/openshift/console/pull/14590

Bug OCPBUGS-45624: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/openshift-state-metrics/pull/119

Bug OCPBUGS-45449: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/334

Bug OCPBUGS-45396: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/operator-framework-olm/pull/908

Bug OCPBUGS-45734: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-api-provider-aws/pull/534

Bug OCPBUGS-45615: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-nutanix/pull/41

Bug OCPBUGS-45991: node-image create --report and --pxe flags should be marked as experimental

View the Description View the linked PRs

Description of problem:

    The --report and --pxe flags were introduced in 4.18. It should be marked as experimental until 4.19.

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/oc/pull/1951

Bug OCPBUGS-45733: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-provider-gcp/pull/96

Bug OCPBUGS-45828: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-external-snapshotter/pull/169

Task HOSTEDCP-2215: Expand pre-commit hooks

View the Description View the linked PRs

We should expand upon our current pre-commit hooks:

all hooks will either run in either the pre-commit stage or pre-push stage
adds pre-push hooks to run make verify
add pre-push hook to run make test

This will help prevent errors before code makes it on GitHub and CI.

https://github.com/openshift/hypershift/pull/5245

Bug OCPBUGS-45595: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/hypershift/pull/5272

Bug OCPBUGS-45481: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-provider-azure/pull/123

Bug OCPBUGS-45937: aws-sdk-go-v2 fails to authenticate AssumeRoleWithWebIdentity on AWS STS clusters

View the Description View the linked PRs

This is a clone of issue OCPBUGS-41727. The following is the description of the original issue:
—
Original bug title:

cert-manager [v1.15 Regression] Failed to issue certs with ACME Route53 dns01 solver in AWS STS env

Description of problem:

    When using Route53 as the dns01 solver to create certificates, it fails in both automated and manual tests. For the full log, please refer to the "Actual results" section.

Version-Release number of selected component (if applicable):

    cert-manager operator v1.15.0 staging build

How reproducible:

    Always

Steps to Reproduce: also documented in gist

    1. Install the cert-manager operator 1.15.0
    2. Follow the doc to auth operator with AWS STS using ccoctl: https://docs.openshift.com/container-platform/4.16/security/cert_manager_operator/cert-manager-authenticate.html#cert-manager-configure-cloud-credentials-aws-sts_cert-manager-authenticate
     3. Create a ACME issuer with Route53 dns01 solver
     4. Create a cert using the created issuer

OR:

Refer by running `/pj-rehearse pull-ci-openshift-cert-manager-operator-master-e2e-operator-aws-sts` on https://github.com/openshift/release/pull/59568

Actual results:

1. The certificate is not Ready.
2. The challenge of the cert is stuck in the pending status:

PresentError: Error presenting challenge: failed to change Route 53 record set: operation error Route 53: ChangeResourceRecordSets, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region

Expected results:

The certificate should be Ready. The challenge should succeed.

Additional info:

The only way to get it working again seems to be injecting the "AWS_REGION" environment variable into the controller pod. See upstream discussion/change:

I couldn't find a way to inject the env var into our operator-managed operands, so I only verified this workaround using the upstream build v1.15.3. After applying the patch with the following command, the challenge succeeded and the certificate became Ready.

oc patch deployment cert-manager -n cert-manager \
--patch '{"spec": {"template": {"spec": {"containers": [{"name": "cert-manager-controller", "env": [{"name": "AWS_REGION", "value": "aws-global"}]}]}}}}'

https://github.com/openshift/cloud-credential-operator/pull/798

Bug OCPBUGS-45458: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/gcp-workload-identity-federation-webhook/pull/7

Task OU-588: Adjust the monitoring plugin to include a HorizontalNav extension in the admin alert menu

View the Description View the linked PRs

Open Questions

How far back we need to backport this?

https://github.com/openshift/monitoring-plugin/pull/303

Bug OCPBUGS-45267: Mismatch on the controller and resources that selects it

View the Description View the linked PRs

Description of problem:

The manila controller[1] defines labels that are not based on the asset prefix defined in the manila config[2], consequently when assets that selects this resource are generated they use the asset prefix as a base to define the label, resulting in them not being selected. For example in the pod antifinity[3] and controller pbd[4]. We need to change the labels used in the selectors to match the actual labels of the controller.

[1]https://github.com/openshift/csi-operator/blob/master/assets/overlays/openstack-manila/generated/standalone/controller.yaml#L45-L47

[2]https://github.com/openshift/csi-operator/blob/master/pkg/driver/openstack-manila/openstack_manila.go#L51

[3]https://github.com/openshift/csi-operator/blob/master/assets/overlays/openstack-manila/generated/standalone/controller.yaml#L55

[4]https://github.com/openshift/csi-operator/blob/master/assets/overlays/openstack-manila/generated/hypershift/controller_pdb.yaml#L16

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/csi-operator/pull/335

Bug OCPBUGS-45601: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/multus-admission-controller/pull/92

Bug OCPBUGS-46483: [Azure disk/file csi driver]on ARO HCP the CPO reconcile CSO CSI Secrets incorrect

View the Description View the linked PRs

Description of problem:

[Azure disk/file csi driver]on ARO HCP could not provision volume succeed

Version-Release number of selected component (if applicable):

4.19.0-0.nightly-2024-12-13-083421

How reproducible:

Always

Steps to Reproduce:

    1.Install AKS cluster on azure.
    2.Install hypershift operator on the AKS cluster.
    3.Use hypershift CLI create hosted cluster with the Client Certificate mode.
    4.Check the azure disk/file csi dirver work well on hosted cluster.

Actual results:

    In step 4: the the azure disk/file csi dirver provision volume failed on hosted cluster

# azure disk pvc provision failed
$ oc describe pvc mypvc
...
  Normal   WaitForFirstConsumer  74m                    persistentvolume-controller                                                                                waiting for first consumer to be created before binding
  Normal   Provisioning          74m                    disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_2334468f-9d27-4bdd-a53c-27271ee60073  External provisioner is provisioning volume for claim "default/mypvc"
  Warning  ProvisioningFailed    74m                    disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_2334468f-9d27-4bdd-a53c-27271ee60073  failed to provision volume with StorageClass "managed-csi": rpc error: code = Unavailable desc = error reading from server: EOF
  Warning  ProvisioningFailed    71m                    disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_28ba5ad9-c4f8-4dc8-be40-c80c546b7ef8  failed to provision volume with StorageClass "managed-csi": rpc error: code = Unavailable desc = error reading from server: EOF
  Normal   Provisioning          71m                    disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_28ba5ad9-c4f8-4dc8-be40-c80c546b7ef8  External provisioner is provisioning volume for claim "default/mypvc"
...

$ oc logs azure-disk-csi-driver-controller-74d944bbcb-7zz89 -c csi-driver
W1216 08:07:04.282922       1 main.go:89] nodeid is empty
I1216 08:07:04.290689       1 main.go:165] set up prometheus server on 127.0.0.1:8201
I1216 08:07:04.291073       1 azuredisk.go:213]
DRIVER INFORMATION:
-------------------
Build Date: "2024-12-13T02:45:35Z"
Compiler: gc
Driver Name: disk.csi.azure.com
Driver Version: v1.29.11
Git Commit: 4d21ae15d668d802ed5a35068b724f2e12f47d5c
Go Version: go1.23.2 (Red Hat 1.23.2-1.el9) X:strictfipsruntime
Platform: linux/amd64
Topology Key: topology.disk.csi.azure.com/zone

I1216 08:09:36.814776       1 utils.go:77] GRPC call: /csi.v1.Controller/CreateVolume
I1216 08:09:36.814803       1 utils.go:78] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"topology.disk.csi.azure.com/zone":""}}],"requisite":[{"segments":{"topology.disk.csi.azure.com/zone":""}}]},"capacity_range":{"required_bytes":1073741824},"name":"pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316","parameters":{"csi.storage.k8s.io/pv/name":"pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316","csi.storage.k8s.io/pvc/name":"mypvc","csi.storage.k8s.io/pvc/namespace":"default","skuname":"Premium_LRS"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":7}}]}
I1216 08:09:36.815338       1 controllerserver.go:208] begin to create azure disk(pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316) account type(Premium_LRS) rg(ci-op-zj9zc4gd-12c20-rg) location(centralus) size(1) diskZone() maxShares(0)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x190c61d]

goroutine 153 [running]:
sigs.k8s.io/cloud-provider-azure/pkg/provider.(*ManagedDiskController).CreateManagedDisk(0x0, {0x2265cf0, 0xc0001285a0}, 0xc0003f2640)
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/sigs.k8s.io/cloud-provider-azure/pkg/provider/azure_managedDiskController.go:127 +0x39d
sigs.k8s.io/azuredisk-csi-driver/pkg/azuredisk.(*Driver).CreateVolume(0xc000564540, {0x2265cf0, 0xc0001285a0}, 0xc000272460)
	/go/src/github.com/openshift/azure-disk-csi-driver/pkg/azuredisk/controllerserver.go:297 +0x2c59
github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler.func1({0x2265cf0?, 0xc0001285a0?}, {0x1e5a260?, 0xc000272460?})
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:6420 +0xcb
sigs.k8s.io/azuredisk-csi-driver/pkg/csi-common.logGRPC({0x2265cf0, 0xc0001285a0}, {0x1e5a260, 0xc000272460}, 0xc00017cb80, 0xc00014ea68)
	/go/src/github.com/openshift/azure-disk-csi-driver/pkg/csi-common/utils.go:80 +0x409
github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler({0x1f3e440, 0xc000564540}, {0x2265cf0, 0xc0001285a0}, 0xc00029a700, 0x2084458)
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:6422 +0x143
google.golang.org/grpc.(*Server).processUnaryRPC(0xc00059cc00, {0x2265cf0, 0xc000128510}, {0x2270d60, 0xc0004f5980}, 0xc000308480, 0xc000226a20, 0x31c8f80, 0x0)
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1379 +0xdf8
google.golang.org/grpc.(*Server).handleStream(0xc00059cc00, {0x2270d60, 0xc0004f5980}, 0xc000308480)
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1790 +0xe8b
google.golang.org/grpc.(*Server).serveStreams.func2.1()
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1029 +0x7f
created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 16
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1040 +0x125

# azure file pvc provision failed
$ oc describe pvc mypvc
Name:          mypvc
Namespace:     openshift-cluster-csi-drivers
StorageClass:  azurefile-csi
Status:        Pending
Volume:
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner: file.csi.azure.com
               volume.kubernetes.io/storage-provisioner: file.csi.azure.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type     Reason                Age                From                                                                                                      Message
  ----     ------                ----               ----                                                                                                      -------
  Normal   ExternalProvisioning  14s (x2 over 14s)  persistentvolume-controller                                                                               Waiting for a volume to be created either by the external provisioner 'file.csi.azure.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Normal   Provisioning          7s (x4 over 14s)   file.csi.azure.com_azure-file-csi-driver-controller-879f56577-5hjn8_38c8218e-e52c-4248-ada7-268742afaac0  External provisioner is provisioning volume for claim "openshift-cluster-csi-drivers/mypvc"
  Warning  ProvisioningFailed    7s (x4 over 14s)   file.csi.azure.com_azure-file-csi-driver-controller-879f56577-5hjn8_38c8218e-e52c-4248-ada7-268742afaac0  failed to provision volume with StorageClass "azurefile-csi": rpc error: code = Internal desc = failed to ensure storage account: could not list storage accounts for account type Standard_LRS: StorageAccountClient is nil

Expected results:

    In step 4: the the azure disk/file csi dirver should provision volume succeed on hosted cluster

Additional info:

https://github.com/openshift/hypershift/pull/5311

Bug OCPBUGS-35726: monitoring-plugin cert-hash not updated

View the Description View the linked PRs

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Delete the openshift-monitoring/monitoring-plugin-cert secret, SCO will re-create a new one with different content

Actual results:

- monitoring-plugin is still using the old cert content.
- If the cluster doesn’t show much activity, the hash may take time to be updated.

Expected results:

CMO should detect that exact change and run a sync to  recompute and set the new hash.

Additional info:

- We shouldn't rely on another changeto trigger the sync loop.
- CMO should maybe watch that secret? (its name isn't known in advance).

https://github.com/openshift/cluster-monitoring-operator/pull/2524

Bug OCPBUGS-41826: Various accessibility violations in console

View the Description View the linked PRs

Description of problem:

When updating cypress-axe, new changes and bugfixes in the axe-core accessibility auditing package have surfaced various accessibility violations that have to be addressed

Version-Release number of selected component (if applicable):

    openshift4.18.0

How reproducible:

    always

Steps to Reproduce:

    1. Update axe-core and cypress-axe to the latest versions
    2. Run test-cypress-console and run a cypress test, I used other-routes.cy.ts

Actual results:

    The tests fail with various accessibility violations

Expected results:

    The tests pass without accessibility violations

Additional info:

https://github.com/openshift/console/pull/14311

Bug OCPBUGS-44970: Loki on SNO throws excessive restarts while waiting for DNS deployment

View the Description View the linked PRs

Context Thread

As a maintainer of the SNO CI lane, I would like to ensure that the following test doesn't failure regularly as part of SNO CI.

[sig-architecture] platform pods in ns/openshift-e2e-loki should not exit an excessive amount of times

This issue is a symptom of a greater problem with SNO where there is downtime in resolving DNS after the upgrade reboot where the DNS operator has an outage while its deploying the new DNS pods. During that time, loki exists after hitting the following error:

2024/10/23 07:21:32 OIDC provider initialization failed: Get "https://sso.redhat.com/auth/realms/redhat-external/.well-known/openid-configuration": dial tcp: lookup sso.redhat.com on 172.30.0.10:53: read udp 10.128.0.4:53104->172.30.0.10:53: read: connection refused

This issue is important because it can contribute to payload rejection in our blocking CI jobs.

Acceptance Criteria:

Problem is discussed with the networking team to understand the best path to resolution and decision is documented
Either the DNS operator or test are adjusted to address or mitigate the issue.
CI is free from the issue in test results for an extended period. (Need to confirm how often we're seeing it first before this period can be defined with confidence).

https://github.com/openshift/origin/pull/29329

Bug OCPBUGS-45923: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/machine-api-provider-gcp/pull/98

Bug OCPBUGS-46065: [4.19 IPSEC] pod to pod communication is degraded

View the Description View the linked PRs

Description of problem:

Bare Metal UPI cluster

Nodes lose communication with other nodes and this affects the pod communication on these nodes as well. This issue can be fixed with an OVN rebuild on the nodes db that are hitting the issue but eventually the nodes will degrade again and lose communication again. Note despite an OVN Rebuild fixing the issue temporarily Host Networking is set to True so it's using the kernel routing table. 

**update: observed on Vsphere with routingViaHost: false, ipForwarding: global configuration as well.

Version-Release number of selected component (if applicable):

 4.14.7, 4.14.30

How reproducible:

Can't reproduce locally but reproducible and repeatedly occurring in customer environment

Steps to Reproduce:

identify a host node who's pods can't be reached from other hosts in default namespaces ( tested via openshift-dns). observe curls to that peer pod consistently timeout. TCPdumps to target pod observe that packets are arriving and are acknowledged, but never route back to the client pod successfully. (SYN/ACK seen at pod network layer, not at geneve; so dropped before hitting geneve tunnel).

Actual results:

Nodes will repeatedly degrade and lose communication despite fixing the issue with a ovn db rebuild (db rebuild only provides hours/days of respite, no permanent resolve).

Expected results:

Nodes should not be losing communication and even if they did it should not happen repeatedly

Additional info:

What's been tried so far
========================

- Multiple OVN rebuilds on different nodes (works but node will eventually hit issue again)

- Flushing the conntrack (Doesn't work)

- Restarting nodes (doesn't work)

Data gathered
=============

- Tcpdump from all interfaces for dns-pods going to port 7777 (to segregate traffic)

- ovnkube-trace

- SOSreports of two nodes having communication issues before an OVN rebuild

- SOSreports of two nodes having communication issues after an OVN rebuild 

- OVS trace dumps of br-int and br-ex 


====

More data in nested comments below.

linking KCS: https://access.redhat.com/solutions/7091399

https://github.com/openshift/cluster-network-operator/pull/2590

Bug OCPBUGS-19824: coreos-bootimages ConfigMap should have 0.0.1-snapshot substituted

View the Description View the linked PRs

Description of problem:

In 4.8's installer#4760, the installer began passing oc adm release new ... a manifest so the cluster-version operator would manage a coreos-bootimages ConfigMap in the openshift-machine-config-operator namespace. installer#4797 reported issues with the 0.0.1-snapshot placeholder not getting substituted, and installer#4814 attempted to fix that issue by converting the manifest from JSON to YAML to align with the replacement rexexp. But for reasons I don't understand, that manifest still doesn't seem to be getting replaced.

Version-Release number of selected component (if applicable):

From 4.8 through 4.15.

How reproducible:

100%

Steps to Reproduce:

With 4.8.0:

$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.8.0-x86_64
$ grep releaseVersion manifests/0000_50_installer_coreos-bootimages.yaml

Actual results:

  releaseVersion: 0.0.1-snapshot

Expected results:

  releaseVersion: 4.8.0

or other output that matches the extracted release. We just don't want the 0.0.1-snapshot placeholder.

Additional info:

Reproducing in the latest 4.14 RC:

$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.2-x86_64
$ grep releaseVersion manifests/0000_50_installer_coreos-bootimages.yaml
  releaseVersion: 0.0.1-snapshot

https://github.com/openshift/oc/pull/1945

Bug OCPBUGS-45656: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/installer/pull/9279

Bug OCPBUGS-45264: rendered machine config fails to apply when performance profile contains very big list of cpus

View the Description View the linked PRs

Description of problem:

    When Applying profile with isolated field containing huge cpu  list, profile doesn't apply and no errors is reported

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-11-26-075648

How reproducible:

    Everytime.

Steps to Reproduce:

    1. Create a profile as specified below:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  annotations:
    kubeletconfig.experimental: '{"topologyManagerPolicy":"restricted"}'
  creationTimestamp: "2024-11-27T10:25:13Z"
  finalizers:
  - foreground-deletion
  generation: 61
  name: performance
  resourceVersion: "3001998"
  uid: 8534b3bf-7bf7-48e1-8413-6e728e89e745
spec:
  cpu:
    isolated: 25,281,117,373,127,383,3,259,30,286,77,333,23,279,21,277,66,322,12,268,15,271,61,317,120,376,35,291,62,318,93,349,126,382,19,275,52,308,110,366,50,306,92,348,124,380,119,375,2,258,29,285,60,316,115,371,118,374,104,360,108,364,70,326,72,328,76,332,96,352,99,355,64,320,80,336,97,353,8,264,11,267,38,294,53,309,57,313,103,359,14,270,87,343,7,263,40,296,51,307,94,350,116,372,39,295,46,302,90,346,101,357,107,363,26,282,67,323,98,354,106,362,113,369,6,262,10,266,20,276,33,289,112,368,85,341,121,377,68,324,71,327,79,335,81,337,83,339,88,344,9,265,89,345,91,347,100,356,54,310,31,287,58,314,59,315,22,278,47,303,105,361,17,273,114,370,111,367,28,284,49,305,55,311,84,340,27,283,95,351,5,261,36,292,41,297,43,299,45,301,75,331,102,358,109,365,37,293,56,312,63,319,65,321,74,330,125,381,13,269,42,298,44,300,78,334,122,378,4,260,16,272,34,290,123,379,18,274,48,304,69,325,82,338,24,280,32,288,73,329,86,342,220,476,251,507,206,462,226,482,229,485,221,477,253,509,255,511,135,391,144,400,183,439,233,489,137,393,186,442,198,454,190,446,234,490,147,403,163,419,172,428,148,404,149,405,250,506,151,407,217,473,238,494,131,387,230,486,235,491,246,502,145,401,194,450,199,455,143,399,169,425,231,487,245,501,129,385,142,398,179,435,225,481,236,492,152,408,203,459,214,470,166,422,207,463,212,468,130,386,155,411,215,471,188,444,201,457,210,466,193,449,200,456,248,504,141,397,167,423,191,447,181,437,222,478,252,508,128,384,139,395,174,430,164,420,168,424,187,443,232,488,133,389,157,413,208,464,140,396,185,441,241,497,219,475,175,431,184,440,213,469,154,410,197,453,249,505,209,465,218,474,227,483,244,500,134,390,153,409,178,434,160,416,195,451,196,452,211,467,132,388,136,392,146,402,138,394,150,406,239,495,173,429,192,448,202,458,205,461,216,472,158,414,159,415,176,432,189,445,237,493,242,498,177,433,182,438,204,460,240,496,254,510,162,418,171,427,180,436,243,499,156,412,165,421,170,426,228,484,247,503,161,417,223,479,224,480
    reserved: 0,256,1,257
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - count: 20
      size: 2M
  machineConfigPoolSelector:
    machineconfiguration.openshift.io/role: worker-cnf
  net:
    userLevelNetworking: true
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
  numa:
    topologyPolicy: restricted
  realTimeKernel:
    enabled: false
  workloadHints:
    highPowerConsumption: true
    perPodPowerManagement: false
    realTime: true

    2. The worker-cnf node doesn't contain any kernel args associated with the above profile.
    3.

Actual results:

    System doesn't boot with kernel args associated with above profile

Expected results:

    System should boot with Kernel args presented from Performance Profile.

Additional info:

We can see MCO gets the details and creates the mc:

Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: machine-config-daemon[9550]: "Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1=\"all\" --delete=psi=0 --delete=skew_tick=1 --delete=tsc=reliable --delete=rcupda>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: cbs=25,281,117,373,127,383,3,259,30,286,77,333,23,279,21,277,66,322,12,268,15,271,61,317,120,376,35,291,62,318,93,349,126,382,19,275,52,308,110,366,50,306,92,348,124,380,119,375,2,258,29,285,60,316,115,3>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: 4,190,446,234,490,147,403,163,419,172,428,148,404,149,405,250,506,151,407,217,473,238,494,131,387,230,486,235,491,246,502,145,401,194,450,199,455,143,399,169,425,231,487,245,501,129,385,142,398,179,435,2>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: systemd.cpu_affinity=0,1,256,257 --append=iommu=pt --append=amd_pstate=guided --append=tsc=reliable --append=nmi_watchdog=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=is>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: ,78,334,122,378,4,260,16,272,34,290,123,379,18,274,48,304,69,325,82,338,24,280,32,288,73,329,86,342,220,476,251,507,206,462,226,482,229,485,221,477,253,509,255,511,135,391,144,400,183,439,233,489,137,393>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: 510,162,418,171,427,180,436,243,499,156,412,165,421,170,426,228,484,247,503,161,417,223,479,224,480 --append=nohz_full=25,281,117,373,127,383,3,259,30,286,77,333,23,279,21,277,66,322,12,268,15,271,61,317>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: ,476,251,507,206,462,226,482,229,485,221,477,253,509,255,511,135,391,144,400,183,439,233,489,137,393,186,442,198,454,190,446,234,490,147,403,163,419,172,428,148,404,149,405,250,506,151,407,217,473,238,49>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: ppend=nosoftlockup --append=skew_tick=1 --append=rcutree.kthread_prio=11 --append=default_hugepagesz=1G --append=hugepagesz=2M --append=hugepages=20]"
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com rpm-ostree[18750]: client(id:machine-config-operator dbus:1.336 unit:crio-36c845a9c9a58a79a0e09dab668f8b21b5e46e5734a527c269c6a5067faa423b.scope uid:0) added; new total=1
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com rpm-ostree[18750]: Loaded sysroot

Actual Kernel args:
BOOT_IMAGE=(hd1,gpt3)/boot/ostree/rhcos-854dd632224b34d5f4df1884c4ba8c2f9527422b37744b83e7b1b98172586ff4/vmlinuz-5.14.0-427.44.1.el9_4.x86_64 rw ostree=/ostree/boot.0/rhcos/854dd632224b34d5f4df1884c4ba8c2f9527422b37744b83e7b1b98172586ff4/0 ignition.platform.id=metal ip=dhcp root=UUID=0068e804-432c-409d-aabc-260aa71e3669 rw rootflags=prjquota boot=UUID=7797d927-876e-426b-9a30-d1e600c1a382 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=0 skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on

https://github.com/openshift/cluster-node-tuning-operator/pull/1232

Bug CNV-50554: The create button on MultiNetworkPolicies and NetworkPolicies list page is in wrong position

View the Description View the linked PRs

Description of problem:

The create button on MultiNetworkPolicies and NetworkPolicies list page is in wrong position, it should on the top right.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

https://github.com/openshift/networking-console-plugin/pull/153

Bug OCPBUGS-42338: Further increase the default node-monitor-grace-period

View the Description View the linked PRs

Description of problem:

    See https://github.com/kubernetes/kubernetes/issues/127352

Version-Release number of selected component (if applicable):

    See https://github.com/kubernetes/kubernetes/issues/127352

How reproducible:

    See https://github.com/kubernetes/kubernetes/issues/127352

Steps to Reproduce:

    See https://github.com/kubernetes/kubernetes/issues/127352

Actual results:

    See https://github.com/kubernetes/kubernetes/issues/127352

Expected results:

    See https://github.com/kubernetes/kubernetes/issues/127352

Additional info:

    See https://github.com/kubernetes/kubernetes/issues/127352

https://github.com/openshift/hypershift/pull/5166

Bug OCPBUGS-45554: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/134

Bug OCPBUGS-45566: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/operator-framework/operator-marketplace/pull/580

Bug OCPBUGS-45591: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/csi-operator/pull/342

Bug OCPBUGS-45816: Hide/Show all series status under"Observe -> Metrics" kebab menu is wrong

View the Description View the linked PRs

Description of problem:

checked in 4.18.0-0.nightly-2024-12-05-103644/4.19.0-0.nightly-2024-12-04-03122, admin console go to "Observe -> Metrics", execute one query, make sure there is result for it, for example "cluster_version", click the kebab menu, "Show all series" under the list, it's wrong, should be "Hide all series", click "Show all series" will unselect all series, then "Hide all series" always show under the menu, click it, the series would be changed from selected and unselected, but always see "Hide all series", see recording: https://drive.google.com/file/d/1kfwAH7FuhcloCFdRK--l01JYabtzcG6e/view?usp=drive_link

same issue for developer console for 4.18+, 4.17 and below does not have such issue

Version-Release number of selected component (if applicable):

4.18+

How reproducible:

always with 4.18+

Steps to Reproduce:

see the description

Actual results:

Hide/Show all series status under"Observe -> Metrics" kebab menu is wrong

Expected results:

should be right

https://github.com/openshift/monitoring-plugin/pull/291

Bug OCPBUGS-46385: layout issue on Overview page

View the Description View the linked PRs

Description of problem:

the go to arrow and new doc link icon not aligned with text any more

Version-Release number of selected component (if applicable):

4.19.0-0.nightly-2024-12-12-144418

How reproducible:

Always

Steps to Reproduce:

    1. goes to Home -> Overview page
    2.
    3.

Actual results:

the go to arrow and new doc link icon are not horizontal aligned with their text any more

Expected results:

icon and text should be aligned

Additional info:

    screenshot https://drive.google.com/file/d/1S61XY-lqmmJgGbwB5hcR2YU_O1JSJPtI/view?usp=drive_link

https://github.com/openshift/console/pull/14630

Bug OCPBUGS-45468: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/console/pull/14581

Bug OCPBUGS-45984: [IBMCloud] [CAPI] ImageReconciliationFailed by invalid IAM token

View the Description View the linked PRs

Description of problem:

CAPI install got ImageReconciliationFailed when creating vpc custom image

Version-Release number of selected component (if applicable):

 4.19.0-0.nightly-2024-12-06-101930

How reproducible:

always

Steps to Reproduce:

1.add the following in install-config.yaml
featureSet: CustomNoUpgrade
featureGates: [ClusterAPIInstall=true]     
2. create IBMCloud cluster with IPI

Actual results:

level=info msg=Done creating infra manifests
level=info msg=Creating kubeconfig entry for capi cluster ci-op-h3ykp5jn-32a54-xprzg
level=info msg=Waiting up to 30m0s (until 11:25AM UTC) for network infrastructure to become ready...
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: infrastructure was not ready within 30m0s: client rate limiter Wait returned an error: context deadline exceeded

in IBMVPCCluster-openshift-cluster-api-guests log

reason: ImageReconciliationFailed
    message: 'error failure trying to create vpc custom image: error unknown failure
      creating vpc custom image: The IAM token that was specified in the request has
      expired or is invalid. The request is not authorized to access the Cloud Object
      Storage resource.'

Expected results:

create cluster succeed

Additional info:

the resources created when install failed: 
ci-op-h3ykp5jn-32a54-xprzg-cos  dff97f5c-bc5e-4455-b470-411c3edbe49c crn:v1:bluemix:public:cloud-object-storage:global:a/fdc2e14cf8bc4d53a67f972dc2e2c861:f648897a-2178-4f02-b948-b3cd53f07d85::
ci-op-h3ykp5jn-32a54-xprzg-vpc  is.vpc crn:v1:bluemix:public:is:jp-tok:a/fdc2e14cf8bc4d53a67f972dc2e2c861::vpc:r022-46c7932d-8f4d-4d53-a398-555405dfbf18
copier-resurrect-panzer-resistant  is.security-group crn:v1:bluemix:public:is:jp-tok:a/fdc2e14cf8bc4d53a67f972dc2e2c861::security-group:r022-2367a32b-41d1-4f07-b148-63485ca8437b
deceiving-unashamed-unwind-outward  is.network-acl crn:v1:bluemix:public:is:jp-tok:a/fdc2e14cf8bc4d53a67f972dc2e2c861::network-acl:r022-b50286f6-1052-479f-89bc-fc66cd9bf613

https://github.com/openshift/installer/pull/9301

Bug OCPBUGS-45317: node-joiner --pxe does not rename the pxe artifacts

View the Description View the linked PRs

Description of problem:

node-joiner --pxe does not rename pxe artifacts

Version-Release number of selected component (if applicable):

How reproducible:

always

Steps to Reproduce:

    1. node-joiner --pxe

Actual results:

   agent*.* artifacts are generated in the working dir

Expected results:

    In the target folder, there should be only the following artifacts:

* node.x86_64-initrd.img
* node.x86_64-rootfs.img
* node.x86_64-vmlinuz
* node.x86_64.ipxe (if required)

Additional info:

https://github.com/openshift/installer/pull/9280

Bug OCPBUGS-44901: oc-mirror should have the same format for tags of helm, operator and additional images

View the Description View the linked PRs

Today, when source images are by digest only, oc-mirror applies a default tag:

for operators and additional images it is the digest
for helm images it is digestAlgorithm+"-"+digest

This should be unified.

https://github.com/openshift/oc-mirror/pull/981

Bug OCPBUGS-45825: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cluster-kube-controller-manager-operator/pull/826

Bug OCPBUGS-46144: Azure: installer sometimes fails to provision control plane

View the Description View the linked PRs

Component Readiness has found a potential regression in the following test:

install should succeed: infrastructure

installer fails with:

time="2024-10-20T04:34:57Z" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: control-plane machines were not provisioned within 15m0s: client rate limiter Wait returned an error: context deadline exceeded"

Significant regression detected.
Fishers Exact probability of a regression: 99.96%.
Test pass rate dropped from 98.94% to 89.29%.

Sample (being evaluated) Release: 4.18
Start Time: 2024-10-14T00:00:00Z
End Time: 2024-10-21T23:59:59Z
Success Rate: 89.29%
Successes: 25
Failures: 3
Flakes: 0

Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T23:59:59Z
Success Rate: 98.94%
Successes: 93
Failures: 1
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&FeatureSet=default&Installer=ipi&Network=ovn&NetworkAccess=default&Platform=azure&Scheduler=default&SecurityMode=default&Suite=serial&Topology=ha&Upgrade=none&baseEndTime=2024-10-01%2023%3A59%3A59&baseRelease=4.17&baseStartTime=2024-09-01%2000%3A00%3A00&capability=Other&columnGroupBy=Architecture%2CNetwork%2CPlatform&component=Installer%20%2F%20openshift-installer&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20azure%20serial%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=CGroupMode%3Av2&includeVariant=ContainerRuntime%3Arunc&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&passRateAllTests=0&passRateNewTests=95&pity=5&sampleEndTime=2024-10-21%2023%3A59%3A59&sampleRelease=4.18&sampleStartTime=2024-10-14%2000%3A00%3A00&testId=cluster%20install%3A3e14279ba2c202608dd9a041e5023c4c&testName=install%20should%20succeed%3A%20infrastructure

https://github.com/openshift/installer/pull/9310

Bug OCPBUGS-44920: Period placed incorrectly in i18n error message

View the Description View the linked PRs

Description of problem:

    The period is placed inside the quotes of the missingKeyHandler i18n error

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always when there is a missingKeyHandler error

Steps to Reproduce:

    1. Check browser console
    2. Observe period is placed inside the quites
    3.

Actual results:

    It is placed inside the quotes

Expected results:

    It should be placed outside the quotes

Additional info:

https://github.com/openshift/console/pull/14547

Bug OCPBUGS-45516: The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

View the Description View the linked PRs

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

https://github.com/openshift/cloud-provider-aws/pull/99

4.19.0-0.okd-scos-2024-12-24-102529

Changes from 4.18.0-okd-scos.ec.0

Complete Features

Incomplete Features

Goal

User Stories

UX doc

Non-Requirements

Notes

Background

Outcomes

Description

Goals & Outcomes

Background

Outcomes

Steps

Feature Overview

Goals

References

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Feature Overview

Goals

General Prioritization for the Feature

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Use Cases (Optional):

Out of Scope

Background

Customer Considerations

Documentation Considerations

Interoperability Considerations

Feature Overview (aka. Goal Summary)

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Feature Overview

Goals & Requirements

Feature Overview

Acceptance Criteria

Epic Goal

Acceptance Criteria

Done Checklist

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Questions to Answer (Optional):

Out of Scope

Background

Customer Considerations

Documentation Considerations

Interoperability Considerations

Feature Overview

Feature Overview

Feature Overview

Requirements{}

Epic Goal

Requirements{}

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Use Cases (Optional):

Questions to Answer (Optional):

Out of Scope

Background

Customer Considerations

Documentation Considerations

Interoperability Considerations