Back to index

4.19.0-0.okd-scos-2024-12-24-102529

Jump to: Complete Features | Incomplete Features | Incomplete Epics | Other Complete | Other Incomplete |

Changes from 4.18.0-okd-scos.ec.0

Note: this page shows the Feature-Based Change Log for a release

Complete Features

These features were completed when this image was assembled

Goal:
Track Insights Operator Data Enhancements epic in 2024

 

 

 

 

Description of problem:

    Context
OpenShift Logging is migrating from Elasticsearch to Loki. While the option to use Loki has existed forquite a while, the information about end of Elasticsearch support has not been available until recently. With the information available now, we can expect more and more customers to migrate and hit the issue described in INSIGHTOCP-1927.
P.S. Note the bar chart in INSIGHTOCP-1927 which shows how frequently is the related KCS linked in customer cases.
Data to gather
LokiStack custom resources (any name, any namespace)
Backports
The option to use Loki is available since Logging 5.5 whose compatibility started at OCP 4.9. Considering the OCP life cycle, backports to up to OCP 4.14 would be nice.
Unknowns
Since Logging 5.7, Logging supports installation of multiple instances in customer namespaces. The Insights Operator would have to look for the CRs in all namespaces, which poses the following questions:

What is the expected number of the LokiStack CRs in a cluster?
Should the Insights operator look for the resource in all namespaces? Is there a way to narrow down the scope?

The CR will contain the name of a customer namespaces which is a sensitive information.
What is the API group of the CR? Is there a risk of LokiStack CRs in customer namespaces that would NOT be related to OpenShift Logging?



SME
Oscar Arribas Arribas

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

N/A    

Actual results:

    

Expected results:

    

Additional info:

    

Incomplete Features

When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release

Provide a simple way to get a VM-friendly networking setup, without having to configure the underlying physical network.

Goal

Primary used-defined networks can be managed from the UI and the user flow is seamless.

User Stories

  • As a cluster admin,
    I want to use the UI to define a ClusterUserDefinedNetwork, assigned with a namespace selector.
  • As a project admin,
    I want to use the UI to define a UserDefinedNetwork in my namespace.
  • As a project admin,
    I want to be queried to create a UserDefinedNetwork before I create any Pods/VMs in my new project.
  • As a project admin running VMs in a namespace with UDN defined,
    I expect the "pod network" to be called "user-defined primary network",
    and I expect that when using it, the proper network binding is used.
  • As a project admin,
    I want to use the UI to request a specific IP for my VM connected to UDN.

UX doc

https://docs.google.com/document/d/1WqkTPvpWMNEGlUIETiqPIt6ZEXnfWKRElBsmAs9OVE0/edit?tab=t.0#heading=h.yn2cvj2pci1l

Non-Requirements

  • <List of things not included in this epic, to alleviate any doubt raised during the grooming process.>

Notes

Placeholder feature for ccx-ocp-core maintenance tasks.

This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.

It looks like the insights-operator doesn't work with IPv6, there are log errors like this:

E1209 12:20:27.648684   37952 run.go:72] "command failed" err="failed to run groups: failed to listen on secure address: listen tcp: address fd01:0:0:5::6:8000: too many colons in address" 

It's showing up in metal techpreview jobs.

The URL isn't being constructed correctly, use NetJoinHostPort over Sprintf.   Some more details here https://github.com/stbenjam/no-sprintf-host-port.  There's a non-default linter in golangci-lint for this.

 

Component Readiness has found a potential regression in the following test:

[sig-architecture] platform pods in ns/openshift-insights should not exit an excessive amount of times

Test has a 56.36% pass rate, but 95.00% is required.

Sample (being evaluated) Release: 4.18
Start Time: 2024-12-02T00:00:00Z
End Time: 2024-12-09T16:00:00Z
Success Rate: 56.36%
Successes: 31
Failures: 24
Flakes: 0

View the test details report for additional context.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Background

The admin console's alert rule details page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.

Outcomes

That dev console page is loaded from monitoring-plugin and the code for the page is removed from the console codebase.

 

Ensure removal of deprecated patternfly components from kebab-dropdown.tsx and alerting.tsx once this story and OU-257 are completed.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Description

“In order to allow internal teams to define their dashboards in Perses, we as the Observability UI Team need to Add support on the console to display Perses dashboards”

Goals & Outcomes

Product Requirements:

  • The console dashboards plugin is able to render dashboards coming from Perses

 

Background

In order to allow customers and internal teams to see dashboards created using Perses, we must add them as new elements on the current dashboard list

Outcomes

  • When navigating to the Monitoring / Dashboards. Perses dashboards are listed with the current console dashboards. The extension point is backported to 4.14

Steps

  • COO (monitoring-console-plugin)
    • Add the Perses dashboards feature called "perses-dashboards" in the monitoring plugin.
    • Create a function to fetch dashboards from the Perses API
  • CMO (monitoring-plugin)
    • An extension point is added to inject the function to fetch dashboards from Perses API and merge the results with the current console dashboards
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

tldr: three basic claims, the rest is explanation and one example

  1. We cannot improve long term maintainability solely by fixing bugs.
  2. Teams should be asked to produce designs for improving maintainability/debugability.
  3. Specific maintenance items (or investigation of maintenance items), should be placed into planning as peer to PM requests and explicitly prioritized against them.

While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.

One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.

I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.

We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.


Relevant links:

Feature Overview

As a cluster-admin, I want to run update in discrete steps. Update control plane and worker nodes independently.
I also want to back-up and restore incase of a problematic upgrade.

 

Background:

This Feature is a continuation of https://issues.redhat.com/browse/OCPSTRAT-180.
Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.Below is the list of done tasks.  

  1. OTA-700 Reduce False Positives (such as Degraded) 
  2. OTA-922 - Better able to show the progress made in each discrete step 
  3. [Cover by status command]Better visibility into any errors during the upgrades and documentation of what they error means and how to recover. 

Goals

  1. Have an option to do upgrades in more discrete steps under admin control. Specifically, these steps are: 
    • Control plane upgrade
    • Worker nodes upgrade
    • Workload enabling upgrade (i..e. Router, other components) or infra nodes
  2. An user experience around an end-2-end back-up and restore after a failed upgrade 
  3. MCO-530 - Support in Telemetry for the discrete steps of upgrades 

References

Epic Goal

  • Eliminate the gap between measured availability and Available=true

Why is this important?

  • Today it's not uncommon, even for CI jobs, to have multiple operators which blip through either Degraded=True or Available=False conditions
  • We should assume that if our CI jobs do this then when operating in customer environments with higher levels of chaos things will be even worse
  • We have had multiple customers express that they've pursued rolling back upgrades because the cluster is telling them that portions of the cluster are Degraded or Unavailable when they're actually not
  • Since our product is self-hosted, we can reasonably expect that the instability that we experience on our platform workloads (kube-apiserver, console, authentication, service availability), will also impact customer workloads that run exactly the same way: we're just better at detecting it.

Scenarios

  1. In all of the following, assume standard 3 master 0 worker or 3 master 2+ worker topologies
  2. Add/update CI jobs which ensure 100% Degraded=False and Available=True for the duration of upgrade
  3. Add/update CI jobs which measure availability of all components which are not explicitly defined as non-HA (ex: metal's DHCP server is singleton)
  4. Address all identified issues

Acceptance Criteria

  • openshift/enhancements CONVENTIONS outlines these requirements
  • CI - Release blocking jobs include these new/updated tests
  • Release Technical Enablement - N/A if we do this we should need no docs
  • No outstanding identified issues

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. Clayton, David, and Trevor identified many issues early in 4.8 development but were unable to ensure all teams addressed them that list is in this query, teams will be asked to address everything on this list as a 4.9 blocker+ bug and we will re-evaluate status closer to 4.9 code freeze to see which may be deferred to 4.10
    https://bugzilla.redhat.com/buglist.cgi?columnlist=product%2Ccomponent%2Cassigned_to%2Cbug_severity%2Ctarget_release%2Cbug_status%2Cresolution%2Cshort_desc%2Cchangeddate&f1=longdesc&f2=cf_environment&j_top=OR&list_id=12012976&o1=casesubstring&o2=casesubstring&query_based_on=ClusterOperator%20conditions&query_format=advanced&v1=should%20not%20change%20condition%2F&v2=should%20not%20change%20condition%2F

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • DEV - Tests in place
  • DEV - No outstanding failing tests
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

These are alarming conditions which may frighten customers, and we don't want to see them in our own, controlled, repeatable update CI. This example job had logs like:

Feb 18 21:11:25.799 E clusteroperator/openshift-apiserver changed Degraded to True: APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()

And the job failed, but none of the failures were "something made openshift-apiserver mad enough to go Degraded".

Definition of done:

  • Same as OTA-362
  • File bugs or the existing issues
  • If bug exists then add the tests to the exception list.
  • Unless tests are in exception list , they should fail if we see degraded != false.

Feature Overview

This feature aims to enable customers of OCP to integrate 3rd party KMS solutions for encrypting etcd values at rest in accordance with:

https://kubernetes.io/docs/tasks/administer-cluster/kms-provider/

Goals

  • Bring KMS v2 API to beta|stable level
  • Create/expose mechanisms for customers to plug in containers/operators which can serve the API server's needs (can it be an operator, something provided via CoreOS layering, vanilla container spec provided to API server operator?)
  • Provide similar UX experience for all of self-hosted, hypershift, SNO scenarios
  • Provide example container/operator for the mechanism

General Prioritization for the Feature

  1. Approved design for detection & actuation for stand-alone OCP clusters.
    1. How to detect a problem like an expired/lost key and no contact with the KMS provider?
    2. How to inform/notify the situation, even at node level, of the situation
  2. Tech Preview (Feature gated) enabling Kube-KMS v2 for partners to start working on KMS plugin provider integrations:
    1. Cloud: (priority Azure > AWS > Google)
      1. Azure KMS
      2. Azure Dedicated HSM
      3. AWS KMS
      4. AWS CloudHSM
      5. Google Cloud HSM
    2. On-premise:
      1. HashiCorp Vault
      2. EU FSI & EU Telco KMS/HSM top-2 providers
  3. GA after at least one stable KMS plugin provider

Scenario:

For an OCP cluster with external KMS enabled:

  • The customer loses the key to the external KMS 
  • The external KMS service is degraded or unavailable

How doe the above scenario(s) impact the cluster? The API may be unavailable

 

Goal:

  • Detection: The ability to detect these failure condition(s) and make it visible to the cluster admin.
  • Actuation: To what extent can we restore the cluster? ( API availability, Control Plane operators). Recovering customer data is outside of the scope

 

Investigation Steps:

Detection:

  • How do we detect issues with the external KMS?
  • How do we detect issues with the KMS plugins?
  • How do we surface the information that an issue happened with KMS?
    • Metrics / Alerts? Will not work with SNO
    • ClusterOperatorStatus?

Actuation:

  • Is the control-plane self-recovering?
  • What actions are required for the user to recover the cluster partially/completely?
  • Complete: kube-apiserver? KMS plugin?
  • Partial: kube-apiserver? Etcd? KMS plugin?

User stories that might result in KCS:

  • KMS / KMS plugin unavailable
    • Is there any degradation? (most likely not with kms v2)
  • KMS unavailable and DEK not in cache anymore
    • Degradation will most likely occur, but what happens when the KMS becomes available again? Is the cluster self-recovering?
  • Key has been deleted and later recovered
    • Is the cluster self-recovering?
  • KMS / KMS plugin misconfigured
    • Is the apiserver rolled-back to the previous healthy revision?
    • Is the misconfiguration properly surfaced?

Plugins research:

  • What are the pros and cons of managing the plugins ourselves vs leaving that responsibility to the customer?
  • What is the list of KMS we need to support?
  • Do all the KMS plugins we need to use support KMS v2? If not reach out to the provider
  • HSMs?

POCs:

Acceptance Criteria:

  • Document the detection and actuation process in a KEP.
  • Generate new Jira work items based on the new findings.

Feature Overview (aka. Goal Summary)

Volume Group Snapshots is a key new Kubernetes storage feature that allows multiple PVs to be grouped together and snapshotted at the same time. This enables customers to takes consistent snapshots of applications that span across multiple PVs.

This is also a key requirement for backup and DR solutions.

https://kubernetes.io/blog/2023/05/08/kubernetes-1-27-volume-group-snapshot-alpha/

https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3476-volume-group-snapshot

Goals (aka. expected user outcomes)

Productise the volume group snapshots feature as tech preview have docs, testing as well as a feature gate to enable it in order for customers and partners to test it in advance.

Requirements (aka. Acceptance Criteria):

The feature should be graduated beta upstream to become TP in OCP. Tests and CI must pass and a feature gate should allow customers and partners to easily enable it. We should identify all OCP shipped CSI drivers that support this feature and configure them accordingly.

Use Cases (Optional):

 

  1. As a storage vendor I want to have early access to the VolumeGroupSnapshot feature to test and validate my driver support.
  2. As a backup vendor I want to have early access to the VolumeGroupSnapshot feature to test and validate my backup solution.
  3. As a customer I want early access to test the VolumeGroupSnapshot feature in order to take consistent snapshots of my workloads that are relying on multiple PVs.

Out of Scope

CSI drivers development/support of this feature.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Customer Considerations

Drivers must support this feature and enable it. Partners may need to change their operator and/or doc to support it.

Documentation Considerations

Document how to enable the feature, what this feature does and how to use it. Update the OCP driver's table to include this capability.

Interoperability Considerations

Can be leveraged by ODF and OCP virt, especially around backup and DR scenarios.

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

Add Volume Group Snapshots as Tech Preview. This is a key new Kubernetes storage feature that allows multiple PVs to be grouped together and snapshotted at the same time. This enables customers to takes consistent snapshots of applications that span across multiple PVs.

We will rely on the newly beta promoted feature. This feature is driver dependent.

This will need a new external-snapshotter rebase + removal of the feature gate check in csi-snapshot-controller-operator. Freshly installed or upgraded from older release, will have group snapshot v1beta1 API enabled + enabled support for it in the snapshot-controller (+ ship corresponding external-snapshotter sidecar).

No opt-in, no opt-out.

OCP itself will not ship any CSI driver that supports it.

 
Why is this important? (mandatory)

What are the benefits to the customer or Red Hat?   Does it improve security, performance, supportability, etc?  Why is work a priority?

This is also a key requirement for backup and DR solutions specially for OCP virt.

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1. As a storage vendor I want to have early access to the VolumeGroupSnapshot feature to test and validate my driver support.
  2. As a backup vendor I want to have early access to the VolumeGroupSnapshot feature to test and validate my backup solution.
  3. As a customer I want early access to test the VolumeGroupSnapshot feature in order to take consistent snapshots of my workloads that are relying on multiple PVs

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

External snapshotter rebase to the upstream version that include the beta API.

 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - STOR / ODF
  • Documentation - STOR
  • QE - STOR / ODF
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Since we don't ship any driver with OCP that support the feature we need to have testing with ODF

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

We're looking at enabling it by default which could introduce risk. Since the feature has recently landed upstream, we will need to rebase on a newer external snapshotter that we initially targeted. 

When moving to v1 there may be non backward compatible changes.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Feature Overview (aka. Goal Summary)  

Crun is GA as non default since OCP 4.14 . We want to make it as default in 4.18 while still supporting runc as non-default 

Benefits of Crun is covered here https://github.com/containers/crun 

 

FAQ.:  https://docs.google.com/document/d/1N7tik4HXTKsXS-tMhvnmagvw6TE44iNccQGfbL_-eXw/edit

***Note -> making Crun default does not means we will remove the support for runc nor we have any plans in foreseeable future to do that  

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

This is Image mode on OpenShift. It uses the rpm-ostree native containers interface and not bootc but that is an implementation detail.

In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.

The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.

On-cluster, automated RHCOS Layering builds are important for multiple reasons:

  • One-click/one-command upgrades of OCP are very popular. Many customers may want to make one or just a few customizations but also want to keep that simplified upgrade experience. 
  • Customers who only need to customize RHCOS temporarily (hotfix, driver test package, etc) will find off-cluster builds to be too much friction for one driver.
  • One of OCP's virtues is that the platform and OS are developed, tested, and versioned together. Off-cluster building breaks that connection and leaves it up to the user to keep the OS up-to-date with the platform containers. We must make it easy for customers to add what they need and keep the OS image matched to the platform containers.

Goals & Requirements

  • The goal of this feature is primarily to bring the 4.14 progress (OCPSTRAT-35) to a Tech Preview or GA level of support.
  • Customers should be able to specify a Containerfile with their customizations and "forget it" as long as the automated builds succeed. If they fail, the admin should be alerted and pointed to the logs from the failed build.
    • The admin should then be able to correct the build and resume the upgrade.
  • Intersect with the Custom Boot Images such that a required custom software component can be present on every boot of every node throughout the installation process including the bootstrap node sequence (example: out-of-box storage driver needed for root disk).
  • Users can return a pool to an unmodified image easily.
  • RHEL entitlements should be wired in or at least simple to set up (once).
  • Parity with current features – including the current drain/reboot suppression list, CoreOS Extensions, and config drift monitoring.

This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.

 

As a cluster admin of user provided infrastructure,
when I apply the machine config that opts a pool into On Cluster Layering,
I want to also be able to remove that config and have the pool revert back to its non-layered state with the previously applied config.
 
As a cluster admin using on cluster layering,
when an image build has failed,
I want it to retry 3 times automatically without my intervention and show me where to find the log of the failure.

 
As a cluster admin,
when I enable On Cluster Layering,
I want to know that the builder image I am building with is stable and will not change unless I change it
so that I keep the same API promises as we do elsewhere in the platform.

 

To test:

As a cluster admin using on cluster layering,
when I try to upgrade my cluster and the Cluster Version Operator is not available,
I want the upgrade operation to be blocked.

As a cluster admin,
when I use a disconnected environment,
I want to still be able to use On Cluster Layering.

As a cluster admin using On Cluster layering,
When there has been config drift of any sort that degrades a node and I have resolved the issue,
I want to it to resync without forcing a reboot.

As a cluster admin using on cluster layering,
when a pool is using on cluster layering and references an internal registry
I want that registry available on the host network so that the pool can successfully scale up 
(MCO-770, MCO-578, MCO-574 )

As a cluster admin using on cluster layering,
when a pool is using on cluster layering and I want to scale up nodes,
the nodes should have the same config as the other nodes in the pool.

Maybe:

Entitlements: MCO-1097, MCO-1099

Not Likely:

As a cluster admin using on cluster layering,
when I try to upgrade my cluster,
I want the upgrade operation to succeed at the same rate as non-OCL upgrades do.

With OCL GA'ing soon, we'll need a blocking path within our e2e test suite that must pass before a PR can be merged. Since e2e-gcp-op-techpreview is a non-blocking job, we should do both of the following:

  1. Migrate the tests from e2e-gcp-op-techpreview into e2e-gcp-op. This can be done by moving the tests in the MCO repo from the test/e2e-techpreview folder to the test/e2e folder. There might be some minor cleanups such as fixing duplicate function names, etc. but it should be fairly straightforward to do.
  2. Make e2e-gcp-op-techpreview a blocking job. A PR to the openshift/release repo to set optional: false for that job in both the 4.18 and 4.19 configs will be needed. This should be a pretty straightforward config change.

Feature Overview

As a cluster admin for standalone OpenShift, I want to customize the prefix of the machine names created by CPMS due to company policies related to nomenclature. Implement the Control Plane Machine Set (CPMS) feature in OpenShift to support machine names where user can set custom names prefixes. Note the prefix will always be suffixed by "<5-chars>-<index>" as this is part of the CPMS internal design.

Acceptance Criteria

A new field called machineNamePrefix has been added to CPMS CR.
This field would allow the customer to specify a custom prefix for the machine names.
The machine names would then be generated using the format: <machineNamePrefix><5-chars><index>
Where:
<machineNamePrefix> is the custom prefix provided by the customer
<5-chars> is a random 5 character string (this is required and cannot be changed)
<index> represents the index of the machine (0, 1, 2, etc.)
Ensure that if the machineNamePrefix is changed, the operator reconciles and succeeds in rolling out the changes.

Epic Goal

  • Provide a new field to the CPMS that allows to define a Machine name prefix
  • This prefix will supersede the current usage of the control plane label and role combination we use today
  • The names must still continue to be suffixed with <chars>-<idx> as this is important to the operation of CPMS

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • DEV - Downstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

Edge customers requiring computing on-site to serve business applications (e.g., point of sale, security & control applications, AI inference) are asking for a 2-node HA solution for their environments. Only two nodes at the edge, because the 3d node induces too much cost, but still they need HA for critical workload. To address this need, a 2+1 topology is introduced. It supports a small cheap arbiter node that can optionally be remote/virtual to reduce onsite HW cost. 

Goals (aka. expected user outcomes)

Support OpenShift on 2+1 topology, meaning two primary nodes with large capacity to run workload and control plan, and a third small “arbiter” node which ensure quorum. See requirements for more details.

Requirements (aka. Acceptance Criteria):

  1. Co-located arbiter node -  3d node in same network/location with low latency network access, but the arbiter node is much smaller compared to the two main nodes. Target resource requirements for the arbiter node: 4 cores / 8 vcpu, 16G RAM, 120G disk (non-spinning), 1x1 GbE network ports, no BMC
  2. OCP Virt fully functionally, incl. Live migration of VMs (assuming RWX CSI Driver is available)
  3. Single Node outage is handled seamlessly
  4. In case the arbiter node is down , a reboot/restart of the two remaining nodes has to work, i.e. the two remaining nodes re-gain quorum and spin-up the workload. 
  5. Scale out  of the cluster by adding additional worker nodes should be possible
  6. Transition the cluster into a regular 3 node compact cluster, e.g. by adding a new node as control plane node, then removing the arbiter node, should be possible
  7. Regular workload should not be scheduled to the arbiter node (e.g by making it un-schedulabe, or introduce a new node role “arbiter”). Only essential control plane workload (etcd components) should run on the arbiter node. Non-essential control plan workload (i.e. router, registry, console, monitoring etc) should also not be scheduled to the arbiter nodded.
  8. It must be possible to explicitly schedule additional workload to the arbiter node. That is important for 3d party solutions (e.g. storage provider) which also have  quorum based mechanisms.
  9. must seamlessly integrate into existing installation/update mechanisms, esp. zero touch provisioning etc.
  10. Added: ability to track OLA usage in the fleet of connected clusters via OCP telemetry data

 

 

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both self-managed
Classic (standalone cluster) yes
Hosted control planes no
Multi node, Compact (three node), or Single node (SNO), or all Multi node and Compact (three node)
Connected / Restricted Network both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86_86 and ARM
Operator compatibility full
Backport needed (list applicable versions) no
UI need (e.g. OpenShift Console, dynamic plugin, OCM) no
Other (please specify) n/a

 

Questions to Answer (Optional):

  1. How to implement the scheduling restrictions to the arbiter node? New node role “arbiter”?
  2. Can this be delivered in one release, or do we need to split, e.g. TechPreview + GA?

Out of Scope

  1. Storage driver providing RWX shared storage

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

  • Two node support is in high demand by telco, industrial and retail customers.
  • VMWare supports a two node VSan solution: https://core.vmware.com/resource/vsan-2-node-cluster-guide
  • Example edge hardware frequently used for edge deployments with a co-located small arbiter node: Dell PowerEdge XR4000z Server is an edge computing device that allows restaurants, retailers, and other small to medium businesses to set up local computing for data-intensive workloads. 

 

Customer Considerations

See requirements - there are two main groups of customers: co-located arbiter node, and remote arbiter node.

 

Documentation Considerations

  1. Topology needs to be documented, esp. The requirements of the arbiter node.

 

Interoperability Considerations

  1. OCP Virt needs to be explicitly tested on this scenario to support VM HA (live migration, restart on other node)

 

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Once the HighlyAvailableArbiter has been added to the ocp/api, we need to update the cluster-config-operator dependencies to reference the new change, so that it propagates to cluster installs in our payloads.

Update the dependencies for CEO for library-go and ocp/api to support the Arbiter additions, doing this in a separate PR to keep things clean and easier to test.

We need to update CEO (cluster etcd operator) to understand what an arbiter/witness node is so it can properly assign an etcd member on our less powerful node.

Feature Overview

Ability to install OpenShift on Nutanix with nodes having multiple NICs (multiple subnets) from IPI and for autoscaling with MachineSets.

 

Feature Overview

Ability to install OpenShift on Nutanix with nodes having multiple NICs (multiple subnets) from IPI and for autoscaling with MachineSets.

Feature Overview

Support mapping OpenShift zones to vSphere host groups, in addition to vSphere clusters.

When defining zones for vSphere administrators can map regions to vSphere datacenters and zones to vSphere clusters.

There are use cases where vSphere clusters have only one cluster construct with all their ESXi hosts but the administrators want to divide the ESXi hosts in host groups. A common example is vSphere stretched clusters, where there is only one logical vSphere cluster but the ESXi nodes are distributed across to physical sites, and grouped by site in vSphere host groups.

In order for OpenShift to be able to distribute its nodes on vSphere matching the physical grouping of hosts, OpenShift zones have to be able to map to vSphere host groups too.

Requirements{}

  • Users can define OpenShift zones mapping them to host groups at installation time (day 1)
  • Users can use host groups as OpenShift zones post-installation (day 2)

Epic Goal

Support mapping OpenShift zones to vSphere host groups, in addition to vSphere clusters.

When defining zones for vSphere administrators can map regions to vSphere datacenters and zones to vSphere clusters.

There are use cases where vSphere clusters have only one cluster construct with all their ESXi hosts but the administrators want to divide the ESXi hosts in host groups. A common example is vSphere stretched clusters, where there is only one logical vSphere cluster but the ESXi nodes are distributed across to physical sites, and grouped by site in vSphere host groups.

In order for OpenShift to be able to distribute its nodes on vSphere matching the physical grouping of hosts, OpenShift zones have to be able to map to vSphere host groups too.

Requirements{}

  • Users can define OpenShift zones mapping them to host groups at installation time (day 1)
  • Users can use host groups as OpenShift zones post-installation (day 2)

As an openshift engineer enable host vm group zonal in mao so that compute nodes properly are deployed

Acceptance Criteria:

  • Modify workspace to include vmgroup
  • properly configure vsphere cluster to add vm into vmgroup

Description of problem:

When we set multiple networks on LRP:

port rtoe-GR_227br_tenant.red_ovn-control-plane
        mac: "02:42:ac:12:00:07"
        ipv6-lla: "fe80::42:acff:fe12:7"
        networks: ["169.254.0.15/17", "172.18.0.7/16", "fc00:f853:ccd:e793::7/64", "fd69::f/112"]

and also use lb_force_snat_ip=routerip it picks the lexicographically sorted first item from the set of networks - there is no ordering for this

This breaks Services implementation on L2 UDNs

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.

2.

3.

Actual results:

Expected results:

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

  1. internal CI failure
  2. customer issue / SD
  3. internal RedHat testing failure

If it is an internal RedHat testing failure:

  • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

  • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
  • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
  • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
  • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
  • If it's a connectivity issue,
  • What is the srcNode, srcIP and srcNamespace and srcPodName?
  • What is the dstNode, dstIP and dstNamespace and dstPodName?
  • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

  • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
  • Don’t presume that Engineering has access to Salesforce.
  • Do presume that Engineering will access attachments through supportshell.
  • Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
  • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    • If the issue is in a customer namespace then provide a namespace inspect.
    • If it is a connectivity issue:
      • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
      • What is the dstNode, dstNamespace, dstPodName and dstPodIP?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
      • Please provide the UTC timestamp networking outage window from must-gather
      • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    • If it is not a connectivity issue:
      • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
  • When showing the results from commands, include the entire command in the output.  
  • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
  • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
  • Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
  • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
  • For guidance on using this template please see
    OCPBUGS Template Training for Networking  components

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

<your text here>

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

<your text here>

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Epic Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

<your text here>

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

<your text here>

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Epic Goal

  • Base on user analytics many of customers switch back and fourth between perspectives, and average15 times per session. 
  • The following steps will be need:
    • Surface all Dev specific Nav items in the Admin Console
    • Disable the Dev perspective by default but allow admins to enable via console setting

Why is this important?

  • We need to alleviate this pain point and improve the overall user experience for our users.

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

K8s 1.31 introduces VolumeAttributesClass as beta (code in external provisioner). We should make it available to customers as tech preview.

VolumeAttributesClass allows PVC to be modified after their creation and while attached. There is as vast number of parameters that can be updated but the most popular is to change the QoS values. Parameters that can be changed depend on the driver used.

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Productise VolumeAttributesClass as TP in anticipation for GA. Customer can start testing VolumeAttributesClass.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

  • Disabled by default
  • put it under TechPreviewNoUpgrade
  • make sure VolumeAttributeClass object is available in beta APIs
  • enable the feature in external-provisioner and external-resizer at least in AWS EBS CSI driver, check the other drivers.
    • Add RBAC rules for these objects
  • make sure we run its tests in one of TechPreviewNoUpgrade CI jobs (with hostpath CSI driver)
  • reuse / add a job with AWS EBS CSI driver + tech preview.

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both both
Classic (standalone cluster) yes
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all all
Connected / Restricted Network both
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) all
Operator compatibility N/A core storage
Backport needed (list applicable versions) None
UI need (e.g. OpenShift Console, dynamic plugin, OCM) TBD for TP
Other (please specify) n/A

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

As an OCP user, I want to change parameters of my existing PVC such as the QoS attributes.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

UI for TP

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

There's been some limitations and complains on the fact that PVC attributes are sealed after their creation avoiding customers to update them. This is particularly impacting for a specific QoS is set and the volume requirements are changing.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

Customer should not use it in production atm.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Document VolumeAttributesClass creation and how to update PVC. Mention any limitation. Mention it's tech preview no upgrade. Add drivers support if needed.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Check which drivers support it for which parameters.

Epic Goal

Support upstream feature "VolumeAttributesClass" in OCP as Beta, i.e. test it and have docs for it.

Why is this important?

  • We get this upstream feature through Kubernetes rebase. We should ensure it works well in OCP and we have docs for it.

Upstream links

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview (aka. Goal Summary)  

A common concern with dealing with escalations/incidents in Managed OpenShift Hosted Control Planes is the resolution time incurred when the fix needs to be delivered in a component of the solution that ships within the OpenShift release payload. This is because OpenShift's release payloads:

  •  Have a hotfix process that is customer/support-exception targeted rather than fleet targeted
  • Can take weeks to be available for Managed OpenShift

This feature seeks to provide mechanisms that put the upper time boundary in delivering such fixes to match the current HyperShift Operator <24h expectation

Goals (aka. expected user outcomes)

  • Hosted Control Plane fixes are delivered through Konflux builds
  • No additional upgrade edges
  • Release specific
  • Adequate, fleet representative, automated testing coverage
  • Reduced human interaction

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

  • Overriding Hosted Control Plane components can be done automatically once the PRs are ready and the affected versions have been properly identified
  • Managed OpenShift Hosted Clusters have their Control Planes fix applied without requiring customer intervention and without workload disruption beyond what might already be incurred because of the incident it is solving
  • Fix can be promoted through integration, stage and production canary with a good degree of observability

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both managed (ROSA and ARO)
Classic (standalone cluster) No
Hosted control planes Yes
Multi node, Compact (three node), or Single node (SNO), or all All supported ROSA/HCP topologies
Connected / Restricted Network All supported ROSA/HCP topologies
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) All supported ROSA/HCP topologies
Operator compatibility CPO and Operators depending on it
Backport needed (list applicable versions) TBD
UI need (e.g. OpenShift Console, dynamic plugin, OCM) No
Other (please specify) No

Use Cases (Optional):

  • Incident response when the engineering solution is partially or completely in the Hosted Control Plane side rather than in the HyperShift Operator

Out of Scope

  • HyperShift Operator binary bundling

Background

Discussed previously during incident calls. Design discussion document

Customer Considerations

  • Because the Managed Control Plane version does not change but it is overridden, customer visibility and impact should be limited as much as possible.

Documentation Considerations

SOP needs to be defined for:

  • Requesting and approving the fleet wide fixes described above
  • Building and delivering them
  • Identifying clusters with deployed fleet wide fixes

Goal

  • Have a Konflux build for every supported branch on every pull request / merge that modifies the Control Plane Operator

Why is this important?

  • In order to build the Control Plane Operator images to be used for management cluster wide overrides.
  • To be able to deliver managed Hosted Control Plane fixes to managed OpenShift with a similar SLO as the fixes for the HyperShift Operator.

Scenarios

  1. A PR that modifies the control plane in a supported branch is posted for a fix affecting managed OpenShift

Acceptance Criteria

  • Dev - Konflux application and component per supported release
  • Dev - SOPs for managing/troubleshooting the Konflux Application
  • Dev - Release Plan that delivers to the appropriate AppSre production registry
  • QE - HyperShift Operator versions that encode an override must be tested with the CPO Konflux builds that they make

Dependencies (internal and external)

  1. Konflux

Previous Work (Optional):

  1. HOSTEDCP-2027

Open questions:

  1. Antoni Segura Puimedon  How long or how many times should the CPO override be tested?

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • DEV - Konflux App link: <link to Konflux App for CPO>
  • DEV - SOP: <link to meaningful PR or GitHub Issue>
  • QE - Test plan in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Feature Overview

The OpenShift IPsec implementation will be enhanced for a growing set of enterprise use cases, and for larger scale deployments. 

Goals

The OpenShift IPsec implementation was originally built for purpose-driven use cases from telco NEPs, but was also be useful for a specific set of other customer use cases outside of that context.  As customer adoption grew and it was adopted by some of the largest (by number of cluster nodes) deployments in the field, it became obvious that some redesign is necessary in order to continue to deliver enterprise-grade IPsec, for both East-West and North-South traffic, and for some of our most-demanding customer deployments.  

Key enhancements include observability and blocked traffic across paths if IPsec encryption is not functioning properly.   

Requirements

 

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

Questions to answer…

  •  

Out of Scope

  • Configuration of external-to-cluster IPsec endpoints for N-S IPsec. 

Background, and strategic fit

The OpenShift IPsec feature is fundamental to customer deployments for ensuring that all traffic between cluster nodes (East-West) and between cluster nodes and external-to-the-cluster entities that also are configured for IPsec (North-South) is encrypted by default.  This encryption must scale to the largest of deployments. 

Assumptions

  •  

Customer Considerations

  • Customers require the option to use their own certificates or CA for IPsec. 
  • Customers require observability of configuration (e.g. is the IPsec tunnel up and passing traffic)
  • If the IPsec tunnel is not up or otherwise functioning, traffic across the intended-to-be-encrypted network path should be blocked. 

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?
The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

While running IPsec e2e tests in the CI, the data plane traffic is not flowing with desired traffic type esp or udp. For example, ipsec mode external, the traffic type seems to seen as esp for EW traffic, but it's supposed to be geneve (udp) taffic.

Example CI run: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/50687/rehearse-50687-pull-ci-openshift-cluster-network-operator-master-e2e-aws-ovn-ipsec-serial/1789527351734833152

This issue was reproducible on a local cluster after many attempts and noticed ipsec states are not cleanup on the node which is a residue from previous test run with ipsec full mode.
 
[peri@sdn-09 origin]$ kubectl get networks.operator.openshift.io cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  creationTimestamp: "2024-05-13T18:55:57Z"
  generation: 1362
  name: cluster
  resourceVersion: "593827"
  uid: 10f804c9-da46-41ee-91d5-37aff920bee4
spec:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  defaultNetwork:
    ovnKubernetesConfig:
      egressIPConfig: {}
      gatewayConfig:
        ipv4: {}
        ipv6: {}
        routingViaHost: false
      genevePort: 6081
      ipsecConfig:
        mode: External
      mtu: 1400
      policyAuditConfig:
        destination: "null"
        maxFileSize: 50
        maxLogFiles: 5
        rateLimit: 20
        syslogFacility: local0
    type: OVNKubernetes
  deployKubeProxy: false
  disableMultiNetwork: false
  disableNetworkDiagnostics: false
  logLevel: Normal
  managementState: Managed
  observedConfig: null
  operatorLogLevel: Normal
  serviceNetwork:
  - 172.30.0.0/16
  unsupportedConfigOverrides: null
  useMultiNetworkPolicy: false
status:
  conditions:
  - lastTransitionTime: "2024-05-13T18:55:57Z"
    status: "False"
    type: ManagementStateDegraded
  - lastTransitionTime: "2024-05-14T10:13:12Z"
    status: "False"
    type: Degraded
  - lastTransitionTime: "2024-05-13T18:55:57Z"
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2024-05-14T11:50:26Z"
    status: "False"
    type: Progressing
  - lastTransitionTime: "2024-05-13T18:57:13Z"
    status: "True"
    type: Available
  readyReplicas: 0
  version: 4.16.0-0.nightly-2024-05-08-222442
[peri@sdn-09 origin]$ oc debug node/worker-0
Starting pod/worker-0-debug-k6nlm ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.111.23
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# toolbox
Checking if there is a newer version of registry.redhat.io/rhel9/support-tools available...
Container 'toolbox-root' already exists. Trying to start...
(To remove the container and start with a fresh toolbox, run: sudo podman rm 'toolbox-root')
toolbox-root
Container started successfully. To exit, type 'exit'.
[root@worker-0 /]# tcpdump -i enp2s0 -c 1 -v --direction=out esp and src 192.168.111.23 and dst 192.168.111.24
dropped privs to tcpdump
tcpdump: listening on enp2s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
16:07:01.854214 IP (tos 0x0, ttl 64, id 20451, offset 0, flags [DF], proto ESP (50), length 152)
    worker-0 > worker-1: ESP(spi=0x52cc9c8d,seq=0xe1c5c), length 132
1 packet captured
6 packets received by filter
0 packets dropped by kernel
[root@worker-0 /]# exit
exit
 
sh-5.1# ipsec whack --trafficstatus
006 #20: "ovn-1184d9-0-in-1", type=ESP, add_time=1715687134, inBytes=206148172, outBytes=0, maxBytes=2^63B, id='@1184d960-3211-45c4-a482-d7b6fe995446'
006 #19: "ovn-1184d9-0-out-1", type=ESP, add_time=1715687112, inBytes=0, outBytes=40269835, maxBytes=2^63B, id='@1184d960-3211-45c4-a482-d7b6fe995446'
006 #27: "ovn-185198-0-in-1", type=ESP, add_time=1715687419, inBytes=71406656, outBytes=0, maxBytes=2^63B, id='@185198f6-7dde-4e9b-b2aa-52439d2beef5'
006 #26: "ovn-185198-0-out-1", type=ESP, add_time=1715687401, inBytes=0, outBytes=17201159, maxBytes=2^63B, id='@185198f6-7dde-4e9b-b2aa-52439d2beef5'
006 #14: "ovn-922aca-0-in-1", type=ESP, add_time=1715687004, inBytes=116384250, outBytes=0, maxBytes=2^63B, id='@922aca42-b893-496e-bb9b-0310884f4cc1'
006 #13: "ovn-922aca-0-out-1", type=ESP, add_time=1715686986, inBytes=0, outBytes=986900228, maxBytes=2^63B, id='@922aca42-b893-496e-bb9b-0310884f4cc1'
006 #6: "ovn-f72f26-0-in-1", type=ESP, add_time=1715686855, inBytes=115781441, outBytes=98, maxBytes=2^63B, id='@f72f2622-e7dc-414e-8369-6013752ea15b'
006 #5: "ovn-f72f26-0-out-1", type=ESP, add_time=1715686833, inBytes=9320, outBytes=29002449, maxBytes=2^63B, id='@f72f2622-e7dc-414e-8369-6013752ea15b'
sh-5.1# ip xfrm state; echo ' '; ip xfrm policy
src 192.168.111.21 dst 192.168.111.23
proto esp spi 0x7f7ddcf5 reqid 16413 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x6158d9a0f4a28598500e15f81a40ef715502b37ecf979feb11bbc488479c8804598011ee 128
lastused 2024-05-14 16:07:11
anti-replay esn context:
seq-hi 0x0, seq 0x18564, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff 
sel src 192.168.111.21/32 dst 192.168.111.23/32 proto udp dport 6081 
src 192.168.111.23 dst 192.168.111.21
proto esp spi 0xda57e42e reqid 16413 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x810bebecef77951ae8bb9a46cf53a348a24266df8b57bf2c88d4f23244eb3875e88cc796 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000 
sel src 192.168.111.23/32 dst 192.168.111.21/32 proto udp sport 6081 
src 192.168.111.21 dst 192.168.111.23
proto esp spi 0xf84f2fcf reqid 16417 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x0f242efb072699a0f061d4c941d1bb9d4eb7357b136db85a0165c3b3979e27b00ff20ac7 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000 
sel src 192.168.111.21/32 dst 192.168.111.23/32 proto udp sport 6081 
src 192.168.111.23 dst 192.168.111.21
proto esp spi 0x9523c6ca reqid 16417 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xe075d39b6e53c033f5225f8be48efe537c3ba605cee2f5f5f3bb1cf16b6c53182ecf35f7 128
lastused 2024-05-14 16:07:11
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x10fb2
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000 
sel src 192.168.111.23/32 dst 192.168.111.21/32 proto udp dport 6081 
src 192.168.111.20 dst 192.168.111.23
proto esp spi 0x459d8516 reqid 16397 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xee778e6db2ce83fa24da3b18e028451bbfcf4259513bca21db832c3023e238a6b55fdacc 128
lastused 2024-05-14 16:07:13
anti-replay esn context:
seq-hi 0x0, seq 0x3ec45, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff 
sel src 192.168.111.20/32 dst 192.168.111.23/32 proto udp dport 6081 
src 192.168.111.23 dst 192.168.111.20
proto esp spi 0x3142f53a reqid 16397 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x6238fea6dffdd36cbb909f6aab48425ba6e38f9d32edfa0c1e0fc6af8d4e3a5c11b5dfd1 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000 
sel src 192.168.111.23/32 dst 192.168.111.20/32 proto udp sport 6081 
src 192.168.111.20 dst 192.168.111.23
proto esp spi 0xeda1ccb9 reqid 16401 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xef84a90993bd71df9c97db940803ad31c6f7d2e72a367a1ec55b4798879818a6341c38b6 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000 
sel src 192.168.111.20/32 dst 192.168.111.23/32 proto udp sport 6081 
src 192.168.111.23 dst 192.168.111.20
proto esp spi 0x02c3c0dd reqid 16401 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x858ab7326e54b6d888825118724de5f0c0ad772be2b39133c272920c2cceb2f716d02754 128
lastused 2024-05-14 16:07:13
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x26f8e
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000 
sel src 192.168.111.23/32 dst 192.168.111.20/32 proto udp dport 6081 
src 192.168.111.24 dst 192.168.111.23
proto esp spi 0xc9535b47 reqid 16405 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xd7a83ff4bd6e7704562c597810d509c3cdd4e208daabf2ec074d109748fd1647ab2eff9d 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x53d4c, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff 
sel src 192.168.111.24/32 dst 192.168.111.23/32 proto udp dport 6081 
src 192.168.111.23 dst 192.168.111.24
proto esp spi 0xb66203c8 reqid 16405 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xc207001a7f1ed7f114b3e327308ddbddc36de5272a11fe0661d03eaecc84b6761c7ec9c4 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000 
sel src 192.168.111.23/32 dst 192.168.111.24/32 proto udp sport 6081 
src 192.168.111.24 dst 192.168.111.23
proto esp spi 0x2e4d4deb reqid 16409 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x91e399d83aa1c2626424b502d4b8dae07d4a170f7ef39f8d1baca8e92b8a1dee210e2502 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000 
sel src 192.168.111.24/32 dst 192.168.111.23/32 proto udp sport 6081 
src 192.168.111.23 dst 192.168.111.24
proto esp spi 0x52cc9c8d reqid 16409 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xb605451f32f5dd7a113cae16e6f1509270c286d67265da2ad14634abccf6c90f907e5c00 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0xe2735
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000 
sel src 192.168.111.23/32 dst 192.168.111.24/32 proto udp dport 6081 
src 192.168.111.22 dst 192.168.111.23
proto esp spi 0x973119c3 reqid 16389 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x87d13e67b948454671fb8463ec0cd4d9c38e5e2dd7f97cbb8f88b50d4965fb1f21b36199 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x2af9a, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff 
sel src 192.168.111.22/32 dst 192.168.111.23/32 proto udp dport 6081 
src 192.168.111.23 dst 192.168.111.22
proto esp spi 0x4c3580ff reqid 16389 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x2c09750f51e86d60647a60e15606f8b312036639f8de2d7e49e733cda105b920baade029 128
lastused 2024-05-14 14:36:43
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x1
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000 
sel src 192.168.111.23/32 dst 192.168.111.22/32 proto udp sport 6081 
src 192.168.111.22 dst 192.168.111.23
proto esp spi 0xa3e469dc reqid 16393 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x1d5c5c232e6fd4b72f3dad68e8a4d523cbd297f463c53602fad429d12c0211d97ae26f47 128
lastused 2024-05-14 14:18:42
anti-replay esn context:
seq-hi 0x0, seq 0xb, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 000007ff 
sel src 192.168.111.22/32 dst 192.168.111.23/32 proto udp sport 6081 
src 192.168.111.23 dst 192.168.111.22
proto esp spi 0xdee8476f reqid 16393 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x5895025ce5b192a7854091841c73c8e29e7e302f61becfa3feb44d071ac5c64ce54f5083 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x1f1a3
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000 
sel src 192.168.111.23/32 dst 192.168.111.22/32 proto udp dport 6081 
 
src 192.168.111.23/32 dst 192.168.111.21/32 proto udp sport 6081 
dir out priority 1360065 ptype main 
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16413 mode transport
src 192.168.111.21/32 dst 192.168.111.23/32 proto udp dport 6081 
dir in priority 1360065 ptype main 
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16413 mode transport
src 192.168.111.23/32 dst 192.168.111.21/32 proto udp dport 6081 
dir out priority 1360065 ptype main 
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16417 mode transport
src 192.168.111.21/32 dst 192.168.111.23/32 proto udp sport 6081 
dir in priority 1360065 ptype main 
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16417 mode transport
src 192.168.111.23/32 dst 192.168.111.20/32 proto udp sport 6081 
dir out priority 1360065 ptype main 
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16397 mode transport
src 192.168.111.20/32 dst 192.168.111.23/32 proto udp dport 6081 
dir in priority 1360065 ptype main 
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16397 mode transport
src 192.168.111.23/32 dst 192.168.111.20/32 proto udp dport 6081 
dir out priority 1360065 ptype main 
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16401 mode transport
src 192.168.111.20/32 dst 192.168.111.23/32 proto udp sport 6081 
dir in priority 1360065 ptype main 
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16401 mode transport
src 192.168.111.23/32 dst 192.168.111.24/32 proto udp sport 6081 
dir out priority 1360065 ptype main 
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16405 mode transport
src 192.168.111.24/32 dst 192.168.111.23/32 proto udp dport 6081 
dir in priority 1360065 ptype main 
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16405 mode transport
src 192.168.111.23/32 dst 192.168.111.24/32 proto udp dport 6081 
dir out priority 1360065 ptype main 
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16409 mode transport
src 192.168.111.24/32 dst 192.168.111.23/32 proto udp sport 6081 
dir in priority 1360065 ptype main 
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16409 mode transport
src 192.168.111.23/32 dst 192.168.111.22/32 proto udp sport 6081 
dir out priority 1360065 ptype main 
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16389 mode transport
src 192.168.111.22/32 dst 192.168.111.23/32 proto udp dport 6081 
dir in priority 1360065 ptype main 
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16389 mode transport
src 192.168.111.23/32 dst 192.168.111.22/32 proto udp dport 6081 
dir out priority 1360065 ptype main 
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16393 mode transport
src 192.168.111.22/32 dst 192.168.111.23/32 proto udp sport 6081 
dir in priority 1360065 ptype main 
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16393 mode transport
src ::/0 dst ::/0 
socket out priority 0 ptype main 
src ::/0 dst ::/0 
socket in priority 0 ptype main 
src ::/0 dst ::/0 
socket out priority 0 ptype main 
src ::/0 dst ::/0 
socket in priority 0 ptype main 
src 0.0.0.0/0 dst 0.0.0.0/0 
socket out priority 0 ptype main 
src 0.0.0.0/0 dst 0.0.0.0/0 
socket in priority 0 ptype main 
src 0.0.0.0/0 dst 0.0.0.0/0 
socket out priority 0 ptype main 
src 0.0.0.0/0 dst 0.0.0.0/0 
socket in priority 0 ptype main 
src 0.0.0.0/0 dst 0.0.0.0/0 
socket out priority 0 ptype main 
src 0.0.0.0/0 dst 0.0.0.0/0 
socket in priority 0 ptype main 
src 0.0.0.0/0 dst 0.0.0.0/0 
socket out priority 0 ptype main 
src 0.0.0.0/0 dst 0.0.0.0/0 
socket in priority 0 ptype main 
src 0.0.0.0/0 dst 0.0.0.0/0 
socket out priority 0 ptype main 
src 0.0.0.0/0 dst 0.0.0.0/0 
socket in priority 0 ptype main 
src 0.0.0.0/0 dst 0.0.0.0/0 
socket out priority 0 ptype main 
src 0.0.0.0/0 dst 0.0.0.0/0 
socket in priority 0 ptype main 
src ::/0 dst ::/0 proto ipv6-icmp type 135 
dir out priority 1 ptype main 
src ::/0 dst ::/0 proto ipv6-icmp type 135 
dir fwd priority 1 ptype main 
src ::/0 dst ::/0 proto ipv6-icmp type 135 
dir in priority 1 ptype main 
src ::/0 dst ::/0 proto ipv6-icmp type 136 
dir out priority 1 ptype main 
src ::/0 dst ::/0 proto ipv6-icmp type 136 
dir fwd priority 1 ptype main 
src ::/0 dst ::/0 proto ipv6-icmp type 136 
dir in priority 1 ptype main 
sh-5.1# cat /etc/ipsec.conf 
# /etc/ipsec.conf - Libreswan 4.0 configuration file
#
# see 'man ipsec.conf' and 'man pluto' for more information
#
# For example configurations and documentation, see https://libreswan.org/wiki/
 
config setup
# If logfile= is unset, syslog is used to send log messages too.
# Note that on busy VPN servers, the amount of logging can trigger
# syslogd (or journald) to rate limit messages.
#logfile=/var/log/pluto.log
#
# Debugging should only be used to find bugs, not configuration issues!
# "base" regular debug, "tmi" is excessive and "private" will log
# sensitive key material (not available in FIPS mode). The "cpu-usage"
# value logs timing information and should not be used with other
# debug options as it will defeat getting accurate timing information.
# Default is "none"
# plutodebug="base"
# plutodebug="tmi"
#plutodebug="none"
#
# Some machines use a DNS resolver on localhost with broken DNSSEC
# support. This can be tested using the command:
# dig +dnssec DNSnameOfRemoteServer
# If that fails but omitting '+dnssec' works, the system's resolver is
# broken and you might need to disable DNSSEC.
# dnssec-enable=no
#
# To enable IKE and IPsec over TCP for VPN server. Requires at least
# Linux 5.7 kernel or a kernel with TCP backport (like RHEL8 4.18.0-291)
# listen-tcp=yes
# To enable IKE and IPsec over TCP for VPN client, also specify
# tcp-remote-port=4500 in the client's conn section.
 
# if it exists, include system wide crypto-policy defaults
include /etc/crypto-policies/back-ends/libreswan.config
 
# It is best to add your IPsec connections as separate files
# in /etc/ipsec.d/
include /etc/ipsec.d/*.conf
sh-5.1# cat /etc/ipsec.d/openshift.conf 
# Generated by ovs-monitor-ipsec...do not modify by hand!
 
 
config setup
    uniqueids=yes
 
conn %default
    keyingtries=%forever
    type=transport
    auto=route
    ike=aes_gcm256-sha2_256
    esp=aes_gcm256
    ikev2=insist
 
conn ovn-f72f26-0-in-1
    left=192.168.111.23
    right=192.168.111.22
    leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
    rightid=@f72f2622-e7dc-414e-8369-6013752ea15b
    leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
    leftrsasigkey=%cert
    rightca=%same
    leftprotoport=udp/6081
    rightprotoport=udp
 
conn ovn-f72f26-0-out-1
    left=192.168.111.23
    right=192.168.111.22
    leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
    rightid=@f72f2622-e7dc-414e-8369-6013752ea15b
    leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
    leftrsasigkey=%cert
    rightca=%same
    leftprotoport=udp
    rightprotoport=udp/6081
 
conn ovn-1184d9-0-in-1
    left=192.168.111.23
    right=192.168.111.20
    leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
    rightid=@1184d960-3211-45c4-a482-d7b6fe995446
    leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
    leftrsasigkey=%cert
    rightca=%same
    leftprotoport=udp/6081
    rightprotoport=udp
 
conn ovn-1184d9-0-out-1
    left=192.168.111.23
    right=192.168.111.20
    leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
    rightid=@1184d960-3211-45c4-a482-d7b6fe995446
    leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
    leftrsasigkey=%cert
    rightca=%same
    leftprotoport=udp
    rightprotoport=udp/6081
 
conn ovn-922aca-0-in-1
    left=192.168.111.23
    right=192.168.111.24
    leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
    rightid=@922aca42-b893-496e-bb9b-0310884f4cc1
    leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
    leftrsasigkey=%cert
    rightca=%same
    leftprotoport=udp/6081
    rightprotoport=udp
 
conn ovn-922aca-0-out-1
    left=192.168.111.23
    right=192.168.111.24
    leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
    rightid=@922aca42-b893-496e-bb9b-0310884f4cc1
    leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
    leftrsasigkey=%cert
    rightca=%same
    leftprotoport=udp
    rightprotoport=udp/6081
 
conn ovn-185198-0-in-1
    left=192.168.111.23
    right=192.168.111.21
    leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
    rightid=@185198f6-7dde-4e9b-b2aa-52439d2beef5
    leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
    leftrsasigkey=%cert
    rightca=%same
    leftprotoport=udp/6081
    rightprotoport=udp
 
conn ovn-185198-0-out-1
    left=192.168.111.23
    right=192.168.111.21
    leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
    rightid=@185198f6-7dde-4e9b-b2aa-52439d2beef5
    leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
    leftrsasigkey=%cert
    rightca=%same
    leftprotoport=udp
    rightprotoport=udp/6081
 
sh-5.1# 

Feature Overview (aka. Goal Summary)  

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

<your text here>

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

<your text here>

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

link back to OCPSTRAT-1644 somehow

 

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

 
Why is this important? (mandatory)

What are the benefits to the customer or Red Hat?   Does it improve security, performance, supportability, etc?  Why is work a priority?

 
Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Epic Goal*

Drive the technical part of the Kubernetes 1.32 upgrade, including rebasing openshift/kubernetes repository and coordination across OpenShift organization to get e2e tests green for the OCP release.

 
Why is this important? (mandatory)

OpenShift 4.19 cannot be released without Kubernetes 1.32

 
Scenarios (mandatory) 

  1.  

 
Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

 

Slack Discussion Channel - https://redhat.enterprise.slack.com/archives/C07V32J0YKF

As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.

Feature Overview (aka. Goal Summary)  

Here are common update improvements from customer interactions on Update experience

  1. Show nodes where pod draining is taking more time.
    Customers have to dig deeper often to find the nodes for further debugging. 
    The ask has been to bubble up this on the update progress window.
  2. oc update status ?
    From the UI we can see the progress of the update. From oc cli we can see this from "oc get cvo"  
     But the ask is to show more details in a human-readable format.

    Know where the update has stopped. Consider adding at what run level it has stopped.
     
    oc get clusterversion
    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    
    version   4.12.0    True        True          16s     Working towards 4.12.4: 9 of 829 done (1% complete)
    

     

Documentation Considerations

Update docs for UX and CLI changes

Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22

Epic Goal*

Add a new command `oc adm upgrade status` command which is backed by an API.  Please find the mock output of the command output attached in this card.

Why is this important? (mandatory)

  • From the UI we can see the progress of the update. Using OC CLI we can see some of the information using "oc get clusterversion" but the output is not readable and it is a lot of extra information to process. 
  • Customer as asking us to show more details in a human-readable format as well provide an API which they can use for automation.

Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing - Tests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Other 

After OTA-960 is fixed, ClusterVersion/version and oc adm upgrade can be used to monitor the process of migrating a cluster to multi-arch.

$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.18.0-ec.3: 761 of 890 done (85% complete), waiting on machine-config

Upgradeable=False

  Reason: PoolUpdating
  Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details

Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.18 (available channels: candidate-4.18)
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.

But oc adm upgrade status reports COMPLETION 100% while the migration/upgrade is still ongoing.

$ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Completed
Target Version:  4.18.0-ec.3 (from 4.18.0-ec.3)
Completion:      100% (33 operators updated, 0 updating, 0 waiting)
Duration:        15m
Operator Status: 33 Healthy

Control Plane Nodes
NAME                                        ASSESSMENT    PHASE     VERSION       EST   MESSAGE
ip-10-0-95-224.us-east-2.compute.internal   Unavailable   Updated   4.18.0-ec.3   -     Node is unavailable
ip-10-0-33-81.us-east-2.compute.internal    Completed     Updated   4.18.0-ec.3   -
ip-10-0-45-170.us-east-2.compute.internal   Completed     Updated   4.18.0-ec.3   -

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Completed    100%         3 Total, 2 Available, 0 Progressing, 0 Outdated, 0 Draining, 0 Excluded, 0 Degraded

Worker Pool Nodes: worker
NAME                                        ASSESSMENT    PHASE     VERSION       EST   MESSAGE
ip-10-0-72-40.us-east-2.compute.internal    Unavailable   Updated   4.18.0-ec.3   -     Node is unavailable
ip-10-0-17-117.us-east-2.compute.internal   Completed     Updated   4.18.0-ec.3   -
ip-10-0-22-179.us-east-2.compute.internal   Completed     Updated   4.18.0-ec.3   -

= Update Health =
SINCE   LEVEL     IMPACT         MESSAGE
-       Warning   Update Speed   Node ip-10-0-95-224.us-east-2.compute.internal is unavailable
-       Warning   Update Speed   Node ip-10-0-72-40.us-east-2.compute.internal is unavailable

Run with --details=health for additional description and links to related online documentation

$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.18.0-ec.3   True        True          14m     Working towards 4.18.0-ec.3: 761 of 890 done (85% complete), waiting on machine-config

$ oc get co machine-config
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.18.0-ec.3   True        True          False      63m     Working towards 4.18.0-ec.3

The reason is that PROGRESSING=True is not detected for co/machine-config as the status command checks only operator.Status.Versions[name=="operator"] and it needs to check ClusterOperator.Status.Versions[name=="operator-image"] as well.

 

For grooming:

It will be challenging for the status command to check the operator image's pull spec because it does not know the expected value. CVO knows it because CVO holds the manifests (containing the expected value) from the multi-arch payload.

One "hacky" workaround is that the status command gets the pull spec from the MCO deployment:

oc get deployment -n openshift-machine-config-operator machine-config-operator -o json | jq -r '.spec.template.spec.containers[]|select(.name=="machine-config-operator")|.image'
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:787a505ca594b0a727549353c503dec9233a9d3c2dcd6b64e3de5f998892a1d5 

Note this co/machine-config -> deployment/machine-config-operator trick may not be feasible if we want to extend it to all cluster operators. But it should work as a hacky workaround to check only MCO.

We may claim that the status command is not designed for monitoring the multi-arch migration and suggest to use oc adm upgrade instead. In that case, we can close this card as Obsolete/Won'tDo.

 

manifests.ziphas the mockData/manifests for the status cmd that are taken during the migration.

 

oc#1920 started the work for the status command to recognize the migration and we need to extend the work to cover (the comments from Petr's review):

  • "Target Version: 4.18.0-ec.3 (from 4.18.0-ec.3)": confusing. We should tell "multi-arch" migration somehow. Or even better: from the current arch to multi-arch, for example "Target Version: 4.18.0-ec.3 multi (from x86_64)" if we could get the origin arch from CV or somewhere else.
    • We have spec.desiredUpdate.architecture since forever, and can use that being Multi as a partial hint.  MULTIARCH-4559 is adding tech-preview status properties around architecture in 4.18, but tech-preview, so may not be worth bothering with in oc code.  Two history entries with the same version string but different digests is probably a reliable-enough heuristic, coupled with the spec-side hint.
  • "Duration: 6m55s (Est. Time Remaining: 1h4m)": We will see if we could find a simple way to hand this special case. I do not understand "the 97% completion will be reached so fast." as I am not familiar with the algorithm. But it seems acceptable to Petr that we show N/A for the migration.
    • I think I get "the 97% completion will be reached so fast." now as only MCO has the operator-image pull spec. Other COs claim the completeness immaturely. With that said, "N/A" sounds like the most possible way for now.
  • Node status like "All control plane nodes successfully updated to 4.18.0-ec.3" for control planes and "ip-10-0-17-117.us-east-2.compute.internal Completed". It is technically hard to detect the transaction during migration as MCO annotates only the version. This may become a separate card if it is too big to finish with the current one.
  • "targetImagePullSpec := getMCOImagePullSpec(mcoDeployment)" should be computed just once. Now it is in the each iteration of the for loop. We should also comment about why we do it with this hacky way.

Feature Overview (aka. Goal Summary)  

We need to maintain our dependencies across all the libraries we use in order to stay in compliance. 

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

<your text here>

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

<enter general Feature acceptance here>

 

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both  
Classic (standalone cluster)  
Hosted control planes  
Multi node, Compact (three node), or Single node (SNO), or all  
Connected / Restricted Network  
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
Operator compatibility  
Backport needed (list applicable versions)  
UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
Other (please specify)  

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

<your text here>

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

<your text here>

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

<your text here>

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

Currently console is using TypeScript 4, which is preventing us from upgrading to NodeJS-22. Due to that we need to update TypeScript 5 (not necessarily latest version).

 

AC:

  • Update TypeScript to version 5
  • Update ES build target to ES-2021

 

Note: In case of higher complexity we should be splitting the story into multiple stories, per console package.

As a developer I want to make sure we are running the latest version of webpack in order to take advantage of the latest benefits and also keep current so that future updating is a painless as possible.

We are currently on v4.47.0.

Changelog: https://webpack.js.org/blog/2020-10-10-webpack-5-release/

By updating to version 5 we will need to update following pkgs as well:

  • html-webpack-plugin
  • webpack-bundle-analyzer
  • copy-webpack-plugin
  • fork-ts-checker-webpack-plugin

AC: Update webpack to version 5 and determine what should be the ideal minor version.

Feature Overview (aka. Goal Summary)  

The ability in OpenShift to create trust and directly consume access tokens issued by external OIDC Authentication Providers using an authentication approach similar to upstream Kubernetes.

BYO Identity will help facilitate CLI only workflows and capabilities of the Authentication Provider (such as Keycloak, Dex, Azure AD) similar to upstream Kubernetes. 

Goals (aka. expected user outcomes)

Ability in OpenShift to provide a direct, pluggable Authentication workflow such that the OpenShift/K8s API server can consume access tokens issued by external OIDC identity providers. Kubernetes provides this integration as described here. Customer/Users can then configure their IDPs to support the OIDC protocols and workflows they desire such as Client credential flow.

OpenShift OAuth server is still available as default option, with the ability to tune in the external OIDC provider as a Day-2 configuration.

Requirements (aka. Acceptance Criteria):

  1. The customer should be able to tie into RBAC functionality, similar to how it is closely aligned with OpenShift OAuth 
  2.  

Use Cases (Optional):

  1. As a customer, I would like to integrate my OIDC Identity Provider directly with the OpenShift API server.
  2. As a customer in multi-cluster cloud environment, I have both K8s and non-K8s clusters using my IDP and hence I need seamless authentication directly to the OpenShift/K8sAPI using my Identity Provider 
  3.  

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

 

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

 

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

 

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  Initial completion during Refinement status.

 

Interoperability Considerations

Which other projects and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

Epic Goal

The ability to provide a direct authentication workflow such that OpenShift can consume bearer tokens issued by external OIDC identity providers, replacing the built-in OAuth stack by deactivating/removing its components as necessary.

 
Why is this important? (mandatory)

OpenShift has its own built-in OAuth server which can be used to obtain OAuth access tokens for authentication to the API. The server can be configured with an external identity provider (including support for OIDC), however it is still the built-in server that issues tokens, and thus authentication is limited to the capabilities of the oauth-server.

 
Scenarios (mandatory) 

  • As a customer, I want to integrate my OIDC Identity Provider directly with OpenShift so that I can fully use its capabilities in machine-to-machine workflows.
    *As a customer in a hybrid cloud environment, I want to seamlessly use my OIDC Identity Provider across all of my fleet.

 
Dependencies (internal and external) (mandatory)

  • Support in the console/console-operator (already completed)
  • Support in the OpenShift CLI `oc` (already completed)

Contributing Teams(and contacts) (mandatory) 

  • Development - OCP Auth
  • Documentation - OCP Auth
  • QE - OCP Auth
  • PX - 
  • Others -

Acceptance Criteria (optional)

  • external OIDC provider can be configured to be used directly via the kube-apiserver to issue tokens
  • built-in oauth stack no longer operational in the cluster; respective APIs, resources and components deactivated
  • changing back to the built-in oauth stack possible

Drawbacks or Risk (optional)

  • Enabling an external OIDC provider to an OCP cluster will result in the oauth-apiserver being removed from the system; this inherently means that the two API Services it is serving (v1.oauth.openshift.io, v1.user.openshift.io) will be gone from the cluster, and therefore any related data will be lost. It is the user's responsibility to create backups of any required data.
  • Configuring an external OIDC identity provider for authentication by definition means that any security updates or patches must be managed independently from the cluster itself, i.e. cluster updates will not resolve security issues relevant to the provider itself; the provider will have to be updated separately. Additionally, new functionality or features on the provider's side might need integration work in OpenShift (depending on their nature).

Done - Checklist (mandatory)

  • CI Testing -  Basic e2e automationTests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Engineering Stories Merged
  • All associated work items with the Epic are closed
  • Epic status should be “Release Pending” 

Description of problem:
This is a bug found during pre-merge test of 4.18 epic AUTH-528 PRs and filed for better tracking per existing "OpenShift - Testing Before PR Merges - Left-Shift Testing" google doc workflow.

co/console degraded with AuthStatusHandlerDegraded after OCP BYO external oidc is configured and then removed (i.e. reverted back to OAuth IDP).

Version-Release number of selected component (if applicable):

Cluster-bot build which is built at 2024-11-25 09:39 CST (UTC+800)
build 4.18,openshift/cluster-authentication-operator#713,openshift/cluster-authentication-operator#740,openshift/cluster-kube-apiserver-operator#1760,openshift/console-operator#940

How reproducible:

Always (tried twice, both hit it)

Steps to Reproduce:

1. Launch a TechPreviewNoUpgrade standalone OCP cluster with above build. Configure htpasswd IDP. Test users can login successfully.

2. Configure BYO external OIDC in this OCP cluster using Microsoft Entra ID. KAS and console pods can roll out successfully. oc login and console login to Microsoft Entra ID can succeed.

3. Remove BYO external OIDC configuration, i.e. go back to original htpasswd OAuth IDP:
[xxia@2024-11-25 21:10:17 CST my]$ oc patch authentication.config/cluster --type=merge -p='
spec: 
  type: ""
  oidcProviders: null
'
authentication.config.openshift.io/cluster patched

[xxia@2024-11-25 21:15:24 CST my]$ oc get authentication.config  cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Authentication
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    release.openshift.io/create-only: "true"
  creationTimestamp: "2024-11-25T04:11:59Z"
  generation: 5
  name: cluster
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: e814f1dc-0b51-4b87-8f04-6bd99594bf47
  resourceVersion: "284724"
  uid: 2de77b67-7de4-4883-8ceb-f1020b277210
spec:
  oauthMetadata:
    name: ""
  serviceAccountIssuer: ""
  type: ""
  webhookTokenAuthenticator:
    kubeConfig:
      name: webhook-authentication-integrated-oauth
status:
  integratedOAuthMetadata:
    name: oauth-openshift
  oidcClients:
  - componentName: cli
    componentNamespace: openshift-console
  - componentName: console
    componentNamespace: openshift-console
    conditions:
    - lastTransitionTime: "2024-11-25T13:10:23Z"
      message: ""
      reason: OIDCConfigAvailable
      status: "False"
      type: Degraded
    - lastTransitionTime: "2024-11-25T13:10:23Z"
      message: ""
      reason: OIDCConfigAvailable
      status: "False"
      type: Progressing
    - lastTransitionTime: "2024-11-25T13:10:23Z"
      message: ""
      reason: OIDCConfigAvailable
      status: "True"
      type: Available
    currentOIDCClients:
    - clientID: 95fbae1d-69a7-4206-86bd-00ea9e0bb778
      issuerURL: https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/v2.0
      oidcProviderName: microsoft-entra-id


KAS and console pods indeed can roll out successfully; and now oc login and console login indeed can succeed using the htpasswd user and password:
[xxia@2024-11-25 21:49:32 CST my]$ oc login -u testuser-1 -p xxxxxx
Login successful.
...

But co/console degraded, which is weird:
[xxia@2024-11-25 21:56:07 CST my]$ oc get co | grep -v 'True *False *False'
NAME                                       VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
console                                    4.18.0-0.test-2024-11-25-020414-ci-ln-71cvsj2-latest   True        False         True       9h      AuthStatusHandlerDegraded: Authentication.config.openshift.io "cluster" is invalid: [status.oidcClients[1].currentOIDCClients[0].issuerURL: Invalid value: "": oidcClients[1].currentOIDCClients[0].issuerURL in body should match '^https:\/\/[^\s]', status.oidcClients[1].currentOIDCClients[0].oidcProviderName: Invalid value: "": oidcClients[1].currentOIDCClients[0].oidcProviderName in body should be at least 1 chars long]

Actual results:

co/console degraded, as above.

Expected results:

co/console is normal.

Additional info:

    

Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission. 

With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.

With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn". 

Epic Goal

Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.

When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.

To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).

Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).

The following tables track progress.

Progress summary

# namespaces 4.19 4.18 4.17 4.16 4.15 4.14
monitored 82 82 82 82 82 82
fix needed 68 68 68 68 68 68
fixed 39 39 35 32 39 1
remaining 29 29 33 36 29 67
~ remaining non-runlevel 8 8 12 15 8 46
~ remaining runlevel (low-prio) 21 21 21 21 21 21
~ untested 5 2 2 2 82 82

Progress breakdown

# namespace 4.19 4.18 4.17 4.16 4.15 4.14
1 oc debug node pods #1763 #1816 #1818  
2 openshift-apiserver-operator #573 #581  
3 openshift-authentication #656 #675  
4 openshift-authentication-operator #656 #675  
5 openshift-catalogd #50 #58  
6 openshift-cloud-credential-operator #681 #736  
7 openshift-cloud-network-config-controller #2282 #2490 #2496    
8 openshift-cluster-csi-drivers #118 #5310 #135 #524 #131 #306 #265 #75   #170 #459 #484  
9 openshift-cluster-node-tuning-operator #968 #1117  
10 openshift-cluster-olm-operator #54 n/a n/a
11 openshift-cluster-samples-operator #535 #548  
12 openshift-cluster-storage-operator #516   #459 #196 #484 #211  
13 openshift-cluster-version       #1038 #1068  
14 openshift-config-operator #410 #420  
15 openshift-console #871 #908 #924  
16 openshift-console-operator #871 #908 #924  
17 openshift-controller-manager #336 #361  
18 openshift-controller-manager-operator #336 #361  
19 openshift-e2e-loki #56579 #56579 #56579 #56579  
20 openshift-image-registry       #1008 #1067  
21 openshift-ingress   #1032        
22 openshift-ingress-canary   #1031        
23 openshift-ingress-operator   #1031        
24 openshift-insights #1033 #1041 #1049 #915 #967  
25 openshift-kni-infra #4504 #4542 #4539 #4540  
26 openshift-kube-storage-version-migrator #107 #112  
27 openshift-kube-storage-version-migrator-operator #107 #112  
28 openshift-machine-api #1308 #1317 #1311 #407 #315 #282 #1220 #73 #50 #433 #332 #326 #1288 #81 #57 #443  
29 openshift-machine-config-operator #4636 #4219 #4384 #4393  
30 openshift-manila-csi-driver #234 #235 #236  
31 openshift-marketplace #578 #561 #570
32 openshift-metallb-system #238 #240 #241    
33 openshift-monitoring #2298 #366 #2498   #2335 #2420  
34 openshift-network-console #2545        
35 openshift-network-diagnostics #2282 #2490 #2496    
36 openshift-network-node-identity #2282 #2490 #2496    
37 openshift-nutanix-infra #4504 #4539 #4540  
38 openshift-oauth-apiserver #656 #675  
39 openshift-openstack-infra #4504   #4539 #4540  
40 openshift-operator-controller #100 #120  
41 openshift-operator-lifecycle-manager #703 #828  
42 openshift-route-controller-manager #336 #361  
43 openshift-service-ca #235 #243  
44 openshift-service-ca-operator #235 #243  
45 openshift-sriov-network-operator #995 #999 #1003  
46 openshift-user-workload-monitoring #2335 #2420  
47 openshift-vsphere-infra #4504 #4542 #4539 #4540  
48 (runlevel) kube-system            
49 (runlevel) openshift-cloud-controller-manager            
50 (runlevel) openshift-cloud-controller-manager-operator            
51 (runlevel) openshift-cluster-api            
52 (runlevel) openshift-cluster-machine-approver            
53 (runlevel) openshift-dns            
54 (runlevel) openshift-dns-operator            
55 (runlevel) openshift-etcd            
56 (runlevel) openshift-etcd-operator            
57 (runlevel) openshift-kube-apiserver            
58 (runlevel) openshift-kube-apiserver-operator            
59 (runlevel) openshift-kube-controller-manager            
60 (runlevel) openshift-kube-controller-manager-operator            
61 (runlevel) openshift-kube-proxy            
62 (runlevel) openshift-kube-scheduler            
63 (runlevel) openshift-kube-scheduler-operator            
64 (runlevel) openshift-multus            
65 (runlevel) openshift-network-operator            
66 (runlevel) openshift-ovn-kubernetes            
67 (runlevel) openshift-sdn            
68 (runlevel) openshift-storage            

Feature Overview (aka. Goal Summary)  

Implement Migration core for MAPI to CAPI for AWS

  • This feature covers the design and implementation of converting from using the Machine API (MAPI) to Cluster API (CAPI) for AWS
  • This Design investigates possible solutions for AWS
  • Once AWS shim/sync layer is implemented use the architecture for other clouds in phase-2 & phase 3

Acceptance Criteria

When customers use CAPI, There must be no negative effect to switching over to using CAPI .  Seamless migration of Machine resources. the fields in MAPI/CAPI should reconcile from both CRDs.

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Why is this important?

  • We need to build out the core so that development of the migration for individual providers can then happen in parallel
  •  

Scenarios

  1. ...

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

  1. ...

Open questions::

  1. ...

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

As QE have tried to test upstream CAPI pausing, we've hit a few issues with running the migration controller. & cluster capi operator on a real cluster vs envtest.

This card captures the work required to iron out these kinks, and get things running (i.e not crashing).

I also think we want an e2e or some sort of automated testing to ensure we don't break things again.

 

Goal: Stop the CAPI operator crashing on startup in a real cluster.

 

Non goals: get the entire conversion flow running from CAPI -> MAPI and MAPI -> CAPI. We still need significant feature work before we're here.

Feature Overview (aka. Goal Summary)  

As a cluster administrator, I want to use Karpenter on an OpenShift cluster running in AWS to scale nodes instead of Cluster Autoscalar(CAS). I want to automatically manage heterogeneous compute resources in my OpenShift cluster without the additional manual task of managing node pools. Additional features I want are:

  • Reducing cloud costs through instance selection and scaling/descaling
  • Support GPUs, spot instances, mixed compute types and other compute types.
  • Automatic node lifecycle management and upgrades

This feature covers the work done to integrate upstream Karpenter 1.x with ROSA HCP. This eliminates the need for manual node pool management while ensuring cost-effective compute selection for workloads. Red Hat manages the node lifecycle and upgrades.

The feature will be rolled out with ROSA (AWS) since it has more mature Karpenter ecosystem, followed by ARO (Azure) implementation(check OCPSTRAT-1498)

Goals (aka. expected user outcomes)

  1. Run Karpenter in management cluster and disable CAS
  2. Automate node provisioning in workload cluster
  3. automate lifecycle management  in workload cluster
  4. Reduce cost in heterogenous compute workloads

https://docs.google.com/document/d/1ID_IhXPpYY4K3G_wa1MYJxOb3yz5FYoOj3ONSkEDsZs/edit?tab=t.0#heading=h.yvv1wy2g0utk

Requirements (aka. Acceptance Criteria):

As a cluster-admin or SRE I should be able to configure Karpenter with OCP on AWS. Both cli and UI should enable users to configure Karpenter and disable CAS.

  1. Run Karpenter in management cluster and disable CAS
  2. OCM API 
    • Enable/Disable Cluster autoscaler
    • Enable/disable AutoNode feature
    • New ARN role configuration for Karpenter
    • Optional: New managed policy or integration with existing nodepool role permissions
  3. Expose NodeClass/Nodepool resources to users. 
  4. secure node provisioning and management, machine approval system for Karpenter instances
  5. HCP Karpenter cleanup/deletion support
  6. ROSA CAPI fields to enable/disable/configure Karpenter
  7. Write end-to-end tests for karpenter running on ROSA HCP

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both managed ROSA HCP
Classic (standalone cluster)  
Hosted control planes yes
Multi node, Compact (three node), or Single node (SNO), or all MNO
Connected / Restricted Network Connected
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86_x64, ARM (aarch64)
Operator compatibility  
Backport needed (list applicable versions) No
UI need (e.g. OpenShift Console, dynamic plugin, OCM) yes - console
Other (please specify) rosa-cli

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

<your text here>

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

<your text here>

 

Out of Scope

High-level list of items that are out of scope.  Initial completion during Refinement status.

  • Supporting this feature in Standalone OCP/self-hosted HCP/ROSA classic
  • Creating a multi-provider cost/pricing operator compatible with CAPI is beyond the scope of this Feature. That may take more time.
  •  

Background

Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

  • Karpenter.sh is an open-source node provisioning project built for Kubernetes. It is designed to simplify Kubernetes infrastructure by automatically launching and terminating nodes based on the needs of your workloads. Karpenter can help you to reduce costs, improve performance, and simplify operations.
  • Karpenter works by observing the unscheduled pods in your cluster and launching new nodes to accommodate them. Karpenter can also terminate nodes that are no longer needed, which can help you save money on infrastructure costs.
  • Karpenter architecture has a Karpenter-core and Karpenter-provider as components. 
    The core has AWS code which does the resource calculation to reduce the cost by re-provisioning new nodes.

Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

<your text here>

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

  • Migration guides from using CAS to Karpenter
  • Performance testing to compare CAS vs Karpenter on ROSA HCP
  • API documentation for NodePool and EC2NodeClass configuration

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

<your text here>

Goal

  • Codify and enable usage of a prototype for HCP working with karpetner management side.

Why is this important?

  • A first usable version is critical to democratize knowledge and develop internal feedback.

Acceptance Criteria

  • Deploying a cluster with --auto-node results in karpenter running management side, the CRDs and a default ec2NodeClass installed within the guest cluster
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Goal Summary

This feature aims to make sure that the HyperShift operator and the control-plane it deploys uses Managed Service Identities (MSI) and have access to scoped credentials (also via access to AKS's image gallery potentially). Additionally, for operators deployed in customers account (system components), they would be scoped with Azure workload identities. 

Epic Goal

The Cluster Network Operator can authenticate with Service Principal backed by a certificate stored in an Azure Key Vault. The Secrets CSI driver will be used to mount the certificate as a volume on the image registry deployment in a hosted control plane.

Why is this important?

  • This is needed to enable authentication with Service Principal with backing certificates for ARO HCP.

Acceptance Criteria

  • Cluster Network Operator is able to authenticate with Azure in ARO HCP using Service Principal with a backing certificate.
  • Updated documentation
  • ARO HCP CI coverage

Dependencies (internal and external)

Azure SDK

Previous Work (Optional):

SDN-4450

Open questions:

Which degree of coverage should run on AKS e2e vs on existing e2es

Done Checklist

CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

Problem

Today Azure installation requires manually created service principal which involves relations, permission granting, credential setting, credential storage, credentials rotation, credentials clean up, and service principal deletion. This is not only mundane and time-consuming but also less secure and risks access to resources by adversaries due to lack of credential rotation. 

Goal

Employ Azure managed credentials which drastically reduce the steps required to just managed identity creation, permission granting, and resource deletion. 

Ideally, this should be a HyperShift-native functionality. I.e., HyperShift should use managed identities for the control plane, the kubelet, and any add-on that needs access to Azure resources.  

Feature Overview

  • An assistant to help developers in ODC edit configuration YML files

Goals

  • Perform an architectural spike to better assess feasibility and value of pursuing this further

Requirements

Requirement Notes isMvp?
CI - MUST be running successfully with test automation This is a requirement for ALL features. YES
Release Technical Enablement Provide necessary release enablement details and documents. YES

(Optional) Use Cases

This Section:

  • Main success scenarios - high-level user stories
  • Alternate flow/scenarios - high-level user stories
  • ...

Questions to answer…

  • Is there overlap with what other teams at RH are already planning? 

Out of Scope

Background, and strategic fit

This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.

Assumptions

  • ...

Customer Considerations

  • More details in the outcome parent RHDP-985

Documentation Considerations

Questions to be addressed:

  • What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
  • Does this feature have doc impact?
  • New Content, Updates to existing content, Release Note, or No Doc Impact
  • If unsure and no Technical Writer is available, please contact Content Strategy.
  • What concepts do customers need to understand to be successful in [action]?
  • How do we expect customers will use the feature? For what purpose(s)?
  • What reference material might a customer want/need to complete [action]?
  • Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
  • What is the doc impact (New Content, Updates to existing content, or Release Note)?

Problem: As a user of OpenShift Lightspeed, I would like to import a YAML generated in the Lightspeed window into the Op

Why is it important?

Use cases:

  1. <case>

Acceptance criteria:

  • Add an Import in YAML Editor action inside the OLS chat popup.
  • Along with the copy button we can add another button that imports the generated YAML data inside the YAML editor.
  • This action should also be able to redirect users to the YAML editor and then paste the generated YAML inside the editor.
  • We will need to create a new extension point that can help trigger the action from the OLS chat popup and a way to listen to any such triggers inside the YAML editor.
  • We also need to consider certain edge cases like - 
  • What happens if the user has already added something in the editor and the trigger import action from OLS?
  • What happens when the user imported a YAML from OLS and then regenerated it again to modify something?

Dependencies (External/Internal):

Design Artifacts:

Exploration:

Note:

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

Incomplete Epics

This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled

Epic Goal

This epic tracks any part of our codebase / solutions we implemented taking shortcuts.
Whenever a shortcut is taken, we should add a story here not to forget to improve it in a safer and more maintainabile way.

Why is this important?

maintanability and debuggability, and in general fighting the technical debt, is critical to keep velocity and ensure overall high quality

Scenarios

  1. N/A

Acceptance Criteria

  • depends on the specific card

Dependencies (internal and external)

  • depends on the specific card

Previous Work (Optional):

https://issues.redhat.com/browse/CNF-796
https://issues.redhat.com/browse/CNF-1479 
https://issues.redhat.com/browse/CNF-2134
https://issues.redhat.com/browse/CNF-6745
https://issues.redhat.com/browse/CNF-8036
https://issues.redhat.com/browse/CNF-9566 

 Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Capture the necessary accidental work to get CI / Konflux unstuck during the 4.19 cycle

Due to capacity problems on the s390x environment, the Konflux team recommended disabling the s390x platform from the PR pipeline.

 

Slack thread

An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.

When running ./build-frontend.sh, I am getting the following warnings in the build log:

warning " > cypress-axe@0.12.0" has unmet peer dependency "axe-core@^3 || ^4".
warning " > cypress-axe@0.12.0" has incorrect peer dependency "cypress@^3 || ^4 || ^5 || ^6".

To fix:

  • upgrade cypress-axe to a version which supports our current cypress version (13.10.0) and install axe-core to resolve the warnings

Goal

  • ...

Why is this important?

Scenarios

  1. ...

Acceptance Criteria

  • Dev - Has a valid enhancement if necessary
  • CI - MUST be running successfully with tests automated
  • QE - covered in Polarion test plan and tests implemented
  • Release Technical Enablement - Must have TE slides
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions:

Done Checklist

  • CI - CI is running, tests are automated and merged.
  • Release Technical Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Enhancement merged: <link to meaningful PR or GitHub Issue>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

When adding the assignContributorRole to assign contributor roles for appropriate scopes to existing SPs we missed the assignment of the role over the DNS RG scope

We are constantly bumping up against quotas when trying to create new ServicePrincipals per test. Example:

=== NAME  TestCreateClusterV2
    hypershift_framework.go:291: failed to create cluster, tearing down: failed to create infra: ERROR: The directory object quota limit for the Tenant has been exceeded. Please ask your administrator to increase the quota limit or delete objects to reduce the used quota. 

We need to create a set of ServicePrincipals to use during testing, and we need to reuse them while executing the e2e-aks.

The details of this Jira Card are restricted (Red Hat Employee and Contractors only)

These are items that the team has prioritized to address in 4.18.

In https://issues.redhat.com/browse/MCO-1469, we are migrating my helper binaries into the MCO repository. I had to make changes to several of my helpers in the original repository to address bugs and other issues in order to unblock https://github.com/openshift/release/pull/58241. Because of the changes I requested during the PR review to make the integration easier, it may be a little tricky to incorporate all of my changes into the MCO repository, but it is still doable.

Done When:

  • The latest changes to zacks-openshift-helpers are incorporated into the MCO repository versions of the relevant helper binaries.

In OCP 4.7  and before, you were able to see the MCD logs of the previous container post upgrade. Now it seems that we no longer do in newer versions. I am not sure if this is a change in kube pod logging behaviour, how the pod gets shutdown and brought up, or something in the MCO.

 

This however makes it relatively hard to debug in newer versions of the MCO, and in numerous bugs we could not pinpoint the source of the issue since we no longer have necessary logs. We should find a way to properly save the previous boot MCD logs if possible.

This epic has been repurposed for handling bugs and issues related to DataImage api ( see comments by Zane and slack discussion below ). Some issues have already been added, will add more issues to improve the stability and reliability of this feature.

Reference links :
Issue opened for IBIO : https://issues.redhat.com/browse/OCPBUGS-43330
Slack discussion threads :
https://redhat-internal.slack.com/archives/CFP6ST0A3/p1729081044547689?thread_ts=1728928990.795199&cid=CFP6ST0A3
https://redhat-internal.slack.com/archives/C0523LQCQG1/p1732110124833909?thread_ts=1731660639.803949&cid=C0523LQCQG1

Description of problem:

After deleting a BaremetalHost which has a related DataImage, the DataImage is still present. I'd expect that together with the bmh deletion the dataimage gets deleted as well. 

Version-Release number of selected component (if applicable):

4.17.0-rc.0    

How reproducible:

100%    

Steps to Reproduce:

    1. Create BaremetalHost object as part of the installation process using Image Based Install operator

    2. Image Based Install operator will create a dataimage as part of the install process
     
    3. Delete the BaremetalHost object 

    4. Check the DataImage assigned to the BareMetalHost     

Actual results:

While the BaremetalHost was deleted the DataImage is still present:

oc -n kni-qe-1 get bmh
No resources found in kni-qe-1 namespace.

 oc -n kni-qe-1 get dataimage -o yaml
apiVersion: v1
items:
- apiVersion: metal3.io/v1alpha1
  kind: DataImage
  metadata:
    creationTimestamp: "2024-09-24T11:58:10Z"
    deletionGracePeriodSeconds: 0
    deletionTimestamp: "2024-09-24T14:06:15Z"
    finalizers:
    - dataimage.metal3.io
    generation: 2
    name: sno.kni-qe-1.lab.eng.rdu2.redhat.com
    namespace: kni-qe-1
    ownerReferences:
    - apiVersion: metal3.io/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: BareMetalHost
      name: sno.kni-qe-1.lab.eng.rdu2.redhat.com
      uid: 0a8bb033-5483-4fe8-8e44-06bf43ae395f
    resourceVersion: "156761793"
    uid: 2358cae9-b660-40e6-9095-7daabb4d9e48
  spec:
    url: https://image-based-install-config.multicluster-engine.svc:8000/images/kni-qe-1/ec274bfe-a295-4cd4-8847-4fe4d232b255.iso
  status:
    attachedImage:
      url: https://image-based-install-config.multicluster-engine.svc:8000/images/kni-qe-1/ec274bfe-a295-4cd4-8847-4fe4d232b255.iso
    error:
      count: 0
      message: ""
    lastReconciled: "2024-09-24T12:03:28Z"
kind: List
metadata:
  resourceVersion: ""
    

Expected results:

    The DataImage gets deleted when the BaremetalHost owner gets deleted.

Additional info:

This is impacting automated test pipelines which use ImageBasedInstall operator as the cleanup stage gets stuck waiting for the namespace deletion which still holds the DataImage. Also deleting the DataImage gets stuck and it can only be deleted by removing the finalizer.

oc  get namespace kni-qe-1 -o yaml
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/sa.scc.mcs: s0:c33,c2
    openshift.io/sa.scc.supplemental-groups: 1001060000/10000
    openshift.io/sa.scc.uid-range: 1001060000/10000
  creationTimestamp: "2024-09-24T11:40:03Z"
  deletionTimestamp: "2024-09-24T14:06:14Z"
  labels:
    app.kubernetes.io/instance: clusters
    cluster.open-cluster-management.io/managedCluster: kni-qe-1
    kubernetes.io/metadata.name: kni-qe-1
    name: kni-qe-1-namespace
    open-cluster-management.io/cluster-name: kni-qe-1
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/audit-version: v1.24
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: v1.24
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: v1.24
  name: kni-qe-1
  resourceVersion: "156764765"
  uid: ee984850-665a-4f5e-8f17-0c44b57eb925
spec:
  finalizers:
  - kubernetes
status:
  conditions:
  - lastTransitionTime: "2024-09-24T14:06:23Z"
    message: All resources successfully discovered
    reason: ResourcesDiscovered
    status: "False"
    type: NamespaceDeletionDiscoveryFailure
  - lastTransitionTime: "2024-09-24T14:06:23Z"
    message: All legacy kube types successfully parsed
    reason: ParsedGroupVersions
    status: "False"
    type: NamespaceDeletionGroupVersionParsingFailure
  - lastTransitionTime: "2024-09-24T14:06:23Z"
    message: All content successfully deleted, may be waiting on finalization
    reason: ContentDeleted
    status: "False"
    type: NamespaceDeletionContentFailure
  - lastTransitionTime: "2024-09-24T14:06:23Z"
    message: 'Some resources are remaining: dataimages.metal3.io has 1 resource instances'
    reason: SomeResourcesRemain
    status: "True"
    type: NamespaceContentRemaining
  - lastTransitionTime: "2024-09-24T14:06:23Z"
    message: 'Some content in the namespace has finalizers remaining: dataimage.metal3.io
      in 1 resource instances'
    reason: SomeFinalizersRemain
    status: "True"
    type: NamespaceFinalizersRemaining
  phase: Terminating
     

Tracking all things Konflux related for the Metal Platform Team
Full enablement should happen during OCP 4.19 development cycle

Description of problem:


The host that gets used in production builds to download the iso will change soon.

It would be good to allow this host to be set through configuration from the release team / ocp-build-data

    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

List of component:

  • assisted-image-service
  • assisted-installer
  • assisted-installer-controller
  • assisted-installer-agent
  • assisted-service-rhel9
  • assisted-service-rhel8

Adept the current dockerfiles and add them to the upstream repos.

List of components:

  • assisted-installer
  • assisted-installer-reporter (aka assisted-installer-controller)
  • assisted-installer-agent

A lot of time our pipelines as well as other teams' pipelines are stuck because they are unable to provision hosts with different architectures to build the images.

Because we currently don't use the multi-arch images we build with konflux, we will stop building multi-arch for now and readd those architectures when we need them.

Currently, the monitoring stack is configured using a configmap. In OpenShift though the best practice is to configure operators using custom resources.

Why this matters

  • We can add [cross]validation rules to CRD fields to avoid misconfigurations
  • End users get a much faster feedback loop. No more applying the config and scanning logs if things don't look right. The API server will give immediate feedback
  • Organizational users (such as ACM) can manage a single resource and observe its status

To start the effort we should create a feature gate behind which we can start implementing a CRD config approach. This allows us to iterate in smaller increments without having to support full feature parity with the config map from the start. We can start small and add features as they evolve.

One proposal for a minimal DoD was:

  • We have a feature gate
  • We have outlined our idea and approach in an enhancement proposal. This does not have to be complete, just outline how we intend to implement this. OpenShift members have reviewed this and given their general approval. The OEP does not need to be complete or merged,
  • We have CRD scaffolding that CVO creates and CMO watches
  • We have a clear idea for a migration path. Even with a feature gate in place we may not simply switch config mechanisms, i.e. we must have a mechanism to merge settings from the config maps and CR, with the CR taking precedence.
  • We have at least one or more fields, CMO can act upon. For example
    • a bool field telling CMO to use the config map for configuration
    • ...

Feature parity should be planned in one or more separate epics.

This story covers the implementation of our initial CRD in CMO. When the feature gate is enabled, CMO watches a singleton CR (name tbd) and acts on changes. The inital feature could be a boolean flag (defaults to true) that tells CMO to merge the configmap settings. If a user sets this flag to false, the config map is ignored and default settings are applied.

The history of this epic starts with this PR which triggered a lengthy conversation around the workings of the image  API with respect to importing imagestreams  images as single vs manifestlisted. The imagestreams today by default have the `importMode` flag set to `Legacy` to avoid breaking behavior of existing clusters in the field. This makes sense for single arch clusters deployed with a single  arch payload, but when users migrate to use the multi payload, more often than not, their intent is to add nodes of other architecture types. When this happens - it gives rise to problems when using imagestreams with the default behavior of importing a single manifest image. The oc commands do have a new flag to toggle the importMode, but this breaks functionality of existing users who just want to create an imagestream and use it with existing commands.

There was a discussion with David Eads and other staff engineers and it was decided that the approach to be taken is to default imagestreams' importMode to `preserveOriginal` if the cluster is installed with/ upgraded to a multi payload. So a few things need to happen to achieve this:

  • CVO would need to expose a field in the status section indicative of the type of payload in the cluster (single vs multi)
  • cluster-openshift-apiserver-operator would read this field and add it to the apiserver configmap. openshift-apiserver would use this value to determine the setting of importMode value.
  • Document clearly that the behavior of imagestreams in a cluster with multi payload is different from the traditional single payload

Some open questions:

  • What happens to existing imagestreams on upgrades
  • How do we handle CVO managed imagestreams (IMO, CVO managed imagestreams should always set importMode to preserveOriginal as the images are associated with the payload)

 

This is a container Epic for tasks which we know need to be done for Tech Preview but which we don't intend to do now. It needs to be groomed before it is useful for planning.

This task focuses on ensuring that all OpenStack resources automatically created by Hypershift for Hosted Control Planes are tagged with a unique identifier, such as the HostedCluster ID. These resources include, but are not limited to, servers, ports, and security groups. Proper tagging will enable administrators to clearly identify and manage resources associated with specific OpenShift clusters.

Acceptance Criteria:

  1. Tagging Mechanism: All OpenStack resources created by Hypershift (e.g., servers, ports, security groups) should be tagged with the relevant Cluster ID or other unique identifiers.
  2. Automated Tagging: The tagging should occur automatically when resources are provisioned by Hypershift.
  3. Consistency: Tags should follow a standardized naming convention for easy identification (e.g., cluster-id: <ClusterID>).
  4. Compatibility: Ensure the solution is compatible with the current Hypershift setup for OpenShift Hosted Control Planes, without disrupting functionality.
  5. Testing: Create automated tests or manual test procedures to verify that the resources are properly tagged when created.
  6. Documentation: Update relevant documentation to inform administrators about the new tagging system, how to filter resources by tags, and any related configurations.

Epic Goal*

What is our purpose in implementing this?  What new capability will be available to customers?

Why is this important? (mandatory)

What are the benefits to the customer or Red Hat?   Does it improve security, performance, supportability, etc?  Why is work a priority?

Scenarios (mandatory) 

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.  

  1.  

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic. 

Contributing Teams(and contacts) (mandatory) 

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

  • Development - 
  • Documentation -
  • QE - 
  • PX - 
  • Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.  

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

  • CI Testing - Tests are merged and completing successfully
  • Documentation - Content development is complete.
  • QE - Test scenarios are written and executed successfully.
  • Technical Enablement - Slides are complete (if requested by PLM)
  • Other 

We need to send an enhancement proposal that would contain the design changes we suggest in openshift/api/config/v1/types_cluster_version.go to allow changing the log level of the CVO using an API configuration before implementing such changes in the API.

Definition of Done:

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

  • We want to remove official support for UPI and IPI support for Alibaba Cloud provider. Going forward, we are recommending installations on Alibaba Cloud with either external platform or agnostic platform installation method.

Why is this important?

An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

We have decided to remove IPI and UPI support for Alibaba Cloud, which until recently has been in Tech Preview due to the following reasons:

(1) Low customer interest of using Openshift on Alibaba Cloud

(2) Removal of Terraform usage

(3) MAPI to CAPI migration

(4) CAPI adoption for installation (Day 1) in OpenShift 4.16 and Day 2 (post-OpenShift 4.16)

Scenarios

Impacted areas based on CI:

alibaba-cloud-csi-driver/openshift-alibaba-cloud-csi-driver-release-4.16.yaml
alibaba-disk-csi-driver-operator/openshift-alibaba-disk-csi-driver-operator-release-4.16.yaml
cloud-provider-alibaba-cloud/openshift-cloud-provider-alibaba-cloud-release-4.16.yaml
cluster-api-provider-alibaba/openshift-cluster-api-provider-alibaba-release-4.16.yaml
cluster-cloud-controller-manager-operator/openshift-cluster-cloud-controller-manager-operator-release-4.16.yaml
machine-config-operator/openshift-machine-config-operator-release-4.16.yaml

Acceptance Criteria

  • CI - MUST be running successfully with tests automated
  • Release Technical Enablement - Provide necessary release enablement details and documents.
  • ...

Dependencies (internal and external)

  1. ...

Previous Work (Optional):

Open questions::

Done Checklist

  • CI - CI jobs are removed
  • Release Enablement <link to Feature Enablement Presentation>
  • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
  • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
  • DEV - Downstream build attached to advisory: <link to errata>
  • QE - Test plans in Polarion: <link or reference to Polarion>
  • QE - Automated tests merged: <link or reference to automated tests>
  • DOC - Downstream documentation merged: <link to meaningful PR>

Acceptance Criteria

  • Since api and library-go are the last projects for removal, remove only alibaba specific code and vendoring

Other Complete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled

Description of problem:

In ASH arm template 06_workers.json[1], there is an unused variable "identityName" defined, this is harmless, but little weird to be present in official upi installation doc[2], which might confuse user when installing UPI cluster on ASH.

[1] https://github.com/openshift/installer/blob/master/upi/azurestack/06_workers.json#L52
[2]  https://docs.openshift.com/container-platform/4.17/installing/installing_azure_stack_hub/upi/installing-azure-stack-hub-user-infra.html#installation-arm-worker_installing-azure-stack-hub-user-infra

suggest to remove it from arm template.
    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Initially, the clusters at version 4.16.9 were having issues with reconciling the IDP. The error which was found in Dynatrace was

 

  "error": "failed to update control plane: failed to reconcile openshift oauth apiserver: failed to reconcile oauth server config: failed to generate oauth config: failed to apply IDP AAD config: Service Unavailable",  

 

Initially it was assumed that the IDP service was unavialble but the CU confirmed that they also had the GroupSync operator running inside all clusters, which can successfully connect to the customer IDP and sync User + Group information from the IDP into the cluster.

The CU was advised to upgrade to 4.16.18 keeping in mind few of the other OCPBUGS which were related to proxy and would be resolved by upgrading to 4.16.15+

However, after upgrade the IDP is still failing to apply it seems. It looks like  IDP reconciler isn't considering the Additional Trust Bundle for the customer proxy 

Checking DT Logs, it seems to fail to verify the certificate

"error": "failed to update control plane: failed to reconcile openshift oauth apiserver: failed to reconcile oauth server config: failed to generate oauth config: failed to apply IDP AAD config: tls: failed to verify certificate: x509: certificate signed by unknown authority",

  "error": "failed to update control plane: [failed to reconcile openshift oauth apiserver: failed to reconcile oauth server config: failed to generate oauth config: failed to apply IDP AAD config: tls: failed to verify certificate: x509: certificate signed by unknown authority, failed to update status: Operation cannot be fulfilled on hostedcontrolplanes.hypershift.openshift.io \"rosa-staging\": the object has been modified; please apply your changes to the latest version and try again]", 

Version-Release number of selected component (if applicable):

4.16.18

How reproducible:

Customer has a few clusters deployed and each of them has the same issue.    

Steps to Reproduce:

    1. Create a HostedCluster with a proxy configuration that specifies an additionalTrustBundle, and an OpenID idp that can be publicly verified (ie. EntraID or Keycloak with LetsEncrypt certs)
    2. Wait for the cluster to come up and try to use the IDP
    3.
    

Actual results:

IDP is failing to work for HCP

Expected results:

IDP should be working for the clusters

Additional info:

    The issue will happen only if the IDP does not require a custom trust bundle to be verified.

Description of problem:

   The initial set of default endpoint overrides we specified in the installer are missing a v1 at the end of the DNS services override.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Created a service for DNS server for secondary networks in Openshift-Virtualizaion, using MetalLB, but the IP is still pending, when accessing the service from the UI, it crash.

Version-Release number of selected component (if applicable):

4.17

How reproducible:

    

Steps to Reproduce:

    1. Create an IP pool (for example 1 IP) for Metal LB and fully utilize the IP range (which other service)
    2. Allocate a new IP using the oc expose command like below
    3. Check the service status on the UI
    

Actual results:

UI crash

Expected results:

Should show the service status

Additional info:

oc expose -n openshift-cnv deployment/secondary-dns --name=dns-lb --type=LoadBalancer --port=53 --target-port=5353 --protocol='UDP'

Description of problem:

Improving tests to remove the issue in the following helm test case
Perform the helm chart upgrade for already upgraded helm chart : HR-08-TC02: Helm Release Perform the helm chart upgrade for already upgraded helm chart : HR-08-TC02 expand_less	37s
{The following error originated from your application code, not from Cypress. It was caused by an unhandled promise rejection.

  > Cannot read properties of undefined (reading 'repoName')

When Cypress detects uncaught errors originating from your application it will automatically fail the current test.

    

Version-Release number of selected component (if applicable):


    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:


    

Expected results:


    

Additional info:


    

Description of problem:

When the master MCP is paused below alert are triggered
Failed to resync 4.12.35 because: Required MachineConfigPool 'master' is paused

The node have been rebooted to make sure there is no pending MC rollout

Affects version

  4.12
    

How reproducible:

Steps to Reproduce:

    1. Create a MC and apply it to master
    2. use below mc
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 99-master-cgroupsv2
spec:
  kernelArguments:
    - systemd.unified_cgroup_hierarchy=1
    3.Wait until the nodes are rebooted and running
    4. pause the mcp

    Actual results:{code:none}
MCP pausing causing the alert
    

Expected results:


Alerts should not be fired

    Additional info:{code:none}

    

Description of problem:

The story is to track i18n upload/download routine tasks which are perform every sprint. 

 

A.C.

  - Upload strings to Memosource at the start of the sprint and reach out to localization team

  - Download translated strings from Memsource when it is ready

  -  Review the translated strings and open a pull request

  -  Open a followup story for next sprint

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

In CAPI, we use a random machineNetwork instead of using the one passed in by the user. 

Description of problem:

    Due the recent changes, using oc 4.17 adm node-image commands on a 4.18 ocp cluster doesn't work

Version-Release number of selected component (if applicable):

    

How reproducible:

    always

Steps to Reproduce:

    1. oc adm node-image create / monitor
    2.
    3.
    

Actual results:

    The commands fail

Expected results:

    The commands should work as expected

Additional info:

    

Description of problem:

    Currently both the nodepool controller and capi controller set the updatingConfig condition on nodepool upgrades. We should only use one to set the condition to avoid constant switching between conditions and to ensure the logic used for setting this condition is the same.

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    CAPI and Nodepool controller set a different status because their logic is not consistent.

Expected results:

    CAPI and Nodepool controller set  the same status because their logic is not cosolidated.

Additional info:

    

Description of problem:

    We are currently using node 18, but our types are for node 10

Version-Release number of selected component (if applicable):

    4.19.0

How reproducible:

    Always

Steps to Reproduce:

    1. Open frontend/package.json
    2. Observe @types/node and engine version
    3.
    

Actual results:

    They are different 

Expected results:

    They are the same

Additional info:

    

Description of problem:

checked in 4.18.0-0.nightly-2024-12-05-103644/4.19.0-0.nightly-2024-12-04-031229, OCPBUGS-34533 is reproduced on 4.18+, no such issue with 4.17 and below.

steps: login admin console or developer console(admin console go to "Observe -> Alerting -> Silences" tab, developer console go to "Observe -> Silences" tab), to create silence, edit the "Until" option, even with a valid timestamp or invalid stamp, will get error "[object Object]" in the "Until" field. see screen recording: https://drive.google.com/file/d/14JYcNyslSVYP10jFmsTaOvPFZSky1eg_/view?usp=drive_link

checked 4.17 fix for OCPBUGS-34533 is already in 4.18+ code

Version-Release number of selected component (if applicable):

4.18+

How reproducible:

always

Steps to Reproduce:

1. see the descriptions

Actual results:

Unable to edit "until" filed in silences

Expected results:

able to edit "until" filed in silences

Description of problem:

v1alpha1 schema is still present in the v1 ConsolePlugin CRD and should be removed manually since the generator is re-adding it automatically.    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

To do

  • Change Over Dockerfile base images
  • Double Check image versions in new e2e configs eg inital-4.17,n1minor,n2minor etc.....
  • Do we still need hypershift-aws-e2e-4.17 on newer branches (Seth)
  • MCE config file in release repo
  • Add n-1 e2e test on e2e test file change

Description of problem:

    "destroy cluster" doesn't delete the PVC disks which have the label "kubernetes-io-cluster-<infra-id>: owned"

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-multi-2024-11-27-162629

How reproducible:

    Always

Steps to Reproduce:

1. include the step which sets the cluster default storageclass to the hyperdisk one before ipi-install (see my debug PR https://github.com/openshift/release/pull/59306)
2. "create cluster", and make sure it succeeds
3. "destroy cluster"

Note: although we confirmed with issue with disk type "hyperdisk-balanced", we believe other disk types have the same issue. 

Actual results:

    The 2 PVC disks of hyperdisk-balanced type are not deleted during "destroy cluster", although the disks have the label "kubernetes-io-cluster-<infra-id>: owned".

Expected results:

    The 2 PVC disks should be deleted during "destroy cluster", because they have the correct/expected labels according to which the uninstaller should be able to detect them. 

Additional info:

    FYI the PROW CI debug job: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/59306/rehearse-59306-periodic-ci-openshift-verification-tests-master-installer-rehearse-4.18-installer-rehearse-debug/1861958752689721344

Description of problem:

when TechPreviewNoUpgrade feature gate is enabled, console will show a customized 'Create Project' modal to all users.
In the customized modal, 'Display name' and 'Description' values user typed are not taking effect

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-12-16-065305    

How reproducible:

Always when TechPreviewNoUpgrade feature gate is enabled    

Steps to Reproduce:

1. Enable TechPreviewNoUpgrade feature gate
$ oc patch  featuregate cluster -p '{"spec": {"featureSet": "TechPreviewNoUpgrade"}}' --type merge
2. any user login to console and create a project from web, set 'Display name' and 'Description' then click on 'Create' 
3. Check created project YAML
$ oc get project ku-5 -o json | jq .metadata.annotations
{
  "openshift.io/description": "",
  "openshift.io/display-name": "",
  "openshift.io/requester": "kube:admin",
  "openshift.io/sa.scc.mcs": "s0:c28,c17",
  "openshift.io/sa.scc.supplemental-groups": "1000790000/10000",
  "openshift.io/sa.scc.uid-range": "1000790000/10000"
} 

Actual results:

display-name and description are all empty    

Expected results:

display-name and description should be set to the values user had configured    

Additional info:

once TP is enabled, customized create project modal is looking like https://drive.google.com/file/d/1HmIlm0u_Ia_TPsa0ZAGyTloRmpfD0WYk/view?usp=drive_link

Description of problem:

When attempting to install a specific version of an operator from the web console, the install plan of the latest version of that operator is created if the operator version had a + in it. 

Version-Release number of selected component (if applicable):

4.17.6 (Tested version)    

How reproducible:

Easily reproducible   

Steps to Reproduce:

1. Under Operators > Operator Hub, install an operator with a + character in the version.
2. On the next screen, note that the + in the version text box is missing.
3. Make no changes to the default options and proceed to install the operator.
4. An install plan is created to install the operator with the latest version from the channel. 

Actual results:

The install plan is created for the latest version from the channel. 

Expected results:

The install plan is created for the requested version. 

Additional info:

Notes on the reproducer:
- For step 1: the selected version shouldn't be the latest version from the channel for the purposes of this bug. 
- For step 1: The version will need to be selected from the version dropdown to reproduce the bug. If the default version that appears in the dropdown is used, then the bug won't reproduce. 
 
Other Notes: 
- This might also happen with other special characters in the version string other than +, but this is not something that I tested.    

Description of problem:

The StaticPodOperatorStatus API validations permit:
- nodeStatuses[].currentRevision can be cleared and can decrease
- more than one entry in nodeStatuses can have a targetRevision > 0
But both of these signal a bug in one or more of the static pod controllers that write to them.

Version-Release number of selected component (if applicable):

This has been the case ~forever but we are aware of bugs in 4.18+ that are resulting in controllers trying to make these invalid writes. We also have more expressive validation mechanisms today that make it possible to plug the holes.

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

After the upgrade to OpenShift Container Platform 4.17, it's being observed that aws-efs-csi-driver-controller-metrics/aws-efs-csi-driver-controller-metrics is reporting target down state. When checking the newly created Container one can find the below logs, that may explain the effect seen/reported.

$ oc logs aws-efs-csi-driver-controller-5b8d5cfdf4-zwh67 -c kube-rbac-proxy-8211
W1119 07:53:10.249934       1 deprecated.go:66] 
==== Removed Flag Warning ======================

logtostderr is removed in the k8s upstream and has no effect any more.

===============================================
		
I1119 07:53:10.250382       1 kube-rbac-proxy.go:233] Valid token audiences: 
I1119 07:53:10.250431       1 kube-rbac-proxy.go:347] Reading certificate files
I1119 07:53:10.250645       1 kube-rbac-proxy.go:395] Starting TCP socket on 0.0.0.0:9211
I1119 07:53:10.250944       1 kube-rbac-proxy.go:402] Listening securely on 0.0.0.0:9211
I1119 07:54:01.440714       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:54:19.860038       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:54:31.432943       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:54:49.852801       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:55:01.433635       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:55:19.853259       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:55:31.432722       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:55:49.852606       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:56:01.432707       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:56:19.853137       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:56:31.440223       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:56:49.856349       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:57:01.432528       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:57:19.853132       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:57:31.433104       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:57:49.852859       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:58:01.433321       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
I1119 07:58:19.853612       1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.17

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift Container Platform 4.17
2. Install aws-efs-csi-driver-operator
3. Create efs.csi.aws.com CSIDriver object and wait for the aws-efs-csi-driver-controller to roll out.

Actual results:

The below Target Down Alert is being raised

50% of the aws-efs-csi-driver-controller-metrics/aws-efs-csi-driver-controller-metrics targets in Namespace openshift-cluster-csi-drivers namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.

Expected results:

The ServiceMonitor endpoint should be reachable and properly responding with the desired information to monitor the health of the component.

Additional info:


When deploying with an endpoint override for the resourceController, the Power VS machine API provider will ignore the override.

Creating clusters in which machines are created in a public subnet and use a public IP makes it possible to avoid creating NAT gateways (or proxies) for AWS clusters. While not applicable for every test, this configuration will save us money and cloud resources.

Description of problem:

    If the install is performed with an AWS user missing the `ec2:DescribeInstanceTypeOfferings`, the installer will use a hardcoded instance type from the set of non-edge machine pools. This can potentially cause the edge node to fail during provisioning, since the instance type doesn't take into account edge/wavelength zones support.

Because edge nodes are not needed for the installation to complete, the issue is not noticed by the installer, only by inspecting the status of the edge nodes.

Version-Release number of selected component (if applicable):

    4.16+ (since edge nodes support was added)

How reproducible:

    always

Steps to Reproduce:

    1. Specify an edge machine pool in the install-config without an instance type
    2. Run the install with an user without `ec2:DescribeInstanceTypeOfferings`
    3.
    

Actual results:

    In CI the `node-readiness` test step will fail and the edge nodes will show

                    errorMessage: 'error launching instance: The requested configuration is currently not supported. Please check the documentation for supported configurations.'         
                    errorReason: InvalidConfiguration
              

Expected results:

    Either
1. the permission is always required when instance type is not set for an edge pool; or
2.  a better instance type default is used

Additional info:

    Example CI job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/9230/pull-ci-openshift-installer-master-e2e-aws-ovn-edge-zones/1862140149505200128

Description of problem:

    When deploying with endpoint overrides, the block CSI driver will try to use the default endpoints rather than the ones specified.

Description of problem:


In order to test OCL we run e2e automated test cases in a cluster that has OCL enabled in master and worker pools.

We have seen that rarely a new machineconfig is rendered but no MOSB resource is created.




    

Version-Release number of selected component (if applicable):

4.18
    

How reproducible:

Rare
    

Steps to Reproduce:

We don't have any steps to reproduce it. It happens eventually when we run a regression in a cluster with OCL enabled in master and worker pools.


    

Actual results:

We see that in some scenarios a new MC is created, then a new rendered MC is created too, but now MOSB is created and the pool is stuck forever.

    

Expected results:

Whenever a new rendered MC is created, a new MOSB sould be created too to build the new image.

    

Additional info:

In the comments section we will add all the must-gather files that are related to this issue.


In some scenarios we can see this error reported by the os-builder pod:


2024-12-03T16:44:14.874310241Z I1203 16:44:14.874268       1 request.go:632] Waited for 596.269343ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/secrets?labelSelector=machineconfiguration.openshift.io%2Fephemeral-build-object%2Cmachineconfiguration.openshift.io%2Fmachine-os-build%3Dmosc-worker-5fc70e666518756a629ac4823fc35690%2Cmachineconfiguration.openshift.io%2Fon-cluster-layering%2Cmachineconfiguration.openshift.io%2Frendered-machine-config%3Drendered-worker-7c0a57dfe9cd7674b26bc5c030732b35%2Cmachineconfiguration.openshift.io%2Ftarget-machine-config-pool%3Dworker


Nevertheless, we only see this error in some of them, not in all of them.

    

Description of problem:

checked on 4.18.0-0.nightly-2024-12-07-130635/4.19.0-0.nightly-2024-12-07-115816, admin console, go to alert details page, "No datapoints found." on alert details graph. see picture for CannotRetrieveUpdates alert: https://drive.google.com/file/d/1RJCxUZg7Z8uQaekt39ux1jQH_kW9KYXd/view?usp=drive_link

issue exists in 4.18+, no such issue with 4.17

Version-Release number of selected component (if applicable):

4.18+

How reproducible:

always on 4.18+    

Steps to Reproduce:

1. see the description
    

Actual results:

"No datapoints found." on alert details graph

Expected results:

show correct graph

Description of problem:

AlertmanagerConfig with missing options causes Alertmanager to crash

Version-Release number of selected component (if applicable):

 

How reproducible:

Always

Steps to Reproduce:

A cluster administrator has enabled monitoring for user-defined projects.
CMO 

~~~
 config.yaml: |
    enableUserWorkload: true
    prometheusK8s:
      retention: 7d
~~~

A cluster administrator has enabled alert routing for user-defined projects. 

UWM cm / CMO cm 

~~~
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    alertmanager:
      enabled: true 
      enableAlertmanagerConfig: true
~~~

verify existing config: 

~~~
$ oc exec -n openshift-user-workload-monitoring alertmanager-user-workload-0 -- amtool config show --alertmanager.url http://localhost:9093  
global:
  resolve_timeout: 5m
  http_config:
    follow_redirects: true
  smtp_hello: localhost
  smtp_require_tls: true
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
  telegram_api_url: https://api.telegram.org
route:
  receiver: Default
  group_by:
  - namespace
  continue: false
receivers:
- name: Default
templates: []
~~~

create alertmanager config without options "smtp_from:" and "smtp_smarthost"

~~~
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: example
  namespace: example-namespace
spec:
  receivers:
    - emailConfigs:
        - to: some.username@example.com
      name: custom-rules1
  route:
    matchers:
      - name: alertname
    receiver: custom-rules1
    repeatInterval: 1m
~~~

check logs for alertmanager: the following error is seen. 

~~~
ts=2023-09-05T12:07:33.449Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="no global SMTP smarthost set"
~~~ 

Actual results:

Alertmamnager fails to restart.

Expected results:

CRD should be pre validated.

Additional info:

Reproducible with and without user workload Alertmanager.

Description of problem

When updating a 4.13 cluster to 4.14, the new-in-4.14 ImageRegistry capability will always be enabled, because capabilities cannot be uninstalled.

Version-Release number of selected component (if applicable)

4.14 oc should learn about this, so they will appropriately extract registry CredentialsRequests when connecting to 4.13 clusters for 4.14 manifests. 4.15 oc will get OTA-1010 to handle this kind of issue automatically, but there's no problem with getting an ImageRegistry hack into 4.15 engineering candidates in the meantime.

How reproducible

100%

Steps to Reproduce

1. Connect your oc to a 4.13 cluster.
2. Extract manifests for a 4.14 release.
3. Check for ImageRegistry CredentialsRequests.

Actual results

$ oc adm upgrade | head -n1
Cluster version is 4.13.12
$ oc adm release extract --included --credentials-requests --to credentials-requests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.0-x86_64
$ grep -r ImageRegistry credentials-requests
...no hits...

Expected results

$ oc adm upgrade | head -n1
Cluster version is 4.13.12
$ oc adm release extract --included --credentials-requests --to credentials-requests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.0-x86_64
$ grep -r ImageRegistry credentials-requests
credentials-requests/0000_50_cluster-image-registry-operator_01-registry-credentials-request.yaml:    capability.openshift.io/name: ImageRegistry

Additional info

We already do this for MachineAPI. The ImageRegistry capability landed later, and this is us catching the oc-extract hack up with that change.

The cluster-baremetal-operator sets up a number of watches for resources using Owns() that have no effect because the Provisioning CR does not (and should not) own any resources of the given type or using EnqueueRequestForObject{}, which similarly has no effect because the resource name and namespace are different from that of the Provisioning CR.

The commit https://github.com/openshift/cluster-baremetal-operator/pull/351/commits/d4e709bbfbae6d316f2e76bec18b0e10a45ac93e should be reverted as it adds considerable complexity to no effect whatsoever.

The correct way to trigger a reconcile of the provisioning CR is using EnqueueRequestsFromMapFunc(watchOCPConfigPullSecret) (note that the map function watchOCPConfigPullSecret() is poorly named - it always returns the name/namespace of the Provisioning CR singleton, regardless of the input, which is what we want). We should replace the ClusterOperator, Proxy, and Machine watches with ones of this form.

See https://github.com/openshift/cluster-baremetal-operator/pull/423/files#r1628777876 and https://github.com/openshift/cluster-baremetal-operator/pull/351/commits/d4e709bbfbae6d316f2e76bec18b0e10a45ac93e#r1628776168 for commentary.

 

Description of problem:

    Some bundles in the Catalog have been given the property in the FBC (and not in the bundle's CSV) which does not get propagated through to the helm chart annotations.

Version-Release number of selected component (if applicable):

    

How reproducible:

    Install elasticsearch 5.8.13

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    cluster is upgradeable

Expected results:

    cluster is not upgradeable

Additional info:

    

Description of problem:

For various reasons, Pods may get evicted. Once they are evicted, the owner of the Pod should recreate the Pod so it is scheduled again.

With OLM, we can see that evicted Pods owned by Catalogsources are not rescheduled. The outcome is that all subscriptions have a "ResolutionFailed=True" condition, which hinders an upgrade of the operator. Specifically the customer is seeing an affected CatalogSource is "multicluster-engine-CENSORED_NAME-redhat-operator-index "in openshift-marketplace namespace, pod name: "multicluster-engine-CENSORED_NAME-redhat-operator-index-5ng9j"

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.16.21

How reproducible:

Sometimes, when Pods are evicted on the cluster

Steps to Reproduce:

1. Set up an OpenShift Container Platform 4.16 cluster, install various Operators
2. Create a condition that a Node will evict Pods (for example by creating DiskPressure on the Node)
3. Observe if any Pods owned by CatalogSources are being evicted

Actual results:

If Pods owned by CatalogSources are being evicted, they are not recreated / rescheduled.

Expected results:

When Pods owned by CatalogSources are being evicted, they are being recreacted / rescheduled.

Additional info:

Description of problem:

    A similar testing scenario to OCPBUGS-38719, but with the pre-existing dns private zone is not a peering zone, instead it is a normal dns zone which binds to another VPC network. And the installation will fail finally, because the dns record-set "*.apps.<cluster name>.<base domain>" is added to the above dns private zone, rather than the cluster's dns private zone. 

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-multi-2024-10-24-093933

How reproducible:

    Always

Steps to Reproduce:

    Please refer to the steps told in https://issues.redhat.com/browse/OCPBUGS-38719?focusedId=25944076&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25944076

Actual results:

    The installation failed, due to the cluster operator "ingress" degraded

Expected results:

    The installation should succeed.

Additional info:

    

Description of problem

From our docs:

Due to fundamental Kubernetes design, all OpenShift Container Platform updates between minor versions must be serialized. You must update from OpenShift Container Platform <4.y> to <4.y+1>, and then to <4.y+2>. You cannot update from OpenShift Container Platform <4.y> to <4.y+2> directly. However, administrators who want to update between two even-numbered minor versions can do so incurring only a single reboot of non-control plane hosts.

We should add a new precondition that enforces that policy, so cluster admins who run --to-image ... don't hop straight from 4.y.z to 4.(y+2).z' or similar without realizing that they were outpacing testing and policy.

Version-Release number of selected component

The policy and current lack-of guard both date back to all OCP 4 releases, and since they're Kube-side constraints, they may date back to the start of Kube.

How reproducible

Every time.

Steps to Reproduce

1. Install a 4.y.z cluster.
2. Use --to-image to request an update to a 4.(y+2).z release.
3. Wait a few minutes for the cluster-version operator to consider the request.
4. Check with oc adm upgrade.

Actual results

Update accepted.

Expected results

Update rejected (unless it was forced), complaining about the excessively long hop.

Description of problem:


When setting up the "webhookTokenAuthenticator" the oauth configure "type" is set to "None". 
Then controller sets the console configmap with "authType=disabled". Which will cause that the console pod goes in the crash loop back due to the not allowed type:

Error:
validate.go:76] invalid flag: user-auth, error: value must be one of [oidc openshift], not disabled.

This worked before on 4.14, stopped working on 4.15.

    

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.15
    

How reproducible:


    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

The console can't start, seems like it is not allowed to change the console.
    

Expected results:


    

Additional info:


    

Description of problem:

    In an effort to ensure all HA components are not degraded by design during normal e2e test or upgrades, we are collecting all operators that are blipping Degraded=True during any payload job run.

This card captures machine-config operator that blips Degraded=True during some ci job runs.


Example Job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview-serial/1843561357304139776
  
Reasons associated with the blip: MachineConfigDaemonFailed or MachineConfigurationFailed

For now, we put an exception in the test. But it is expected that teams take action to fix those and remove the exceptions after the fix go in.

Exception is defined here: https://github.com/openshift/origin/blob/e5e76d7ca739b5699639dd4c500f6c076c697da6/pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go#L109


See linked issue for more explanation on the effort.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

During bootstrapping we're running into the following scenario:

4 members: master 0, 1 and 2 (are full voting) and bootstrap (torn down/dead member) revision rollout causes 0 to restart and leaves you with 2/4 healthy, which means quorum-loss.

This causes apiserver unavailability during the installation and should be avoided.
    

Version-Release number of selected component (if applicable):

4.17, 4.18 but is likely a longer standing issue

How reproducible:

rarely    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

apiserver should not return any errors

Additional info:

    

The following test is failing more than expected:

Undiagnosed panic detected in pod

See the sippy test details for additional context.

Observed in 4.18-e2e-vsphere-ovn-upi-serial/1861922894817267712

Undiagnosed panic detected in pod
{  pods/openshift-machine-config-operator_machine-config-daemon-4mzxf_machine-config-daemon_previous.log.gz:E1128 00:28:30.700325    4480 panic.go:261] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=<}

Description of problem:

    Pull support from upstream kubernetes (see KEP 4800: https://github.com/kubernetes/enhancements/issues/4800) for LLC alignment support in cpumanager

Version-Release number of selected component (if applicable):

    4.19

How reproducible:

    100%

Steps to Reproduce:

    1. try to schedule a pod which requires exclusive CPU allocation and whose CPUs should be affine to the same LLC block
    2. observe random and likely wrong (not LLC-aligned) allocation
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

DEBUG Creating ServiceAccount for control plane nodes 
DEBUG Service account created for XXXXX-gcp-r4ncs-m 
DEBUG Getting policy for openshift-dev-installer   
DEBUG adding roles/compute.instanceAdmin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member 
DEBUG adding roles/compute.networkAdmin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member 
DEBUG adding roles/compute.securityAdmin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member 
DEBUG adding roles/storage.admin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member 
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during pre-provisioning: failed to add master roles: failed to set IAM policy, unexpected error: googleapi: Error 400: Service account XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com does not exist., badRequest

It appears that the Service account was created correctly. The roles are assigned to the service account. It is possible that there needs to be a "wait for action to complete" on the server side to ensure that this will all be ok.

Version-Release number of selected component (if applicable):

    

How reproducible:

Random. Appears to be a sync issue    

Steps to Reproduce:

    1. Run the installer for a normal GCP basic install
    2.
    3.
    

Actual results:

    Installer fails saying that the Service Account that the installer created does not have the permissions to perform an action. Sometimes it takes numerous tries for this to happen (very intermittent). 

Expected results:

    Successful install

Additional info:

    

Description of problem:

    When creating a kubevirt hosted cluster with the following apiserver publishing configuration

- service: APIServer
    servicePublishingStrategy:
      type: NodePort
      nodePort:
        address: my.hostna.me
        port: 305030

Shows following error:

"failed to reconcile virt launcher policy: could not determine if amy.hostna.me is an IPv4 or IPv6 address"

And network policies and not propertly deployed at the virtual machine namespaces.

Version-Release number of selected component (if applicable):

 4.17

How reproducible:

    Always

Steps to Reproduce:

    1.Create a kubevirt hosted cluster with apiserver nodeport publish with a hostname
    2. Wait for hosted cluster creation.
    

Actual results:

Following error pops up and network policies are not created

"failed to reconcile virt launcher policy: could not determine if amy.hostna.me is an IPv4 or IPv6 address"    

Expected results:

    No error pops ups and network policies are created.

Additional info:

    This is where the error get originated -> https://github.com/openshift/hypershift/blob/ef8596d4d69a53eb60838ae45ffce2bca0bfa3b2/hypershift-operator/controllers/hostedcluster/network_policies.go#L644

    That error should prevent network policies creation.

Description of problem:
machine-approver logs

E0221 20:29:52.377443       1 controller.go:182] csr-dm7zr: Pending CSRs: 1871; Max pending allowed: 604. Difference between pending CSRs and machines > 100. Ignoring all CSRs as too many recent pending CSRs seen

.

oc get csr |wc -l
3818
oc get csr |grep "node-bootstrapper" |wc -l
2152

By approving the pending CSR manually I can get the cluster to scaleup.

We can increase the maxPending to a higher number https://github.com/openshift/cluster-machine-approver/blob/2d68698410d7e6239dafa6749cc454272508db19/pkg/controller/controller.go#L330 

 

Description of problem:

"Cannot read properties of undefined (reading 'state')" Error in search tool when filtering Subscriptions while adding new Subscriptions

Version-Release number of selected component (if applicable):

    

How reproducible:

    Always

Steps to Reproduce:

    1. As an Administrator, go to Home -> Search and filter by Subscription component
    2. Start creating subscriptions (bulk)
 
    

Actual results:

    The filtered results will turn in "Oh no! Something went wrong" view

Expected results:

    Get updated results every few seconds

Additional info:

If the view is reloaded -> Fix    

 

Stack Trace:

TypeError: Cannot read properties of undefined (reading 'state')
    at L (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/subscriptions-chunk-89fe3c19814d1f6cdc84.min.js:1:3915)
    at na (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:58879)
    at Hs (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:111315)
    at Sc (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:98327)
    at Cc (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:98255)
    at _c (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:98118)
    at pc (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:95105)
    at https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:44774
    at t.unstable_runWithPriority (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:289:3768)
    at Uo (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:44551) 

 

Description of problem:

    4.18 HyperShift operator's NodePool controller fails to serialize NodePool ConfigMaps that contain ImageDigestMirrorSet. Inspecting the code, it fails on NTO reconciliation logic, where only machineconfiguration API schemas are loaded into the YAML serializer: https://github.com/openshift/hypershift/blob/f7ba5a14e5d0cf658cf83a13a10917bee1168011/hypershift-operator/controllers/nodepool/nto.go#L415-L421

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    100%

Steps to Reproduce:

    1. Install 4.18 HyperShift operator
    2. Create NodePool with configuration ConfigMap that includes ImageDigestMirrorSet
    3. HyperShift operator fails to reconcile NodePool

Actual results:

    HyperShift operator fails to reconcile NodePool

Expected results:

    HyperShift operator to successfully reconcile NodePool

Additional info:

    Regression introduced by https://github.com/openshift/hypershift/pull/4717

Description of problem:

    Currently check-patternfly-modules.sh checks them serially, which could be improved by checking them in parallel. 

Since yarn why does not write to anything, this should be easily parallelizable as there is no race condition with writing back to the yarn.lock

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Missing metrics - example: cluster_autoscaler_failed_scale_ups_total 

Version-Release number of selected component (if applicable):

    

How reproducible:

Always 

Steps to Reproduce:

#curl the autoscalers metrics endpoint: 

$ oc exec deployment/cluster-autoscaler-default -- curl -s http://localhost:8085/metrics | grep cluster_autoscaler_failed_scale_ups_total 
    

Actual results:

the metrics does not return a value until an event has happened   

Expected results:

The metric counter should be initialized at start up providing a zero value

Additional info:

I have been through the file: 

https://raw.githubusercontent.com/openshift/kubernetes-autoscaler/master/cluster-autoscaler/metrics/metrics.go 

and checked off the metrics that do not appear when scraping the metrics endpoint straight after deployment. 

the following metrics are in metrics.go but are missing from the scrape

~~~
node_group_min_count
node_group_max_count
pending_node_deletions
errors_total
scaled_up_gpu_nodes_total
failed_scale_ups_total
failed_gpu_scale_ups_total
scaled_down_nodes_total
scaled_down_gpu_nodes_total
unremovable_nodes_count 
skipped_scale_events_count
~~~

 

Description of problem:

CVO manifests contain some feature-gated ones:

  • since at least 4.16, there are feature-gated ClusterVersion CRDs
  • UpdateStatus API feature is delivered through DevPreview (now) and TechPreview (later) feature set

We observed HyperShift CI jobs to fail when adding DevPreview-gated deployment manifests to CVO, which was unexpected. Investigating further, we discovered that HyperShift applies them:

error: error parsing /var/payload/manifests/0000_00_update-status-controller_03_deployment-DevPreviewNoUpgrade.yaml: error converting YAML to JSON: yaml: invalid map key: map[interface {}]interface {}{".ReleaseImage":interface {}(nil)}

But even without these added manifests, this happens for existing ClusterVersion CRD manifests present in the payload:

$ ls -1 manifests/*clusterversions*crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-CustomNoUpgrade.crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-Default.crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-DevPreviewNoUpgrade.crd.yaml
manifests/0000_00_cluster-version-operator_01_clusterversions-TechPreviewNoUpgrade.crd.yaml

In a passing HyperShift CI job, the same log shows that all four manifests are applied instead of just one:

customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured

Version-Release number of selected component (if applicable):

4.18

How reproducible:

Always

Steps to Reproduce:

1. inspect the cluster-version-operator-*-bootstrap.log of a HyperShift CI job

Actual results:

customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured

= all four ClusterVersion CRD manifests are applied

Expected results:

customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created

= ClusterVersion CRD manifest is applied just once

Additional info

I'm filing this card so that I can link it to the "easy" fix https://github.com/openshift/hypershift/pull/5093 which is not the perfect fix, but allows us to add featureset-gated manifests to CVO without breaking HyperShift. It is desirable to improve this even further and actually correctly select the manifests to be applied for CVO bootstrap, but that involves non-trivial logic similar to one used by CVO and it seems to be better approached as a feature to be properly assessed and implemented, rather than a bugfix, so I'll file a separate HOSTEDCP card for that.

Description of problem:

    Some permissions are missing when edge zones are specified in the install-config.yaml, probably those related to Carrier Gateways (but maybe more)

Version-Release number of selected component (if applicable):

    4.16+

How reproducible:

    always with minimal permissions

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    time="2024-11-20T22:40:58Z" level=debug msg="\tfailed to describe carrier gateways in vpc \"vpc-0bdb2ab5d111dfe52\": UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-girt7h2j-4515a-minimal-perm is not authorized to perform: ec2:DescribeCarrierGateways because no identity-based policy allows the ec2:DescribeCarrierGateways action"

Expected results:

    All required permissions are listed in pkg/asset/installconfig/aws/permissions.go

Additional info:

    See https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/9222/pull-ci-openshift-installer-master-e2e-aws-ovn-edge-zones/1859351015715770368 for a failed min-perms install

Description of problem:

    When using PublicIPv4Pool, CAPA will try to allocate IP address in the supplied pool which requires the `ec2:AllocateAddress` permission

Version-Release number of selected component (if applicable):

    4.16+

How reproducible:

    always

Steps to Reproduce:

    1. Minimal permissions and publicIpv4Pool set
    2.
    3.
    

Actual results:

    time="2024-11-21T05:39:49Z" level=debug msg="E1121 05:39:49.352606     327 awscluster_controller.go:279] \"failed to reconcile load balancer\" err=<"
time="2024-11-21T05:39:49Z" level=debug msg="\tfailed to allocate addresses to load balancer: failed to allocate address from Public IPv4 Pool \"ipv4pool-ec2-0768267342e327ea9\" to role lb-apiserver: failed to allocate Elastic IP for \"lb-apiserver\": UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-2cr41ill-663fd-minimal-perm is not authorized to perform: ec2:AllocateAddress on resource: arn:aws:ec2:us-east-1:460538899914:ipv4pool-ec2/ipv4pool-ec2-0768267342e327ea9 because no identity-based policy allows the ec2:AllocateAddress action. Encoded authorization failure message: Iy1gCtvfPxZ2uqo-SHei1yJQvNwaOBl5F_8BnfeEYCLMczeDJDdS4fZ_AesPLdEQgK7ahuOffqIr--PWphjOUbL2BXKZSBFhn3iN9tZrDCnQQPKZxf9WaQmSkoGNWKNUGn6rvEZS5KvlHV5vf5mCz5Bk2lk3w-O6bfHK0q_dphLpJjU-sTGvB6bWAinukxSYZ3xbirOzxfkRfCFdr7nDfX8G4uD4ncA7_D-XriDvaIyvevWSnus5AI5RIlrCuFGsr1_3yEvrC_AsLENZHyE13fA83F5-Abpm6-jwKQ5vvK1WuD3sqpT5gfTxccEqkqqZycQl6nsxSDP2vDqFyFGKLAmPne8RBRbEV-TOdDJphaJtesf6mMPtyMquBKI769GW9zTYE7nQzSYUoiBOafxz6K1FiYFoc1y6v6YoosxT8bcSFT3gWZWNh2upRJtagRI_9IRyj7MpyiXJfcqQXZzXkAfqV4nsJP8wRXS2vWvtjOm0i7C82P0ys3RVkQVcSByTW6yFyxh8Scoy0HA4hTYKFrCAWA1N0SROJsS1sbfctpykdCntmp9M_gd7YkSN882Fy5FanA"
time="2024-11-21T05:39:49Z" level=debug msg="\t\tstatus code: 403, request id: 27752e3c-596e-43f7-8044-72246dbca486"

Expected results:

    

Additional info:

Seems to happen consistently with shared-vpc-edge-zones CI job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/9230/pull-ci-openshift-installer-master-e2e-aws-ovn-shared-vpc-edge-zones/1860015198224519168    

Description of problem:

The LB name should be yunjiang-ap55-sk6jl-ext-a6aae262b13b0580, rather than ending with ELB service endpoint (elb.ap-southeast-5.amazonaws.com):

	failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed provisioning resources after infrastructure ready: failed to find HostedZone ID for NLB: failed to list load balancers: ValidationError: The load balancer name 'yunjiang-ap55-sk6jl-ext-a6aae262b13b0580.elb.ap-southeast-5.amazonaws.com' cannot be longer than '32' characters\n\tstatus code: 400, request id: f8adce67-d844-4088-9289-4950ce4d0c83

Checking the tag value, the value of Name key is correct: yunjiang-ap55-sk6jl-ext


    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-30-141716
    

How reproducible:

always
    

Steps to Reproduce:

    1. Deploy a cluster on ap-southeast-5
    2.
    3.
    

Actual results:

The LB can not be created
    

Expected results:

Create a cluster successfully.
    

Additional info:

No such issues on other AWS regions.
    

Description of problem:

oc adm node-image create --pxe does not generate only pxe artifacts, but copies everything from the node-joiner pod. Also, the name of the pxe artifacts are not corrected (prefixed with agent, instead of node)    

Version-Release number of selected component (if applicable):

    

How reproducible:

always

Steps to Reproduce:

    1. oc adm node-image create --pxe

Actual results:

    All the node-joiner pods are copied. PXE artifacts name are wrong.

Expected results:

    In the target folder, there should be only the following artifacts:
* node.x86_64-initrd.img
* node.x86_64-rootfs.img
* node.x86_64-vmlinuz

Additional info:

    

Description of problem:

    As more systems have been added to Power VS, the assumption that every zone in a region has the same set of systypes has been broken. To properly represent what system types are available, the powervs_regions struct needed to be altered and parts of the installer referencing it needed to be updated.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. Try to deploy with s1022 in dal10
    2. SysType not available, even though it is a valid option in Power VS.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

When running the delete command on oc-mirror after a mirrorToMirror, the graph-image is not being deleted.
    

Version-Release number of selected component (if applicable):

    

How reproducible:
With the following ImageSetConfiguration (use the same for the DeleteImageSetConfiguration only changing the kind and the mirror to delete)

kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    channels:
    - name: stable-4.13
      minVersion: 4.13.10
      maxVersion: 4.13.10
    graph: true
    

Steps to Reproduce:

    1. Run mirror to mirror
./bin/oc-mirror -c ./alex-tests/alex-isc/isc.yaml --workspace file:///home/aguidi/go/src/github.com/aguidirh/oc-mirror/alex-tests/clid-230 docker://localhost:6000 --v2 --dest-tls-verify=false

    2. Run the delete --generate
./bin/oc-mirror delete -c ./alex-tests/alex-isc/isc-delete.yaml --generate --workspace file:///home/aguidi/go/src/github.com/aguidirh/oc-mirror/alex-tests/clid-230 --delete-id clid-230-delete-test docker://localhost:6000 --v2 --dest-tls-verify=false

    3. Run the delete
./bin/oc-mirror delete --delete-yaml-file /home/aguidi/go/src/github.com/aguidirh/oc-mirror/alex-tests/clid-230/working-dir/delete/delete-images-clid-230-delete-test.yaml docker://localhost:6000 --v2 --dest-tls-verify=false
    

Actual results:

During the delete --generate the graph-image is not being included in the delete file 

2024/10/25 09:44:21  [WARN]   : unable to find graph image in local cache: SKIPPING. %!v(MISSING)
2024/10/25 09:44:21  [WARN]   : reading manifest latest in localhost:55000/openshift/graph-image: manifest unknown

Because of that the graph-image is not being deleted from the target registry

[aguidi@fedora oc-mirror]$ curl http://localhost:6000/v2/openshift/graph-image/tags/list | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    51  100    51    0     0  15577      0 --:--:-- --:--:-- --:--:-- 17000
{
  "name": "openshift/graph-image",
  "tags": [
    "latest"
  ]
}
    

Expected results:

graph-image should be deleted even after mirrorToMirror
    

Additional info:

    

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/5200/pull-ci-openshift-hypershift-main-e2e-openstack/1862228917390151680

{Failed  === RUN   TestAutoscaling/EnsureHostedCluster/EnsurePodsWithEmptyDirPVsHaveSafeToEvictAnnotations
    util.go:1232: 
        the pod  openstack-manila-csi-controllerplugin-676cc65ffc-tnnkb is not in the audited list for safe-eviction and should not contain the safe-to-evict-local-volume annotation
        Expected
            <string>: socket-dir
        to be empty
        --- FAIL: TestAutoscaling/EnsureHostedCluster/EnsurePodsWithEmptyDirPVsHaveSafeToEvictAnnotations (0.02s)
} 

Description of problem:

We have an OKD 4.12 cluster which has persistent and 
increasing ingresswithoutclassname alerts with no ingresses normally 
present in the cluster. I believe the ingresswithoutclassname being 
counted is created as part of the ACME validation process managed by the
 cert-manager operator with it's openshift route addon which are torn down once the ACME validation is complete.

Version-Release number of selected component (if applicable):

 4.12.0-0.okd-2023-04-16-041331

How reproducible:

seems very consistent. went away during an update but came back shortly after and continues to increase.

Steps to Reproduce:

1. create ingress w/o classname
2. see counter increase
3. delete classless ingress
4. counter does not decrease.

Additional info:

https://github.com/openshift/cluster-ingress-operator/issues/912

Description of problem:

Observed in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn/1866088107347021824/artifacts/e2e-gcp-ovn/ipi-install-install/artifacts/.openshift_install-1733747884.log

Distinct issues occurring in this job caused the "etcd bootstrap member to be removed from cluster" gate to take longer than its 5 minute timeout, but there was plenty of time left to complete bootstrapping successfully. It doesn't make sense to have a narrow timeout here because progress toward removal of the etcd bootstrap member begins the moment the etcd cluster starts for the first time, not when the installer starts waiting to observe it.

Version-Release number of selected component (if applicable):

4.19.0

How reproducible:

Sometimes

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

 

Incorrect capitalization for `Lightspeed` to capitalized `LightSpeed` in ja and zh langs

 

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

This is part of the plan to improve stability of ipsec in ocp releases.

There are several regressions identified in libreswan-4.9 (default in 4.14.z and 4.15.z) which needs to be addressed in an incremental approach. The first step is to introduce libreswan-4.6-3.el9_0.3 which is the oldest major version(4.6) that can still be released in rhel9. It includes a libreswan crash fix and some CVE backports that are present in libreswan-4.9 but not in libreswan-4.5 (so that it can pass the internal CVE scanner check).

This pinning of libreswan-4.6-3.el9_0.3 is only needed for 4.14.z since containerized ipsec is used in 4.14. Starting 4.15, ipsec is moved to host and this CNO PR (about to merge as of writing) will allow ovnk to use host ipsec execs which only requires libreswan pkg update in rhcos extension.

 

Description of problem:

bump ovs version to openvswitch3.4-3.4.0-18.el9fdp for ocp 4.19 to include the ovs-monitor-ipsec improvement https://issues.redhat.com/browse/FDP-846

Description of problem:

   This bug is filed a result of https://access.redhat.com/support/cases/#/case/03977446
ALthough both nodes topologies are equavilent, PPC reported a false negative:

  Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU corres differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66]

Version-Release number of selected component (if applicable):

    4.14

How reproducible:

    always

Steps to Reproduce:

    1.TBD
    2.
    3.
    

Actual results:

    Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU corres differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66]

Expected results:

    topologies matches, the PPC should work fine

Additional info:

    

Description of problem:

`sts:AssumeRole` is required by creating Shared-VPC [1], otherwise which will cause the error:

 level=fatal msg=failed to fetch Cluster Infrastructure Variables: failed to fetch dependency of "Cluster Infrastructure Variables": failed to generate asset "Platform Provisioning Check": aws.hostedZone: Invalid value: "Z01991651G3UXC4ZFDNDU": unable to retrieve hosted zone: could not get hosted zone: Z01991651G3UXC4ZFDNDU: AccessDenied: User: arn:aws:iam::301721915996:user/ci-op-1c2w7jv2-ef4fe-minimal-perm-installer is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::641733028092:role/ci-op-1c2w7jv2-ef4fe-shared-role
level=fatal msg=	status code: 403, request id: ab7160fa-ade9-4afe-aacd-782495dc9978
Installer exit with code 1

[1]https://docs.openshift.com/container-platform/4.17/installing/installing_aws/installing-aws-account.html

    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-12-03-174639
    

How reproducible:

Always
    

Steps to Reproduce:

1. Create install-config for Shared-VPC cluster
2. Run openshift-install create permissions-policy
3. Create cluster by using the above installer-required policy.

    

Actual results:

See description
    

Expected results:

sts:AssumeRole is included in the policy file when Shared VPC is configured.
    

Additional info:

The configuration of Shared-VPC is like:
platform:
  aws:
	hostedZone:
	hostedZoneRole:

    

Description of problem:

    In an effort to ensure all HA components are not degraded by design during normal e2e test or upgrades, we are collecting all operators that are blipping Degraded=True during any payload job run.

This card captures machine-config operator that blips Degraded=True during upgrade runs.


Example Job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-azure-ovn-upgrade/1843023092004163584   

Reasons associated with the blip: RenderConfigFailed   

For now, we put an exception in the test. But it is expected that teams take action to fix those and remove the exceptions after the fix go in.

Exceptions are defined here: 


See linked issue for more explanation on the effort.

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

OSD-26887: managed services taints several nodes as infrastructure. This taint appears to be applied after some of the platform DS are scheduled there, causing this alert to fire.  Managed services rebalances the DS after the taint is added, and the alert clears, but origin fails this test. Allowing this alert to fire while we investigate why the taint is not added at node birth.

Description of problem:

Missing translations for "PodDisruptionBudget violated" string

Code:

"count PodDisruptionBudget violated_one": "count PodDisruptionBudget violated_one", "count PodDisruptionBudget violated_other": "count PodDisruptionBudget violated_other",
   

Code: 

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

HyperShift CEL validation blocks ARM64 NodePool creation for non-AWS/Azure platforms
Can't add a Bare Metal worker node to the hosted cluster. 
This was discussed on #project-hypershift Slack channel.

Version-Release number of selected component (if applicable):

MultiClusterEngine v2.7.2 
HyperShift Operator image: 
registry.redhat.io/multicluster-engine/hypershift-rhel9-operator@sha256:56bd0210fa2a6b9494697dc7e2322952cd3d1500abc9f1f0bbf49964005a7c3a   

How reproducible:

Always

Steps to Reproduce:

1. Create a HyperShift HostedCluster on a non-AWS/non-Azure platform
2. Try to create a NodePool with ARM64 architecture specification

Actual results:

- CEL validation blocks creating NodePool with arch: arm64 on non-AWS/Azure platforms
- Receive error: "The NodePool is invalid: spec: Invalid value: "object": Setting Arch to arm64 is only supported for AWS and Azure"
- Additional validation in NodePool spec also blocks arm64 architecture

Expected results:

- Allow ARM64 architecture specification for NodePools on BareMetal platform 
- Remove or update the CEL validation to support this use case

Additional info:

NodePool YAML:
apiVersion: hypershift.openshift.io/v1beta1
kind: NodePool
metadata:
  name: nodepool-doca5-1
  namespace: doca5
spec:
  arch: arm64
  clusterName: doca5
  management:
    autoRepair: false
    replace:
      rollingUpdate:
        maxSurge: 1
        maxUnavailable: 0
      strategy: RollingUpdate
    upgradeType: InPlace
  platform:
    agent:
      agentLabelSelector: {}
    type: Agent
  release:
    image: quay.io/openshift-release-dev/ocp-release:4.16.21-multi
  replicas: 1    

Description of problem:

    ingress-to-route controller does not provide any information about failed conversions from ingress to route. This is a big issue in environments heavily dependent on the ingress objects. The only way to find why a route is not created is guess and try as the only information one can get is that the route is not created. 

Version-Release number of selected component (if applicable):

    OCP 4.14

How reproducible:

    100%

Steps to Reproduce:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    route.openshift.io/termination: passthrough
  name: hello-openshift-class
  namespace: test
spec:
  ingressClassName: openshift-default
  rules:
  - host: ingress01-rhodain-test01.apps.rhodain03.sbr-virt.gsslab.brq2.redhat.com
    http:
      paths:
      - backend:
          service:
            name: myapp02
            port:
              number: 8080
        path: /
        pathType: Prefix
  tls:
  - {}  

Actual results:

    Route is not created and no error is logged

Expected results:

    En error is provided in the events or at least in the controllers logs. The events are prefered as the ingress objects are mainly created by uses without cluster admin privileges.

Additional info:

    

 

It looks like OLMv1 doesn't handle proxies correctly, aws-ovn-proxy job is permafailing https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-proxy/1861444783696777216

I suspect it's on the OLM operator side, are you looking at the cluster-wide proxy object and wiring it into your containers if set?

Description of problem:

HorizontalNav component of @openshift-console/dynamic-plugin-sdk doest not have customData prop which is available in console repo. 

This prop is needed to pass any value between tabs in details page

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

Application of PerformanceProfile with invalid cpuset in one of the reserved/isolated/shared/offlined cpu fields causing webhook validation to panic instead of returning an informant error.

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-22-231049

How reproducible:

Apply a PerformanceProfile with invalid cpu values

Steps to Reproduce:

Apply the following PerformanceProfile with invalid cpu values:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: pp
spec:
  cpu:
    isolated: 'garbage'
    reserved: 0-3
  machineConfigPoolSelector:
    pools.operator.machineconfiguration.openshift.io/worker-cnf: ""
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""     

Actual results:

On OCP >= 4.18 the error is:
Error from server: error when creating "pp.yaml": admission webhook "vwb.performance.openshift.io" denied the request: panic: runtime error: invalid memory address or nil pointer dereference [recovered]
  
On OCP <= 4.17 the error is:
Validation webhook passes without any errors. Invalid configuration propogates to the cluster and breaks it.

Expected results:

We expect to pushback an informant error when invalid cpuset has been entered, without panicking or accepting it!

The "oc adm pod-network" command for working with openshift-sdn multitenant mode is now totally useless in OCP 4.17 and newer clusters (since it's only useful with openshift-sdn, and openshift-sdn no longer exists as of OCP 4.17). Of course, people might use a new oc binary to talk to an older cluster, but probably the built-in documentation should make it clearer that this is not a command that will be useful to 99% of users.

If it's possible to make "pod-network" not show up as a subcommand in "oc adm -h" that would probably be good. If not, it should probably have a description like "Manage OpenShift-SDN Multitenant mode networking [DEPRECATED]", and likewise, the longer descriptions of the pod-network subcommands should talk about "OpenShift-SDN Multitenant mode" rather than "the redhat/openshift-ovs-multitenant network plugin" (which is OCP 3 terminology), and maybe should explicitly say something like "this has no effect when using the default OpenShift Networking plugin (OVN-Kubernetes)".

Description of problem:

The release signature configmap file is invalid with no name defined

Version-Release number of selected component (if applicable):

oc-mirror version 
WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202410011141.p0.g227a9c4.assembly.stream.el9-227a9c4", GitCommit:"227a9c499b6fd94e189a71776c83057149ee06c2", GitTreeState:"clean", BuildDate:"2024-10-01T20:07:43Z", GoVersion:"go1.22.5 (Red Hat 1.22.5-1.module+el8.10.0+22070+9237f38b) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}

How reproducible:

100%

Steps to Reproduce:

1) with isc :
cat /test/yinzhou/config.yaml 
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v2alpha1
mirror:
  platform:
    channels:
    - name: stable-4.16
2) do mirror2disk + disk2mirror 
3) use the signature configmap  to create resource 

Actual results:

3) failed to create resource with error: 
oc create -f signature-configmap.json 
The ConfigMap "" is invalid: metadata.name: Required value: name or generateName is required

oc create -f signature-configmap.yaml 
The ConfigMap "" is invalid: metadata.name: Required value: name or generateName is required  

 

 

Expected results:

No error 
 

 

Description of problem:

    If the serverless function is not running and on click of Test Serverless button, nothing is happening.

Version-Release number of selected component (if applicable):

    4.15

How reproducible:

    Always

Steps to Reproduce:

    1.Install serverless operator
    2.Create serverless function and make sure the status is false
    3.Click on Test Serverless function
    

Actual results:

    No response

Expected results:

    May be an alert or may be we can hide that option if function is not ready?

Additional info:

    

Description of problem:

Node was created today with worker label. It was labeled as a loadbalancer to match mcp selector. MCP saw the selector and moved to Updating but the machine-config-daemon pod isn't responding. We tried deleting the pod and it still didn't pick up that it needed to get a new config. Manually editing the desired config appears to workaround the issue but shouldn't be necessary.

Node created today:

[dasmall@supportshell-1 03803880]$ oc get nodes worker-048.kub3.sttlwazu.vzwops.com -o yaml | yq .metadata.creationTimestamp
'2024-04-30T17:17:56Z'

Node has worker and loadbalancer roles:

[dasmall@supportshell-1 03803880]$ oc get node worker-048.kub3.sttlwazu.vzwops.com
NAME                                  STATUS   ROLES                 AGE   VERSION
worker-048.kub3.sttlwazu.vzwops.com   Ready    loadbalancer,worker   1h    v1.25.14+a52e8df


MCP shows a loadbalancer needing Update and 0 nodes in worker pool:

[dasmall@supportshell-1 03803880]$ oc get mcp
NAME           CONFIG                                                   UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
loadbalancer   rendered-loadbalancer-1486d925cac5a9366d6345552af26c89   False     True       False      4              3                   3                     0                      87d
master         rendered-master-47f6fa5afe8ce8f156d80a104f8bacae         True      False      False      3              3                   3                     0                      87d
worker         rendered-worker-a6be9fb3f667b76a611ce51811434cf9         True      False      False      0              0                   0                     0                      87d
workerperf     rendered-workerperf-477d3621fe19f1f980d1557a02276b4e     True      False      False      38             38                  38                    0                      87d


Status shows mcp updating:

[dasmall@supportshell-1 03803880]$ oc get mcp loadbalancer -o yaml | yq .status.conditions[4]
lastTransitionTime: '2024-04-30T17:33:21Z'
message: All nodes are updating to rendered-loadbalancer-1486d925cac5a9366d6345552af26c89
reason: ''
status: 'True'
type: Updating


Node still appears happy with worker MC:

[dasmall@supportshell-1 03803880]$ oc get node worker-048.kub3.sttlwazu.vzwops.com -o yaml | grep rendered-
    machineconfiguration.openshift.io/currentConfig: rendered-worker-a6be9fb3f667b76a611ce51811434cf9
    machineconfiguration.openshift.io/desiredConfig: rendered-worker-a6be9fb3f667b76a611ce51811434cf9
    machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-a6be9fb3f667b76a611ce51811434cf9
    machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-a6be9fb3f667b76a611ce51811434cf9


machine-config-daemon pod appears idle:

[dasmall@supportshell-1 03803880]$ oc logs -n openshift-machine-config-operator machine-config-daemon-wx2b8 -c machine-config-daemon
2024-04-30T17:48:29.868191425Z I0430 17:48:29.868156   19112 start.go:112] Version: v4.12.0-202311220908.p0.gef25c81.assembly.stream-dirty (ef25c81205a65d5361cfc464e16fd5d47c0c6f17)
2024-04-30T17:48:29.871340319Z I0430 17:48:29.871328   19112 start.go:125] Calling chroot("/rootfs")
2024-04-30T17:48:29.871602466Z I0430 17:48:29.871593   19112 update.go:2110] Running: systemctl daemon-reload
2024-04-30T17:48:30.066554346Z I0430 17:48:30.066006   19112 rpm-ostree.go:85] Enabled workaround for bug 2111817
2024-04-30T17:48:30.297743470Z I0430 17:48:30.297706   19112 daemon.go:241] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858 (412.86.202311271639-0) 828584d351fcb58e4d799cebf271094d5d9b5c1a515d491ee5607b1dcf6ebf6b
2024-04-30T17:48:30.324852197Z I0430 17:48:30.324543   19112 start.go:101] Copied self to /run/bin/machine-config-daemon on host
2024-04-30T17:48:30.325677959Z I0430 17:48:30.325666   19112 start.go:188] overriding kubernetes api to https://api-int.kub3.sttlwazu.vzwops.com:6443
2024-04-30T17:48:30.326381479Z I0430 17:48:30.326368   19112 metrics.go:106] Registering Prometheus metrics
2024-04-30T17:48:30.326447815Z I0430 17:48:30.326440   19112 metrics.go:111] Starting metrics listener on 127.0.0.1:8797
2024-04-30T17:48:30.327835814Z I0430 17:48:30.327811   19112 writer.go:93] NodeWriter initialized with credentials from /var/lib/kubelet/kubeconfig
2024-04-30T17:48:30.327932144Z I0430 17:48:30.327923   19112 update.go:2125] Starting to manage node: worker-048.kub3.sttlwazu.vzwops.com
2024-04-30T17:48:30.332123862Z I0430 17:48:30.332097   19112 rpm-ostree.go:394] Running captured: rpm-ostree status
2024-04-30T17:48:30.332928272Z I0430 17:48:30.332909   19112 daemon.go:1049] Detected a new login session: New session 1 of user core.
2024-04-30T17:48:30.332935796Z I0430 17:48:30.332926   19112 daemon.go:1050] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh
2024-04-30T17:48:30.368619942Z I0430 17:48:30.368598   19112 daemon.go:1298] State: idle
2024-04-30T17:48:30.368619942Z Deployments:
2024-04-30T17:48:30.368619942Z * ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858
2024-04-30T17:48:30.368619942Z                    Digest: sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858
2024-04-30T17:48:30.368619942Z                   Version: 412.86.202311271639-0 (2024-04-30T17:05:27Z)
2024-04-30T17:48:30.368619942Z           LayeredPackages: kernel-devel kernel-headers
2024-04-30T17:48:30.368619942Z
2024-04-30T17:48:30.368619942Z   ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858
2024-04-30T17:48:30.368619942Z                    Digest: sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858
2024-04-30T17:48:30.368619942Z                   Version: 412.86.202311271639-0 (2024-04-30T17:05:27Z)
2024-04-30T17:48:30.368619942Z           LayeredPackages: kernel-devel kernel-headers
2024-04-30T17:48:30.368907860Z I0430 17:48:30.368884   19112 coreos.go:54] CoreOS aleph version: mtime=2023-08-08 11:20:41.285 +0000 UTC build=412.86.202308081039-0 imgid=rhcos-412.86.202308081039-0-metal.x86_64.raw
2024-04-30T17:48:30.368932886Z I0430 17:48:30.368926   19112 coreos.go:71] Ignition provisioning: time=2024-04-30T17:03:44Z
2024-04-30T17:48:30.368938120Z I0430 17:48:30.368931   19112 rpm-ostree.go:394] Running captured: journalctl --list-boots
2024-04-30T17:48:30.372893750Z I0430 17:48:30.372884   19112 daemon.go:1307] journalctl --list-boots:
2024-04-30T17:48:30.372893750Z -2 847e119666d9498da2ae1bd89aa4c4d0 Tue 2024-04-30 17:03:13 UTC—Tue 2024-04-30 17:06:32 UTC
2024-04-30T17:48:30.372893750Z -1 9617b204b8b8412fb31438787f56a62f Tue 2024-04-30 17:09:06 UTC—Tue 2024-04-30 17:36:39 UTC
2024-04-30T17:48:30.372893750Z  0 3cbf6edcacde408b8979692c16e3d01b Tue 2024-04-30 17:39:20 UTC—Tue 2024-04-30 17:48:30 UTC
2024-04-30T17:48:30.372912686Z I0430 17:48:30.372891   19112 rpm-ostree.go:394] Running captured: systemctl list-units --state=failed --no-legend
2024-04-30T17:48:30.378069332Z I0430 17:48:30.378059   19112 daemon.go:1322] systemd service state: OK
2024-04-30T17:48:30.378069332Z I0430 17:48:30.378066   19112 daemon.go:987] Starting MachineConfigDaemon
2024-04-30T17:48:30.378121340Z I0430 17:48:30.378106   19112 daemon.go:994] Enabling Kubelet Healthz Monitor
2024-04-30T17:48:31.486786667Z I0430 17:48:31.486747   19112 daemon.go:457] Node worker-048.kub3.sttlwazu.vzwops.com is not labeled node-role.kubernetes.io/master
2024-04-30T17:48:31.491674986Z I0430 17:48:31.491594   19112 daemon.go:1243] Current+desired config: rendered-worker-a6be9fb3f667b76a611ce51811434cf9
2024-04-30T17:48:31.491674986Z I0430 17:48:31.491603   19112 daemon.go:1253] state: Done
2024-04-30T17:48:31.495704843Z I0430 17:48:31.495617   19112 daemon.go:617] Detected a login session before the daemon took over on first boot
2024-04-30T17:48:31.495704843Z I0430 17:48:31.495624   19112 daemon.go:618] Applying annotation: machineconfiguration.openshift.io/ssh
2024-04-30T17:48:31.503165515Z I0430 17:48:31.503052   19112 update.go:2110] Running: rpm-ostree cleanup -r
2024-04-30T17:48:32.232728843Z Bootloader updated; bootconfig swap: yes; bootversion: boot.1.1, deployment count change: -1
2024-04-30T17:48:35.755815139Z Freed: 92.3 MB (pkgcache branches: 0)
2024-04-30T17:48:35.764568364Z I0430 17:48:35.764548   19112 daemon.go:1563] Validating against current config rendered-worker-a6be9fb3f667b76a611ce51811434cf9
2024-04-30T17:48:36.120148982Z I0430 17:48:36.120119   19112 rpm-ostree.go:394] Running captured: rpm-ostree kargs
2024-04-30T17:48:36.179660790Z I0430 17:48:36.179631   19112 update.go:2125] Validated on-disk state
2024-04-30T17:48:36.182434142Z I0430 17:48:36.182406   19112 daemon.go:1646] In desired config rendered-worker-a6be9fb3f667b76a611ce51811434cf9
2024-04-30T17:48:36.196911084Z I0430 17:48:36.196879   19112 config_drift_monitor.go:246] Config Drift Monitor started

Version-Release number of selected component (if applicable):

    4.12.45

How reproducible:

    They can reproduce in multiple clusters

Actual results:

    Node stays with rendered-worker config

Expected results:

    machineconfigpool updating should prompt a change to the desired config which the machine-config-daemon pod then updates node to

Additional info:

    here is the latest must-gather where this issue is occuring:
https://attachments.access.redhat.com/hydra/rest/cases/03803880/attachments/3fd0cf52-a770-4525-aecd-3a437ea70c9b?usePresignedUrl=true

Description of problem:

    Destroying a private cluster doesn't delete the forwarding-rule/backend-service/health-check/firewall-rules created by ingress operator.

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-multi-2024-10-23-202329

How reproducible:

    Always

Steps to Reproduce:

1. pre-create vpc network/subnets/router and a bastion host
2. "create install-config", and then insert the network settings under platform.gcp, along with "publish: Internal" (see [1])
3. "create cluster" (use the above bastion host as http proxy)
4. "destroy cluster" (see [2])

Actual results:

    Although "destroy cluster" completes successfully, the forwarding-rule/backend-service/health-check/firewall-rules created by ingress operator are not deleted (see [3]), which leads to deleting the vpc network/subnets failure.

Expected results:

    The forwarding-rule/backend-service/health-check/firewall-rules created by ingress operator should also be deleted during "destroy cluster".

Additional info:

FYI one history bug https://issues.redhat.com/browse/OCPBUGS-37683    

Managed services marks a couple of nodes as "infra" so user workloads don't get scheduled on them.  However, platform daemonsets like iptables-alerter should run there – and the typical toleration for that purpose should be:

 tolerations:
- operator: Exists

instead the toleration is

tolerations:
- key: "node-role.kubernetes.io/master"
  operator: "Exists"
  effect: "NoSchedule" 

 

Examples from other platform DS:

 

$ for ns in openshift-cluster-csi-drivers openshift-cluster-node-tuning-operator openshift-dns openshift-image-registry openshift-machine-config-operator openshift-monitoring openshift-multus openshift-multus openshift-multus openshift-network-diagnostics openshift-network-operator openshift-ovn-kubernetes openshift-security; do echo "NS: $ns"; oc get ds -o json -n $ns|jq '.items.[0].spec.template.spec.tolerations'; done
NS: openshift-cluster-csi-drivers
[
  {
    "operator": "Exists"
  }
]
NS: openshift-cluster-node-tuning-operator
[
  {
    "operator": "Exists"
  }
]
NS: openshift-dns
[
  {
    "key": "node-role.kubernetes.io/master",
    "operator": "Exists"
  }
]
NS: openshift-image-registry
[
  {
    "operator": "Exists"
  }
]
NS: openshift-machine-config-operator
[
  {
    "operator": "Exists"
  }
]
NS: openshift-monitoring
[
  {
    "operator": "Exists"
  }
]
NS: openshift-multus
[
  {
    "operator": "Exists"
  }
]
NS: openshift-multus
[
  {
    "operator": "Exists"
  }
]
NS: openshift-multus
[
  {
    "operator": "Exists"
  }
]
NS: openshift-network-diagnostics
[
  {
    "operator": "Exists"
  }
]
NS: openshift-network-operator
[
  {
    "effect": "NoSchedule",
    "key": "node-role.kubernetes.io/master",
    "operator": "Exists"
  }
]
NS: openshift-ovn-kubernetes
[
  {
    "operator": "Exists"
  }
]
NS: openshift-security
[
  {
    "operator": "Exists"
  }
] 

The helper doesn't have all the namespaces in it, and we're getting some flakes in CI like this:

 
 
{{batch/v1/Job/openshift-backplane-managed-scripts/<batch_job>/container/osd-delete-backplane-script-resources
does not have a cpu request (rule: "batch/v1/Job/openshift-backplane-managed-scripts/<batch_job>/container/osd-delete-backplane-script-resources/request[cpu]")}}

Description of problem:

The machine-os-builder deployment manifest does not set the openshift.io/required-scc annotation, which appears to be required for the upgrade conformance suite to pass. The rest of the MCO components currently set this annotation and we can probably use the same setting for the Machine Config Controller (which is restricted-v2). What I'm unsure of is whether this also needs to be set on the builder pods as well and what the appropriate setting would be for that case.

Version-Release number of selected component (if applicable):

 

How reproducible:

This always occurs in the new CI jobs, e2e-aws-ovn-upgrade-ocb-techpreview and e2e-aws-ovn-upgrade-ocb-conformance-suite-techpreview. Here's two examples from rehearsal failures:

Steps to Reproduce:

Run either of the aforementioned CI jobs.

Actual results:

Test [sig-auth] all workloads in ns/openshift-machine-config-operator must set the 'openshift.io/required-scc' annotation fails.

Expected results:

Test{{ [sig-auth] all workloads in ns/openshift-machine-config-operator must set the 'openshift.io/required-scc' annotation}} should pass.

 

Additional info:

    

Description of problem:

Under some circumstances (not clear exactly which ones), the OVN databases of 2 nodes ended up having 2 src-ip static routes in ovn_cluster_router instead of one: one of them points to the correct IP of the rtoj-GR_${NODE_NAME} LRP and one points to a wrong IP on the join subnet (that IP is not used in any other LRP or LSP).

Both static routes are taken into consideration while routing traffic out from the cluster, so packets that use the right route are able to egress while the packets that use the wrong route are dropped.

Version-Release number of selected component (if applicable):

Reproduced in 4.14.20

How reproducible:

At least once. Only 2 nodes of the cluster.  

Steps to Reproduce:

(Not sure, it was just found after investigation of strange packet drop)

Actual results:

Wrong static route to some non-existent IP in the join subnet. Intermittent packet drop.

Expected results:

No wrong static routes. No packet drop.

Additional info:

This can be workarounded by wiping the OVN databases of the impacted node.

Our unit test runtime is slow. It seems to run anywhere from ~16-20 minutes locally. On CI it can take at least 30 minutes to run. Investigate whether or not any changes can be made to improve the unit test runtime.

This issue tracks the updation of k8s and related openshift APIs to a recent version, to keep in-line with other MAPI providers.

Description of problem:

Console plugin details page is throwing error on some specific YAML    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-30-141716    

How reproducible:

Always    

Steps to Reproduce:

1. Create a ConsolePlugin with minimum required fields  apiVersion: console.openshift.io/v1
kind: ConsolePlugin
metadata:
  name: console-demo-plugin-two
spec:
  backend:
    type: Service
  displayName: OpenShift Console Demo Plugin

2. Visit consoleplugin details page at /k8s/cluster/console.openshift.io~v1~ConsolePlugin/console-demo-plugin

Actual results:

2. We will see an error page    

Expected results:

2. we should not show an error page since ConsolePlugin YAML has every required fields although they are not complete

Additional info:

    

This is a clone of issue OCPBUGS-45859. The following is the description of the original issue:

The following test is failing more than expected:

Undiagnosed panic detected in pod

See the sippy test details for additional context.

Observed in 4.18-e2e-azure-ovn/1864410356567248896 as well as pull-ci-openshift-installer-master-e2e-azure-ovn/1864312373058211840

: Undiagnosed panic detected in pod
{  pods/openshift-cloud-controller-manager_azure-cloud-controller-manager-5788c6f7f9-n2mnh_cloud-controller-manager_previous.log.gz:E1204 22:27:54.558549       1 iface.go:262] "Observed a panic" panic="interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.EndpointSlice" panicGoValue="&runtime.TypeAssertionError{_interface:(*abi.Type)(0x291daa0), concrete:(*abi.Type)(0x2b73880), asserted:(*abi.Type)(0x2f5cc20), missingMethod:\"\"}" stacktrace=<}

Description of problem:

Previously, failed task rus did not emit results, now they do but the UI still shows "No TaskRun results available due to failure" even though task run's status contains a result.
    

Version-Release number of selected component (if applicable):

4.14.3
    

How reproducible:

Always with a task run producing a result but failing afterwards
    

Steps to Reproduce:

    1. Create the pipelinerun below
    2. have a look on its task run
    
apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: hello-pipeline
spec:
  tasks:
  - name: hello
    taskSpec:
      results:
      - name: greeting1
      steps:
      - name: greet
        image: registry.access.redhat.com/ubi8/ubi-minimal
        script: |
          #!/usr/bin/env bash
          set -e
          echo -n "Hello World!" | tee $(results.greeting1.path)
          exit 1
  results:
  - name: greeting2
    value: $(tasks.hello.results.greeting1)
    

Actual results:

No results in UI
    

Expected results:

One result should be displayed even though task run failed
    

Additional info:

Pipelines 1.13.0
    

Description of problem:

    While upgrading the Fusion operator,  IBM team is facing the following error in the operator's subscription:
error validating existing CRs against new CRD's schema for "fusionserviceinstances.service.isf.ibm.com": error validating service.isf.ibm.com/v1, Kind=FusionServiceInstance "ibm-spectrum-fusion-ns/odfmanager": updated validation is too restrictive: [].status.triggerCatSrcCreateStartTime: Invalid value: "number": status.triggerCatSrcCreateStartTime in body must be of type integer: "number"


question here, "triggerCatSrcCreateStartTime" has been present in the operator for the past few releases and it's datatype (integer) hasn't changed in the latest release as well. There was  one "FusionServiceInstance" CR present in the cluster when this issue was hit and the value of "triggerCatSrcCreateStartTime" field being "1726856593000774400".

Version-Release number of selected component (if applicable):

    Its impacting between OCP 4.16.7 and OCP 4.16.14 versions

How reproducible:

    Always

Steps to Reproduce:

    1.Upgrade the fusion operator ocp version 4.16.7 to ocp 4.16.14
    2.
    3.
    

Actual results:

    Upgrade fails with error in description

Expected results:

    Upgrade should not be failed 

Additional info:

    

The aks-e2e test keeps failing on the CreateClusterV2 test because the `ValidReleaseInfo` condition is not set. The patch that sets this status keeps failing. Investigate why & provide a fix.

Description of problem:

    

Version-Release number of selected component (if applicable):

    

How reproducible:

    every time

Steps to Reproduce:

    1. Create the dashboard with a bar chart and sort query result asc.
    2. 
    3.
    

Actual results:

 bar goes outside of the border

Expected results:

bar should not goes outside of the border    

Additional info:

    

screenshot: https://drive.google.com/file/d/1xPRgenpyCxvUuWcGiWzmw5kz51qKLHyI/view?usp=drive_link

Description of problem:

Trying to setup a disconnected HCP cluster with self-managed image registry.

After the cluster installed, all the imagestream failed to import images.
With error:
```
Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client
```

The imagestream will talk to openshift-apiserver and get the image target there.

After login to the hcp namespace, figured out that I cannot access any external network with https protocol.

Version-Release number of selected component (if applicable):

4.14.35    

How reproducible:

    always

Steps to Reproduce:

    1. Install the hypershift hosted cluster with above setup
    2. The cluster can be created successfully and all the pods on the cluster can be running with the expected images pulled
    3. Check the internal image-registry
    4. Check the openshift-apiserver pod from management cluster
    

Actual results:

All the imagestreams failed to sync from the remote registry.
$ oc describe is cli -n openshift
Name:            cli
Namespace:        openshift
Created:        6 days ago
Labels:            <none>
Annotations:        include.release.openshift.io/ibm-cloud-managed=true
            include.release.openshift.io/self-managed-high-availability=true
            openshift.io/image.dockerRepositoryCheck=2024-11-06T22:12:32Z
Image Repository:    image-registry.openshift-image-registry.svc:5000/openshift/cli
Image Lookup:        local=false
Unique Images:        0
Tags:            1latest
  updates automatically from registry quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d  ! error: Import failed (InternalError): Internal error occurred: [122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-1@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-2@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-3@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-4@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-5@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://quay.io/v2/": http: server gave HTTP response to HTTPS client]


Access the external network from the openshift-apiserver pod:
sh-5.1$ curl --connect-timeout 5 https://quay.io/v2
curl: (28) Operation timed out after 5001 milliseconds with 0 out of 0 bytes received
sh-5.1$ curl --connect-timeout 5 https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/
curl: (28) Operation timed out after 5001 milliseconds with 0 out of 0 bytes received

sh-5.1$ env | grep -i http.*proxy
HTTPS_PROXY=http://127.0.0.1:8090
HTTP_PROXY=http://127.0.0.1:8090

Expected results:

The openshift-apiserver should be able to talk to the remote https services.

Additional info:

It is working after set the registry to no_proxy

sh-5.1$ NO_PROXY=122610517469.dkr.ecr.us-west-2.amazonaws.com curl --connect-timeout 5 https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/
Not Authorized
 

 

Description of problem:

Additional network is not correctly configured on the secondary interface inside the masters and the workers.

With install-config.yaml with this section:

# This file is autogenerated by infrared openshift plugin                                                                                                                                                                                                                                                                    
apiVersion: v1                                                                                                                                                                                                                                                                                                               
baseDomain: "shiftstack.local"
compute:
- name: worker
  platform:
    openstack:
      zones: []
      additionalNetworkIDs: ['26a751c3-c316-483c-91ed-615702bcbba9']
      type: "worker"
  replicas: 3
controlPlane:
  name: master
  platform:
    openstack:
      zones: []
      additionalNetworkIDs: ['26a751c3-c316-483c-91ed-615702bcbba9']
      type: "master"
  replicas: 3
metadata:
  name: "ostest"
networking:
  clusterNetworks:
  - cidr: fd01::/48
    hostPrefix: 64
  serviceNetwork:
    - fd02::/112
  machineNetwork:
    - cidr: "fd2e:6f44:5dd8:c956::/64"
  networkType: "OVNKubernetes"
platform:
  openstack:
    cloud:            "shiftstack"
    region:           "regionOne"
    defaultMachinePlatform:
      type: "master"
    apiVIPs: ["fd2e:6f44:5dd8:c956::5"]
    ingressVIPs: ["fd2e:6f44:5dd8:c956::7"]
    controlPlanePort:
      fixedIPs:
        - subnet:
            name: "subnet-ssipv6"
pullSecret: |
  {"auths": {"installer-host.example.com:8443": {"auth": "ZHVtbXkxMjM6ZHVtbXkxMjM="}}}
sshKey: <hidden>
additionalTrustBundle: <hidden>
imageContentSources:
- mirrors:
  - installer-host.example.com:8443/registry
  source: quay.io/openshift-release-dev/ocp-v4.0-art-dev
- mirrors:
  - installer-host.example.com:8443/registry
  source: registry.ci.openshift.org/ocp/release

The installation works. However, the additional network is not configured on the masters or the workers, which leads in our case to faulty manila integration.

In journal of all OCP nodes, it's observed logs repeteadly like below one from the master-0:

Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info>  [1731590504.9667] device (enp4s0): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')
Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <warn>  [1731590504.9672] device (enp4s0): Activation: failed for connection 'Wired connection 1'
Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info>  [1731590504.9674] device (enp4s0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info>  [1731590504.9768] dhcp4 (enp4s0): canceled DHCP transaction
Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info>  [1731590504.9768] dhcp4 (enp4s0): activation: beginning transaction (timeout in 45 seconds)
Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info>  [1731590504.9768] dhcp4 (enp4s0): state changed no lease

Where that server has specifically an interface connected to the subnet "StorageNFSSubnet":

$ openstack server list | grep master-0
| da23da4a-4af8-4e54-ac60-88d6db2627b6 | ostest-kmmtt-master-0       | ACTIVE | StorageNFS=fd00:fd00:fd00:5000::fb:d8; network-ssipv6=fd2e:6f44:5dd8:c956::2e4            | ostest-kmmtt-rhcos                            | master    |

That subnet is defined in openstack as dhcpv6-stateful:

$ openstack subnet show StorageNFSSubnet
+----------------------+-------------------------------------------------------+
| Field                | Value                                                 |
+----------------------+-------------------------------------------------------+
| allocation_pools     | fd00:fd00:fd00:5000::fb:10-fd00:fd00:fd00:5000::fb:fe |
| cidr                 | fd00:fd00:fd00:5000::/64                              |
| created_at           | 2024-11-13T12:34:41Z                                  |
| description          |                                                       |
| dns_nameservers      |                                                       |
| dns_publish_fixed_ip | None                                                  |
| enable_dhcp          | True                                                  |
| gateway_ip           | None                                                  |
| host_routes          |                                                       |
| id                   | 480d7b2a-915f-4f0c-9717-90c55b48f912                  |
| ip_version           | 6                                                     |
| ipv6_address_mode    | dhcpv6-stateful                                       |
| ipv6_ra_mode         | dhcpv6-stateful                                       |
| name                 | StorageNFSSubnet                                      |
| network_id           | 26a751c3-c316-483c-91ed-615702bcbba9                  |
| prefix_length        | None                                                  |
| project_id           | 4566c393806c43b9b4e9455ebae1cbb6                      |
| revision_number      | 0                                                     |
| segment_id           | None                                                  |
| service_types        | None                                                  |
| subnetpool_id        | None                                                  |
| tags                 |                                                       |
| updated_at           | 2024-11-13T12:34:41Z                                  |
+----------------------+-------------------------------------------------------+

I also compared with ipv4 installation, and the storageNFSsubnet IP is successfully configured on enp4s0.

Version-Release number of selected component (if applicable):

  • 4.18.0-0.nightly-2024-11-12-201730,
  • RHOS-17.1-RHEL-9-20240701.n.1

How reproducible: Always
Additional info: must-gather and journal of the OCP nodes provided in private comment.

Description of problem:

The 'Plus' button in the 'Edit Pod Count' popup window overlaps the input field, which is incorrect.

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-12-05-103644

How reproducible:

    Always

Steps to Reproduce:

    1.Navigate to Workloads -> ReplicaSets page, choose one resource, and click the Keban list buton, choose ‘Edit Pod count’
    2.
    3.
    

Actual results:

    The Layout is incorrect

Expected results:

    The 'Plus' button in the 'Edit Pod Count' popup window should not overlaps the input field

Additional info:

 Snapshot: https://drive.google.com/file/d/1mL7xeT7FzkdsM1TZlqGdgCP5BG6XA8uh/view?usp=drive_link
https://drive.google.com/file/d/1qmcal_4hypEPjmG6PTG11AJPwdgt65py/view?usp=drive_link

Description of problem:

console is showing view release notes on several places, but the current link only point to Y release main release note    

Version-Release number of selected component (if applicable):

4.17.2    

How reproducible:

Always    

Steps to Reproduce:

1. set up 4.17.2 cluster
2. navigate to Cluster Settings page, check 'View release note' link in 'Update history' table 

Actual results:

the link only point user to Y release main release note

Expected results:

the link should point to release note of a specific version
the correct link should be 
https://access.redhat.com/documentation/en-us/openshift_container_platform/${major}.${minor}/html/release_notes/ocp-${major}-${minor}-release-notes#ocp-${major}-${minor}-${patch}_release_notes   

Additional info:

    

Description of problem:

Sippy complains about pathological events in ns/openshift-cluster-csi-drivers in vsphere-ovn-serial jobs. See this job as one example.

Jan noticed that the DaemonSet generation is 10-12, while in 4.17 is 2. Why is our operator updating the DaemonSet so often?

I wrote a quick "one-liner" to generate json diffs from the vmware-vsphere-csi-driver-operator logs:

prev=''; grep 'DaemonSet "openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-node" changes' openshift-cluster-csi-drivers_vmware-vsphere-csi-driver-operator-5b79c58f6f-hpr6g_vmware-vsphere-csi-driver-operator.log | sed 's/^.*changes: //' | while read -r line; do diff <(echo $prev | jq .) <(echo $line | jq .); prev=$line; echo "####"; done 

It really seems to be only operator.openshift.io/spec-hash and operator.openshift.io/dep-* fields changing in the json diffs:

####
4,5c4,5
<       "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==",
<       "operator.openshift.io/spec-hash": "fb274874404ad6706171c6774a369876ca54e037fcccc200c0ebf3019a600c36"
---
>       "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==",
>       "operator.openshift.io/spec-hash": "27a1bab0c00ace8ac21d95a5fe9a089282e7b2b3ec042045951bd5e26ae01a09"
13c13
<           "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q=="
---
>           "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A=="
####
4,5c4,5
<       "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==",
<       "operator.openshift.io/spec-hash": "27a1bab0c00ace8ac21d95a5fe9a089282e7b2b3ec042045951bd5e26ae01a09"
---
>       "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==",
>       "operator.openshift.io/spec-hash": "fb274874404ad6706171c6774a369876ca54e037fcccc200c0ebf3019a600c36"
13c13
<           "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A=="
---
>           "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q=="
#### 

The deployment is also changing in the same way. We need to find what is causing the spec-hash and dep-* fields to change and avoid the unnecessary churn that causes new daemonset / deployment rollouts.

 

Version-Release number of selected component (if applicable):

4.18.0

How reproducible:

~20% failure rate in 4.18 vsphere-ovn-serial jobs

Steps to Reproduce:

    

Actual results:

operator rolls out unnecessary daemonset / deployment changes

Expected results:

don't roll out changes unless there is a spec change

Additional info:

    

Please review the following PR: https://github.com/openshift/coredns/pull/130

The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.

Differences in upstream and downstream builds impact the fidelity of your CI signal.

If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.

Closing this issue without addressing the difference will cause the issue to
be reopened automatically.

Description of problem:

Customer is trying to install Self managed OCP cluster in aws. This customer use AWS VPC DHCPOptionSet. where it has a trailing dot (.) at the end of domain name in dhcpoptionset. due to this setting Master nodes hostname also has trailing dot & this cause failure in OpenShift installation.  

Version-Release number of selected component (if applicable):

    

How reproducible:

    always

Steps to Reproduce:

1.Please create a aws vpc with DHCPOptionSet, where DHCPoptionSet has trailing dot at the domain name.
2.Try installation of cluster with IPI. 

Actual results:

    Openshift Installer should allowed to create AWS Master nodes, where domain has trailing dot(.).

Expected results:

    

Additional info:

    

Description of problem:

Unit tests for openshift/builder permanently failing for v4.18
    

Version-Release number of selected component (if applicable):

4.18
    

How reproducible:

Always
    

Steps to Reproduce:

    1. Run PR against openshift/builder
    

Actual results:

Test fails: 
--- FAIL: TestUnqualifiedClone (0.20s)
    source_test.go:171: unable to add submodule: "Cloning into '/tmp/test-unqualified335202210/sub'...\nfatal: transport 'file' not allowed\nfatal: clone of 'file:///tmp/test-submodule643317239' into submodule path '/tmp/test-unqualified335202210/sub' failed\n"
    source_test.go:195: unable to find submodule dir
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
    

Expected results:

Tests pass
    

Additional info:

Example: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_builder/401/pull-ci-openshift-builder-master-unit/1853816128913018880
    

Description of problem:

    We have recently enabled a few endpoint overrides, but ResourceManager was accidentally excluded.

Description of problem:

Installing 4.17 agent-based hosted cluster on bare-metal with IPv6 stack in disconnected environment. We cannot install MetalLB operator on the hosted cluster to expose openshift router and handle ingress because the openshift-marketplace pods that extract the operator bundle and the relative pods are in Error state. They try to execute the following command but cannot reach the cluster apiserver:

opm alpha bundle extract -m /bundle/ -n openshift-marketplace -c b5a818607a7a162d7f9a13695046d44e47d8127a45cad69c0d8271b2da945b1 -z

INFO[0000] Using in-cluster kube client config          
Error: error loading manifests from directory: Get "https://[fd02::1]:443/api/v1/namespaces/openshift-marketplace/configmaps/b5a818607a7a162d7f9a13695046d44e47d8127a45cad69c0d8271b2da945b1": dial tcp [fd02::1]:443: connect: connection refused



In our hosted cluster fd02::1 is the clusterIP of the kubernetes service and the endpoint associated to the service is [fd00::1]:6443. By debugging the pods we see that connection to clusterIP is refused but if we try to connect to its endpoint the connection is established and we get 403 Forbidden:

sh-5.1$ curl -k https://[fd02::1]:443
curl: (7) Failed to connect to fd02::1 port 443: Connection refused


sh-5.1$ curl -k https://[fd00::1]:6443
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403

This issue is happening also in other pods in the hosted cluster which are in Error or in CrashLoopBackOff, we have similar error in their logs, e.g.:

F1011 09:11:54.129077       1 cmd.go:162] failed checking apiserver connectivity: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-service-ca-operator/leases/service-ca-operator-lock": dial tcp [fd02::1]:443: connect: connection refused


IPv6 disconnected 4.16 hosted cluster with same configuration was installed successfully and didn't show this issue, and neither IPv4 disconnected 4.17. So the issue is with IPv6 stack only.

Version-Release number of selected component (if applicable):

Hub cluster: 4.17.0-0.nightly-2024-10-10-004834

MCE 2.7.0-DOWNANDBACK-2024-09-27-14-52-56

Hosted cluster: version 4.17.1
image: registry.ci.openshift.org/ocp/release@sha256:e16ac60ac6971e5b6f89c1d818f5ae711c0d63ad6a6a26ffe795c738e8cc4dde

How reproducible:

100%

Steps to Reproduce:

    1. Install MCE 2.7 on 4.17 IPv6 disconnected BM hub cluster
    2. Install 4.17 agent-based hosted cluster and scale up the nodepool 
    3. After worker nodes are installed, attempt to install MetalLB operator to hanlde ingress
    

Actual results:

MetalLB operator cannot be installed because pods cannot connect to the cluster apiserver.

Expected results:

Pods in the cluster can connect to apiserver. 

Additional info:

 

 

Description of problem:

    During the EUS to EUS upgrade of a MNO cluster from 4.14.16 to 4.16.11 on baremetal, we have seen that depending on the custom configuration, like performance profile or container runtime config, one or more control plane nodes are rebooted multiple times. 

Seems that this is a race condition. When the first MachineConfig rendered is generated, the first Control Plane node start the reboot(the maxUnavailable is set to 1 on the master MCP), and at this moment a new MachineConfig render is generated, what means a second reboot. Once this first node is rebooted the second time, the rest of the Control Plane nodes are rebooted just once, because no more new MachineConfig renders are generated.

Version-Release number of selected component (if applicable):

    OCP 4.14.16 > 4.15.31  > 4.16.11

How reproducible:

    Perform the upgrade of a Multi Node OCP with a custom configuration like a performance profile or container runtime configuration (like force cgroups v1, or update runc to crun)

Steps to Reproduce:

    1. Deploy on baremetal a MNO OCP 4.14 with a custom manifest, like the below:

---
apiVersion: config.openshift.io/v1
kind: Node
metadata:
  name: cluster
spec:
  cgroupMode: v1

    2. Upgrade the cluster to the next minor version available, for instance 4.15.31, make a partial upgrade pausing the worker Machine Config Pool.

    3. Monitoring the upgrade process (cluster operators, Machine Configs, Machine Config Pools and nodes)
    

Actual results:

    You will see that once almost all the Cluster Operators are in the 4.15.31 version, except the Machine Config Operator, at this moment review the MachineConfig reders that are generated for the master Machine Config Pool, and also monitor the nodes, to see that new MachineConfig render is generated once the first Control Plane node has been rebooted.

Expected results:

  What is expected is that in a upgrade only one Machine Config Render is generated per Machine Config Pool, and only one reboot per node to finish the upgrade.  

Additional info:

    

Description of problem:

    1 Client can not connect to the kube-apiserver via kubernetes svc, as the kubernetes svc is not in the cert SANs
    2 The kube-apiserver-operator generate apiserver certs, and insert the kubernetes svc ip from the network cr status.ServiceNetwork
    3 When the temporary control plane is down, and the network cr is not ready yet, Client will not connect to apiserver again

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1. I have just met this for very rare conditions, especially when the machine performance is poor     
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

When deploying a disconnected cluster, creating the iso by "openshifit-install agent create image" is failing (authentication required), when the release image resides in a secured local-registry.
Actually the issue is this:
openshift-install generates registry-config out of the install-config.yaml, and it's only the local regustry credentials (disconnected deploy), but it's not creating an icsp-file to get the image from local registry.      

Version-Release number of selected component (if applicable):

    

How reproducible:

    Run an agent-based iso image creation of a disconnected clutser. choose a version (nightly), where the image is in secured registry (such as registry.ci).  it will fail on authentication required.

Steps to Reproduce:

    1.openshift-install agant create image
    2.
    3.
    

Actual results:

failing on authentication required    

Expected results:

    iso to be created

Additional info:

    

Description of problem:

dynamic plugin in Pending status will block console plugins tab page loading    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-11-27-162407    

How reproducible:

Always    

Steps to Reproduce:

1. Create a dynamic plugin which will be in Pending status, we can create from file https://github.com/openshift/openshift-tests-private/blob/master/frontend/fixtures/plugin/pending-console-demo-plugin-1.yaml 

2. Enable the 'console-demo-plugin-1' plugin and navigate to Console plugins tab at /k8s/cluster/operator.openshift.io~v1~Console/cluster/console-plugins

Actual results:

2. page will always be loading    

Expected results:

2. console plugins list table should be displayed    

Additional info:

    

Description of problem:

'Channel' and 'Version' dropdowns do not collapse if the user does not select an option    

Version-Release number of selected component (if applicable):

4.18.0-0.nightly-2024-12-04-113014    

How reproducible:

    Always

Steps to Reproduce:

    1. Naviage to Operator Insatallation page OR Operator Install details page
       eg: /operatorhub/ns/openshift-console?source=["Red+Hat"]&details-item=datagrid-redhat-operators-openshift-marketplace&channel=stable&version=8.5.4
       /operatorhub/subscribe?pkg=datagrid&catalog=redhat-operators&catalogNamespace=openshift-marketplace&targetNamespace=openshift-console&channel=stable&version=8.5.4&tokenizedAuth=     
    2. Click the Channel/Update channel OR 'Version' dropdown list
    3. Click the dropdow again
    

Actual results:

The dropdown list cannot collapse, only if user selected an option OR click other area

Expected results:

 the dropdown can collapse after click    

Additional info:

    

Description of problem:

    The --report and --pxe flags were introduced in 4.18. It should be marked as experimental until 4.19.

Version-Release number of selected component (if applicable):

    4.18

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

We should expand upon our current pre-commit hooks:

  • all hooks will either run in either the pre-commit stage or pre-push stage
  • adds pre-push hooks to run make verify
  • add pre-push hook to run make test

This will help prevent errors before code makes it on GitHub and CI.

This is a clone of issue OCPBUGS-41727. The following is the description of the original issue:

Original bug title:

cert-manager [v1.15 Regression] Failed to issue certs with ACME Route53 dns01 solver in AWS STS env

Description of problem:

    When using Route53 as the dns01 solver to create certificates, it fails in both automated and manual tests. For the full log, please refer to the "Actual results" section.

Version-Release number of selected component (if applicable):

    cert-manager operator v1.15.0 staging build

How reproducible:

    Always

Steps to Reproduce: also documented in gist

    1. Install the cert-manager operator 1.15.0
    2. Follow the doc to auth operator with AWS STS using ccoctl: https://docs.openshift.com/container-platform/4.16/security/cert_manager_operator/cert-manager-authenticate.html#cert-manager-configure-cloud-credentials-aws-sts_cert-manager-authenticate
     3. Create a ACME issuer with Route53 dns01 solver
     4. Create a cert using the created issuer

OR:

Refer by running `/pj-rehearse pull-ci-openshift-cert-manager-operator-master-e2e-operator-aws-sts` on https://github.com/openshift/release/pull/59568 

Actual results:

1. The certificate is not Ready.
2. The challenge of the cert is stuck in the pending status:

PresentError: Error presenting challenge: failed to change Route 53 record set: operation error Route 53: ChangeResourceRecordSets, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region  

Expected results:

The certificate should be Ready. The challenge should succeed.

Additional info:

The only way to get it working again seems to be injecting the "AWS_REGION" environment variable into the controller pod. See upstream discussion/change:

I couldn't find a way to inject the env var into our operator-managed operands, so I only verified this workaround using the upstream build v1.15.3. After applying the patch with the following command, the challenge succeeded and the certificate became Ready.

oc patch deployment cert-manager -n cert-manager \
--patch '{"spec": {"template": {"spec": {"containers": [{"name": "cert-manager-controller", "env": [{"name": "AWS_REGION", "value": "aws-global"}]}]}}}}' 

Description of problem:

The manila controller[1] defines labels that are not based on the asset prefix defined in the manila config[2], consequently when assets that selects this resource are generated they use the asset prefix as a base to define the label, resulting in them not being selected. For example in the pod antifinity[3] and controller pbd[4]. We need to change the labels used in the selectors to match the actual labels of the controller.

[1]https://github.com/openshift/csi-operator/blob/master/assets/overlays/openstack-manila/generated/standalone/controller.yaml#L45-L47

[2]https://github.com/openshift/csi-operator/blob/master/pkg/driver/openstack-manila/openstack_manila.go#L51

[3]https://github.com/openshift/csi-operator/blob/master/assets/overlays/openstack-manila/generated/standalone/controller.yaml#L55

[4]https://github.com/openshift/csi-operator/blob/master/assets/overlays/openstack-manila/generated/hypershift/controller_pdb.yaml#L16

 

Version-Release number of selected component (if applicable):

    

How reproducible:

    

Steps to Reproduce:

    1.
    2.
    3.
    

Actual results:

    

Expected results:

    

Additional info:

    

Description of problem:

[Azure disk/file csi driver]on ARO HCP could not provision volume succeed   

Version-Release number of selected component (if applicable):

4.19.0-0.nightly-2024-12-13-083421    

How reproducible:

Always    

Steps to Reproduce:

    1.Install AKS cluster on azure.
    2.Install hypershift operator on the AKS cluster.
    3.Use hypershift CLI create hosted cluster with the Client Certificate mode.
    4.Check the azure disk/file csi dirver work well on hosted cluster.

Actual results:

    In step 4: the the azure disk/file csi dirver provision volume failed on hosted cluster

# azure disk pvc provision failed
$ oc describe pvc mypvc
...
  Normal   WaitForFirstConsumer  74m                    persistentvolume-controller                                                                                waiting for first consumer to be created before binding
  Normal   Provisioning          74m                    disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_2334468f-9d27-4bdd-a53c-27271ee60073  External provisioner is provisioning volume for claim "default/mypvc"
  Warning  ProvisioningFailed    74m                    disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_2334468f-9d27-4bdd-a53c-27271ee60073  failed to provision volume with StorageClass "managed-csi": rpc error: code = Unavailable desc = error reading from server: EOF
  Warning  ProvisioningFailed    71m                    disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_28ba5ad9-c4f8-4dc8-be40-c80c546b7ef8  failed to provision volume with StorageClass "managed-csi": rpc error: code = Unavailable desc = error reading from server: EOF
  Normal   Provisioning          71m                    disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_28ba5ad9-c4f8-4dc8-be40-c80c546b7ef8  External provisioner is provisioning volume for claim "default/mypvc"
...

$ oc logs azure-disk-csi-driver-controller-74d944bbcb-7zz89 -c csi-driver
W1216 08:07:04.282922       1 main.go:89] nodeid is empty
I1216 08:07:04.290689       1 main.go:165] set up prometheus server on 127.0.0.1:8201
I1216 08:07:04.291073       1 azuredisk.go:213]
DRIVER INFORMATION:
-------------------
Build Date: "2024-12-13T02:45:35Z"
Compiler: gc
Driver Name: disk.csi.azure.com
Driver Version: v1.29.11
Git Commit: 4d21ae15d668d802ed5a35068b724f2e12f47d5c
Go Version: go1.23.2 (Red Hat 1.23.2-1.el9) X:strictfipsruntime
Platform: linux/amd64
Topology Key: topology.disk.csi.azure.com/zone

I1216 08:09:36.814776       1 utils.go:77] GRPC call: /csi.v1.Controller/CreateVolume
I1216 08:09:36.814803       1 utils.go:78] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"topology.disk.csi.azure.com/zone":""}}],"requisite":[{"segments":{"topology.disk.csi.azure.com/zone":""}}]},"capacity_range":{"required_bytes":1073741824},"name":"pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316","parameters":{"csi.storage.k8s.io/pv/name":"pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316","csi.storage.k8s.io/pvc/name":"mypvc","csi.storage.k8s.io/pvc/namespace":"default","skuname":"Premium_LRS"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":7}}]}
I1216 08:09:36.815338       1 controllerserver.go:208] begin to create azure disk(pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316) account type(Premium_LRS) rg(ci-op-zj9zc4gd-12c20-rg) location(centralus) size(1) diskZone() maxShares(0)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x190c61d]

goroutine 153 [running]:
sigs.k8s.io/cloud-provider-azure/pkg/provider.(*ManagedDiskController).CreateManagedDisk(0x0, {0x2265cf0, 0xc0001285a0}, 0xc0003f2640)
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/sigs.k8s.io/cloud-provider-azure/pkg/provider/azure_managedDiskController.go:127 +0x39d
sigs.k8s.io/azuredisk-csi-driver/pkg/azuredisk.(*Driver).CreateVolume(0xc000564540, {0x2265cf0, 0xc0001285a0}, 0xc000272460)
	/go/src/github.com/openshift/azure-disk-csi-driver/pkg/azuredisk/controllerserver.go:297 +0x2c59
github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler.func1({0x2265cf0?, 0xc0001285a0?}, {0x1e5a260?, 0xc000272460?})
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:6420 +0xcb
sigs.k8s.io/azuredisk-csi-driver/pkg/csi-common.logGRPC({0x2265cf0, 0xc0001285a0}, {0x1e5a260, 0xc000272460}, 0xc00017cb80, 0xc00014ea68)
	/go/src/github.com/openshift/azure-disk-csi-driver/pkg/csi-common/utils.go:80 +0x409
github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler({0x1f3e440, 0xc000564540}, {0x2265cf0, 0xc0001285a0}, 0xc00029a700, 0x2084458)
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:6422 +0x143
google.golang.org/grpc.(*Server).processUnaryRPC(0xc00059cc00, {0x2265cf0, 0xc000128510}, {0x2270d60, 0xc0004f5980}, 0xc000308480, 0xc000226a20, 0x31c8f80, 0x0)
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1379 +0xdf8
google.golang.org/grpc.(*Server).handleStream(0xc00059cc00, {0x2270d60, 0xc0004f5980}, 0xc000308480)
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1790 +0xe8b
google.golang.org/grpc.(*Server).serveStreams.func2.1()
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1029 +0x7f
created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 16
	/go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1040 +0x125

# azure file pvc provision failed
$ oc describe pvc mypvc
Name:          mypvc
Namespace:     openshift-cluster-csi-drivers
StorageClass:  azurefile-csi
Status:        Pending
Volume:
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner: file.csi.azure.com
               volume.kubernetes.io/storage-provisioner: file.csi.azure.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type     Reason                Age                From                                                                                                      Message
  ----     ------                ----               ----                                                                                                      -------
  Normal   ExternalProvisioning  14s (x2 over 14s)  persistentvolume-controller                                                                               Waiting for a volume to be created either by the external provisioner 'file.csi.azure.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Normal   Provisioning          7s (x4 over 14s)   file.csi.azure.com_azure-file-csi-driver-controller-879f56577-5hjn8_38c8218e-e52c-4248-ada7-268742afaac0  External provisioner is provisioning volume for claim "openshift-cluster-csi-drivers/mypvc"
  Warning  ProvisioningFailed    7s (x4 over 14s)   file.csi.azure.com_azure-file-csi-driver-controller-879f56577-5hjn8_38c8218e-e52c-4248-ada7-268742afaac0  failed to provision volume with StorageClass "azurefile-csi": rpc error: code = Internal desc = failed to ensure storage account: could not list storage accounts for account type Standard_LRS: StorageAccountClient is nil

Expected results:

    In step 4: the the azure disk/file csi dirver should provision volume succeed on hosted cluster 

Additional info:

    

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

Delete the openshift-monitoring/monitoring-plugin-cert secret, SCO will re-create a new one with different content

Actual results:

- monitoring-plugin is still using the old cert content.
- If the cluster doesn’t show much activity, the hash may take time to be updated.

Expected results:

CMO should detect that exact change and run a sync to  recompute and set the new hash.

Additional info:

- We shouldn't rely on another changeto trigger the sync loop.
- CMO should maybe watch that secret? (its name isn't known in advance). 

Description of problem:

When updating cypress-axe, new changes and bugfixes in the axe-core accessibility auditing package have surfaced various accessibility violations that have to be addressed     

Version-Release number of selected component (if applicable):

    openshift4.18.0

How reproducible:

    always

Steps to Reproduce:

    1. Update axe-core and cypress-axe to the latest versions
    2. Run test-cypress-console and run a cypress test, I used other-routes.cy.ts

Actual results:

    The tests fail with various accessibility violations

Expected results:

    The tests pass without accessibility violations

Additional info:

    

Context Thread

 As a maintainer of the SNO CI lane, I would like to ensure that the following test doesn't failure regularly as part of SNO CI.

[sig-architecture] platform pods in ns/openshift-e2e-loki should not exit an excessive amount of times

This issue is a symptom of a greater problem with SNO where there is downtime in resolving DNS after the upgrade reboot where the DNS operator has an outage while its deploying the new DNS pods. During that time, loki exists after hitting the following error:

2024/10/23 07:21:32 OIDC provider initialization failed: Get "https://sso.redhat.com/auth/realms/redhat-external/.well-known/openid-configuration": dial tcp: lookup sso.redhat.com on 172.30.0.10:53: read udp 10.128.0.4:53104->172.30.0.10:53: read: connection refused

This issue is important because it can contribute to payload rejection in our blocking CI jobs.

Acceptance Criteria:

  • Problem is discussed with the networking team to understand the best path to resolution and decision is documented
  • Either the DNS operator or test are adjusted to address or mitigate the issue.
  • CI is free from the issue in test results for an extended period. (Need to confirm how often we're seeing it first before this period can be defined with confidence).

Description of problem:

Bare Metal UPI cluster

Nodes lose communication with other nodes and this affects the pod communication on these nodes as well. This issue can be fixed with an OVN rebuild on the nodes db that are hitting the issue but eventually the nodes will degrade again and lose communication again. Note despite an OVN Rebuild fixing the issue temporarily Host Networking is set to True so it's using the kernel routing table. 

**update: observed on Vsphere with routingViaHost: false, ipForwarding: global configuration as well.

Version-Release number of selected component (if applicable):

 4.14.7, 4.14.30

How reproducible:

Can't reproduce locally but reproducible and repeatedly occurring in customer environment 

Steps to Reproduce:

identify a host node who's pods can't be reached from other hosts in default namespaces ( tested via openshift-dns). observe curls to that peer pod consistently timeout. TCPdumps to target pod observe that packets are arriving and are acknowledged, but never route back to the client pod successfully. (SYN/ACK seen at pod network layer, not at geneve; so dropped before hitting geneve tunnel).

Actual results:

Nodes will repeatedly degrade and lose communication despite fixing the issue with a ovn db rebuild (db rebuild only provides hours/days of respite, no permanent resolve).

Expected results:

Nodes should not be losing communication and even if they did it should not happen repeatedly     

Additional info:

What's been tried so far
========================

- Multiple OVN rebuilds on different nodes (works but node will eventually hit issue again)

- Flushing the conntrack (Doesn't work)

- Restarting nodes (doesn't work)

Data gathered
=============

- Tcpdump from all interfaces for dns-pods going to port 7777 (to segregate traffic)

- ovnkube-trace

- SOSreports of two nodes having communication issues before an OVN rebuild

- SOSreports of two nodes having communication issues after an OVN rebuild 

- OVS trace dumps of br-int and br-ex 


====

More data in nested comments below. 

linking KCS: https://access.redhat.com/solutions/7091399 

Description of problem:

In 4.8's installer#4760, the installer began passing oc adm release new ... a manifest so the cluster-version operator would manage a coreos-bootimages ConfigMap in the openshift-machine-config-operator namespace. installer#4797 reported issues with the 0.0.1-snapshot placeholder not getting substituted, and installer#4814 attempted to fix that issue by converting the manifest from JSON to YAML to align with the replacement rexexp. But for reasons I don't understand, that manifest still doesn't seem to be getting replaced.

Version-Release number of selected component (if applicable):

From 4.8 through 4.15.

How reproducible:

100%

Steps to Reproduce:

With 4.8.0:

$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.8.0-x86_64
$ grep releaseVersion manifests/0000_50_installer_coreos-bootimages.yaml

Actual results:

  releaseVersion: 0.0.1-snapshot

Expected results:

  releaseVersion: 4.8.0

or other output that matches the extracted release. We just don't want the 0.0.1-snapshot placeholder.

Additional info:

Reproducing in the latest 4.14 RC:

$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.2-x86_64
$ grep releaseVersion manifests/0000_50_installer_coreos-bootimages.yaml
  releaseVersion: 0.0.1-snapshot

Description of problem:

    When Applying profile with isolated field containing huge cpu  list, profile doesn't apply and no errors is reported 

Version-Release number of selected component (if applicable):

    4.18.0-0.nightly-2024-11-26-075648

How reproducible:

    Everytime.

Steps to Reproduce:

    1. Create a profile as specified below:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  annotations:
    kubeletconfig.experimental: '{"topologyManagerPolicy":"restricted"}'
  creationTimestamp: "2024-11-27T10:25:13Z"
  finalizers:
  - foreground-deletion
  generation: 61
  name: performance
  resourceVersion: "3001998"
  uid: 8534b3bf-7bf7-48e1-8413-6e728e89e745
spec:
  cpu:
    isolated: 25,281,117,373,127,383,3,259,30,286,77,333,23,279,21,277,66,322,12,268,15,271,61,317,120,376,35,291,62,318,93,349,126,382,19,275,52,308,110,366,50,306,92,348,124,380,119,375,2,258,29,285,60,316,115,371,118,374,104,360,108,364,70,326,72,328,76,332,96,352,99,355,64,320,80,336,97,353,8,264,11,267,38,294,53,309,57,313,103,359,14,270,87,343,7,263,40,296,51,307,94,350,116,372,39,295,46,302,90,346,101,357,107,363,26,282,67,323,98,354,106,362,113,369,6,262,10,266,20,276,33,289,112,368,85,341,121,377,68,324,71,327,79,335,81,337,83,339,88,344,9,265,89,345,91,347,100,356,54,310,31,287,58,314,59,315,22,278,47,303,105,361,17,273,114,370,111,367,28,284,49,305,55,311,84,340,27,283,95,351,5,261,36,292,41,297,43,299,45,301,75,331,102,358,109,365,37,293,56,312,63,319,65,321,74,330,125,381,13,269,42,298,44,300,78,334,122,378,4,260,16,272,34,290,123,379,18,274,48,304,69,325,82,338,24,280,32,288,73,329,86,342,220,476,251,507,206,462,226,482,229,485,221,477,253,509,255,511,135,391,144,400,183,439,233,489,137,393,186,442,198,454,190,446,234,490,147,403,163,419,172,428,148,404,149,405,250,506,151,407,217,473,238,494,131,387,230,486,235,491,246,502,145,401,194,450,199,455,143,399,169,425,231,487,245,501,129,385,142,398,179,435,225,481,236,492,152,408,203,459,214,470,166,422,207,463,212,468,130,386,155,411,215,471,188,444,201,457,210,466,193,449,200,456,248,504,141,397,167,423,191,447,181,437,222,478,252,508,128,384,139,395,174,430,164,420,168,424,187,443,232,488,133,389,157,413,208,464,140,396,185,441,241,497,219,475,175,431,184,440,213,469,154,410,197,453,249,505,209,465,218,474,227,483,244,500,134,390,153,409,178,434,160,416,195,451,196,452,211,467,132,388,136,392,146,402,138,394,150,406,239,495,173,429,192,448,202,458,205,461,216,472,158,414,159,415,176,432,189,445,237,493,242,498,177,433,182,438,204,460,240,496,254,510,162,418,171,427,180,436,243,499,156,412,165,421,170,426,228,484,247,503,161,417,223,479,224,480
    reserved: 0,256,1,257
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - count: 20
      size: 2M
  machineConfigPoolSelector:
    machineconfiguration.openshift.io/role: worker-cnf
  net:
    userLevelNetworking: true
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
  numa:
    topologyPolicy: restricted
  realTimeKernel:
    enabled: false
  workloadHints:
    highPowerConsumption: true
    perPodPowerManagement: false
    realTime: true

    2. The worker-cnf node doesn't contain any kernel args associated with the above profile.
    3.
    

Actual results:

    System doesn't boot with kernel args associated with above profile

Expected results:

    System should boot with Kernel args presented from Performance Profile.

Additional info:

We can see MCO gets the details and creates the mc:

Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: machine-config-daemon[9550]: "Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1=\"all\" --delete=psi=0 --delete=skew_tick=1 --delete=tsc=reliable --delete=rcupda>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: cbs=25,281,117,373,127,383,3,259,30,286,77,333,23,279,21,277,66,322,12,268,15,271,61,317,120,376,35,291,62,318,93,349,126,382,19,275,52,308,110,366,50,306,92,348,124,380,119,375,2,258,29,285,60,316,115,3>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: 4,190,446,234,490,147,403,163,419,172,428,148,404,149,405,250,506,151,407,217,473,238,494,131,387,230,486,235,491,246,502,145,401,194,450,199,455,143,399,169,425,231,487,245,501,129,385,142,398,179,435,2>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: systemd.cpu_affinity=0,1,256,257 --append=iommu=pt --append=amd_pstate=guided --append=tsc=reliable --append=nmi_watchdog=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=is>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: ,78,334,122,378,4,260,16,272,34,290,123,379,18,274,48,304,69,325,82,338,24,280,32,288,73,329,86,342,220,476,251,507,206,462,226,482,229,485,221,477,253,509,255,511,135,391,144,400,183,439,233,489,137,393>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: 510,162,418,171,427,180,436,243,499,156,412,165,421,170,426,228,484,247,503,161,417,223,479,224,480 --append=nohz_full=25,281,117,373,127,383,3,259,30,286,77,333,23,279,21,277,66,322,12,268,15,271,61,317>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: ,476,251,507,206,462,226,482,229,485,221,477,253,509,255,511,135,391,144,400,183,439,233,489,137,393,186,442,198,454,190,446,234,490,147,403,163,419,172,428,148,404,149,405,250,506,151,407,217,473,238,49>
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: ppend=nosoftlockup --append=skew_tick=1 --append=rcutree.kthread_prio=11 --append=default_hugepagesz=1G --append=hugepagesz=2M --append=hugepages=20]"
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com rpm-ostree[18750]: client(id:machine-config-operator dbus:1.336 unit:crio-36c845a9c9a58a79a0e09dab668f8b21b5e46e5734a527c269c6a5067faa423b.scope uid:0) added; new total=1
Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com rpm-ostree[18750]: Loaded sysroot

Actual Kernel args:
BOOT_IMAGE=(hd1,gpt3)/boot/ostree/rhcos-854dd632224b34d5f4df1884c4ba8c2f9527422b37744b83e7b1b98172586ff4/vmlinuz-5.14.0-427.44.1.el9_4.x86_64 rw ostree=/ostree/boot.0/rhcos/854dd632224b34d5f4df1884c4ba8c2f9527422b37744b83e7b1b98172586ff4/0 ignition.platform.id=metal ip=dhcp root=UUID=0068e804-432c-409d-aabc-260aa71e3669 rw rootflags=prjquota boot=UUID=7797d927-876e-426b-9a30-d1e600c1a382 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=0 skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on

    

Description of problem:

The create button on MultiNetworkPolicies and NetworkPolicies list page is in wrong position, it should on the top right.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.
2.
3.

Actual results:


Expected results:


Additional info:


Description of problem:

    See https://github.com/kubernetes/kubernetes/issues/127352

Version-Release number of selected component (if applicable):

    See https://github.com/kubernetes/kubernetes/issues/127352    

How reproducible:

    See https://github.com/kubernetes/kubernetes/issues/127352

Steps to Reproduce:

    See https://github.com/kubernetes/kubernetes/issues/127352

Actual results:

    See https://github.com/kubernetes/kubernetes/issues/127352

Expected results:

    See https://github.com/kubernetes/kubernetes/issues/127352

Additional info:

    See https://github.com/kubernetes/kubernetes/issues/127352

Description of problem:

checked in 4.18.0-0.nightly-2024-12-05-103644/4.19.0-0.nightly-2024-12-04-03122, admin console go to "Observe -> Metrics", execute one query, make sure there is result for it, for example "cluster_version", click the kebab menu, "Show all series" under the list, it's wrong, should be "Hide all series", click "Show all series" will unselect all series, then "Hide all series" always show under the menu, click it, the series would be changed from selected and unselected, but always see "Hide all series", see recording: https://drive.google.com/file/d/1kfwAH7FuhcloCFdRK--l01JYabtzcG6e/view?usp=drive_link

same issue for developer console for 4.18+, 4.17 and below does not have such issue

Version-Release number of selected component (if applicable):

4.18+

How reproducible:

always with 4.18+

Steps to Reproduce:

see the description

Actual results:

Hide/Show all series status under"Observe -> Metrics" kebab menu is wrong

Expected results:

should be right

Description of problem:

the go to arrow and new doc link icon not aligned with text any more    

Version-Release number of selected component (if applicable):

4.19.0-0.nightly-2024-12-12-144418    

How reproducible:

Always    

Steps to Reproduce:

    1. goes to Home -> Overview page
    2.
    3.
    

Actual results:

the go to arrow and new doc link icon are not horizontal aligned with their text any more     

Expected results:

icon and text should be aligned

Additional info:

    screenshot https://drive.google.com/file/d/1S61XY-lqmmJgGbwB5hcR2YU_O1JSJPtI/view?usp=drive_link 

Description of problem:

CAPI install got ImageReconciliationFailed when creating vpc custom image

Version-Release number of selected component (if applicable):

 4.19.0-0.nightly-2024-12-06-101930    

How reproducible:

always    

Steps to Reproduce:

1.add the following in install-config.yaml
featureSet: CustomNoUpgrade
featureGates: [ClusterAPIInstall=true]     
2. create IBMCloud cluster with IPI    

Actual results:

level=info msg=Done creating infra manifests
level=info msg=Creating kubeconfig entry for capi cluster ci-op-h3ykp5jn-32a54-xprzg
level=info msg=Waiting up to 30m0s (until 11:25AM UTC) for network infrastructure to become ready...
level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: infrastructure was not ready within 30m0s: client rate limiter Wait returned an error: context deadline exceeded  

in IBMVPCCluster-openshift-cluster-api-guests log

reason: ImageReconciliationFailed
    message: 'error failure trying to create vpc custom image: error unknown failure
      creating vpc custom image: The IAM token that was specified in the request has
      expired or is invalid. The request is not authorized to access the Cloud Object
      Storage resource.'  

Expected results:

create cluster succeed    

Additional info:

the resources created when install failed: 
ci-op-h3ykp5jn-32a54-xprzg-cos  dff97f5c-bc5e-4455-b470-411c3edbe49c crn:v1:bluemix:public:cloud-object-storage:global:a/fdc2e14cf8bc4d53a67f972dc2e2c861:f648897a-2178-4f02-b948-b3cd53f07d85::
ci-op-h3ykp5jn-32a54-xprzg-vpc  is.vpc crn:v1:bluemix:public:is:jp-tok:a/fdc2e14cf8bc4d53a67f972dc2e2c861::vpc:r022-46c7932d-8f4d-4d53-a398-555405dfbf18
copier-resurrect-panzer-resistant  is.security-group crn:v1:bluemix:public:is:jp-tok:a/fdc2e14cf8bc4d53a67f972dc2e2c861::security-group:r022-2367a32b-41d1-4f07-b148-63485ca8437b
deceiving-unashamed-unwind-outward  is.network-acl crn:v1:bluemix:public:is:jp-tok:a/fdc2e14cf8bc4d53a67f972dc2e2c861::network-acl:r022-b50286f6-1052-479f-89bc-fc66cd9bf613
    

Description of problem:

node-joiner --pxe does not rename pxe artifacts

Version-Release number of selected component (if applicable):

    

How reproducible:

always

Steps to Reproduce:

    1. node-joiner --pxe

Actual results:

   agent*.* artifacts are generated in the working dir

Expected results:

    In the target folder, there should be only the following artifacts:
* node.x86_64-initrd.img
* node.x86_64-rootfs.img
* node.x86_64-vmlinuz
* node.x86_64.ipxe (if required)

Additional info:

    

Today, when source images are by digest only, oc-mirror applies a default tag:

  • for operators and additional images it is the digest
  • for helm images it is digestAlgorithm+"-"+digest

This should be unified.

Component Readiness has found a potential regression in the following test:

install should succeed: infrastructure

installer fails with:

time="2024-10-20T04:34:57Z" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: control-plane machines were not provisioned within 15m0s: client rate limiter Wait returned an error: context deadline exceeded" 

Significant regression detected.
Fishers Exact probability of a regression: 99.96%.
Test pass rate dropped from 98.94% to 89.29%.

Sample (being evaluated) Release: 4.18
Start Time: 2024-10-14T00:00:00Z
End Time: 2024-10-21T23:59:59Z
Success Rate: 89.29%
Successes: 25
Failures: 3
Flakes: 0

Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T23:59:59Z
Success Rate: 98.94%
Successes: 93
Failures: 1
Flakes: 0

View the test details report at https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?Aggregation=none&Architecture=amd64&FeatureSet=default&Installer=ipi&Network=ovn&NetworkAccess=default&Platform=azure&Scheduler=default&SecurityMode=default&Suite=serial&Topology=ha&Upgrade=none&baseEndTime=2024-10-01%2023%3A59%3A59&baseRelease=4.17&baseStartTime=2024-09-01%2000%3A00%3A00&capability=Other&columnGroupBy=Architecture%2CNetwork%2CPlatform&component=Installer%20%2F%20openshift-installer&confidence=95&dbGroupBy=Platform%2CArchitecture%2CNetwork%2CTopology%2CFeatureSet%2CUpgrade%2CSuite%2CInstaller&environment=amd64%20default%20ipi%20ovn%20azure%20serial%20ha%20none&ignoreDisruption=true&ignoreMissing=false&includeVariant=Architecture%3Aamd64&includeVariant=CGroupMode%3Av2&includeVariant=ContainerRuntime%3Arunc&includeVariant=FeatureSet%3Adefault&includeVariant=Installer%3Aipi&includeVariant=Installer%3Aupi&includeVariant=Owner%3Aeng&includeVariant=Platform%3Aaws&includeVariant=Platform%3Aazure&includeVariant=Platform%3Agcp&includeVariant=Platform%3Ametal&includeVariant=Platform%3Avsphere&includeVariant=Topology%3Aha&minFail=3&passRateAllTests=0&passRateNewTests=95&pity=5&sampleEndTime=2024-10-21%2023%3A59%3A59&sampleRelease=4.18&sampleStartTime=2024-10-14%2000%3A00%3A00&testId=cluster%20install%3A3e14279ba2c202608dd9a041e5023c4c&testName=install%20should%20succeed%3A%20infrastructure

Description of problem:

    The period is placed inside the quotes of the missingKeyHandler i18n error 

Version-Release number of selected component (if applicable):

    4.18.0

How reproducible:

    Always when there is a missingKeyHandler error

Steps to Reproduce:

    1. Check browser console
    2. Observe period is placed inside the quites
    3.
    

Actual results:

    It is placed inside the quotes

Expected results:

    It should be placed outside the quotes

Additional info:

    

Other Incomplete

This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled

one issue is that the ovnk build needed to test the PR is taking close to an hour and sometimes closer to 2 hours. One major problem is that 'dnf install' that is done
in the ovnk Dockerfile(s) is not able to reach certain dnf repositories that are
default in the base openshift image. They may be behind the Red Hat VPN which
our test clusters don't have access too. If the problem is a Timeout, the default
dnf timeout is 30s and that can cause 5+ minutes of delay. We do more than one
dnf install in our Dockerfiles too, so the problem is amplified.

another issue is the default job timeout is 4h0m, with a ~1h build time and other
time consuming steps like must-gather, watchers, and long e2e runs getting to that
4h is pretty easy.

User Story

As a developer looking to contribute to OCP BuildConfig I want contribution guidelines that make it easy for me to build and test all the components.

Background

Much of the contributor documentation for openshift/builder is either extremely out of date or buggy. This hinders the ability for newcomers to contribute.

Approach

  1. Document dependencies needed to build openshift/builder from source.
  2. Update "dev" container image for openshift/builder so teams can experiment locally.
  3. Provide instructions on how to test
    1. "WIP Pull Request" process
    2. "Disable operators" mode.
    3. Red Hatter instructions: using cluster-bot

Acceptance Criteria

  • New contributors can compile openshift/builder from GitHub instructions
  • New contributors can test their code changes on an OpenShift instance
  • Red Hatters can test their code changes with cluster-bot.

Description of problem:

s2i conformance test appears to fail permanently on OCP 4.16.z
    

Version-Release number of selected component (if applicable):

4.16.z
    

How reproducible:

Since 2024-11-04 at least
    

Steps to Reproduce:

    Run OpenShift build test suite in PR
    

Actual results:

Test fails - root cause appears to be that a built/deployed pod crashloops
    

Expected results:

Test succeeds
    

Additional info:

Job history https://prow.ci.openshift.org/job-history/gs/test-platform-results/pr-logs/directory/pull-ci-openshift-openshift-controller-manager-release-4.16-e2e-gcp-ovn-builds
    

CMO should create and deploy a configmap that contains the data for accelerators monitoring. When CMO creates node-exporter daemonsets and mounts the the configmap into the node-exporter's pods

Background 

The monitoring-plugin is still using Patternfly v4; it needs to be upgraded to Patternfly v5. This major version release deprecates components in the monitoring-plugin. These components will need to be replaced/removed to accommodate the version update. 

We need to remove the deprecated components from the monitoring plugin, extending the work from CONSOLE-4124

Work to be done: 

  • upgrade monitoring-plugin > package.json > Patternfly v5
  • Remove/replace any deprecated components after upgrading to Patternfly v5. 

Outcome 

  • The monitoring-plugin > package.json will be upgrade to use Patternfly v5
  • Any deprecrated components from Patternfly v4 will be removed or replaced my similiar Patternfly v5 components

One of our customers observed this issue. In order to reproduce, In my test cluster, I intentionally increased the overall CPU limits to over 200% and monitored the cluster for more than 2 days. However, I did not see the KubeCPUOvercommit alert, which ideally should trigger after 10 minutes of overcommitment. 

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests      Limits
  --------                          --------            ------
  cpu                                2654m (75%)        8450m (241%)
  memory                         5995Mi (87%)  12264Mi (179%)
  ephemeral-storage         0 (0%)        0 (0%)
  hugepages-1Gi             0 (0%)        0 (0%)
  hugepages-2Mi             0 (0%)        0 (0%)

 

OCP console --> Observe --> alerting --> alert rule and select for the `KubeCPUOvercommit` alert.

Expression:

sum by (cluster) (namespace_cpu:kube_pod_container_resource_requests:sum{job="kube-state-metrics"}) - (sum by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) - max by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"})) > 0 and (sum by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) - max by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"})) > 0

The following Insights APIs use duration attributes:

  • insightsoperator.operator.openshift.io
  • datagathers.insights.openshift.io

The kubebuilder validation patterns are defined as

^0|([1-9][0-9]*(\\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$

and

^([1-9][0-9]*(\\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$

Unfortunately this is not enough and it fails in case of updating the resource e.g with value "2m0s".

The validation pattern must allow these values.

User Story:
As an OpenShift Engineer I want Create PR for machine-api refactoring of feature gate parameters so We need to pull out the logic from Neil's PR that removes individual feature gate parameters to use the new FeatureGate mutable map.

Description:
< Record any background information >

Acceptance Criteria:
< Record how we'll know we're done >

Other Information:
< Record anything else that may be helpful to someone else picking up the card >

issue created by splat-bot

Description of problem:

openstack-manila-csi-controllerplugin-csi-driver container is not functional in the first run, it needs to restart once and then it's good. This causes HCP e2e to fail on the EnsureNoCrashingPods test.

Version-Release number of selected component (if applicable):

4.19, 4.18

How reproducible:

Deploy Shift on Stack with Manila available in the cloud.

Actual results:

The openstack-manila-csi-controllerplugin pod will restart once and then it'll be functional.

Expected results:

No restart should be needed. This is likely an orchestration issue.

 

Issue present in Standalone clusters: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-storage-operator/544/pull-ci-openshift-cluster-storage-operator-master-e2e-openstack-manila-csi/1862115443632771072/artifacts/e2e-openstack-manila-csi/gather-extra/artifacts/pods/openshift-manila-csi-driver_openstack-manila-csi-nodeplugin-5cqcw_csi-driver_previous.log

Also present in HCP clusters: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_hypershift/5138/pull-ci-openshift-hypershift-main-e2e-openstack/1862138522127831040/artifacts/e2e-openstack/hypershift-openstack-e2e-execute/artifacts/TestCreateCluster/namespaces/e2e-clusters-kktnx-example-cq7xp/core/pods/logs/openstack-manila-csi-controllerplugin-675ff65bf5-gqf65-csi-driver-previous.log

  Security Tracking Issue

Do not make this issue public.

Flaw:


Non-linear parsing of case-insensitive content in golang.org/x/net/html
https://bugzilla.redhat.com/show_bug.cgi?id=2333122

An attacker can craft an input to the Parse functions that would be processed non-linearly with respect to its length, resulting in extremely slow parsing. This could cause a denial of service.

~~~

Description of problem:

We have two EAP application server clusters and for each of them there is a service created. We have a route configured to the one of the services. When we update the route programmatically to lead to the second service/cluster the response shows it is still being attached to the same service.

Steps to Reproduce:
1. Create two separate clusters of the EAP servers
2. Create one service for the first cluster (hsc1) and one for the second one (hsc2)
3. Create a route for the first service (hsc1)
4. Start both of the clusters and assure the replication works
5. Send a request to the first cluster using the route URL - response should contain identification of the first cluster (hsc-1-xxx)

[2024-08-29 11:30:44,544] INFO - [ForkJoinPool-1-worker-1] 1st request responseString hsc-1-2-0c872b89-ef3e-48c6-8163-372e447e013d 1 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
[2024-08-29 11:30:44,654] INFO - [ForkJoinPool-1-worker-1] 2nd request responseString hsc-1-2-0c872b89-ef3e-48c6-8163-372e447e013d 2 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com

6. update the route programatically to redirect to the second service (hsc2)

...
builder.editSpec().editTo().withName("hsc2").endTo().endSpec();
...

7. Send the request again using the same route - in the response there is the same identification of the first cluster

[2024-08-29 11:31:45,098] INFO - [ForkJoinPool-1-worker-1] responseString after route update hsc-1-2-0c872b89-ef3e-48c6-8163-372e447e013d 3 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com

although the service was updated in the route yaml:

...
kind: Service
    name: hsc2

When creating a new route hsc2 for a service hsc2 and using it for the third request we can see the second cluster was targetted correctly with his own separate replication working

[2024-08-29 13:43:13,679] INFO - [ForkJoinPool-1-worker-1] 1st request responseString hsc-1-2-00594ca9-f70c-45de-94b8-354a6e1fc293 1 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
[2024-08-29 13:43:13,790] INFO - [ForkJoinPool-1-worker-1] 2nd request responseString hsc-1-2-00594ca9-f70c-45de-94b8-354a6e1fc293 2 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
[2024-08-29 13:44:14,056] INFO - [ForkJoinPool-1-worker-1] responseString after second route for service hsc2 was used hsc-2-2-614582a9-3c71-4690-81d3-32a616ed8e54 1 with route hsc2-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com

I also did a different attempt.
I Stopped the test in debug mode after the two requests were executed

[2024-08-30 14:23:43,101] INFO - [ForkJoinPool-1-worker-1] 1st request responseString hsc-1-1-489d4239-be4f-4d5e-9343-3211ae479d51 1 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
[2024-08-30 14:23:43,210] INFO - [ForkJoinPool-1-worker-1] 2nd request responseString hsc-1-1-489d4239-be4f-4d5e-9343-3211ae479d51 2 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com

Then manually changed the route yaml to use the hsc2 service and send the request manually:

curl http://hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com/Counter
hsc-2-2-84fa1d7e-4045-4708-b89e-7d7f3cd48541 1

responded correctly with the second service/cluster.

Then resumed the test run in debug mode and sent the request programmatically

[2024-08-30 14:24:59,509] INFO - [ForkJoinPool-1-worker-1] responseString after route update hsc-1-1-489d4239-be4f-4d5e-9343-3211ae479d51 3 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com

responded with the wrong first service/cluster.

Actual results: Route directs to the same service and EAP cluster

Expected results: After the update the route should direct to the second service and EAP cluster

Additional info:
This issue started to occur from OCP 4.16. Going through the 4.16 release notes and suggested route configuration didn't lead to any possible configuration chnages which should have been applied.

The code of the MultipleClustersTest.twoClustersTest where was this issue discovered is available here.

All the logs as well as services and route yamls are attached to the EAPQE jira.

CBO-installed Ironic unconditionally has TLS, even though we don't do proper host validation just yet (see bug OCPBUGS-20412). Ironic in the installer does not use TLS (mostly for historical reasons). Now that OCPBUGS-36283 added a TLS certificate for virtual media, we can use the same for Ironic API. At least initially, it will involve disabling host validation for IPA.

Description of problem:

openshift virt allows hotplugging block volumes into it's pods, which relies on the fact that changing the cgroup corresponding to the pid of the container suffices.

crun is test driving some changes it integrated recently;
it's configuring two cgroups, `*.scope` and sub cgroup called `container`
while before, the parent existed as sort of a no op
(wasn't configured, so, all devices were allowed, for example)
This results in the volume hotplug breaking since applying the device filter to the sub cgroup is not enough anymore

Version-Release number of selected component (if applicable):

4.18.0 RC2

How reproducible:

100%    

Steps to Reproduce:

    1. Block volume hotplug to VM
    2.
    3.
    

Actual results:

    Failure

Expected results:

    Success

Additional info:

https://kubevirt.io/user-guide/storage/hotplug_volumes/