Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
Placeholder feature for ccx-ocp-core maintenance tasks.
This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.
Description of problem:
The operatorLogLevel reverting back to `Normal` automatically for insightsoperator object
Version-Release number of selected component (if applicable):
4.14,4.15,4.16,4.17
How reproducible:
100%
Steps to Reproduce:
1. Edit the insightsoperator/cluster object: $ oc edit insightsoperator/cluster ... ... spec: logLevel: Normal managementState: Managed operatorLogLevel: Normal ====> Change it to Debug/Trace/TraceAll 2. After 1 Min check if the changes are constant: $ oc get insightsoperator/cluster -o yaml ... ... spec: logLevel: Normal managementState: Managed operatorLogLevel: Normal 3. The OC CLI DOC states that it supports below logLevels: $ oc explain insightsoperator.spec.operatorLogLevel GROUP: operator.openshift.io KIND: InsightsOperator VERSION: v1FIELD: operatorLogLevel <string> ENUM: "" Normal Debug Trace TraceAll DESCRIPTION: operatorLogLevel is an intent based logging for the operator itself. It does not give fine grained control, but it is a simple way to manage coarse grained logging choices that operators have to interpret for themselves. Valid values are: "Normal", "Debug", "Trace", "TraceAll". Defaults to "Normal".
Actual results:
The logLevel is set back to Normal automatically.
Expected results:
The logLevel should not be changed to Normal until manually modified.
Additional info:
The same issue is observed with insightsoperator.spec.logLevel where any logLevel other than Normal getting reverted.
Goal:
Track Insights Operator Data Enhancements epic in 2024
Description of problem:
Context OpenShift Logging is migrating from Elasticsearch to Loki. While the option to use Loki has existed forquite a while, the information about end of Elasticsearch support has not been available until recently. With the information available now, we can expect more and more customers to migrate and hit the issue described in INSIGHTOCP-1927. P.S. Note the bar chart in INSIGHTOCP-1927 which shows how frequently is the related KCS linked in customer cases. Data to gather LokiStack custom resources (any name, any namespace) Backports The option to use Loki is available since Logging 5.5 whose compatibility started at OCP 4.9. Considering the OCP life cycle, backports to up to OCP 4.14 would be nice. Unknowns Since Logging 5.7, Logging supports installation of multiple instances in customer namespaces. The Insights Operator would have to look for the CRs in all namespaces, which poses the following questions: What is the expected number of the LokiStack CRs in a cluster? Should the Insights operator look for the resource in all namespaces? Is there a way to narrow down the scope? The CR will contain the name of a customer namespaces which is a sensitive information. What is the API group of the CR? Is there a risk of LokiStack CRs in customer namespaces that would NOT be related to OpenShift Logging? SME Oscar Arribas Arribas
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
N/A
Actual results:
Expected results:
Additional info:
Crun is GA as non default since OCP 4.14 . We want to make it as default in 4.18 while still supporting runc as non-default
Benefits of Crun is covered here https://github.com/containers/crun
FAQ.: https://docs.google.com/document/d/1N7tik4HXTKsXS-tMhvnmagvw6TE44iNccQGfbL_-eXw/edit
***Note -> making Crun default does not means we will remove the support for runc nor we have any plans in foreseeable future to do that
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Simplify debugging when a cluster fails to update to a new target release image, when that release image is unsigned or otherwise fails to pull.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
– Kubelet/CRIO to verify RH images & release payload sigstore signatures
– ART will add sigstore signatures to core OCP images
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
These acceptance criteria are for all deployment flavors of OpenShift.
Deployment considerations | List applicable specific needs (N/A = not applicable) | |
Self-managed, managed, or both | both | |
Classic (standalone cluster) | yes | |
Hosted control planes | no | |
Multi node, Compact (three node), or Single node (SNO), or all | ||
Connected / Restricted Network | ||
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | ||
Operator compatibility | ||
Backport needed (list applicable versions) | Not Applicable | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | none, | |
Other (please specify) |
Add documentation for sigstore verification and gpg verification
For folks mirroring release images (e.g. disconnected/restricted-network):
Currently the CVO launches a Job and waits for it to complete to get manifests for an incoming release payload. But the Job controller doesn't bubble up details about why the pod has trouble (e.g. Init:SignatureValidationFailed), so to get those details, we need direct access to the Pod. The Job controller doesn't seem like it's adding much value here, so the goal of this Epic is to drop it and create and monitor the Pod ourselves, so we can deliver better reporting of version-Pod state.
When the version Pod fails to run, the cluster admin will likely need to take some action (clearing the update request, fixing a mirror registry, etc.). The more clearly we share the issues that the Pod is having with the cluster admin, the easier it will be for them to figure out their next steps.
oc adm upgrade and other ClusterVersion status UIs will be able to display Init:SignatureValidationFailed and other version-Pod failure modes directly. We don't expect to be able to give ClusterVersion consumers more detailed next-step advice, but hopefully the easier access to failure-mode context makes it easier for them to figure out next-steps on their own.
This change is purely and updates-team/OTA CVO pull request. No other dependencies.
Definition of done: failure modes like unretrievable image digests (e.g. quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000) or images with missing or unacceptable Sigstore signatures with OTA-1304's ClusterImagePolicy) have failure-mode details in ClusterVersion's RetrievePayload message, instead of the current Job was active longer than specified deadline.
Limited audience, and failures like Init:SignatureValidationFailed are generic, while CVO version-Pod handling is pretty narrow. This may be redundant work if we end up getting nice generic init-Pod-issue handling like RFE-5627. But even if the work ends up being redundant, thinning the CVO stack by removing the Job controller is kind of nice.
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Currently the CVO launches a Job and waits for it to complete to get manifests for an incoming release payload. But the Job controller doesn't bubble up details about why the pod has trouble (e.g. Init:SignatureValidationFailed), so to get those details, we need direct access to the Pod. The Job controller doesn't seem like it's adding much value here, so we probably want to drop it and create and monitor the Pod ourselves.
Definition of done: failure modes like unretrievable image digests (e.g. quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000) or images with missing or unacceptable Sigstore signatures with OTA-1304's ClusterImagePolicy) have failure-mode details in ClusterVersion's RetrievePayload message, instead of the current Job was active longer than specified deadline.
Not clear to me what we want to do with reason, which is currently DeadlineExceeded. Keep that? Split out some subsets like SignatureValidationFailed and whatever we get for image-pull-failures? Other?
Enable sharing ConfigMap and Secret across namespaces
Requirement | Notes | isMvp? |
---|---|---|
Secrets and ConfigMaps can get shared across namespaces | YES |
NA
NA
Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (OCPBU-93) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them.
Questions to be addressed:
* What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
* Does this feature have doc impact?
* New Content, Updates to existing content, Release Note, or No Doc Impact
* If unsure and no Technical Writer is available, please contact Content Strategy.
* What concepts do customers need to understand to be successful in [action]?
* How do we expect customers will use the feature? For what purpose(s)?
* What reference material might a customer want/need to complete [action]?
* Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
* What is the doc impact (New Content, Updates to existing content, or Release Note)?
Epic Goal*
Remove the Shared Resource CSI Driver as a tech preview feature.
Why is this important? (mandatory)
Shared Resources was originally introduced as a tech preview feature in OpenShift Container Platform. After extensive review, we have decided to GA this component through the Builds for OpenShift layered product.
Expected GA will be alongside OpenShift 4.16. Therefore it is safe to remove in OpenShift 4.17
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Drawbacks or Risk (optional)
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
We need to do a lot of R&D and fix some known issues (e.g., see linked BZs).
R&D targetted at 4.16 and productisation of this feature in 4.17
Goal
To make the current implementation of the HAProxy config manager the default configuration.
Objectives
The goal of this user story is to combine the code from the smoke test user story and results from the spike into an implementation PR.
Since multiple gaps were discovered a feature gate will be needed to ensure stability of OCP before the feature can be enabled by default.
Phase 2 Goal:
for Phase-1, incorporating the assets from different repositories to simplify asset management.
Overarching Goal
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift.
Phase 1 & 2 covers implementing base functionality for CAPI.
Phase 2 also covers migrating MAPI resources to CAPI.
There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
The CAPI operator should ensure that, for clusters that are upgraded into a version of openshift supporting CAPI, that a Cluster object exists in the openshift-cluster-api namespace with the name as the infratructure ID of the Cluster.
The cluster spec should be populated with the reference to the infrastructure object and the status should be updated to reflect that the control plane is initialized.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
Provide a simple way to get a VM-friendly networking setup, without having to configure the underlying physical network.
Provide a network solution working out of the box, meeting expectations of a typical VM workload.
Ensure the feature can be used on non dev|tech preview clusters.
Involves a PR to the OpenShift API - need to involve the API team.
This task requires periodic e2e tests in openshift/origin and openshift/ovn-kubernetes asserting the correct behavior of the gated feature. Currently focused on that (must add a virt-aware TP lane).
This is the script that decied if FG can be GA or not
https://github.com/openshift/api/blob/master/tools/codegen/cmd/featuregate-test-analyzer.go
Interesing slack thread https://redhat-internal.slack.com/archives/CE4L0F143/p1729612519105239?thread_ts=1729607662.848259&cid=CE4L0F143
Primary used-defined networks can be managed from the UI and the user flow is seamless.
Placeholder feature for ccx-ocp-core maintenance tasks.
This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.
It looks like the insights-operator doesn't work with IPv6, there are log errors like this:
E1209 12:20:27.648684 37952 run.go:72] "command failed" err="failed to run groups: failed to listen on secure address: listen tcp: address fd01:0:0:5::6:8000: too many colons in address"
It's showing up in metal techpreview jobs.
The URL isn't being constructed correctly, use NetJoinHostPort over Sprintf. Some more details here https://github.com/stbenjam/no-sprintf-host-port. There's a non-default linter in golangci-lint for this.
Component Readiness has found a potential regression in the following test:
[sig-architecture] platform pods in ns/openshift-insights should not exit an excessive amount of times
Test has a 56.36% pass rate, but 95.00% is required.
Sample (being evaluated) Release: 4.18
Start Time: 2024-12-02T00:00:00Z
End Time: 2024-12-09T16:00:00Z
Success Rate: 56.36%
Successes: 31
Failures: 24
Flakes: 0
View the test details report for additional context.
The admin console's alert rule details page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.
That dev console page is loaded from monitoring-plugin and the code for the page is removed from the console codebase.
Ensure removal of deprecated patternfly components from kebab-dropdown.tsx and alerting.tsx once this story and OU-257 are completed.
Technical debt led to a majority of the alerting components and all routing components being placed in a single file. This made the file very large, difficult to work with, and confusing for others to work with the routing items.
Alerting components have all been moved into their own separate files and the routing for the monitoring plugin is moved out of alerting.tsx into a routing specific file.
In order for customers or internal teams to troubleshoot better, they need to be able to see the dashboards created using Perses inside the OpenShift console. We will use the monitoring plugin which already supports console dashboards comming from Grafana, to provide the Perses dashboard funcionallity
Create a component in the monitoring plugin that can render a Perses dashboard based on the dashboard schema returned by the Perses API.
There are 2 dropdowns, one for selecting namespaces and another for selecting the dashboards in the selected namespace.
Previous work
Create dashboard flow chart
In order to allow customers and internal teams to see dashboards created using Perses, we must add them as new elements on the current dashboard list
tldr: three basic claims, the rest is explanation and one example
While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.
One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.
I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.
We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.
Relevant links:
Add a NID alias to OWNERS_ALIASES and update the OWNERS file in test/extended/router and add OWNERS file to test/extended/dns
As a cluster-admin, I want to run update in discrete steps. Update control plane and worker nodes independently.
I also want to back-up and restore incase of a problematic upgrade.
Background:
This Feature is a continuation of https://issues.redhat.com/browse/OCPSTRAT-180.
Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.Below is the list of done tasks.
These are alarming conditions which may frighten customers, and we don't want to see them in our own, controlled, repeatable update CI. This example job had logs like:
Feb 18 21:11:25.799 E clusteroperator/openshift-apiserver changed Degraded to True: APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
And the job failed, but none of the failures were "something made openshift-apiserver mad enough to go Degraded".
Definition of done:
This feature aims to enable customers of OCP to integrate 3rd party KMS solutions for encrypting etcd values at rest in accordance with:
https://kubernetes.io/docs/tasks/administer-cluster/kms-provider/
Scenario:
For an OCP cluster with external KMS enabled:
How doe the above scenario(s) impact the cluster? The API may be unavailable
Goal:
Investigation Steps:
Detection:
Actuation:
User stories that might result in KCS:
Plugins research:
POCs:
Acceptance Criteria:
We did something similar for the aesgcm encryption type in https://github.com/openshift/api/pull/1413/.
Volume Group Snapshots is a key new Kubernetes storage feature that allows multiple PVs to be grouped together and snapshotted at the same time. This enables customers to takes consistent snapshots of applications that span across multiple PVs.
This is also a key requirement for backup and DR solutions.
https://kubernetes.io/blog/2023/05/08/kubernetes-1-27-volume-group-snapshot-alpha/
https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3476-volume-group-snapshot
Productise the volume group snapshots feature as tech preview have docs, testing as well as a feature gate to enable it in order for customers and partners to test it in advance.
The feature should be graduated beta upstream to become TP in OCP. Tests and CI must pass and a feature gate should allow customers and partners to easily enable it. We should identify all OCP shipped CSI drivers that support this feature and configure them accordingly.
CSI drivers development/support of this feature.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Drivers must support this feature and enable it. Partners may need to change their operator and/or doc to support it.
Document how to enable the feature, what this feature does and how to use it. Update the OCP driver's table to include this capability.
Can be leveraged by ODF and OCP virt, especially around backup and DR scenarios.
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Add Volume Group Snapshots as Tech Preview. This is a key new Kubernetes storage feature that allows multiple PVs to be grouped together and snapshotted at the same time. This enables customers to takes consistent snapshots of applications that span across multiple PVs.
We will rely on the newly beta promoted feature. This feature is driver dependent.
This will need a new external-snapshotter rebase + removal of the feature gate check in csi-snapshot-controller-operator. Freshly installed or upgraded from older release, will have group snapshot v1beta1 API enabled + enabled support for it in the snapshot-controller (+ ship corresponding external-snapshotter sidecar).
No opt-in, no opt-out.
OCP itself will not ship any CSI driver that supports it.
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
This is also a key requirement for backup and DR solutions specially for OCP virt.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
External snapshotter rebase to the upstream version that include the beta API.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Since we don't ship any driver with OCP that support the feature we need to have testing with ODF
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
We're looking at enabling it by default which could introduce risk. Since the feature has recently landed upstream, we will need to rebase on a newer external snapshotter that we initially targeted.
When moving to v1 there may be non backward compatible changes.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
As part of https://issues.redhat.com/browse/COS-2572, the OpenShift installer needs to be adapted so that it can work with bootimages which have no OCP content.
iSCSI boot is supported in RHEL and since the implementation of OCPSTRAT-749 it's also available in RHCOS.
Customers require using this feature in different bare metal environments on-prem and cloud-based.
Assisted Installer implements support for it in Oracle Cloud Infrastructure (MGMT-16167) to support their bare metal standard "shapes".
This feature extends this support to make it generic and supported in the Agent-Based Installer, the Assisted Installer and in ACM/MCE.
Support iSCSI boot in bare metal nodes, including platform baremetal and platform "none".
Assisted installer can boot and install OpenShift on nodes with iSCSI disks.
Agent-Based Installer can boot and install OpenShift on nodes with iSCSI disks.
MCE/ACM can boot and install OpenShift on nodes with iSCSI disks.
The installation can be done on clusters with platform baremetal and clusters with platform "none".
Support booting from iSCSI using ABI starting OCP 4.16.
The following PRs are the gaps between release-4.17 branch and master that are needed to make the integration work on 4.17.
https://github.com/openshift/assisted-service/pull/6665
https://github.com/openshift/assisted-service/pull/6603
https://github.com/openshift/assisted-service/pull/6661
The feature has to be backported to 4.16 as well. TBD - list all the PRs that have to be backported.
Instructions to test the AI feature with local env - https://docs.google.com/document/d/1RnRhJN-fgofnVSBTA6mIKcK2_UW7ihbZDLGAVHSdpzc/edit#heading=h.bf4zg53460gu
When testing the agent-based installer using iSCSI with dev-scripts https://github.com/openshift-metal3/dev-scripts/pull/1727 it was found that the installer was not able to complete the installation when using multiple hosts. This same problem did not appear when using SNO.
The iscsi session from all the hosts work do their targets fine until coreos-installer is run, at which time (before reboot) the connection to the target is lost and the coreos-installer fails
Jan 09 16:12:23 master-1 kernel: session1: session recovery timed out after 120 secs Jan 09 16:12:23 master-1 kernel: sd 7:0:0:0: rejecting I/O to offline device Jan 09 16:12:23 master-1 kernel: I/O error, dev sdb, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 2 Jan 09 16:12:23 master-1 installer[2937]: time="2025-01-09T16:12:23Z" level=info msg="\nError: syncing data to disk\n\nCaused by:\n Input/output error (os error 5)\n\nResetting partition table\n" Jan 09 16:12:23 master-1 installer[2937]: time="2025-01-09T16:12:23Z" level=warning msg="Retrying after error: failed executing /usr/bin/nsenter [--target 1 --cgroup --mount --ipc --pid -- coreos-installer install --insecure -i /opt/install-dir/master-f3c24588-2129-483f-9dfb-8a8fe332a4bf.ign --append-karg rd.iscsi.firmware=1 --append-karg ip=enp6s0:dhcp --copy-network /dev/sdb], Error exit status 1, LastOutput \"Error: syncing data to disk\n\nCaused by:\n Input/output error (os error 5)\n\nResetting partition table\nError: syncing partition table to disk\n\nCaused by:\n Input/output error (os error 5)\""
On the host it can be seen that session shows as logged out
Iface Name: default Iface Transport: tcp Iface Initiatorname: iqn.2023-01.com.example:master-1 Iface IPaddress: [default] Iface HWaddress: default Iface Netdev: default SID: 1 iSCSI Connection State: Unknown iSCSI Session State: FREE Internal iscsid Session State: Unknown
The problem occurs because the iscsid service is not running. If it is started by iscsadm then coreos-installer can successfully write the image to disk.
Goal:
Graduate to GA (full support) Gateway API with Istio to unify the management of cluster ingress with a common, open, expressive, and extensible API.
Description:
Gateway API is the evolution of upstream Kubernetes Ingress APIs. The upstream project is part of Kubernetes, working under SIG-NETWORK. OpenShift is contributing to the development, building a leadership position, and preparing OpenShift to support Gateway API, with Istio as our supported implementation.
The plug-able nature of the implementation of Gateway API enables support for additional and optional 3rd-party Ingress technologies.
The team agrees, we should be running the upstream GWAPI conformance tests, as they are readily available and we are an integration product with GWAPI.
We need to answer these questions asked at the March 23, 2023 GWAPI team meeting:
Would it make sense to do it as an optional job in the cluster-ingress-operator?
Is OSSM running the Gateway Conformance test in their CI?
Review what other implementers do with conformance tests to understand what we should do (Do we fork the repo? Clone it? Make a new repo?)
Epic Goal*
Drive the technical part of the Kubernetes 1.31 upgrade, including rebasing openshift/kubernetes repositiry and coordination across OpenShift organization to get e2e tests green for the OCP release.
Why is this important? (mandatory)
OpenShift 4.18 cannot be released without Kubernetes 1.31
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
PRs:
Retro: Kube 1.31 Rebase Retrospective Timeline (OCP 4.18)
Retro recording: https://drive.google.com/file/d/1htU-AglTJjd-VgFfwE3z_dH5tKXT1Tes/view?usp=drive_web
OVN Kubernetes BGP support as a routing protocol for User Defined Network (Segmentation) pod and VM addressability.
OVN-Kubernetes BGP support enables the capability of dynamically exposing cluster scoped network entities into a provider’s network, as well as program BGP learned routes from the provider’s network into OVN.
OVN-Kubernetes currently has no native routing protocol integration, and relies on a Geneve overlay for east/west traffic, as well as third party operators to handle external network integration into the cluster. This enhancement adds support for BGP as a supported routing protocol with OVN-Kubernetes. The extent of this support will allow OVN-Kubernetes to integrate into different BGP user environments, enabling it to dynamically expose cluster scoped network entities into a provider’s network, as well as program BGP learned routes from the provider’s network into OVN. In a follow-on release, this enhancement will provide support for EVPN, which is a common data center networking fabric that relies on BGP.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Importing Routes from the Provider Network
Today in OpenShift there is no API for a user to be able to configure routes into OVN. In order for a user to change how cluster traffic is routed egress into the cluster, the user leverages local gateway mode, which forces egress traffic to hop through the Linux host's networking stack, where a user can configure routes inside of the host via NM State. This manual configuration would need to be performed and maintained across nodes and VRFs within each node.
Additionally, if a user chooses to not manage routes within the host and use local gateway mode, then by default traffic is always sent to the default gateway. The only other way to affect egress routing is by using the Multiple External Gateways (MEG) feature. With this feature the user may choose to have multiple different egress gateways per namespace to send traffic to.
As an alternative, configuring BGP peers and which route-targets to import would eliminate the need to manually configure routes in the host, and would allow dynamic routing updates based on changes in the provider’s network.
Exporting Routes into the Provider Network
There exists a need for provider networks to learn routes directly to services and pods today in Kubernetes. Metal LB is already one solution whereby load balancer IPs are advertised by BGP to provider networks, and this feature development does not intend to duplicate or replace the function of Metal LB. Metal LB should be able to interoperate with OVN-Kubernetes, and be responsible for advertising services to a provider’s network.
However, there is an alternative need to advertise pod IPs on the provider network. One use case is integration with 3rd party load balancers, where they terminate a load balancer and then send packets directly to OCP nodes with the destination IP address being the pod IP itself. Today these load balancers rely on custom operators to detect which node a pod is scheduled to and then add routes into its load balancer to send the packet to the right node.
By integrating BGP and advertising the pod subnets/addresses directly on the provider network, load balancers and other entities on the network would be able to reach the pod IPs directly.
Extending OVN-Kubernetes VRFs into the Provider Network
This is the most powerful motivation for bringing support of EVPN into OVN-Kubernetes. A previous development effort enabled the ability to create a network per namespace (VRF) in OVN-Kubernetes, allowing users to create multiple isolated networks for namespaces of pods. However, the VRFs terminate at node egress, and routes are leaked from the default VRF so that traffic is able to route out of the OCP node. With EVPN, we can now extend the VRFs into the provider network using a VPN. This unlocks the ability to have L3VPNs that extend across the provider networks.
Utilizing the EVPN Fabric as the Overlay for OVN-Kubernetes
In addition to extending VRFs to the outside world for ingress and egress, we can also leverage EVPN to handle extending VRFs into the fabric for east/west traffic. This is useful in EVPN DC deployments where EVPN is already being used in the TOR network, and there is no need to use a Geneve overlay. In this use case, both layer 2 (MAC-VRFs) and layer 3 (IP-VRFs) can be advertised directly to the EVPN fabric. One advantage of doing this is that with Layer 2 networks, broadcast, unknown-unicast and multicast (BUM) traffic is suppressed across the EVPN fabric. Therefore the flooding domain in L2 networks for this type of traffic is limited to the node.
Multi-homing, Link Redundancy, Fast Convergence
Extending the EVPN fabric to OCP nodes brings other added benefits that are not present in OCP natively today. In this design there are at least 2 physical NICs and links leaving the OCP node to the EVPN leaves. This provides link redundancy, and when coupled with BFD and mass withdrawal, it can also provide fast failover. Additionally, the links can be used by the EVPN fabric to utilize ECMP routing.
OVN Kubernetes support for BGP as a routing protocol.
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
CNO should deploy the new RouteAdvertisements OVN-K CRD.
When the OCP API flag to enable BGP support in the cluster is set, CNO should enable support on OVN-K through a CLI arg.
Customers in highly regulated environments are required to adopt strong ciphers. For control-plane components, this means all components must support the modern TLSProfile with TLS 1.3.
During internal discussions [1] for RFE-4992 and follow-up conversations, it became clear we need a dedicated CI job to track and validate that OpenShift control-plane components are aligned and functional with the TLSProfiles configured on the system.
[1] https://redhat-internal.slack.com/archives/CB48XQ4KZ/p1713288937307359
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | yes |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | all |
Connected / Restricted Network | all |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all |
Operator compatibility | ?? |
Backport needed (list applicable versions) | n/a |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | n/a |
Other (please specify) | unknown |
If we try to enable a Modern TLS profile:
EnvVarControllerDegraded: no supported cipherSuites not found in observedConfig
also, if we do manage to pass along the Modern TLS profile cipher suit, we see:
http2: TLSConfig.CipherSuites is missing an HTTP/2-required AES_128_GCM_SHA256 cipher (need at least one of TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 or TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256)
This is Image mode on OpenShift. It uses the rpm-ostree native containers interface and not bootc but that is an implementation detail.
In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.
The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.
On-cluster, automated RHCOS Layering builds are important for multiple reasons:
This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.
As a cluster admin of user provided infrastructure,
when I apply the machine config that opts a pool into On Cluster Layering,
I want to also be able to remove that config and have the pool revert back to its non-layered state with the previously applied config.
As a cluster admin using on cluster layering,
when an image build has failed,
I want it to retry 3 times automatically without my intervention and show me where to find the log of the failure.
As a cluster admin,
when I enable On Cluster Layering,
I want to know that the builder image I am building with is stable and will not change unless I change it
so that I keep the same API promises as we do elsewhere in the platform.
To test:
As a cluster admin using on cluster layering,
when I try to upgrade my cluster and the Cluster Version Operator is not available,
I want the upgrade operation to be blocked.
As a cluster admin,
when I use a disconnected environment,
I want to still be able to use On Cluster Layering.
As a cluster admin using On Cluster layering,
When there has been config drift of any sort that degrades a node and I have resolved the issue,
I want to it to resync without forcing a reboot.
As a cluster admin using on cluster layering,
when a pool is using on cluster layering and references an internal registry
I want that registry available on the host network so that the pool can successfully scale up
(MCO-770, MCO-578, MCO-574 )
As a cluster admin using on cluster layering,
when a pool is using on cluster layering and I want to scale up nodes,
the nodes should have the same config as the other nodes in the pool.
Maybe:
Entitlements: MCO-1097, MCO-1099
Not Likely:
As a cluster admin using on cluster layering,
when I try to upgrade my cluster,
I want the upgrade operation to succeed at the same rate as non-OCL upgrades do.
The original scope of this task is represented across this story & MCO-1494.
With OCL GA'ing soon, we'll need a blocking path within our e2e test suite that must pass before a PR can be merged. This story represents the first stage in creating the blocking path:
[REGRESSION] We need to reinvent the wheel for triggering rebuild functionality and the rebuild mechanism as pool labeling and annotation is no longer a favorable way to interact with layered pools
There are a few situations in which a cluster admin might want to trigger a rebuild of their OS image in addition to situations where cluster state may dictate that we should perform a rebuild. For example, if the custom Dockerfile changes or the machine-config-osimageurl changes, it would be desirable to perform a rebuild in that case. To that end, this particular story covers adding the foundation for a rebuild mechanism in the form of an annotation that can be applied to the target MachineConfigPool. What is out of scope for this story is applying this annotation in response to a change in cluster state (e.g., custom Dockerfile change).
Done When:
This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.
As a cluster admin of user provided infrastructure,
when I apply the machine config that opts a pool into On Cluster Layering,
I want to also be able to remove that config and have the pool revert back to its non-layered state with the previously applied config.
As a cluster admin using on cluster layering,
when an image build has failed,
I want it to retry 3 times automatically without my intervention and show me where to find the log of the failure.
As a cluster admin,
when I enable On Cluster Layering,
I want to know that the builder image I am building with is stable and will not change unless I change it
so that I keep the same API promises as we do elsewhere in the platform.
To test:
As a cluster admin using on cluster layering,
when I try to upgrade my cluster and the Cluster Version Operator is not available,
I want the upgrade operation to be blocked.
As a cluster admin,
when I use a disconnected environment,
I want to still be able to use On Cluster Layering.
As a cluster admin using On Cluster layering,
When there has been config drift of any sort that degrades a node and I have resolved the issue,
I want to it to resync without forcing a reboot.
As a cluster admin using on cluster layering,
when a pool is using on cluster layering and references an internal registry
I want that registry available on the host network so that the pool can successfully scale up
(MCO-770, MCO-578, MCO-574 )
As a cluster admin using on cluster layering,
when a pool is using on cluster layering and I want to scale up nodes,
the nodes should have the same config as the other nodes in the pool.
Maybe:
Entitlements: MCO-1097, MCO-1099
Not Likely:
As a cluster admin using on cluster layering,
when I try to upgrade my cluster,
I want the upgrade operation to succeed at the same rate as non-OCL upgrades do.
As a follow up to https://issues.redhat.com/browse/MCO-1284, the one field we identified that is best updated pre-GA is to make the baseImagePullSecret optional. The builder pod should have the logic to fetch from baseImagePullSecret if the user does not specify this via a MachineOSConfig object.
To make OCL ready for GA, the first step would be graduating the MCO's APIs from v1alpha1 to v1. This requires changes in the openshift/api repo.
As a cluster admin for standalone OpenShift, I want to customize the prefix of the machine names created by CPMS due to company policies related to nomenclature. Implement the Control Plane Machine Set (CPMS) feature in OpenShift to support machine names where user can set custom names prefixes. Note the prefix will always be suffixed by "<5-chars>-<index>" as this is part of the CPMS internal design.
A new field called machineNamePrefix has been added to CPMS CR.
This field would allow the customer to specify a custom prefix for the machine names.
The machine names would then be generated using the format: <machineNamePrefix><5-chars><index>
Where:
<machineNamePrefix> is the custom prefix provided by the customer
<5-chars> is a random 5 character string (this is required and cannot be changed)
<index> represents the index of the machine (0, 1, 2, etc.)
Ensure that if the machineNamePrefix is changed, the operator reconciles and succeeds in rolling out the changes.
To be able to gather test data for this feature, we will need to introduce tech preview periodics, so we need to duplicate each of https://github.com/openshift/release/blob/8f8c7c981c3d81c943363a9435b6c48005eec6e3[…]control-plane-machine-set-operator-release-4.19__periodics.yaml and add the techpreview configuration.
It's configured as an env var, so copy each job, add the env var, and change the name to include -techpreview as a suffic
env: FEATURE_SET: TechPreviewNoUpgrade
Bump openshift/api to vendor machineNamePrefix field and CPMSMachineNamePrefix feature-gate into cpms-operator
Define a new feature gate in openshift/api for this feature so that all the implementation can be safe guarded behind this gate.
Description of problem:
Consecutive points are not allowed in machineNamePrefix, but from the prompt "spec.machineNamePrefix: Invalid value: "string": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character.", consecutive points should be allowed. And I can create machine on providers console with consecutive points https://drive.google.com/file/d/1p5QLhkL4VI3tt3viO98zYG8uqb1ePTnB/view?usp=sharing
Version-Release number of selected component (if applicable):
4.19.0-0.nightly-2025-01-08-165032
How reproducible:
Always
Steps to Reproduce:
1.Update machineNamePrefix containing consecutive points in controlplanemachineset, like machineNamePrefix: huliu..azure 2.Cannot save, get prompt # * spec.machineNamePrefix: Invalid value: "string": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character. 3. If change to machineNamePrefix: huliu.azure then it can save successfully, and rollout succeed.
Actual results:
Cannot save, get prompt
Expected results:
Can save successfully
Additional info:
New feature testing for https://issues.redhat.com/browse/OAPE-22
Provide a new field to the CPMS that allows to define a Machine name prefix
Feature description
Oc-mirror v2 is focuses on major enhancements that include making oc-mirror faster and more robust and introduces caching as well as address more complex air-gapped scenarios. OC mirror v2 is a rewritten version with three goals:
There was a selected bundle feature on v2 that needs to be removed in 4.18 because of the its risk.
An alternative solution is required to unblock one of our customers.
Edge customers requiring computing on-site to serve business applications (e.g., point of sale, security & control applications, AI inference) are asking for a 2-node HA solution for their environments. Only two nodes at the edge, because the 3d node induces too much cost, but still they need HA for critical workload. To address this need, a 2+1 topology is introduced. It supports a small cheap arbiter node that can optionally be remote/virtual to reduce onsite HW cost.
Support OpenShift on 2+1 topology, meaning two primary nodes with large capacity to run workload and control plan, and a third small “arbiter” node which ensure quorum. See requirements for more details.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | self-managed |
Classic (standalone cluster) | yes |
Hosted control planes | no |
Multi node, Compact (three node), or Single node (SNO), or all | Multi node and Compact (three node) |
Connected / Restricted Network | both |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86_86 and ARM |
Operator compatibility | full |
Backport needed (list applicable versions) | no |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | no |
Other (please specify) | n/a |
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
See requirements - there are two main groups of customers: co-located arbiter node, and remote arbiter node.
We will need to make sure the MCO contains the bootstrap files for the arbiter node similar to the files it contains for master nodes.
We need to make sure kubelet respects the new node role type for arbiter. This will be a simple update on the well_known_openshift_labels.go list to include the new node role.
Once the HighlyAvailableArbiter has been added to the ocp/api, we need to update the cluster-config-operator dependencies to reference the new change, so that it propagates to cluster installs in our payloads.
We need to update CEO (cluster etcd operator) to understand what an arbiter/witness node is so it can properly assign an etcd member on our less powerful node.
Update the dependencies for CEO for library-go and ocp/api to support the Arbiter additions, doing this in a separate PR to keep things clean and easier to test.
The console operand checks for valid infra control plane topology fields.
We'll need to add the HighlyAvailableArbiter flag to that check
The console operator performs a check for HighlyAvailable, we will need to add the HighlyAvailableArbiter check there so that it supports it on the same level as HA
Authentication Operator will need to be updated to take in the newly created `HighlyAvailableArbiter` topology flag, and allow the minimum of kube-apiservers running to equal 2.
We need to add the `HighlyAvailableArbiter` value to the controlPlaneTopology in ocp/api package as a TechPreview
https://github.com/openshift/api/blob/master/config/v1/types_infrastructure.go#L95-L103
Bump the openshift/api dependency to latest for the arbiter node enum.
Make sure pacemaker configured in TNF controls etcd.
Verify the following scenarios:
For TNF we need to replace our currently running etcd after the installation with the one that is managed by pacemaker.
This allows us to keep the following benefits:
AC:
Support mapping OpenShift zones to vSphere host groups, in addition to vSphere clusters.
When defining zones for vSphere administrators can map regions to vSphere datacenters and zones to vSphere clusters.
There are use cases where vSphere clusters have only one cluster construct with all their ESXi hosts but the administrators want to divide the ESXi hosts in host groups. A common example is vSphere stretched clusters, where there is only one logical vSphere cluster but the ESXi nodes are distributed across to physical sites, and grouped by site in vSphere host groups.
In order for OpenShift to be able to distribute its nodes on vSphere matching the physical grouping of hosts, OpenShift zones have to be able to map to vSphere host groups too.
Support mapping OpenShift zones to vSphere host groups, in addition to vSphere clusters.
When defining zones for vSphere administrators can map regions to vSphere datacenters and zones to vSphere clusters.
There are use cases where vSphere clusters have only one cluster construct with all their ESXi hosts but the administrators want to divide the ESXi hosts in host groups. A common example is vSphere stretched clusters, where there is only one logical vSphere cluster but the ESXi nodes are distributed across to physical sites, and grouped by site in vSphere host groups.
In order for OpenShift to be able to distribute its nodes on vSphere matching the physical grouping of hosts, OpenShift zones have to be able to map to vSphere host groups too.
As an openshift engineer update the infrastructure and the machine api objects so it can support host vm group zonal
Acceptance criteria
As an openshift engineer enable host vm group zonal in mao so that compute nodes properly are deployed
Acceptance Criteria:
As an openshift engineer enable host vm group zonal in CPMS so that control plane nodes properly are redeployed
Acceptance Criteria:
{}USER STORY:{}
As someone that installs openshift on vsphere, I want to install zonal via host and vm groups so that I can use a stretched physical cluster or use a cluster as a region and hosts as zones .
{}DESCRIPTION:{}
{}Required:{}
{}Nice to have:{}
{}ACCEPTANCE CRITERIA:{}
{}ENGINEERING DETAILS:{}
Configuration steps:
https://github.com/openshift/installer/compare/master...jcpowermac:installer:test-vm-host-group
Overarching Goal
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift.
Phase 1 & 2 covers implementing base functionality for CAPI.
Phase 2 also covers migrating MAPI resources to CAPI.
Phase 2 Goal:
for Phase-1, incorporating the assets from different repositories to simplify asset management.
There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open
The current ClusterOperator status conditions—Available, Progressing, Upgradeable, and Degraded—are set by the corecluster controller independently of the status of other controllers.
This approach does not align with the intended purpose of these conditions, which are meant to reflect the overall status of the operator, considering all the controllers it manages.
To address this, we should introduce controller-level conditions similar to the top-level ones. These conditions would influence an aggregated top-level status, which a new controller (the ClusterOperator controller) would then consolidate into the Available, Progressing, Upgradeable, and Degraded conditions.
Moreover, when running `oc get co` against the cluster-capi-operator, only the name and version are returned. The status is not rolled up into the additional columns as expected.
Before GA, this state information should be visible from `oc get co`
OpenShift is traditionally a single-disk system, meaning the OpenShift installer is designed to install the OpenShift filesystem on one disk. However, new customer use cases have highlighted the need for the ability to configure additional disks during installation. Here are the specific use cases:
User Story:
As an OpenShift administrator, I need to be able to configure my OpenShift cluster to have additional disks on each vSphere VM so that I can use the new data disks for various OS needs.
Description:
This goal of this epic is to be able to allow the cluster administrator to install and configure after install new machines with additional disks attached to each virtual machine for various OS needs.
Required:
Nice to Have:
Acceptance Criteria:
Notes:
User Story:
As an OpenShift Engineer I need to ensure the the CPMS Operator now works with detecting any changes needed when data disks are added to the CPMS definition.
Description:
This task is to verify if any changes are needed in the CPMS Operator to handle the change data disk definitions in the CPMS.
Acceptance Criteria:
- CPMS does not roll out changes when initial install is performed.
- Adding a disk to CPMS results in control plane roll out.
- Removing a disk from CPMS results in control plane roll out.
- No errors logged as a result of data disks being present in the CPMS definition.
Notes:
Ideally we just need to make sure the operator is updated to pull in the new CRD object definitions that contain the new data disk field.
User Story:
As an OpenShift Engineer I need to ensure the the MAPI Operator.
Description:
This task is to verify if any changes are needed in the MAPI Operator to handle the change data disk definitions in the CPMS.
Acceptance Criteria:
- Adding a disk to MachineSet does not result in new machines being rolled out.
- Removing a disk from MachineSet does not result in new machines being rolled out.
- After making changes to a MachineSet related to data disks, when MachineSet is scaled down and then up, new machines contain the new data disk configurations.
- All attempts to modify existing data disk definitions in an existing Machine definition are blocked by the webhook.
Notes:
The behaviors for the data disk field should be the same as all other provider spec level fields. We want to make sure that the new fields are no different than the others. This field is not hot swap capable for running machines. A new VM must be created for this feature to work.
User Story:
As an OpenShift Engineer I need to enhance the OpenShift installer to support creating a cluster with additional disks added to control plane and compute nodes so that I can use the new data disks for various OS needs.
Description:
This task is to enhance the installer to allow configuring data disks in the install-config.yaml. This will also require setting the necessary fields in machineset and machine definitions. The important one being for CAPV to do the initial creation of disks for the configured masters.
Acceptance Criteria:
- install-config.yaml supports configuring data disks in all machinepools.
- CAPV has been updated with new multi disk support.
- CAPV machines are created that result in control plane nodes with data disks.
- MachineSet definitions for compute nodes are created correctly with data disk values from compute pool.
- CPMS definition for masters has the data disks configured correctly.
Notes:
We need to be sure that after installing a cluster, the cluster remains stable and has all correct configurations.
User Story:
As an OpenShift Engineer I need to create a new feature gate and CRD changes for vSphere multi disk so that we can gate the new function until all bugs are ironed out.
Description:
This task is to create the new feature gate to be used by all logical changes around multi disk support for vSphere. We also need to update the types for vsphere machine spec to include new array field that contains data disk definitions.
Acceptance Criteria:
- New feature gate exists for components to use.
- Initial changes to the CRD for data disks are present for components to start using.
See the UDN Sync Meeting notes: https://docs.google.com/document/d/1wjT8lSDfG6Mj-b45_p9SlbpCVG0af-pjDlmAUnM5mOI/edit#bookmark=id.zemsagq71ou2
In our current UDN API, subnets field is mandatory always for primary role and optional for secondary role. This is because users are allowed to have a pure L2 without subnets for secondary networks. However, in the future if we want to add egress support on secondary networks, we might need subnets...
CNV has many different use cases:
This card tracks the design changes to the API and the code changes needed to implement this. See https://github.com/openshift/enhancements/pull/1708 for details.
We want to do Network Policies not MultiNetwork POlicies
Review, refine and harden the CAPI-based Installer implementation introduced in 4.16
From the implementation of the CAPI-based Installer started with OpenShift 4.16 there is some technical debt that needs to be reviewed and addressed to refine and harden this new installation architecture.
Review existing implementation, refine as required and harden as possible to remove all the existing technical debt
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
There should not be any user-facing documentation required for this work
We need a place to add tasks that are not feature oriented.
Most of the platform subdirectories don't have OWNERS files
we should add the aliases for everything that's missing
backport to 4.16
This feature aims to comprehensively refactor and standardize various components across HCP, ensuring consistency, maintainability, and reliability. The overarching goal to increase customer satisfaction by increasing speed to market and saving engineering budget by reducing incidents/bugs. This will be achieved by reducing technical debt, improving code quality, and simplifying the developer experience across multiple areas, including CLI consistency, NodePool upgrade mechanisms, networking flows, and more. By addressing these areas holistically, the project aims to create a more sustainable and scalable codebase that is easier to maintain and extend.
Over time, the HyperShift project has grown organically, leading to areas of redundancy, inconsistency, and technical debt. This comprehensive refactor and standardization effort is a response to these challenges, aiming to improve the project's overall health and sustainability. By addressing multiple components in a coordinated way, the goal is to set a solid foundation for future growth and development.
Ensure all relevant project documentation is updated to reflect the refactored components, new abstractions, and standardized workflows.
This overarching feature is designed to unify and streamline the HCP project, delivering a more consistent, maintainable, and reliable platform for developers, operators, and users.
Goal
Refactor and modularize controllers and other components to improve maintainability, scalability, and ease of use.
As a (user persona), I want to be able to:
https://issues.redhat.com//browse/HOSTEDCP-1801 introduced a new abstraction to be used by ControlPlane components. However, when a component or a sub-resources predicate changes to false, the resources are not removed from the cluster. All resources should be deleted from the cluster.
docs: https://github.com/openshift/hypershift/blob/main/support/controlplane-component/README.md
As a (user persona), I want to be able to:
https://issues.redhat.com//browse/HOSTEDCP-1801 introduced a new abstraction to be used by ControlPlane components. We need to refactor every component to use this abstraction.
Description of criteria:
All ControlPlane Components are refactored:
Example PR to refactor cloud-credential-operator : https://github.com/openshift/hypershift/pull/5203
docs: https://github.com/openshift/hypershift/blob/main/support/controlplane-component/README.md
Provide a PR with a CNO standard refactor
Example PR to refactor HCCO: https://github.com/openshift/hypershift/pull/4860
docs: https://github.com/openshift/hypershift/blob/main/support/controlplane-component/README.md
Provide a PR with a CSI standard refactor
Example PR to refactor HCCO: https://github.com/openshift/hypershift/pull/4860
docs: https://github.com/openshift/hypershift/blob/main/support/controlplane-component/README.md
Improve the consistency and reliability of APIs by enforcing immutability and clarifying service publish strategy support.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
<your text here>
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
<your text here>
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
PopupKebabMenu is orphaned and contains a reference to `@patternfly/react-core/deprecated`. It and related code should be removed so we can drop PF4 and adopt PF6.
Before we can adopt PatternFly 6, we need to drop PatternFly 4. We should drop 4 first so we can understand what impact if any that will have on plugins.
AC:
https://github.com/openshift/console/blob/master/frontend/packages/console-shared/src/components/actions/menu/ActionMenuItem.tsx#L3 contains a reference to `@patternfly/react-core/deprecated`. In order to drop PF4 and adopt PF6, this reference needs to be removed.
This component was never finished and should be removed as it includes a reference to `@patternfly/react-core/deprecated`, which blocks the removal of PF4 and the adoption of PF6.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
<your text here>
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
<your text here>
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
As a developer, I do not want to maintain the code for a project already dead.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
K8s 1.31 introduces VolumeAttributesClass as beta (code in external provisioner). We should make it available to customers as tech preview.
VolumeAttributesClass allows PVC to be modified after their creation and while attached. There is as vast number of parameters that can be updated but the most popular is to change the QoS values. Parameters that can be changed depend on the driver used.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Productise VolumeAttributesClass as TP in anticipation for GA. Customer can start testing VolumeAttributesClass.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | yes |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | all |
Connected / Restricted Network | both |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all |
Operator compatibility | N/A core storage |
Backport needed (list applicable versions) | None |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | TBD for TP |
Other (please specify) | n/A |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
As an OCP user, I want to change parameters of my existing PVC such as the QoS attributes.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
UI for TP
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
There's been some limitations and complains on the fact that PVC attributes are sealed after their creation avoiding customers to update them. This is particularly impacting for a specific QoS is set and the volume requirements are changing.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Customer should not use it in production atm.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Document VolumeAttributesClass creation and how to update PVC. Mention any limitation. Mention it's tech preview no upgrade. Add drivers support if needed.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Check which drivers support it for which parameters.
Support upstream feature "VolumeAttributesClass" in OCP as Beta, i.e. test it and have docs for it.
A common concern with dealing with escalations/incidents in Managed OpenShift Hosted Control Planes is the resolution time incurred when the fix needs to be delivered in a component of the solution that ships within the OpenShift release payload. This is because OpenShift's release payloads:
This feature seeks to provide mechanisms that put the upper time boundary in delivering such fixes to match the current HyperShift Operator <24h expectation
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | managed (ROSA and ARO) |
Classic (standalone cluster) | No |
Hosted control planes | Yes |
Multi node, Compact (three node), or Single node (SNO), or all | All supported ROSA/HCP topologies |
Connected / Restricted Network | All supported ROSA/HCP topologies |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | All supported ROSA/HCP topologies |
Operator compatibility | CPO and Operators depending on it |
Backport needed (list applicable versions) | TBD |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | No |
Other (please specify) | No |
Discussed previously during incident calls. Design discussion document
SOP needs to be defined for:
The default PR posting and pushing tekton file that Konflux generates builds always. We can be more efficient with resources.
It is necessary to get the builds off of main for CPO overrides.
Rolling out new versions of HyperShift Operator or Hosted Control Plane components such as HyperShift's Control Plane Operator will no longer carry the possibility of triggering a Node rollout that can affect customer workloads running on those nodes
Customer Nodepool rollouts exhaustive cause list will be:
Customers will have visibility on rollouts that are pending so that they can effect a rollout of their affected nodepools at their earliest convenience
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Managed (ROSA and ARO) |
Classic (standalone cluster) | No |
Hosted control planes | Yes |
Multi node, Compact (three node), or Single node (SNO), or all | All supported Managed Hosted Control Plane topologies and configurations |
Connected / Restricted Network | All supported Managed Hosted Control Plane topologies and configurations |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | All supported Managed Hosted Control Plane topologies and configurations |
Operator compatibility | N/A |
Backport needed (list applicable versions) | All supported Managed Hosted Control Plane releases |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | Yes. Console represtation of Nodepools/Machinepools should indicate pending rollouts and allow them to be triggered |
Other (please specify) |
Past incidents with fixes to ignition generation resulting in rollout unexpected by the customer with workload impact
There should be an easy way to see, understand the content and trigger queued updates
SOPs for the observability above
ROSA documentation for queued updates
ROSA/HCP and ARO/HCP
test to capture HO updates causing nodePool reboots
The OpenShift IPsec implementation will be enhanced for a growing set of enterprise use cases, and for larger scale deployments.
The OpenShift IPsec implementation was originally built for purpose-driven use cases from telco NEPs, but was also be useful for a specific set of other customer use cases outside of that context. As customer adoption grew and it was adopted by some of the largest (by number of cluster nodes) deployments in the field, it became obvious that some redesign is necessary in order to continue to deliver enterprise-grade IPsec, for both East-West and North-South traffic, and for some of our most-demanding customer deployments.
Key enhancements include observability and blocked traffic across paths if IPsec encryption is not functioning properly.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
The OpenShift IPsec feature is fundamental to customer deployments for ensuring that all traffic between cluster nodes (East-West) and between cluster nodes and external-to-the-cluster entities that also are configured for IPsec (North-South) is encrypted by default. This encryption must scale to the largest of deployments.
Questions to be addressed:
While running IPsec e2e tests in the CI, the data plane traffic is not flowing with desired traffic type esp or udp. For example, ipsec mode external, the traffic type seems to seen as esp for EW traffic, but it's supposed to be geneve (udp) taffic.
This issue was reproducible on a local cluster after many attempts and noticed ipsec states are not cleanup on the node which is a residue from previous test run with ipsec full mode.
[peri@sdn-09 origin]$ kubectl get networks.operator.openshift.io cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
creationTimestamp: "2024-05-13T18:55:57Z"
generation: 1362
name: cluster
resourceVersion: "593827"
uid: 10f804c9-da46-41ee-91d5-37aff920bee4
spec:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
defaultNetwork:
ovnKubernetesConfig:
egressIPConfig: {}
gatewayConfig:
ipv4: {}
ipv6: {}
routingViaHost: false
genevePort: 6081
ipsecConfig:
mode: External
mtu: 1400
policyAuditConfig:
destination: "null"
maxFileSize: 50
maxLogFiles: 5
rateLimit: 20
syslogFacility: local0
type: OVNKubernetes
deployKubeProxy: false
disableMultiNetwork: false
disableNetworkDiagnostics: false
logLevel: Normal
managementState: Managed
observedConfig: null
operatorLogLevel: Normal
serviceNetwork:
- 172.30.0.0/16
unsupportedConfigOverrides: null
useMultiNetworkPolicy: false
status:
conditions:
- lastTransitionTime: "2024-05-13T18:55:57Z"
status: "False"
type: ManagementStateDegraded
- lastTransitionTime: "2024-05-14T10:13:12Z"
status: "False"
type: Degraded
- lastTransitionTime: "2024-05-13T18:55:57Z"
status: "True"
type: Upgradeable
- lastTransitionTime: "2024-05-14T11:50:26Z"
status: "False"
type: Progressing
- lastTransitionTime: "2024-05-13T18:57:13Z"
status: "True"
type: Available
readyReplicas: 0
version: 4.16.0-0.nightly-2024-05-08-222442
[peri@sdn-09 origin]$ oc debug node/worker-0
Starting pod/worker-0-debug-k6nlm ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.111.23
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# toolbox
Checking if there is a newer version of registry.redhat.io/rhel9/support-tools available...
Container 'toolbox-root' already exists. Trying to start...
(To remove the container and start with a fresh toolbox, run: sudo podman rm 'toolbox-root')
toolbox-root
Container started successfully. To exit, type 'exit'.
[root@worker-0 /]# tcpdump -i enp2s0 -c 1 -v --direction=out esp and src 192.168.111.23 and dst 192.168.111.24
dropped privs to tcpdump
tcpdump: listening on enp2s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
16:07:01.854214 IP (tos 0x0, ttl 64, id 20451, offset 0, flags [DF], proto ESP (50), length 152)
worker-0 > worker-1: ESP(spi=0x52cc9c8d,seq=0xe1c5c), length 132
1 packet captured
6 packets received by filter
0 packets dropped by kernel
[root@worker-0 /]# exit
exit
sh-5.1# ipsec whack --trafficstatus
006 #20: "ovn-1184d9-0-in-1", type=ESP, add_time=1715687134, inBytes=206148172, outBytes=0, maxBytes=2^63B, id='@1184d960-3211-45c4-a482-d7b6fe995446'
006 #19: "ovn-1184d9-0-out-1", type=ESP, add_time=1715687112, inBytes=0, outBytes=40269835, maxBytes=2^63B, id='@1184d960-3211-45c4-a482-d7b6fe995446'
006 #27: "ovn-185198-0-in-1", type=ESP, add_time=1715687419, inBytes=71406656, outBytes=0, maxBytes=2^63B, id='@185198f6-7dde-4e9b-b2aa-52439d2beef5'
006 #26: "ovn-185198-0-out-1", type=ESP, add_time=1715687401, inBytes=0, outBytes=17201159, maxBytes=2^63B, id='@185198f6-7dde-4e9b-b2aa-52439d2beef5'
006 #14: "ovn-922aca-0-in-1", type=ESP, add_time=1715687004, inBytes=116384250, outBytes=0, maxBytes=2^63B, id='@922aca42-b893-496e-bb9b-0310884f4cc1'
006 #13: "ovn-922aca-0-out-1", type=ESP, add_time=1715686986, inBytes=0, outBytes=986900228, maxBytes=2^63B, id='@922aca42-b893-496e-bb9b-0310884f4cc1'
006 #6: "ovn-f72f26-0-in-1", type=ESP, add_time=1715686855, inBytes=115781441, outBytes=98, maxBytes=2^63B, id='@f72f2622-e7dc-414e-8369-6013752ea15b'
006 #5: "ovn-f72f26-0-out-1", type=ESP, add_time=1715686833, inBytes=9320, outBytes=29002449, maxBytes=2^63B, id='@f72f2622-e7dc-414e-8369-6013752ea15b'
sh-5.1# ip xfrm state; echo ' '; ip xfrm policy
src 192.168.111.21 dst 192.168.111.23
proto esp spi 0x7f7ddcf5 reqid 16413 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x6158d9a0f4a28598500e15f81a40ef715502b37ecf979feb11bbc488479c8804598011ee 128
lastused 2024-05-14 16:07:11
anti-replay esn context:
seq-hi 0x0, seq 0x18564, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff
sel src 192.168.111.21/32 dst 192.168.111.23/32 proto udp dport 6081
src 192.168.111.23 dst 192.168.111.21
proto esp spi 0xda57e42e reqid 16413 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x810bebecef77951ae8bb9a46cf53a348a24266df8b57bf2c88d4f23244eb3875e88cc796 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.21/32 proto udp sport 6081
src 192.168.111.21 dst 192.168.111.23
proto esp spi 0xf84f2fcf reqid 16417 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x0f242efb072699a0f061d4c941d1bb9d4eb7357b136db85a0165c3b3979e27b00ff20ac7 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.21/32 dst 192.168.111.23/32 proto udp sport 6081
src 192.168.111.23 dst 192.168.111.21
proto esp spi 0x9523c6ca reqid 16417 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xe075d39b6e53c033f5225f8be48efe537c3ba605cee2f5f5f3bb1cf16b6c53182ecf35f7 128
lastused 2024-05-14 16:07:11
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x10fb2
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.21/32 proto udp dport 6081
src 192.168.111.20 dst 192.168.111.23
proto esp spi 0x459d8516 reqid 16397 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xee778e6db2ce83fa24da3b18e028451bbfcf4259513bca21db832c3023e238a6b55fdacc 128
lastused 2024-05-14 16:07:13
anti-replay esn context:
seq-hi 0x0, seq 0x3ec45, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff
sel src 192.168.111.20/32 dst 192.168.111.23/32 proto udp dport 6081
src 192.168.111.23 dst 192.168.111.20
proto esp spi 0x3142f53a reqid 16397 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x6238fea6dffdd36cbb909f6aab48425ba6e38f9d32edfa0c1e0fc6af8d4e3a5c11b5dfd1 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.20/32 proto udp sport 6081
src 192.168.111.20 dst 192.168.111.23
proto esp spi 0xeda1ccb9 reqid 16401 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xef84a90993bd71df9c97db940803ad31c6f7d2e72a367a1ec55b4798879818a6341c38b6 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.20/32 dst 192.168.111.23/32 proto udp sport 6081
src 192.168.111.23 dst 192.168.111.20
proto esp spi 0x02c3c0dd reqid 16401 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x858ab7326e54b6d888825118724de5f0c0ad772be2b39133c272920c2cceb2f716d02754 128
lastused 2024-05-14 16:07:13
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x26f8e
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.20/32 proto udp dport 6081
src 192.168.111.24 dst 192.168.111.23
proto esp spi 0xc9535b47 reqid 16405 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xd7a83ff4bd6e7704562c597810d509c3cdd4e208daabf2ec074d109748fd1647ab2eff9d 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x53d4c, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff
sel src 192.168.111.24/32 dst 192.168.111.23/32 proto udp dport 6081
src 192.168.111.23 dst 192.168.111.24
proto esp spi 0xb66203c8 reqid 16405 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xc207001a7f1ed7f114b3e327308ddbddc36de5272a11fe0661d03eaecc84b6761c7ec9c4 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.24/32 proto udp sport 6081
src 192.168.111.24 dst 192.168.111.23
proto esp spi 0x2e4d4deb reqid 16409 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x91e399d83aa1c2626424b502d4b8dae07d4a170f7ef39f8d1baca8e92b8a1dee210e2502 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.24/32 dst 192.168.111.23/32 proto udp sport 6081
src 192.168.111.23 dst 192.168.111.24
proto esp spi 0x52cc9c8d reqid 16409 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xb605451f32f5dd7a113cae16e6f1509270c286d67265da2ad14634abccf6c90f907e5c00 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0xe2735
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.24/32 proto udp dport 6081
src 192.168.111.22 dst 192.168.111.23
proto esp spi 0x973119c3 reqid 16389 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x87d13e67b948454671fb8463ec0cd4d9c38e5e2dd7f97cbb8f88b50d4965fb1f21b36199 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x2af9a, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff
sel src 192.168.111.22/32 dst 192.168.111.23/32 proto udp dport 6081
src 192.168.111.23 dst 192.168.111.22
proto esp spi 0x4c3580ff reqid 16389 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x2c09750f51e86d60647a60e15606f8b312036639f8de2d7e49e733cda105b920baade029 128
lastused 2024-05-14 14:36:43
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x1
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.22/32 proto udp sport 6081
src 192.168.111.22 dst 192.168.111.23
proto esp spi 0xa3e469dc reqid 16393 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x1d5c5c232e6fd4b72f3dad68e8a4d523cbd297f463c53602fad429d12c0211d97ae26f47 128
lastused 2024-05-14 14:18:42
anti-replay esn context:
seq-hi 0x0, seq 0xb, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 000007ff
sel src 192.168.111.22/32 dst 192.168.111.23/32 proto udp sport 6081
src 192.168.111.23 dst 192.168.111.22
proto esp spi 0xdee8476f reqid 16393 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x5895025ce5b192a7854091841c73c8e29e7e302f61becfa3feb44d071ac5c64ce54f5083 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x1f1a3
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.22/32 proto udp dport 6081
src 192.168.111.23/32 dst 192.168.111.21/32 proto udp sport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16413 mode transport
src 192.168.111.21/32 dst 192.168.111.23/32 proto udp dport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16413 mode transport
src 192.168.111.23/32 dst 192.168.111.21/32 proto udp dport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16417 mode transport
src 192.168.111.21/32 dst 192.168.111.23/32 proto udp sport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16417 mode transport
src 192.168.111.23/32 dst 192.168.111.20/32 proto udp sport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16397 mode transport
src 192.168.111.20/32 dst 192.168.111.23/32 proto udp dport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16397 mode transport
src 192.168.111.23/32 dst 192.168.111.20/32 proto udp dport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16401 mode transport
src 192.168.111.20/32 dst 192.168.111.23/32 proto udp sport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16401 mode transport
src 192.168.111.23/32 dst 192.168.111.24/32 proto udp sport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16405 mode transport
src 192.168.111.24/32 dst 192.168.111.23/32 proto udp dport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16405 mode transport
src 192.168.111.23/32 dst 192.168.111.24/32 proto udp dport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16409 mode transport
src 192.168.111.24/32 dst 192.168.111.23/32 proto udp sport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16409 mode transport
src 192.168.111.23/32 dst 192.168.111.22/32 proto udp sport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16389 mode transport
src 192.168.111.22/32 dst 192.168.111.23/32 proto udp dport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16389 mode transport
src 192.168.111.23/32 dst 192.168.111.22/32 proto udp dport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16393 mode transport
src 192.168.111.22/32 dst 192.168.111.23/32 proto udp sport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16393 mode transport
src ::/0 dst ::/0
socket out priority 0 ptype main
src ::/0 dst ::/0
socket in priority 0 ptype main
src ::/0 dst ::/0
socket out priority 0 ptype main
src ::/0 dst ::/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 135
dir out priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 135
dir fwd priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 135
dir in priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 136
dir out priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 136
dir fwd priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 136
dir in priority 1 ptype main
sh-5.1# cat /etc/ipsec.conf
# /etc/ipsec.conf - Libreswan 4.0 configuration file
#
# see 'man ipsec.conf' and 'man pluto' for more information
#
# For example configurations and documentation, see https://libreswan.org/wiki/
config setup
# If logfile= is unset, syslog is used to send log messages too.
# Note that on busy VPN servers, the amount of logging can trigger
# syslogd (or journald) to rate limit messages.
#logfile=/var/log/pluto.log
#
# Debugging should only be used to find bugs, not configuration issues!
# "base" regular debug, "tmi" is excessive and "private" will log
# sensitive key material (not available in FIPS mode). The "cpu-usage"
# value logs timing information and should not be used with other
# debug options as it will defeat getting accurate timing information.
# Default is "none"
# plutodebug="base"
# plutodebug="tmi"
#plutodebug="none"
#
# Some machines use a DNS resolver on localhost with broken DNSSEC
# support. This can be tested using the command:
# dig +dnssec DNSnameOfRemoteServer
# If that fails but omitting '+dnssec' works, the system's resolver is
# broken and you might need to disable DNSSEC.
# dnssec-enable=no
#
# To enable IKE and IPsec over TCP for VPN server. Requires at least
# Linux 5.7 kernel or a kernel with TCP backport (like RHEL8 4.18.0-291)
# listen-tcp=yes
# To enable IKE and IPsec over TCP for VPN client, also specify
# tcp-remote-port=4500 in the client's conn section.
# if it exists, include system wide crypto-policy defaults
include /etc/crypto-policies/back-ends/libreswan.config
# It is best to add your IPsec connections as separate files
# in /etc/ipsec.d/
include /etc/ipsec.d/*.conf
sh-5.1# cat /etc/ipsec.d/openshift.conf
# Generated by ovs-monitor-ipsec...do not modify by hand!
config setup
uniqueids=yes
conn %default
keyingtries=%forever
type=transport
auto=route
ike=aes_gcm256-sha2_256
esp=aes_gcm256
ikev2=insist
conn ovn-f72f26-0-in-1
left=192.168.111.23
right=192.168.111.22
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@f72f2622-e7dc-414e-8369-6013752ea15b
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp/6081
rightprotoport=udp
conn ovn-f72f26-0-out-1
left=192.168.111.23
right=192.168.111.22
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@f72f2622-e7dc-414e-8369-6013752ea15b
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp
rightprotoport=udp/6081
conn ovn-1184d9-0-in-1
left=192.168.111.23
right=192.168.111.20
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@1184d960-3211-45c4-a482-d7b6fe995446
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp/6081
rightprotoport=udp
conn ovn-1184d9-0-out-1
left=192.168.111.23
right=192.168.111.20
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@1184d960-3211-45c4-a482-d7b6fe995446
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp
rightprotoport=udp/6081
conn ovn-922aca-0-in-1
left=192.168.111.23
right=192.168.111.24
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@922aca42-b893-496e-bb9b-0310884f4cc1
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp/6081
rightprotoport=udp
conn ovn-922aca-0-out-1
left=192.168.111.23
right=192.168.111.24
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@922aca42-b893-496e-bb9b-0310884f4cc1
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp
rightprotoport=udp/6081
conn ovn-185198-0-in-1
left=192.168.111.23
right=192.168.111.21
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@185198f6-7dde-4e9b-b2aa-52439d2beef5
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp/6081
rightprotoport=udp
conn ovn-185198-0-out-1
left=192.168.111.23
right=192.168.111.21
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@185198f6-7dde-4e9b-b2aa-52439d2beef5
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp
rightprotoport=udp/6081
sh-5.1#
The e2e-aws-ovn-ipsec-upgrade job is currently an optional job and always_run: false because the job not reliable and success rate is so low. This must be made as mandatory CI lane after fixing its relevant issues.
Ability to scale the amount of worker nodes in a Hosted Cluster in ARO/HCP to 500 nodes.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | managed |
Classic (standalone cluster) | No |
Hosted control planes | Yes |
Multi node, Compact (three node), or Single node (SNO), or all | Multi node |
Connected / Restricted Network | Connected |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | All targeted archs for ARO/HCP |
Operator compatibility | N/A |
Backport needed (list applicable versions) | Yes. Must support 4.17 Hosted Clusters |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | No |
Other (please specify) |
Customer workloads require up to 500 nodes in a single hosted cluster.
OCPSTRAT-1251 saw a big amount of work to enable customers to reach 500 nodes in the special ROSA/HCP topology.
500 nodes is the sum of all the maximum machines for the combined machinepools of an ARO/HCP hosted cluster
There should be documentation/SOPs for spotting and reacting to possible noisy neighbor patterns.
ARO/HCP
The ClusterSizingConfiguration resource that was introduced for ROSA allows setting various effects based on the size of a HostedCluster. For a service like ARO where no dedicated request serving nodes are needed, it should be sufficient to apply the effects (as annotations) on HostedClusters based on their size label. The controller/scheduler should simply reconcile HostedClusters and apply the corresponding annotations based on the size label assigned to the given HostedCluster.
As a service provider, I want to be able to:
so that I can
Placeholder to track GA work for CFE-811
Update API godoc to document that manual intervention is required for using .spec.tls.externalCertificate. Something simple like: "The Router service account needs to be granted with read-only access to this secret, please refer to openshift docs for additional details."
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
<your text here>
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
<your text here>
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
link back to OCPSTRAT-1644 somehow
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
As a developer, I want to upgrade the Kubernetes dependencies to 1.32
Epic Goal*
Drive the technical part of the Kubernetes 1.32 upgrade, including rebasing openshift/kubernetes repository and coordination across OpenShift organization to get e2e tests green for the OCP release.
Why is this important? (mandatory)
OpenShift 4.19 cannot be released without Kubernetes 1.32
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Slack Discussion Channel - https://redhat.enterprise.slack.com/archives/C07V32J0YKF
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.19 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.19 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.19 release, dependencies need to be updated to 1.31. This should be done by rebasing/updating as appropriate for the repository
As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.
Here are common update improvements from customer interactions on Update experience
oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0 True True 16s Working towards 4.12.4: 9 of 829 done (1% complete)
Update docs for UX and CLI changes
Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22
Epic Goal*
Add a new command `oc adm upgrade status` command which is backed by an API. Please find the mock output of the command output attached in this card.
Why is this important? (mandatory)
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
On the call to discuss oc adm upgrade status roadmap to server side-implementation (notes) we agreed on basic architectural direction and we can starting moving in that direction:
Let's start building this controller; we can implement the controller perform the functionality currently present in the client, and just expose it through an API. I am not sure how to deal with the fact that we won't have the API merged until it merges into o/api, which is not soon. Maybe we can implement the controller over a temporary fork of o/api and rely on manually inserting the CRD into the cluster when we test the functionality? Not sure.
We need to avoid committing to implementation details and investing effort into things that may change though.
This card only brings a skeleton of the desired functionality to the DevPreviewNoUpgrade feature set. Its purpose is mainly to enable further development by putting the necessary bits in place so that we can start developing more functionality. There's not much point in automating testing of any of the functionality in this card, but it should be useful to start getting familiar with how the new controller is deployed and what are its concepts.
For seeing the new controller in action:
1. Launch a cluster that includes both the code and manifests. As of Nov 11, #1107 is not yet merged so you need to use launch 4.18,openshift/cluster-version-operator#1107 aws,no-spot
2. Enable the DevPreviewNoUpgrade feature set. CVO will restart and will deploy all functionality gated by this feature set, including the USC. It can take a bit of time, ~10-15m should be enough though.
3. Eventually, you should be able to see the new openshift-update-status-controller Namespace created in the cluster
4. You should be able to see a update-status-controller Deployment in that namespace
5. That Deployment should have one replica running and being ready. It should not crashloop or anything like that. You can inspect its logs for obvious failures and such. At this point, its log should, near its end, say something like "the ConfigMap does not exist so doing nothing"
6. Create the ConfigMap that mimics the future API (make sure to create it in the openshift-update-status-controller namespace): oc create configmap -n openshift-update-status-controller status-api-cm-prototype
7. The controller should immediately-ish insert a usc-cv-version key into the ConfigMap. Its content is a YAML-serialized ClusterVersion status insight (see design doc). As of OTA-1269 the content is not that important, but the (1) reference to the CV (2) versions field should be correct.
8. The status insight should have a condition of Updating type. It should be False at this time (the cluster is not updating).
9. Start upgrading the cluster (it's a cluster bot cluster with ephemeral 4.18 version so you'll need to use --to-image=pullspec and probably force it
10. While updating, you should be able to observe the controller activity in the log (it logs some diffs), but also the content of the status insight in the ConfigMap changing. The versions field should change appropriately (and startedAt too), and the Updating condition should become True.
11. Eventually the update should finish and the Updating condition should flip to False again.
Some of these will turn into automated testcases, but it does not make sense to implement that automation while we're using the ConfigMap instead of the API.
After OTA-960 is fixed, ClusterVersion/version and oc adm upgrade can be used to monitor the process of migrating a cluster to multi-arch.
$ oc adm upgrade info: An upgrade is in progress. Working towards 4.18.0-ec.3: 761 of 890 done (85% complete), waiting on machine-config Upgradeable=False Reason: PoolUpdating Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details Upstream: https://api.integration.openshift.com/api/upgrades_info/graph Channel: candidate-4.18 (available channels: candidate-4.18) No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.
But oc adm upgrade status reports COMPLETION 100% while the migration/upgrade is still ongoing.
$ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status Unable to fetch alerts, ignoring alerts in 'Update Health': failed to get alerts from Thanos: no token is currently in use for this session = Control Plane = Assessment: Completed Target Version: 4.18.0-ec.3 (from 4.18.0-ec.3) Completion: 100% (33 operators updated, 0 updating, 0 waiting) Duration: 15m Operator Status: 33 Healthy Control Plane Nodes NAME ASSESSMENT PHASE VERSION EST MESSAGE ip-10-0-95-224.us-east-2.compute.internal Unavailable Updated 4.18.0-ec.3 - Node is unavailable ip-10-0-33-81.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 - ip-10-0-45-170.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 - = Worker Upgrade = WORKER POOL ASSESSMENT COMPLETION STATUS worker Completed 100% 3 Total, 2 Available, 0 Progressing, 0 Outdated, 0 Draining, 0 Excluded, 0 Degraded Worker Pool Nodes: worker NAME ASSESSMENT PHASE VERSION EST MESSAGE ip-10-0-72-40.us-east-2.compute.internal Unavailable Updated 4.18.0-ec.3 - Node is unavailable ip-10-0-17-117.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 - ip-10-0-22-179.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 - = Update Health = SINCE LEVEL IMPACT MESSAGE - Warning Update Speed Node ip-10-0-95-224.us-east-2.compute.internal is unavailable - Warning Update Speed Node ip-10-0-72-40.us-east-2.compute.internal is unavailable Run with --details=health for additional description and links to related online documentation $ oc get clusterversion version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.18.0-ec.3 True True 14m Working towards 4.18.0-ec.3: 761 of 890 done (85% complete), waiting on machine-config $ oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE machine-config 4.18.0-ec.3 True True False 63m Working towards 4.18.0-ec.3
The reason is that PROGRESSING=True is not detected for co/machine-config as the status command checks only operator.Status.Versions[name=="operator"] and it needs to check ClusterOperator.Status.Versions[name=="operator-image"] as well.
For grooming:
It will be challenging for the status command to check the operator image's pull spec because it does not know the expected value. CVO knows it because CVO holds the manifests (containing the expected value) from the multi-arch payload.
One "hacky" workaround is that the status command gets the pull spec from the MCO deployment:
oc get deployment -n openshift-machine-config-operator machine-config-operator -o json | jq -r '.spec.template.spec.containers[]|select(.name=="machine-config-operator")|.image' quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:787a505ca594b0a727549353c503dec9233a9d3c2dcd6b64e3de5f998892a1d5
Note this co/machine-config -> deployment/machine-config-operator trick may not be feasible if we want to extend it to all cluster operators. But it should work as a hacky workaround to check only MCO.
We may claim that the status command is not designed for monitoring the multi-arch migration and suggest to use oc adm upgrade instead. In that case, we can close this card as Obsolete/Won'tDo.
manifests.ziphas the mockData/manifests for the status cmd that are taken during the migration.
oc#1920 started the work for the status command to recognize the migration and we need to extend the work to cover (the comments from Petr's review):
Implement a new Informer controller in the Update Status Controller to watch Node resources in the cluster and maintain an update status insight for each. The informer will need to interact with additional resources such as MachineConfigPools and MachineConfigs, e.g. to discover the OCP version tied to config that is being reconciled on the Node, but should not attempt to maintain the MachineConfigPool status insights. Generally the node status insight should carry enough data for any client to be able to render a line that the oc adm upgrade status currently shows:
NAME ASSESSMENT PHASE VERSION EST MESSAGE
build0-gstfj-ci-prowjobs-worker-b-9lztv Degraded Draining 4.16.0-ec.2 ? failed to drain node: <node> after 1 hour. Please see machine-config-controller logs for more informatio
build0-gstfj-ci-prowjobs-worker-d-ddnxd Unavailable Pending ? ? Machine Config Daemon is processing the node
build0-gstfj-ci-tests-worker-b-d9vz2 Unavailable Pending ? ? Not ready
build0-gstfj-ci-tests-worker-c-jq5rk Unavailable Updated 4.16.0-ec.3 - Node is marked unschedulable
The basic expectations for Node status insights are described in the design docs but the current source of truth for the data structure is the NodeStatusInsight structure from https://github.com/openshift/api/pull/2012 .
Extend the Control Plane Informer in the Update Status Controller so it watches ClusterOperator resources in the cluster and maintains an update status insight for each.
The actual API structure for an update status insights needs to be taken from whatever state https://github.com/openshift/api/pull/2012 is at the moment. The story does not include the actual API form nor how it is exposed by the cluster (depending on the state of API approval, the USC may still expose the API as a ConfigMap or an actual custom resource), it includes just the logic to watch ClusterOperator resources and producing a matching set of cluster operator status insights.
The basic expectations for cluster operator status insights are described in the design docs
Spun out of https://issues.redhat.com/browse/MCO-668
This aims to capture the work required to rotate the MCS-ignition CA + cert.
Original description copied from MCO-668:
Today in OCP there is a TLS certificate generated by the installer , which is called "root-ca" but is really "the MCS CA".
A key derived from this is injected into the pointer Ignition configuration under the "security.tls.certificateAuthorities" section, and this is how the client verifies it's talking to the expected server.
If this key expires (and by default the CA has a 10 year lifetime), newly scaled up nodes will fail in Ignition (and fail to join the cluster).
The MCO should take over management of this cert, and the corresponding user-data secret field, to implement rotation.
Reading:
- There is a section in the customer facing documentation that touches on this: https://docs.openshift.com/container-platform/4.13/security/certificate_types_descriptions/machine-config-operator-certificates.html
- There's a section in the customer facing documentation for this: https://docs.openshift.com/container-platform/4.13/security/certificate_types_descriptions/machine-config-operator-certificates.html that needs updating for clarification.
- There's a pending PR to openshift/api: https://github.com/openshift/api/pull/1484/files
- Also see old (related) bug: https://issues.redhat.com/browse/OCPBUGS-9890
- This is also separate to https://issues.redhat.com/browse/MCO-499 which describes the management of kubelet certs
The machinesets in the machine-api namespace reference a user-data secret (per pool and can be customized) which stores the initial ignition stub configuration pointing to the MCS, and the TLS cert. This today doesn't get updated after creation.
The MCO now has the ability to manage some fields of the machineset object as part of the managed bootimage work. We should extend that to also sync in the updated user-data secrets for the ignition tls cert.
The MCC should be able to parse both install-time-generated machinesets as well as user-created ones, so as to not break compatibility. One way users are using this today is to use a custom secret + machineset to do non-MCO compatible ignition fields, for example, to partition disks for different device types for nodes in the same pool. Extra care should be taken not to break this use case
The CA/cert generated by the installer is not currently managed and also does not preserve the signing key; so the cert controller we are adding in the MCO(leveraged from library-go), throws away everything and starts fresh. Normally this happens fairly quickly so both the MCS and the -user-data secrets are updated together. However, in certain cases(such as agent based installations) where a bootstrap node joins the cluster late, it will have the old CA from installer, and unfortunately the MCS will have a TLS cert signed by the new CA - resulting in invalid TLS cert errors.
To account for such cases, we have to ensure the first CA embedded in any machine is matching the format expected by the cert controller. To do this, we'll have to do the following in the installer:
We need to maintain our dependencies across all the libraries we use in order to stay in compliance.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
<your text here>
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
The SourceSecretForm component needs to be refactored to address several tech debt issues: * Rename to AuthSecretForm
The BasicAuthSubform component needs to be refactored to address several tech debt issues: * Rename to BasicAuthSecretForm
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
Currently console is using TypeScript 4, which is preventing us from upgrading to NodeJS-22. Due to that we need to update TypeScript 5 (not necessarily latest version).
AC:
Note: In case of higher complexity we should be splitting the story into multiple stories, per console package.
As a developer I want to make sure we are running the latest version of webpack in order to take advantage of the latest benefits and also keep current so that future updating is a painless as possible.
We are currently on v4.47.0.
Changelog: https://webpack.js.org/blog/2020-10-10-webpack-5-release/
By updating to version 5 we will need to update following pkgs as well:
AC: Update webpack to version 5 and determine what should be the ideal minor version.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Some Kubernetes clusters do not have direct Internet access and rely solely on proxies for communication, so OLM v1 needs to support proxies to enable this communication.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Some Kubernetes clusters do not have direct Internet access and rely solely on proxies for communication. This may be done for isolation, testing or to enhance security and minimise vulnerabilities. This is a fully supported configuration in OpenShift, with origin tests designed to validate functionality in proxy-based environments. Supporting proxies is essential to ensure your solution operates reliably within these secure and compliant setups.
To address this need, we have two key challenges to solve:
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | Restricted Network |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | 4.18 |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
OpenShift’s centralized proxy control via the proxies.config.openshift.io (a.k.a. proxy.config.openshift.io) resource makes managing proxies across a cluster easier. At the same time, vanilla Kubernetes requires a manual and decentralized proxy configuration, making it more complex to manage, especially in large clusters. There is no native Kubernetes solution that can adequately address the need for centralized proxy management.
Kubernetes lacks a built-in unified API like OpenShift’s proxies.config.openshift.io, which can streamline proxy configuration and management across any Kubernetes vendor. Consequently, Kubernetes requires more manual work to ensure the proxy configuration is consistent across the cluster, and this complexity increases with the scale of the environment. As such, vanilla Kubernetes does not provide a solution that can natively address proxy configuration across all clusters and vendors without relying on external tools or complex manual processes (such as that devised by OpenShift).
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
See RFC for more details
It looks like OLMv1 doesn't handle proxies correctly, aws-ovn-proxy job is permafailing https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-proxy/1861444783696777216
I suspect it's on the OLM operator side, are you looking at the cluster-wide proxy object and wiring it into your containers if set?
The ability in OpenShift to create trust and directly consume access tokens issued by external OIDC Authentication Providers using an authentication approach similar to upstream Kubernetes.
BYO Identity will help facilitate CLI only workflows and capabilities of the Authentication Provider (such as Keycloak, Dex, Azure AD) similar to upstream Kubernetes.
Ability in OpenShift to provide a direct, pluggable Authentication workflow such that the OpenShift/K8s API server can consume access tokens issued by external OIDC identity providers. Kubernetes provides this integration as described here. Customer/Users can then configure their IDPs to support the OIDC protocols and workflows they desire such as Client credential flow.
OpenShift OAuth server is still available as default option, with the ability to tune in the external OIDC provider as a Day-2 configuration.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal
The ability to provide a direct authentication workflow such that OpenShift can consume bearer tokens issued by external OIDC identity providers, replacing the built-in OAuth stack by deactivating/removing its components as necessary.
Why is this important? (mandatory)
OpenShift has its own built-in OAuth server which can be used to obtain OAuth access tokens for authentication to the API. The server can be configured with an external identity provider (including support for OIDC), however it is still the built-in server that issues tokens, and thus authentication is limited to the capabilities of the oauth-server.
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
Contributing Teams(and contacts) (mandatory)
Acceptance Criteria (optional)
Drawbacks or Risk (optional)
Done - Checklist (mandatory)
Description of problem:
This is a bug found during pre-merge test of 4.18 epic AUTH-528 PRs and filed for better tracking per existing "OpenShift - Testing Before PR Merges - Left-Shift Testing" google doc workflow.
co/console degraded with AuthStatusHandlerDegraded after OCP BYO external oidc is configured and then removed (i.e. reverted back to OAuth IDP).
Version-Release number of selected component (if applicable):
Cluster-bot build which is built at 2024-11-25 09:39 CST (UTC+800) build 4.18,openshift/cluster-authentication-operator#713,openshift/cluster-authentication-operator#740,openshift/cluster-kube-apiserver-operator#1760,openshift/console-operator#940
How reproducible:
Always (tried twice, both hit it)
Steps to Reproduce:
1. Launch a TechPreviewNoUpgrade standalone OCP cluster with above build. Configure htpasswd IDP. Test users can login successfully. 2. Configure BYO external OIDC in this OCP cluster using Microsoft Entra ID. KAS and console pods can roll out successfully. oc login and console login to Microsoft Entra ID can succeed. 3. Remove BYO external OIDC configuration, i.e. go back to original htpasswd OAuth IDP: [xxia@2024-11-25 21:10:17 CST my]$ oc patch authentication.config/cluster --type=merge -p=' spec: type: "" oidcProviders: null ' authentication.config.openshift.io/cluster patched [xxia@2024-11-25 21:15:24 CST my]$ oc get authentication.config cluster -o yaml apiVersion: config.openshift.io/v1 kind: Authentication metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" release.openshift.io/create-only: "true" creationTimestamp: "2024-11-25T04:11:59Z" generation: 5 name: cluster ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: e814f1dc-0b51-4b87-8f04-6bd99594bf47 resourceVersion: "284724" uid: 2de77b67-7de4-4883-8ceb-f1020b277210 spec: oauthMetadata: name: "" serviceAccountIssuer: "" type: "" webhookTokenAuthenticator: kubeConfig: name: webhook-authentication-integrated-oauth status: integratedOAuthMetadata: name: oauth-openshift oidcClients: - componentName: cli componentNamespace: openshift-console - componentName: console componentNamespace: openshift-console conditions: - lastTransitionTime: "2024-11-25T13:10:23Z" message: "" reason: OIDCConfigAvailable status: "False" type: Degraded - lastTransitionTime: "2024-11-25T13:10:23Z" message: "" reason: OIDCConfigAvailable status: "False" type: Progressing - lastTransitionTime: "2024-11-25T13:10:23Z" message: "" reason: OIDCConfigAvailable status: "True" type: Available currentOIDCClients: - clientID: 95fbae1d-69a7-4206-86bd-00ea9e0bb778 issuerURL: https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/v2.0 oidcProviderName: microsoft-entra-id KAS and console pods indeed can roll out successfully; and now oc login and console login indeed can succeed using the htpasswd user and password: [xxia@2024-11-25 21:49:32 CST my]$ oc login -u testuser-1 -p xxxxxx Login successful. ... But co/console degraded, which is weird: [xxia@2024-11-25 21:56:07 CST my]$ oc get co | grep -v 'True *False *False' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE console 4.18.0-0.test-2024-11-25-020414-ci-ln-71cvsj2-latest True False True 9h AuthStatusHandlerDegraded: Authentication.config.openshift.io "cluster" is invalid: [status.oidcClients[1].currentOIDCClients[0].issuerURL: Invalid value: "": oidcClients[1].currentOIDCClients[0].issuerURL in body should match '^https:\/\/[^\s]', status.oidcClients[1].currentOIDCClients[0].oidcProviderName: Invalid value: "": oidcClients[1].currentOIDCClients[0].oidcProviderName in body should be at least 1 chars long]
Actual results:
co/console degraded, as above.
Expected results:
co/console is normal.
Additional info:
Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission.
With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.
With OpenShift 4.19, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn".
Epic Goal
Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.
When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.
To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).
Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).
The following tables track progress.
# namespaces | 4.19 | 4.18 | 4.17 | 4.16 | 4.15 | 4.14 |
---|---|---|---|---|---|---|
monitored | 82 | 82 | 82 | 82 | 82 | 82 |
fix needed | 68 | 68 | 68 | 68 | 68 | 68 |
fixed | 39 | 39 | 35 | 32 | 39 | 1 |
remaining | 29 | 29 | 33 | 36 | 29 | 67 |
~ remaining non-runlevel | 8 | 8 | 12 | 15 | 8 | 46 |
~ remaining runlevel (low-prio) | 21 | 21 | 21 | 21 | 21 | 21 |
~ untested | 3 | 2 | 2 | 2 | 82 | 82 |
# | namespace | 4.19 | 4.18 | 4.17 | 4.16 | 4.15 | 4.14 |
---|---|---|---|---|---|---|---|
1 | oc debug node pods | ![]() |
![]() |
![]() |
![]() |
![]() |
|
2 | openshift-apiserver-operator | ![]() |
![]() |
![]() |
![]() |
![]() |
|
3 | openshift-authentication | ![]() |
![]() |
![]() |
![]() |
![]() |
|
4 | openshift-authentication-operator | ![]() |
![]() |
![]() |
![]() |
![]() |
|
5 | openshift-catalogd | ![]() |
![]() |
![]() |
![]() |
![]() |
|
6 | openshift-cloud-credential-operator | ![]() |
![]() |
![]() |
![]() |
![]() |
|
7 | openshift-cloud-network-config-controller | ![]() |
![]() |
![]() |
#2496 | ||
8 | openshift-cluster-csi-drivers | #118 #5310 #135 | #524 #131 #306 #265 #75 | #170 #459 | ![]() |
||
9 | openshift-cluster-node-tuning-operator | ![]() |
![]() |
![]() |
![]() |
![]() |
|
10 | openshift-cluster-olm-operator | ![]() |
![]() |
![]() |
![]() |
n/a | n/a |
11 | openshift-cluster-samples-operator | ![]() |
![]() |
![]() |
![]() |
![]() |
|
12 | openshift-cluster-storage-operator | ![]() |
![]() |
#459 #196 | ![]() |
||
13 | openshift-cluster-version | #1038 | ![]() |
||||
14 | openshift-config-operator | ![]() |
![]() |
![]() |
![]() |
![]() |
|
15 | openshift-console | ![]() |
![]() |
![]() |
![]() |
![]() |
|
16 | openshift-console-operator | ![]() |
![]() |
![]() |
![]() |
![]() |
|
17 | openshift-controller-manager | ![]() |
![]() |
![]() |
![]() |
![]() |
|
18 | openshift-controller-manager-operator | ![]() |
![]() |
![]() |
![]() |
![]() |
|
19 | openshift-e2e-loki | ![]() |
![]() |
![]() |
![]() |
![]() |
|
20 | openshift-image-registry | #1008 | ![]() |
||||
21 | openshift-ingress | #1032 | |||||
22 | openshift-ingress-canary | #1031 | |||||
23 | openshift-ingress-operator | #1031 | |||||
24 | openshift-insights | ![]() |
![]() |
![]() |
![]() |
![]() |
|
25 | openshift-kni-infra | ![]() |
![]() |
![]() |
![]() |
![]() |
|
26 | openshift-kube-storage-version-migrator | ![]() |
![]() |
![]() |
![]() |
![]() |
|
27 | openshift-kube-storage-version-migrator-operator | ![]() |
![]() |
![]() |
![]() |
![]() |
|
28 | openshift-machine-api | #1308 #1317 | #1311 | #407 | #315 #282 #1220 #73 #50 #433 | ![]() |
|
29 | openshift-machine-config-operator | ![]() |
![]() |
#4219 | #4384 | ![]() |
|
30 | openshift-manila-csi-driver | ![]() |
![]() |
![]() |
![]() |
![]() |
|
31 | openshift-marketplace | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
32 | openshift-metallb-system | ![]() |
![]() |
![]() |
#241 | ||
33 | openshift-monitoring | #2298 #366 | #2498 | #2335 | ![]() |
||
34 | openshift-network-console | ![]() |
![]() |
||||
35 | openshift-network-diagnostics | ![]() |
![]() |
![]() |
#2496 | ||
36 | openshift-network-node-identity | ![]() |
![]() |
![]() |
#2496 | ||
37 | openshift-nutanix-infra | ![]() |
![]() |
![]() |
![]() |
![]() |
|
38 | openshift-oauth-apiserver | ![]() |
![]() |
![]() |
![]() |
![]() |
|
39 | openshift-openstack-infra | ![]() |
![]() |
![]() |
![]() |
||
40 | openshift-operator-controller | ![]() |
![]() |
![]() |
![]() |
![]() |
|
41 | openshift-operator-lifecycle-manager | ![]() |
![]() |
![]() |
![]() |
![]() |
|
42 | openshift-route-controller-manager | ![]() |
![]() |
![]() |
![]() |
![]() |
|
43 | openshift-service-ca | ![]() |
![]() |
![]() |
![]() |
![]() |
|
44 | openshift-service-ca-operator | ![]() |
![]() |
![]() |
![]() |
![]() |
|
45 | openshift-sriov-network-operator | ![]() |
![]() |
![]() |
![]() |
![]() |
|
46 | openshift-user-workload-monitoring | ![]() |
![]() |
![]() |
![]() |
![]() |
|
47 | openshift-vsphere-infra | ![]() |
![]() |
![]() |
![]() |
![]() |
|
48 | (runlevel) kube-system | ||||||
49 | (runlevel) openshift-cloud-controller-manager | ||||||
50 | (runlevel) openshift-cloud-controller-manager-operator | ||||||
51 | (runlevel) openshift-cluster-api | ||||||
52 | (runlevel) openshift-cluster-machine-approver | ||||||
53 | (runlevel) openshift-dns | ||||||
54 | (runlevel) openshift-dns-operator | ||||||
55 | (runlevel) openshift-etcd | ||||||
56 | (runlevel) openshift-etcd-operator | ||||||
57 | (runlevel) openshift-kube-apiserver | ||||||
58 | (runlevel) openshift-kube-apiserver-operator | ||||||
59 | (runlevel) openshift-kube-controller-manager | ||||||
60 | (runlevel) openshift-kube-controller-manager-operator | ||||||
61 | (runlevel) openshift-kube-proxy | ||||||
62 | (runlevel) openshift-kube-scheduler | ||||||
63 | (runlevel) openshift-kube-scheduler-operator | ||||||
64 | (runlevel) openshift-multus | ||||||
65 | (runlevel) openshift-network-operator | ||||||
66 | (runlevel) openshift-ovn-kubernetes | ||||||
67 | (runlevel) openshift-sdn | ||||||
68 | (runlevel) openshift-storage |
Implement Migration core for MAPI to CAPI for AWS
When customers use CAPI, There must be no negative effect to switching over to using CAPI . Seamless migration of Machine resources. the fields in MAPI/CAPI should reconcile from both CRDs.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As QE have tried to test upstream CAPI pausing, we've hit a few issues with running the migration controller. & cluster capi operator on a real cluster vs envtest.
This card captures the work required to iron out these kinks, and get things running (i.e not crashing).
I also think we want an e2e or some sort of automated testing to ensure we don't break things again.
Goal: Stop the CAPI operator crashing on startup in a real cluster.
Non goals: get the entire conversion flow running from CAPI -> MAPI and MAPI -> CAPI. We still need significant feature work before we're here.
For the MachineSet controller, we need to implement a forward conversion, converting the MachineAPI MachineSet to ClusterAPI.
This will involve creating the CAPI MachineSet if it does not exist, and managing the Infrastructure templates.
This card covers the case where MAPI is currently authoritative.
To enable CAPI MachineSets to still mirror MAPI MachineSets accurately, and to enable MAPI MachineSets to be implemented by CAPI MachineSets in the future, we need to implement a way to convert CAPI Machines back into MAPI Machines.
These steps assume that the CAPI Machine is authoritative, or, that there is no MAPI Machines.
Epic Goal
This is the epic tracking the work to collect a list of TLS artifacts (certificates, keys and CA bundles).
This list will contain a set of required an optional metadata. Required metadata examples are ownership (name of Jira component) and ability to auto-regenerate certificate after it has expired while offline. In most cases metadata can be set via annotations to secret/configmap containing the TLS artifact.
Components not meeting the required metadata will fail CI - i.e. when a pull request makes a component create a new secret, the secret is expected to have all necessary metadata present to pass CI.
This PR will enforce it WIP API-1789: make TLS registry tests required
Description of problem:
In order to make TLS registry tests required we need to make sure all OpenShift variants are using the same metadata for kube-apiserver certs. Hypershift uses several certs stored in the secret without accompanying metadata (namely component ownership).
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
As a cluster administrator, I want to use Karpenter on an OpenShift cluster running in AWS to scale nodes instead of Cluster Autoscalar(CAS). I want to automatically manage heterogeneous compute resources in my OpenShift cluster without the additional manual task of managing node pools. Additional features I want are:
This feature covers the work done to integrate upstream Karpenter 1.x with ROSA HCP. This eliminates the need for manual node pool management while ensuring cost-effective compute selection for workloads. Red Hat manages the node lifecycle and upgrades.
The feature will be rolled out with ROSA (AWS) since it has more mature Karpenter ecosystem, followed by ARO (Azure) implementation(check OCPSTRAT-1498)
As a cluster-admin or SRE I should be able to configure Karpenter with OCP on AWS. Both cli and UI should enable users to configure Karpenter and disable CAS.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | managed ROSA HCP |
Classic (standalone cluster) | |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | MNO |
Connected / Restricted Network | Connected |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86_x64, ARM (aarch64) |
Operator compatibility | |
Backport needed (list applicable versions) | No |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | yes - console |
Other (please specify) | rosa-cli |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
We decided to start by using validating admission policies to implement ownership of ec2NodeClass fields. So we can restrict fields crud to a particular service account.
This has some caveats:
If using validating policies for this proves to be satisfactory we'll need to consider alternatives, e.g:
Move from karpenter nodepool to programatic generate userdata
For CAPI/MAPI driven machine management the cluster-machine-approver uses the machine.status.ips to match the CSRs. In karpenter there's no Machine resources
We'll need to implement something similar. Some ideas:
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
We'll need to implement something similar. Some ideas:
– Explore using the nodeClaim resource info like status.providerID to match the CSRs
– Store the requesting IP when the ec2 instances query ignition and follow similar comparison criteria than machine approver to match CSRs
– Query AWS to get info and compare info to match CSRs
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Goal Summary
This feature aims to make sure that the HyperShift operator and the control-plane it deploys uses Managed Service Identities (MSI) and have access to scoped credentials (also via access to AKS's image gallery potentially). Additionally, for operators deployed in customers account (system components), they would be scoped with Azure workload identities.
Today Azure installation requires manually created service principal which involves relations, permission granting, credential setting, credential storage, credentials rotation, credentials clean up, and service principal deletion. This is not only mundane and time-consuming but also less secure and risks access to resources by adversaries due to lack of credential rotation.
Employ Azure managed credentials which drastically reduce the steps required to just managed identity creation, permission granting, and resource deletion.
Ideally, this should be a HyperShift-native functionality. I.e., HyperShift should use managed identities for the control plane, the kubelet, and any add-on that needs access to Azure resources.
Currently, we were using contributor role for all control plane identities on ARO HCP. We should go ahead and restrict the identities we know there are existing ARO roles for already.
The Cluster Ingress Operator can authenticate with Service Principal backed by a certificate stored in an Azure Key Vault. The Secrets CSI driver will be used to mount the certificate as a volume on the image registry deployment in a hosted control plane.
Azure SDK
Which degree of coverage should run on AKS e2e vs on existing e2es
CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>
The Cluster Network Operator can authenticate with Service Principal backed by a certificate stored in an Azure Key Vault. The Secrets CSI driver will be used to mount the certificate as a volume on the image registry deployment in a hosted control plane.
Azure SDK
Which degree of coverage should run on AKS e2e vs on existing e2es
CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
This epic covers the scope of automation-related stories in ODC
Automation enhancements for ODC
To date our work within the telecommunications radio access network space has focused primarily on x86-based solutions. Industry trends around sustainability, and more specific discussions with partners and customers, indicate a desire to progress towards ARM-based solutions with a view to production deployments in roughly a 2025 timeframe. This would mean being able to support one or more RAN partners DU applications on ARM-based servers.
Depending on source 75-85% of service provider network power consumption is attributable to the RAN sites, with data centers making up the remainder. This means that in the face of increased downward pressure on both TCO and carbon footprint (the former for company performance reasons, the later for regulatory reasons) it is an attractive place to make substantial improvements using economies of scale.
There are currently three main obvious thrusts to how to go about this:
This BU priority focuses on the third of these approaches.
Reference Documents:
The PerformanceProfile currently allows the user to select either the standard kernel (by default) or the realtime kernel, using the realTimeKernel field. However, for some use cases (e.g. Nvidia based ARM server) a kernel with 64k page size is required. This is supported through the MachineConfig kernelType, which currently supports the following options:
At some point it is likely that 64k page support will be added to the realtime kernel, which would likely mean another "realtime-64k-pages" option (or similar) would be added.
The purpose of this epic is to allow the 64k-pages (standard kernel with 64k pages) option to be selected in the PerformanceProfile and make it easy to support new kernelTypes added to the MachineConfig. There is a workaround for this today, by applying an additional MachineConfig CR, which overrides the kernelType, but this is awkward for the user.
One option to support this in the PerformanceProfile would be to deprecate the existing realTimeKernel option and replace it with a new kernelType option. The kernelType option would support the same values as the MachineConfig kernelType (i.e. default, realtime, 64k-pages). The old option could be supported for backwards compatibility - attempting to use both options at the same time would be treated as an error. Another option would be to add a new kernelPageSize option (with values like default or 64k) and then internally map that to the MachineConfig kernelType (after validation that the combination of kernel type and page size was allowed).
This will require updates to the customer documentation and to the performance-profile-creator to support the new option.
This will also might require updates to the workloadhints kernel related sections.
Acceptance criteria:
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
This epic tracks any part of our codebase / solutions we implemented taking shortcuts.
Whenever a shortcut is taken, we should add a story here not to forget to improve it in a safer and more maintainabile way.
maintanability and debuggability, and in general fighting the technical debt, is critical to keep velocity and ensure overall high quality
https://issues.redhat.com/browse/CNF-796
https://issues.redhat.com/browse/CNF-1479
https://issues.redhat.com/browse/CNF-2134
https://issues.redhat.com/browse/CNF-6745
https://issues.redhat.com/browse/CNF-8036
https://issues.redhat.com/browse/CNF-9566
https://issues.redhat.com/browse/OCPNODE-2217
Since OCP 4.18, crun will be used as the default runtime instead of runc.
We need to adjust NTO to conform with this change.
Epic Goal
Revert the PRs that added the extension and packages needed for kSAN Storage.
https://github.com/openshift/machine-config-operator/pull/4792 and https://github.com/openshift/os/pull/1701
Kubesan requires these host level dependencies:
To ensure that operator-provisioned nodes support kubesan, these deps must be added to the machine-config-operator's config.
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
Capture the necessary accidental work to get CI / Konflux unstuck during the 4.19 cycle
Due to capacity problems on the s390x environment, the Konflux team recommended disabling the s390x platform from the PR pipeline.
Location:
PF component:
AC:
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
When running ./build-frontend.sh, I am getting the following warnings in the build log:
warning " > cypress-axe@0.12.0" has unmet peer dependency "axe-core@^3 || ^4". warning " > cypress-axe@0.12.0" has incorrect peer dependency "cypress@^3 || ^4 || ^5 || ^6".
To fix:
The installer makes heavy use of it's data/data directory, which contains hundreds of files in various subdirectories that are mostly used for inserting into ignition files. From these files, autogenerated code is created that includes the contents in the installer binary.
Unfortunately, subdirectories that do not contain .go files are not regarded by as golang packages and are therefore not included when building the installer as a library: https://go.dev/wiki/Modules#some-needed-files-may-not-be-present-in-populated-vendor-directory
This is currently handled in the installer fork repo by deleting the compile-time autogeneration and instead doing a one-time autogeneration that is checked in to the repo: https://github.com/openshift/installer-aro/pull/27/commits/26a5ed5afe4df93b6dde8f0b34a1f6b8d8d3e583
Since this does not exist in the upstream installer, we will need some way to copy the data/data associated with the current installer version into the wrapper repo - we should probably encapsulate this in a make vendor target. The wiki page above links to https://github.com/goware/modvendor which unfortunately doesn't work, because it assumes you know the file extensions of all of the files (e.g. .c, .h), and it can't handle directory names matching the glob. We could probably easily fix this by forking the tool and teaching it to ignore directories in the source. Alternatively, John Hixson has a script that can do something similar.
We are constantly bumping up against quotas when trying to create new ServicePrincipals per test. Example:
=== NAME TestCreateClusterV2
hypershift_framework.go:291: failed to create cluster, tearing down: failed to create infra: ERROR: The directory object quota limit for the Tenant has been exceeded. Please ask your administrator to increase the quota limit or delete objects to reduce the used quota.
We need to create a set of ServicePrincipals to use during testing, and we need to reuse them while executing the e2e-aks.
When adding the assignContributorRole to assign contributor roles for appropriate scopes to existing SPs we missed the assignment of the role over the DNS RG scope
With the new changes in powervs workspace delete process, need to make sure all the child resources are cleaned up before attempting to delete powervs instance.
Child resources:
The error propagation is generally speaking not 1-to-1. The operator status will generally capture the pool status, but the full error from Controller/Daemon does not fully bubble up to pool/operator, and the journal logs with error generally don’t get bubbled up at all. This is very confusing for customers/admins working with the MCO without full understanding of the MCO’s internal mechanics:
Using “unexpected on-disk state” as an example, this can be caused by any amount of the following:
Etc. etc.
Since error use cases are wide and varied, there are many improvements we can perform for each individual error state. This epic aims to propose targeted improvements to error messaging and propagation specifically. The goals being:
With a side objective of observability, including reporting all the way to the operator status items such as:
Approaches can include:
MCD will send an alert when a node failes to pivot to another MachineConfig. This could prevent an OS upgrade from succeeding. The alert contains the information on what logs to look for
The alerts describes the following
"Error detected in pivot logs on {{ $labels.node }} , upgrade may be blocked. For more details: oc logs -f -n {{ $labels.namespace }} {{ $labels.pod }} -c machine-config-daemon "
It is possible that admin may not be able to interpret exact action to be taken after looking at MCD pod logs. Adding runbook (https://github.com/openshift/runbooks) can help admin in better troubleshooting and taking appropriate action.
Acceptance Criteria:
These are items that the team has prioritized to address in 4.18.
In newer versions of OCP, we have changed our draining mechanism to only fail after 1 hour. This also means that the event which captures the failing drain was also moved to the failure at the 1hr mark.
Today, upgrade tests oft fail with timeouts related to drain errors (PDB or other). There exists no good way to distinguish what pods are failing and for what reason, so we cannot easily aggregate this data in CI to tackle issues related to PDBs to improve upgrade and CI pass rate.
If the MCD, upon a drain run failure, emits the failing pod and reason (PDB, timeout) as an event, it would be easier to write a test to aggregate this data.
Context in this thread: https://coreos.slack.com/archives/C01CQA76KMX/p1633635861184300
Today the MCO bootstraps with a bootstrap MCC/MCS to generate and serve master configs. When the in-cluster MCC comes up, it then tries to regen the same MachineConfig via the in-cluster MCs at the time.
This often causes a drift and for the install to fail. See https://github.com/openshift/machine-config-operator/issues/2114 and https://docs.google.com/document/d/1fgP6Kv1D-75e1Ot0Kg-W2qPyxWDp2_CALltlBLuseec/edit#heading=h.ny6l9ud82fxx for more context. For the most recent occurrence of this, see: https://github.com/openshift/machine-config-operator/pull/3513
Early on this helped us see differences between bootstrap and in-cluster behaviour more easily, but we do have the bootstrap machineconfig on-disk on the masters. In theory, we should just be able to use that directly and attempt to consolidate the changes.
In the case of a drift, instead of failing, we can consider doing an immediate update instead to the latest version.
In https://issues.redhat.com/browse/MCO-1469, we are migrating my helper binaries into the MCO repository. I had to make changes to several of my helpers in the original repository to address bugs and other issues in order to unblock https://github.com/openshift/release/pull/58241. Because of the changes I requested during the PR review to make the integration easier, it may be a little tricky to incorporate all of my changes into the MCO repository, but it is still doable.
Done When:
In OCP 4.7 and before, you were able to see the MCD logs of the previous container post upgrade. Now it seems that we no longer do in newer versions. I am not sure if this is a change in kube pod logging behaviour, how the pod gets shutdown and brought up, or something in the MCO.
This however makes it relatively hard to debug in newer versions of the MCO, and in numerous bugs we could not pinpoint the source of the issue since we no longer have necessary logs. We should find a way to properly save the previous boot MCD logs if possible.
This epic has been repurposed for handling bugs and issues related to DataImage api ( see comments by Zane and slack discussion below ). Some issues have already been added, will add more issues to improve the stability and reliability of this feature.
Reference links :
Issue opened for IBIO : https://issues.redhat.com/browse/OCPBUGS-43330
Slack discussion threads :
https://redhat-internal.slack.com/archives/CFP6ST0A3/p1729081044547689?thread_ts=1728928990.795199&cid=CFP6ST0A3
https://redhat-internal.slack.com/archives/C0523LQCQG1/p1732110124833909?thread_ts=1731660639.803949&cid=C0523LQCQG1
Description of problem:
After deleting a BaremetalHost which has a related DataImage, the DataImage is still present. I'd expect that together with the bmh deletion the dataimage gets deleted as well.
Version-Release number of selected component (if applicable):
4.17.0-rc.0
How reproducible:
100%
Steps to Reproduce:
1. Create BaremetalHost object as part of the installation process using Image Based Install operator 2. Image Based Install operator will create a dataimage as part of the install process 3. Delete the BaremetalHost object 4. Check the DataImage assigned to the BareMetalHost
Actual results:
While the BaremetalHost was deleted the DataImage is still present: oc -n kni-qe-1 get bmh No resources found in kni-qe-1 namespace. oc -n kni-qe-1 get dataimage -o yaml apiVersion: v1 items: - apiVersion: metal3.io/v1alpha1 kind: DataImage metadata: creationTimestamp: "2024-09-24T11:58:10Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2024-09-24T14:06:15Z" finalizers: - dataimage.metal3.io generation: 2 name: sno.kni-qe-1.lab.eng.rdu2.redhat.com namespace: kni-qe-1 ownerReferences: - apiVersion: metal3.io/v1alpha1 blockOwnerDeletion: true controller: true kind: BareMetalHost name: sno.kni-qe-1.lab.eng.rdu2.redhat.com uid: 0a8bb033-5483-4fe8-8e44-06bf43ae395f resourceVersion: "156761793" uid: 2358cae9-b660-40e6-9095-7daabb4d9e48 spec: url: https://image-based-install-config.multicluster-engine.svc:8000/images/kni-qe-1/ec274bfe-a295-4cd4-8847-4fe4d232b255.iso status: attachedImage: url: https://image-based-install-config.multicluster-engine.svc:8000/images/kni-qe-1/ec274bfe-a295-4cd4-8847-4fe4d232b255.iso error: count: 0 message: "" lastReconciled: "2024-09-24T12:03:28Z" kind: List metadata: resourceVersion: ""
Expected results:
The DataImage gets deleted when the BaremetalHost owner gets deleted.
Additional info:
This is impacting automated test pipelines which use ImageBasedInstall operator as the cleanup stage gets stuck waiting for the namespace deletion which still holds the DataImage. Also deleting the DataImage gets stuck and it can only be deleted by removing the finalizer. oc get namespace kni-qe-1 -o yaml apiVersion: v1 kind: Namespace metadata: annotations: openshift.io/sa.scc.mcs: s0:c33,c2 openshift.io/sa.scc.supplemental-groups: 1001060000/10000 openshift.io/sa.scc.uid-range: 1001060000/10000 creationTimestamp: "2024-09-24T11:40:03Z" deletionTimestamp: "2024-09-24T14:06:14Z" labels: app.kubernetes.io/instance: clusters cluster.open-cluster-management.io/managedCluster: kni-qe-1 kubernetes.io/metadata.name: kni-qe-1 name: kni-qe-1-namespace open-cluster-management.io/cluster-name: kni-qe-1 pod-security.kubernetes.io/audit: restricted pod-security.kubernetes.io/audit-version: v1.24 pod-security.kubernetes.io/enforce: restricted pod-security.kubernetes.io/enforce-version: v1.24 pod-security.kubernetes.io/warn: restricted pod-security.kubernetes.io/warn-version: v1.24 name: kni-qe-1 resourceVersion: "156764765" uid: ee984850-665a-4f5e-8f17-0c44b57eb925 spec: finalizers: - kubernetes status: conditions: - lastTransitionTime: "2024-09-24T14:06:23Z" message: All resources successfully discovered reason: ResourcesDiscovered status: "False" type: NamespaceDeletionDiscoveryFailure - lastTransitionTime: "2024-09-24T14:06:23Z" message: All legacy kube types successfully parsed reason: ParsedGroupVersions status: "False" type: NamespaceDeletionGroupVersionParsingFailure - lastTransitionTime: "2024-09-24T14:06:23Z" message: All content successfully deleted, may be waiting on finalization reason: ContentDeleted status: "False" type: NamespaceDeletionContentFailure - lastTransitionTime: "2024-09-24T14:06:23Z" message: 'Some resources are remaining: dataimages.metal3.io has 1 resource instances' reason: SomeResourcesRemain status: "True" type: NamespaceContentRemaining - lastTransitionTime: "2024-09-24T14:06:23Z" message: 'Some content in the namespace has finalizers remaining: dataimage.metal3.io in 1 resource instances' reason: SomeFinalizersRemain status: "True" type: NamespaceFinalizersRemaining phase: Terminating
Tracking all things Konflux related for the Metal Platform Team
Full enablement should happen during OCP 4.19 development cycle
Description of problem:
The host that gets used in production builds to download the iso will change soon. It would be good to allow this host to be set through configuration from the release team / ocp-build-data
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
tracking here all the work that needs to be done to configure the ironic containers (ironic-image and ironic-agent-image) to be ready for OCP 4.20
this includes also CI configuration, tools and documentation updates
all the configuration bits need to happen at least one sprint BEFORE 4.20 branching (current target April 18 2025)
docs tasks can be completed after the configuration tasks
the CI tasks need to be completed RIGHT AFTER 4.20 branching happens
tag creation is now automated during OCP tags creation
builder creation has been automated
we've been installing the sushy package because of dependencies in sushy-oem-idrac and proliantutils packages, and then installing sushy from source de facto updating the code in the image during build time
this is not a great practice, and we also want to remove dependency from the pkg entirely
both sushy-oem-idrac and proliantutils don't get too many updates so it should be fine to install them from source, and we can always rollback if things go sideways
CAPI Agent Control Plane Provider and CAPI Bootstrap Provider will provide an easy way to install clusters through CAPI.
Those providers will not be generic OpenShift providers, as they are geared towards Bare Metal. Those providers will leverage Assisted Installer ZTP flow, and will benefit BM users by avoiding to provision a bootstrap node (as opposed to regular OpenShift install where the bootstrap node is required, but it will comply better to CAPI interface)
milestones:
Yes
With the standard image (non liveISO) flow, we decoupled from BMHs.
The challenge now is that we cannot set the label from the controller directly, as the data will be available in-host only.
We should now:
Flows goes like this: * M3Machine will set providerID on itselffrom bmh and from the label on the node - it expects them to be the same else it wont' succeed. This is how it shares data with the host (eventually)
Reference: https://issues.redhat.com/browse/KONFLUX-1611?filter=-1
ACM/MCE flow: https://miro.com/app/board/uXjVLqVek9E=/ (last diagram)
Adept the current dockerfiles and add them to the upstream repos.
List of components:
Add the dockerfiles for the following components:
In order to release our products with konflux we need to pass the registry-standard EnterpriseContractPolicy.
There are a few things small things we need to configure for all of our components:
The ignores need to be listed in the tekton pipelines.
The relevant repos are:
Currently our base images use ubi9 and MintMaker wants to upgrade it to 9.5, because MintMaker looks at the repo and tries to upgrade to the newest tag.
We should change the base image to use a repo that is just for rhel9.4.
Integration uses latest tags, so update the push-event pipelines to also push latest tag for the following components:
All components need to build images for the following architectures:
List of components:
A lot of time our pipelines as well as other teams' pipelines are stuck because they are unable to provision hosts with different architectures to build the images.
Because we currently don't use the multi-arch images we build with konflux, we will stop building multi-arch for now and readd those architectures when we need them.
Currently each component is in it's own application which means we need a Release object for each one.
We want to have a single application that has all of our components in order to be able to release with a single Release object.
The way to do this is:
List of component:
List of repo
- assisted-events-stream
- assisted-image-service
- assisted-service
- auto-report
- bug-master-bot
- jira-unfurl-bot (auto-merge)
- prow-jobs-scraper
- assisted-installer (all config)
- assisted-installer-agent (all config)
Please describe what this feature is going to do.
To allow for installs on platforms where ISOs are not easily used/supported the assisted installer should have an install flow that requires only a disk image and a user-data service.
Please describe what conditions must be met in order to mark this feature as "done".
Installation using a disk image must work as expected.
If the answer is "yes", please make sure to check the corresponding option.
This allows assisted installer to more closely map to the assumptions for installing using CAPI. It also allows us to more easily support installs on platforms where booting a live-iso isn't an option or is very difficult (OCI and AWS, for example)
I don't know. I assume some data about installs using this method vs ISO methods might exist, but I don't have it.
N/A this isn't so much a feature of the installer as a way of installing.
Today users can customize installer args when we run coreos-installer
The new install process should parse those args and preserve the functionality in a way that is seamless for the user.
The most relevant ones from https://coreos.github.io/coreos-installer/cmd/install/ seem to be:
We shouldn't have to worry about the partitioning flags since we're not writing a new disk and the options for fetching ignitions shouldn't apply since we're always getting ignition from the service and writing it locally.
When discovery is running on the real device rather in a live ISO assisted-installer needs to know how to install using ostree to the local disk rather than using coreos-installer.
Create that new install flow.
The agent or installer may be able to detect if we're running on the install target disk or a live ISO.
If they detect we're not running on a live ISO they should switch automatically to the new install flow.
If it's not possible to detect this we'll need some kind of API for the user to choose a flow.
Allow users to do a basic OpenShift AI intallation with one click in the "operators" page of the cluster creation wizard, similar to how the ODF or MCE operators can be installed.
This feature will be done when users can click on the "OpenShift AI" check box on the operators page of the cluster creation wizard, and end having an installation that can be used for basic tasks.
Yes.
Feature origin (who asked for this feature?)
Currently installing the OpenShift AI operator requires at least one supported GPU. For NVIDIA GPUs it is also necessary to disable secure boot, because otherwise it isn't possible to load the NVIDIA drivers. This ticket is about adding that validation, so that the problem will be detected and reported to the user before installing the cluster.
Currently, the monitoring stack is configured using a configmap. In OpenShift though the best practice is to configure operators using custom resources.
To start the effort we should create a feature gate behind which we can start implementing a CRD config approach. This allows us to iterate in smaller increments without having to support full feature parity with the config map from the start. We can start small and add features as they evolve.
One proposal for a minimal DoD was:
Feature parity should be planned in one or more separate epics.
This story covers the implementation of our initial CRD in CMO. When the feature gate is enabled, CMO watches a singleton CR (name tbd) and acts on changes. The inital feature could be a boolean flag (defaults to true) that tells CMO to merge the configmap settings. If a user sets this flag to false, the config map is ignored and default settings are applied.
This epic is to track stories that are not completed in MON-3865
For the issue https://issues.redhat.com//browse/OCPBUGS-32510 we had identified that we need to have separate metrics client cert for metrics server but for that we need to add approver for metrics-server .
With Prometheus v3, the classic histogram's "le" and summary's "quantile" labels values will be floats.
All queries (in Alerts, Recording rules, dashboards, or interactive ones) with selectors that assume "le"/"quantile" values to be integers only should be adjusted.
Same applies to Relabel Configs.
Queries:
foo_bucket{le="1"} should be turned into foo_bucket{le=~"1(.0)?"} foo_bucket{le=~"1|3"} should be turned into foo_bucket{le=~"1|3(.0)?"}
(same applies to the "quantile" label)
Relabel configs:
- action: foo regex: foo_bucket;(1|3|5|15.5) sourceLabels: - __name__ - le should be adjusted - action: foo regex: foo_bucket;(1|3|5|15.5)(\.0)? sourceLabels: - __name__ - le
(same applies to the "quantile" label)
Also, from upstream Prometheus:
Aggregation by the `le` and `quantile` labels for vectors that contain the old and new formatting will lead to unexpected results, and range vectors that span the transition between the different formatting will contain additional series. The most common use case for both is the quantile calculation via `histogram_quantile`, e.g. `histogram_quantile(0.95, sum by (le) (rate(histogram_bucket[10m])))`. The `histogram_quantile` function already tries to mitigate the effects to some extent, but there will be inaccuracies, in particular for shorter ranges that cover only a few samples.
A warning about this should suffice, as adjusting the queries would be difficult, if not impossible. Additionally, it might complicate things further.
See attached PRs for examples.
A downstream check to help surface such misconfigurations was added. An alert will fire for configs that aren't enabled by default and that may need to be adjusted.
For more details, see https://docs.google.com/document/d/11c0Pr2-Zn3u3cjn4qio8gxFnu9dp0p9bO7gM45YKcNo/edit?tab=t.0#bookmark=id.f5p0o1s8vyjf
thanos needs to be upgraded to support prometheus3
all origin tests were failing
alertmanager v1 no longer supported
The history of this epic starts with this PR which triggered a lengthy conversation around the workings of the image API with respect to importing imagestreams images as single vs manifestlisted. The imagestreams today by default have the `importMode` flag set to `Legacy` to avoid breaking behavior of existing clusters in the field. This makes sense for single arch clusters deployed with a single arch payload, but when users migrate to use the multi payload, more often than not, their intent is to add nodes of other architecture types. When this happens - it gives rise to problems when using imagestreams with the default behavior of importing a single manifest image. The oc commands do have a new flag to toggle the importMode, but this breaks functionality of existing users who just want to create an imagestream and use it with existing commands.
There was a discussion with David Eads and other staff engineers and it was decided that the approach to be taken is to default imagestreams' importMode to `preserveOriginal` if the cluster is installed with/ upgraded to a multi payload. So a few things need to happen to achieve this:
Some open questions:
As documented in OCPCLOUD-1578, OpenShift would like to migrate from Machine API to Cluster API and eventually remove Machine API. This effort is going to require work from all affected platforms including OpenStack. This epic tracks the implementation of the OpenStack part of the mapi2capi and capi2mapi translation layer being added to cluster-capi-operator, based on the scoping done in OSASINFRA-3440.
Note that it is important that we implement both MAPI to CAPI and CAPI to MAPI for all platforms including OpenStack. This ensures we will always have a mirror copy on both sides, which is particularly useful for third-party components that don't have a way to interact with CAPI yet. In these situations, users can create CAPI resources while these components can (until updated) continue to fetch the MAPI mirror and work out the state of the cluster that way.
Repeating from OCPCLOUD-1578:
From an OpenStack perspective, we simply need to ensure we follow suit with other platforms.
TBD.
TBD.
We need to add support for fake MAPI and CAPI cluster/machineset/machine builders to the openshift/cluster-api-actuator-pkg package. These already exist for other providers, so this is likely to be an exercise in copy-pasting-tweaking.
This EPIC regroups the tasks that need to be finished so we can deliver Hosted Control Plane on OpenStack as TechPreview.
Some tasks were initially in this EPIC and were de-prioritized to be done later once we have customer feedback. What remain in the current list are things we think can be achieved within the 4.19 cycle.
This EPIC will be used by QE to test the quality of the product and the outcome will have a direct impact on whether this can be TechPreview or now. The HCP team will only accept it as TechPreview if we have QE coverage.
The initial scenario that QE agreed to start with is the following:
This task focuses on ensuring that all OpenStack resources automatically created by Hypershift for Hosted Control Planes are tagged with a unique identifier, such as the HostedCluster ID. These resources include, but are not limited to, servers, ports, and security groups. Proper tagging will enable administrators to clearly identify and manage resources associated with specific OpenShift clusters.
Acceptance Criteria:
When deploying a HostedCluster, etcd will be using the default CSI StorageClass but we can override it with --etcd-storage-class.
Customers should use a local storage CSI Storage class in production to avoid performance issues in their clusters.
Spike / Document: https://docs.google.com/document/d/1qdYHb7YAQJKSDLOziqLG8jaBQ80qEMchsozgVnI7qyc
This task also includes a doc change in Hypershift repo but depends on OSASINFRA-3681.
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
We need to send an enhancement proposal that would contain the design changes we suggest in openshift/api/config/v1/types_cluster_version.go to allow changing the log level of the CVO using an API configuration before implementing such changes in the API.
Definition of Done:
The ClusterVersionOperator API has been introduced in the DevPreviewNoUpgrade feature set. Enable the CVO in standalone OpenShift to change its log level based on the new API.
Definition of Done:
Make sure that we have dependencies updated and that we are building on the proper base images and with the proper toolchain for OpenShift 4.19.
As an SRE, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
We have decided to remove IPI and UPI support for Alibaba Cloud, which until recently has been in Tech Preview due to the following reasons:
(1) Low customer interest of using Openshift on Alibaba Cloud
(2) Removal of Terraform usage
(3) MAPI to CAPI migration
(4) CAPI adoption for installation (Day 1) in OpenShift 4.16 and Day 2 (post-OpenShift 4.16)
Impacted areas based on CI:
alibaba-cloud-csi-driver/openshift-alibaba-cloud-csi-driver-release-4.16.yaml
alibaba-disk-csi-driver-operator/openshift-alibaba-disk-csi-driver-operator-release-4.16.yaml
cloud-provider-alibaba-cloud/openshift-cloud-provider-alibaba-cloud-release-4.16.yaml
cluster-api-provider-alibaba/openshift-cluster-api-provider-alibaba-release-4.16.yaml
cluster-cloud-controller-manager-operator/openshift-cluster-cloud-controller-manager-operator-release-4.16.yaml
machine-config-operator/openshift-machine-config-operator-release-4.16.yaml
Acceptance Criteria
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).
Update OCP release number in OLM metadata manifests of:
OLM metadata of the operators are typically in /config/manifest directory of each operator. Example of such a bump: https://github.com/openshift/aws-efs-csi-driver-operator/pull/56
We should do it early in the release, so QE can identify new operator builds easily and they are not mixed with the old release.
This epic is part of the 4.18 initiatives we discussed, it includes:
origin should support calling an external test binary implemented using openshift-tests-extension. There's an external test binary already in the hyperkube repo: https://github.com/openshift/kubernetes/tree/master/openshift-hack/cmd/k8s-tests-ext
Here's the existing external binary using the legacy interface:
https://github.com/openshift/origin/blob/master/pkg/test/ginkgo/cmd_runsuite.go#L174-L179
That can just be removed and replaced with k8s-test-ext.
MVP requires for k8s-tests:
Additional things for later:
The design should be flexible enough to allow a scheduling algorithm that takes into account available resources/isolation, but the first pass doesn't need to implement it yet.
If an extension wants to produce artifacts we need to tell them where to write them, i.e. EXTENSION_ARTIFACT_DIR environment variable.
Add a caching layer so external binaries only need to be extracted once when running locally
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
Description of problem:
For various reasons, Pods may get evicted. Once they are evicted, the owner of the Pod should recreate the Pod so it is scheduled again.
With OLM, we can see that evicted Pods owned by Catalogsources are not rescheduled. The outcome is that all subscriptions have a "ResolutionFailed=True" condition, which hinders an upgrade of the operator. Specifically the customer is seeing an affected CatalogSource is "multicluster-engine-CENSORED_NAME-redhat-operator-index "in openshift-marketplace namespace, pod name: "multicluster-engine-CENSORED_NAME-redhat-operator-index-5ng9j"
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.16.21
How reproducible:
Sometimes, when Pods are evicted on the cluster
Steps to Reproduce:
1. Set up an OpenShift Container Platform 4.16 cluster, install various Operators
2. Create a condition that a Node will evict Pods (for example by creating DiskPressure on the Node)
3. Observe if any Pods owned by CatalogSources are being evicted
Actual results:
If Pods owned by CatalogSources are being evicted, they are not recreated / rescheduled.
Expected results:
When Pods owned by CatalogSources are being evicted, they are being recreacted / rescheduled.
Additional info:
// NOTE: The nested `Context` containers inside the following `Describe` container are used to group certain tests based on the environments they demand.// NOTE: When adding a test-case, ensure that the test-case is placed in the appropriate `Context` container.// NOTE: The containers themselves are guaranteed to run in the order in which they appear.var _ = g.Describe("[sig-instrumentation][OCPFeatureGate:MetricsCollectionProfiles] The collection profiles feature-set", g.Ordered, func() { defer g.GinkgoRecover()
o.SetDefaultEventuallyTimeout(15 * time.Minute) o.SetDefaultEventuallyPollingInterval(5 * time.Second)
r := &runner{}
In OCP Origin we have the above test playing with global variables for poll interval and poll timeout which is causing all other tests in origin to have flakes.
a networking test is failing because we are not polling correctly, this above test overrode the default poll value of 10ms and instead made it poll 5sec which caused the test to fail because out poll timeout was itself only 5seconds
Please don't use the global variables or maybe we can unset them after the test run is over?
Please note that this causes flakes that are hard to debug, we didn't know what was causing the poll interval to be 5seconds instead of the default 10ms.
Description of problem:
When Swift is not available (from any manner: 403, 404, etc), Cinder is the backend for the Cluster Image Registry Operator with a PVC. The problem here is that we see an error that Swift is not available but then no PVC is being created.
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
Disable swiftoperator role for your user and no PVC will be created
Actual results:
E1122 15:37:26.301213 1 controller.go:379] unable to sync: unable to sync storage configuration: persistentvolumeclaims "image-registry-storage" not found, requeuing E1122 15:37:50.851275 1 swift.go:84] error listing swift containers: Expected HTTP response code [200 204 300[] when accessing [GET https://10.8.1.135:13808/v1/AUTH_6640775c6b5d4e5fa997fb9b85254da1/], but got 403 instead: <html><h1>Forbidden</h1><p>Access was denied to this resource.</p></html> I1122 15:37:50.858381 1 controller.go:294] object changed: *v1.Config, Name=cluster (metadata=false, spec=true): added:spec.storage.pvc.claim="image-registry-storage", changed:status.conditions.2.lastTransitionTime={"2024-11-22T15:37:26Z" -> "2024-11-22T15:37:50Z"} I1122 15:37:50.873526 1 controller.go:340] object changed: *v1.Config, Name=cluster (status=true): changed:metadata.generation={"12.000000" -> "11.000000"}, removed:metadata.managedFields.2.apiVersion="imageregistry.operator.openshift.io/v1", removed:metadata.managedFields.2.fieldsType="FieldsV1", removed:metadata.managedFields.2.manager="cluster-image-registry-operator", removed:metadata.managedFields.2.operation="Update", removed:metadata.managedFields.2.time="2024-11-22T15:37:50Z", changed:status.conditions.2.lastTransitionTime={"2024-11-22T15:37:26Z" -> "2024-11-22T15:37:50Z"}, changed:status.observedGeneration={"10.000000" -> "11.000000"} E1122 15:37:50.885488 1 controller.go:379] unable to sync: unable to sync storage configuration: persistentvolumeclaims "image-registry-storage" not found, requeuing
Expected results:
PVC should be created and therefore the operator to become healthy.
Description of problem:
Pull support from upstream kubernetes (see KEP 4800: https://github.com/kubernetes/enhancements/issues/4800) for LLC alignment support in cpumanager
Version-Release number of selected component (if applicable):
4.19
How reproducible:
100%
Steps to Reproduce:
1. try to schedule a pod which requires exclusive CPU allocation and whose CPUs should be affine to the same LLC block 2. observe random and likely wrong (not LLC-aligned) allocation 3.
Actual results:
Expected results:
Additional info:
Description of problem:
when TechPreviewNoUpgrade feature gate is enabled, console will show a customized 'Create Project' modal to all users. In the customized modal, 'Display name' and 'Description' values user typed are not taking effect
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-12-16-065305
How reproducible:
Always when TechPreviewNoUpgrade feature gate is enabled
Steps to Reproduce:
1. Enable TechPreviewNoUpgrade feature gate $ oc patch featuregate cluster -p '{"spec": {"featureSet": "TechPreviewNoUpgrade"}}' --type merge 2. any user login to console and create a project from web, set 'Display name' and 'Description' then click on 'Create' 3. Check created project YAML $ oc get project ku-5 -o json | jq .metadata.annotations { "openshift.io/description": "", "openshift.io/display-name": "", "openshift.io/requester": "kube:admin", "openshift.io/sa.scc.mcs": "s0:c28,c17", "openshift.io/sa.scc.supplemental-groups": "1000790000/10000", "openshift.io/sa.scc.uid-range": "1000790000/10000" }
Actual results:
display-name and description are all empty
Expected results:
display-name and description should be set to the values user had configured
Additional info:
once TP is enabled, customized create project modal is looking like https://drive.google.com/file/d/1HmIlm0u_Ia_TPsa0ZAGyTloRmpfD0WYk/view?usp=drive_link
Description of problem:
With balance-slb and nmstate a node got stuck on reboot.
[root@master-1 core]# systemctl list-jobs JOB UNIT TYPE STATE 307 wait-for-br-ex-up.service start running 341 afterburn-checkin.service start waiting 187 multi-user.target start waiting 186 graphical.target start waiting 319 crio.service start waiting 292 kubelet.service start waiting 332 afterburn-firstboot-checkin.service start waiting 306 node-valid-hostname.service start waiting 293 kubelet-dependencies.target start waiting 321 systemd-update-utmp-runlevel.service start waiting systemctl status wait-for-br-ex-up.service Dec 10 20:11:39 master-1.ostest.test.metalkube.org systemd[1]: Starting Wait for br-ex up event from NetworkManager...
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-12-04-113014
How reproducible:
Sometimes
Steps to Reproduce:
1. create nmstate config
interfaces: - name: bond0 type: bond state: up copy-mac-from: eno2 ipv4: enabled: false link-aggregation: mode: balance-xor options: xmit_hash_policy: vlan+srcmac balance-slb: 1 port: - eno2 - eno3 - name: br-ex type: ovs-bridge state: up ipv4: enabled: false dhcp: false ipv6: enabled: false dhcp: false bridge: port: - name: bond0 - name: br-ex - name: br-ex type: ovs-interface state: up copy-mac-from: eno2 ipv4: enabled: true address: - ip: "192.168.111.111" prefix-length: 24 ipv6: enabled: false dhcp: false - name: eno1 type: interface state: up ipv4: enabled: false ipv6: enabled: false dns-resolver: config: server: - 192.168.111.1 routes: config: - destination: 0.0.0.0/0 next-hop-address: 192.168.111.1 next-hop-interface: br-ex
2. reboot
3.
Actual results:
systemctl status wait-for-br-ex-up.service Dec 10 20:11:39 master-1.ostest.test.metalkube.org systemd[1]: Starting Wait for br-ex up event from NetworkManager...
bond0 fails, network is in odd state
[root@master-1 core]# ip -c a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens2f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 90:e2:ba:ca:9f:28 brd ff:ff:ff:ff:ff:ff
altname enp181s0f0
inet6 fe80::92e2:baff:feca:9f28/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 30:d0:42:56:66:bb brd ff:ff:ff:ff:ff:ff
altname enp23s0f0
4: ens2f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 90:e2:ba:ca:9f:29 brd ff:ff:ff:ff:ff:ff
altname enp181s0f1
inet6 fe80::92e2:baff:feca:9f29/64 scope link noprefixroute
valid_lft forever preferred_lft forever
5: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 30:d0:42:56:66:bc brd ff:ff:ff:ff:ff:ff
altname enp23s0f1
6: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 30:d0:42:56:66:bd brd ff:ff:ff:ff:ff:ff
altname enp23s0f2
inet 192.168.111.34/24 brd 192.168.111.255 scope global dynamic noprefixroute eno3
valid_lft 3576sec preferred_lft 3576sec
inet6 fe80::32d0:42ff:fe56:66bd/64 scope link noprefixroute
valid_lft forever preferred_lft forever
7: eno4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 30:d0:42:56:66:be brd ff:ff:ff:ff:ff:ff
altname enp23s0f3
8: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 56:92:14:97:ed:10 brd ff:ff:ff:ff:ff:ff
9: ovn-k8s-mp0: <BROADCAST,MULTICAST> mtu 1400 qdisc noop state DOWN group default qlen 1000
link/ether ae:b9:9e:dc:17:d1 brd ff:ff:ff:ff:ff:ff
10: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN group default qlen 1000
link/ether e6:68:4d:df:e0:bd brd ff:ff:ff:ff:ff:ff
inet6 fe80::e468:4dff:fedf:e0bd/64 scope link
valid_lft forever preferred_lft forever
11: br-int: <BROADCAST,MULTICAST> mtu 1400 qdisc noop state DOWN group default qlen 1000
link/ether 32:5b:1f:35:ce:f5 brd ff:ff:ff:ff:ff:ff
12: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue master ovs-system state DOWN group default qlen 1000
link/ether aa:c8:8c:e3:71:aa brd ff:ff:ff:ff:ff:ff
13: bond0.104@bond0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue master ovs-system state LOWERLAYERDOWN group default qlen 1000
link/ether aa:c8:8c:e3:71:aa brd ff:ff:ff:ff:ff:ff
14: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 30:d0:42:56:66:bd brd ff:ff:ff:ff:ff:ff
inet 192.168.111.111/24 brd 192.168.111.255 scope global noprefixroute br-ex
valid_lft forever preferred_lft forever
Expected results:
System reboots correctly.
Additional info:
br-ex up/down re-generates the event
[root@master-1 core]# nmcli device down br-ex ; nmcli device up br-ex
Description of problem:
On the Azure HCP cluster when creating internal ingress controller we are getting authorization error
Version-Release number of selected component (if applicable):
4.19 and may be further versions
How reproducible:
create internal ingress controller in cluster bot or prowci created Azure HCP cluster
Steps to Reproduce:
1.Create a internal ingress controller mjoseph@mjoseph-mac Downloads % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE console 4.19.0-0.nightly-2025-01-21-163021 True False False 107m csi-snapshot-controller 4.19.0-0.nightly-2025-01-21-163021 True False False 120m dns 4.19.0-0.nightly-2025-01-21-163021 True False False 107m image-registry 4.19.0-0.nightly-2025-01-21-163021 True False False 107m ingress 4.19.0-0.nightly-2025-01-21-163021 True False False 108m insights 4.19.0-0.nightly-2025-01-21-163021 True False False 109m kube-apiserver 4.19.0-0.nightly-2025-01-21-163021 True False False 121m kube-controller-manager 4.19.0-0.nightly-2025-01-21-163021 True False False 121m kube-scheduler 4.19.0-0.nightly-2025-01-21-163021 True False False 121m kube-storage-version-migrator 4.19.0-0.nightly-2025-01-21-163021 True False False 109m monitoring 4.19.0-0.nightly-2025-01-21-163021 True False False 102m network 4.19.0-0.nightly-2025-01-21-163021 True False False 120m node-tuning 4.19.0-0.nightly-2025-01-21-163021 True False False 112m openshift-apiserver 4.19.0-0.nightly-2025-01-21-163021 True False False 121m openshift-controller-manager 4.19.0-0.nightly-2025-01-21-163021 True False False 121m openshift-samples 4.19.0-0.nightly-2025-01-21-163021 True False False 107m operator-lifecycle-manager 4.19.0-0.nightly-2025-01-21-163021 True False False 121m operator-lifecycle-manager-catalog 4.19.0-0.nightly-2025-01-21-163021 True False False 121m operator-lifecycle-manager-packageserver 4.19.0-0.nightly-2025-01-21-163021 True False False 121m service-ca 4.19.0-0.nightly-2025-01-21-163021 True False False 109m storage 4.19.0-0.nightly-2025-01-21-163021 True False False 109m mjoseph@mjoseph-mac Downloads % oc get ingresses.config/cluster -o jsonpath={.spec.domain} apps.93499d233a19644b81ad.qe.azure.devcluster.openshift.com% mjoseph@mjoseph-mac Downloads % oc create -f New\ Folder\ With\ Items/internal_ingress_controller.yaml ingresscontroller.operator.openshift.io/internal created mjoseph@mjoseph-mac Downloads % mjoseph@mjoseph-mac Downloads % mjoseph@mjoseph-mac Downloads % mjoseph@mjoseph-mac Downloads % cat New\ Folder\ With\ Items/internal_ingress_controller.yaml kind: IngressController apiVersion: operator.openshift.io/v1 metadata: name: internal namespace: openshift-ingress-operator spec: domain: internal.93499d233a19644b81ad.qe.azure.devcluster.openshift.com replicas: 1 endpointPublishingStrategy: loadBalancer: scope: Internal type: LoadBalancerService 2. Check the controller status mjoseph@mjoseph-mac Downloads % oc -n openshift-ingress-operator get ingresscontroller NAME AGE default 139m internal 29s mjoseph@mjoseph-mac Downloads % oc get po -n openshift-ingress NAME READY STATUS RESTARTS AGE router-default-5c4db6659b-7cq46 1/1 Running 0 128m router-internal-6b6547cb9-hhtzq 1/1 Running 0 39s mjoseph@mjoseph-mac Downloads % mjoseph@mjoseph-mac Downloads % mjoseph@mjoseph-mac Downloads % oc get co/ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress 4.19.0-0.nightly-2025-01-21-163021 True True False 127m Not all ingress controllers are available. 3. Check the internal ingress controller status mjoseph@mjoseph-mac Downloads % oc -n openshift-ingress-operator get ingresscontroller internal -oyaml apiVersion: operator.openshift.io/v1 kind: IngressController metadata: creationTimestamp: "2025-01-23T07:46:15Z" finalizers: - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller generation: 2 name: internal namespace: openshift-ingress-operator resourceVersion: "29755" uid: 29244558-4d19-4ea4-a5b8-e98b9c07edb3 spec: clientTLS: clientCA: name: "" clientCertificatePolicy: "" domain: internal.93499d233a19644b81ad.qe.azure.devcluster.openshift.com endpointPublishingStrategy: loadBalancer: dnsManagementPolicy: Managed scope: Internal type: LoadBalancerService httpCompression: {} httpEmptyRequestsPolicy: Respond httpErrorCodePages: name: "" replicas: 1 tuningOptions: reloadInterval: 0s unsupportedConfigOverrides: null status: availableReplicas: 1 conditions: - lastTransitionTime: "2025-01-23T07:46:15Z" reason: Valid status: "True" type: Admitted - lastTransitionTime: "2025-01-23T07:46:50Z" message: The deployment has Available status condition set to True reason: DeploymentAvailable status: "True" type: DeploymentAvailable - lastTransitionTime: "2025-01-23T07:46:50Z" message: Minimum replicas requirement is met reason: DeploymentMinimumReplicasMet status: "True" type: DeploymentReplicasMinAvailable - lastTransitionTime: "2025-01-23T07:46:50Z" message: All replicas are available reason: DeploymentReplicasAvailable status: "True" type: DeploymentReplicasAllAvailable - lastTransitionTime: "2025-01-23T07:46:50Z" message: Deployment is not actively rolling out reason: DeploymentNotRollingOut status: "False" type: DeploymentRollingOut - lastTransitionTime: "2025-01-23T07:46:16Z" message: The endpoint publishing strategy supports a managed load balancer reason: WantedByEndpointPublishingStrategy status: "True" type: LoadBalancerManaged - lastTransitionTime: "2025-01-23T07:46:16Z" message: |- The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 403, RawError: {"error":{"code":"AuthorizationFailed","message":"The client '51b4e7f0-f41b-4b52-9bfc-412366b68308' with object id '51b4e7f0-f41b-4b52-9bfc-412366b68308' does not have authorization to perform action 'Microsoft.Network/virtualNetworks/subnets/read' over scope '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-ln-wqg34k2-c04e6-vnet-rg/providers/Microsoft.Network/virtualNetworks/ci-ln-wqg34k2-c04e6-vnet/subnets/ci-ln-wqg34k2-c04e6-subnet' or the scope is invalid. If access was recently granted, please refresh your credentials."}} The cloud-controller-manager logs may contain more details. reason: SyncLoadBalancerFailed status: "False" type: LoadBalancerReady - lastTransitionTime: "2025-01-23T07:46:16Z" message: LoadBalancer is not progressing reason: LoadBalancerNotProgressing status: "False" type: LoadBalancerProgressing - lastTransitionTime: "2025-01-23T07:46:16Z" message: DNS management is supported and zones are specified in the cluster DNS config. reason: Normal status: "True" type: DNSManaged - lastTransitionTime: "2025-01-23T07:46:16Z" message: The wildcard record resource was not found. reason: RecordNotFound status: "False" type: DNSReady - lastTransitionTime: "2025-01-23T07:46:16Z" message: |- One or more status conditions indicate unavailable: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 403, RawError: {"error":{"code":"AuthorizationFailed","message":"The client '51b4e7f0-f41b-4b52-9bfc-412366b68308' with object id '51b4e7f0-f41b-4b52-9bfc-412366b68308' does not have authorization to perform action 'Microsoft.Network/virtualNetworks/subnets/read' over scope '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-ln-wqg34k2-c04e6-vnet-rg/providers/Microsoft.Network/virtualNetworks/ci-ln-wqg34k2-c04e6-vnet/subnets/ci-ln-wqg34k2-c04e6-subnet' or the scope is invalid. If access was recently granted, please refresh your credentials."}} The cloud-controller-manager logs may contain more details.) reason: IngressControllerUnavailable status: "False" type: Available - lastTransitionTime: "2025-01-23T07:46:50Z" status: "False" type: Progressing - lastTransitionTime: "2025-01-23T07:47:46Z" message: |- One or more other status conditions indicate a degraded state: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 403, RawError: {"error":{"code":"AuthorizationFailed","message":"The client '51b4e7f0-f41b-4b52-9bfc-412366b68308' with object id '51b4e7f0-f41b-4b52-9bfc-412366b68308' does not have authorization to perform action 'Microsoft.Network/virtualNetworks/subnets/read' over scope '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-ln-wqg34k2-c04e6-vnet-rg/providers/Microsoft.Network/virtualNetworks/ci-ln-wqg34k2-c04e6-vnet/subnets/ci-ln-wqg34k2-c04e6-subnet' or the scope is invalid. If access was recently granted, please refresh your credentials."}} The cloud-controller-manager logs may contain more details.) reason: DegradedConditions status: "True" type: Degraded - lastTransitionTime: "2025-01-23T07:46:16Z" message: IngressController is upgradeable. reason: Upgradeable status: "True" type: Upgradeable - lastTransitionTime: "2025-01-23T07:46:16Z" message: No evaluation condition is detected. reason: NoEvaluationCondition status: "False" type: EvaluationConditionsDetected domain: internal.93499d233a19644b81ad.qe.azure.devcluster.openshift.com endpointPublishingStrategy: loadBalancer: dnsManagementPolicy: Managed scope: Internal type: LoadBalancerService observedGeneration: 2 selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=internal tlsProfile: ciphers: - ECDHE-ECDSA-AES128-GCM-SHA256 - ECDHE-RSA-AES128-GCM-SHA256 - ECDHE-ECDSA-AES256-GCM-SHA384 - ECDHE-RSA-AES256-GCM-SHA384 - ECDHE-ECDSA-CHACHA20-POLY1305 - ECDHE-RSA-CHACHA20-POLY1305 - DHE-RSA-AES128-GCM-SHA256 - DHE-RSA-AES256-GCM-SHA384 - TLS_AES_128_GCM_SHA256 - TLS_AES_256_GCM_SHA384 - TLS_CHACHA20_POLY1305_SHA256 minTLSVersion: VersionTLS12 mjoseph@mjoseph-mac Downloads %
Actual results:
mjoseph@mjoseph-mac Downloads % oc get co/ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress 4.19.0-0.nightly-2025-01-21-163021 True True False 127m Not all ingress controllers are available.
Expected results:
the internal controller should come up
Additional info:
One more test scenario which is causing the similar error in the HCP cluster in internal LB 1. Create a web server with two services mjoseph@mjoseph-mac Downloads % oc create -f New\ Folder\ With\ Items/webrc.yaml replicationcontroller/web-server-rc created service/service-secure created service/service-unsecure created mjoseph@mjoseph-mac Downloads % oc get po NAME READY STATUS RESTARTS AGE web-server-rc-q87rv 1/1 Running 0 40s mjoseph@mjoseph-mac Downloads % oc get svc oc geNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 172.31.0.1 <none> 443/TCP 152m openshift ExternalName <none> kubernetes.default.svc.cluster.local <none> 147m openshift-apiserver ClusterIP 172.31.165.239 <none> 443/TCP 150m openshift-oauth-apiserver ClusterIP 172.31.254.44 <none> 443/TCP 150m packageserver ClusterIP 172.31.131.10 <none> 443/TCP 150m service-secure ClusterIP 172.31.6.17 <none> 27443/TCP 46s service-unsecure ClusterIP 172.31.199.11 <none> 27017/TCP 46s 2. Add two lb services mjoseph@mjoseph-mac Downloads % oc create -f ../Git/openshift-tests-private/test/extended/testdata/router/bug2013004-lb-services.yaml service/external-lb-57089 created service/internal-lb-57089 created mjoseph@mjoseph-mac Downloads % cat ../Git/openshift-tests-private/test/extended/testdata/router/bug2013004-lb-services.yaml apiVersion: v1 kind: List items: - apiVersion: v1 kind: Service metadata: name: external-lb-57089 spec: ports: - name: https port: 28443 protocol: TCP targetPort: 8443 selector: name: web-server-rc type: LoadBalancer - apiVersion: v1 kind: Service metadata: name: internal-lb-57089 annotations: service.beta.kubernetes.io/azure-load-balancer-internal: "true" spec: ports: - name: https port: 29443 protocol: TCP targetPort: 8443 selector: name: web-server-rc type: LoadBalancer 3. Check the external ip of the internal service, which is not yet assigned mjoseph@mjoseph-mac Downloads % oc get svc -owide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR external-lb-57089 LoadBalancer 172.31.248.177 20.83.73.54 28443:30437/TCP 44s name=web-server-rc internal-lb-57089 LoadBalancer 172.31.156.88 <pending> 29443:31885/TCP 44s name=web-server-rc kubernetes ClusterIP 172.31.0.1 <none> 443/TCP 153m <none> openshift ExternalName <none> kubernetes.default.svc.cluster.local <none> 148m <none> openshift-apiserver ClusterIP 172.31.165.239 <none> 443/TCP 151m <none> openshift-oauth-apiserver ClusterIP 172.31.254.44 <none> 443/TCP 151m <none> packageserver ClusterIP 172.31.131.10 <none> 443/TCP 151m <none> service-secure ClusterIP 172.31.6.17 <none> 27443/TCP 112s name=web-server-rc service-unsecure ClusterIP 172.31.199.11 <none> 27017/TCP 112s name=web-server-rc
Description of problem:
=== RUN TestNodePool/HostedCluster2/EnsureHostedCluster/EnsureSATokenNotMountedUnlessNecessary util.go:1943: Expected <string>: kube-api-access-5jlcn not to have prefix <string>: kube-api-access-
Pod spec:
name: manila-csi-driver-operator resources: requests: cpu: 10m memory: 50Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL runAsNonRoot: true runAsUser: 1000690000 terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /etc/guest-kubeconfig name: guest-kubeconfig - mountPath: /etc/openstack-ca/ name: cacert - mountPath: /etc/openstack/ name: cloud-credentials - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-5jlcn readOnly: true
// This is [Serial] because it modifies ClusterCSIDriver.var _ = g.Describe("[sig-storage][FeatureGate:VSphereDriverConfiguration][Serial][apigroup:operator.openshift.io] vSphere CSI Driver Configuration", func() { defer g.GinkgoRecover() var ( ctx = context.Background() oc = exutil.NewCLI(projectName) originalDriverConfigSpec *opv1.CSIDriverConfigSpec )
o.SetDefaultEventuallyTimeout(5 * time.Minute) o.SetDefaultEventuallyPollingInterval(5 * time.Second)
In OCP Origin we have the above test playing with global variables for poll interval and poll timeout which is causing all other tests in origin to have flakes.
a networking test is failing because we are not polling correctly, this above test overrode the default poll value of 10ms and instead made it poll 5sec which caused the test to fail because out poll timeout was itself only 5seconds
Please don't use the global variables or maybe we can unset them after the test run is over?
Please note that this causes flakes that are hard to debug, we didn't know what was causing the poll interval to be 5seconds instead of the default 10ms.
The "oc adm pod-network" command for working with openshift-sdn multitenant mode is now totally useless in OCP 4.17 and newer clusters (since it's only useful with openshift-sdn, and openshift-sdn no longer exists as of OCP 4.17). Of course, people might use a new oc binary to talk to an older cluster, but probably the built-in documentation should make it clearer that this is not a command that will be useful to 99% of users.
If it's possible to make "pod-network" not show up as a subcommand in "oc adm -h" that would probably be good. If not, it should probably have a description like "Manage OpenShift-SDN Multitenant mode networking [DEPRECATED]", and likewise, the longer descriptions of the pod-network subcommands should talk about "OpenShift-SDN Multitenant mode" rather than "the redhat/openshift-ovs-multitenant network plugin" (which is OCP 3 terminology), and maybe should explicitly say something like "this has no effect when using the default OpenShift Networking plugin (OVN-Kubernetes)".
Tracker issue for bootimage bump in 4.19. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-44977.
Description of problem:
Users could not install 4.19 ocp clusters with 4.19 oc-mirror it fails with error below [core@ci-op-r0wcschh-0f79b-mxw75-bootstrap ~]$ sudo crictl ps FATA[0000] validate service connection: validate CRI v1 runtime API for endpoint "unix:///var/run/crio/crio.sock": rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/run/crio/crio.sock: connect: no such file or directory" [core@ci-op-r0wcschh-0f79b-mxw75-bootstrap ~]$ sudo crictl img FATA[0000] validate service connection: validate CRI v1 image API for endpoint "unix:///var/run/crio/crio.sock": rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/run/crio/crio.sock: connect: no such file or directory" [core@ci-op-r0wcschh-0f79b-mxw75-bootstrap ~]$ journalctl -b -f -u release-image.service -u bootkube.service Dec 27 04:04:04 ci-op-r0wcschh-0f79b-mxw75-bootstrap release-image-download.sh[2568]: Error: initializing source docker://registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: (Mirrors also failed: [ci-op-r0wcschh-0f79b.mirror-registry.qe.gcp.devcluster.openshift.com:5000/ci-op-r0wcschh/release/openshift/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: reading manifest sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1 in ci-op-r0wcschh-0f79b.mirror-registry.qe.gcp.devcluster.openshift.com:5000/ci-op-r0wcschh/release/openshift/release: manifest unknown]): registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: pinging container registry registry.build02.ci.openshift.org: Get "https://registry.build02.ci.openshift.org/v2/": dial tcp 34.74.144.21:443: i/o timeout Dec 27 04:04:04 ci-op-r0wcschh-0f79b-mxw75-bootstrap podman[2568]: 2024-12-27 04:04:04.637824679 +0000 UTC m=+243.178748520 image pull-error registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1 initializing source docker://registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: (Mirrors also failed: [ci-op-r0wcschh-0f79b.mirror-registry.qe.gcp.devcluster.openshift.com:5000/ci-op-r0wcschh/release/openshift/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: reading manifest sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1 in ci-op-r0wcschh-0f79b.mirror-registry.qe.gcp.devcluster.openshift.com:5000/ci-op-r0wcschh/release/openshift/release: manifest unknown]): registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: pinging container registry registry.build02.ci.openshift.org: Get "https://registry.build02.ci.openshift.org/v2/": dial tcp 34.74.144.21:443: i/o timeout Dec 27 04:04:04 ci-op-r0wcschh-0f79b-mxw75-bootstrap release-image-download.sh[2107]: Pull failed. Retrying registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1... Dec 27 04:05:04 ci-op-r0wcschh-0f79b-mxw75-bootstrap release-image-download.sh[2656]: time="2024-12-27T04:05:04Z" level=warning msg="Failed, retrying in 1s ... (1/3). Error: initializing source docker://registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: (Mirrors also failed: [ci-op-r0wcschh-0f79b.mirror-registry.qe.gcp.devcluster.openshift.com:5000/ci-op-r0wcschh/release/openshift/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: reading manifest sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1 in ci-op-r0wcschh-0f79b.mirror-registry.qe.gcp.devcluster.openshift.com:5000/ci-op-r0wcschh/release/openshift/release: manifest unknown]): registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: pinging container registry registry.build02.ci.openshift.org: Get \"https://registry.build02.ci.openshift.org/v2/\": dial tcp 34.74.144.21:443: i/o timeout" Dec 27 04:06:05 ci-op-r0wcschh-0f79b-mxw75-bootstrap release-image-download.sh[2656]: time="2024-12-27T04:06:05Z" level=warning msg="Failed, retrying in 1s ... (2/3). Error: initializing source docker://registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: (Mirrors also failed: [ci-op-r0wcschh-0f79b.mirror-registry.qe.gcp.devcluster.openshift.com:5000/ci-op-r0wcschh/release/openshift/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1: reading manifest sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e8a24cc83f6eb6c1eda1 in ci-op-r0wcschh-0f79b.mirror-registry.qe.gcp.devcluster.openshift.com:5000/ci-op-r0wcschh/release/openshift/release: manifest unknown]): registry.build02.ci.openshift.org/ci-op-r0wcschh/release@sha256:bb836eae0322852a45fc1787716d8c4ddc935f0cfc44e
Version-Release number of selected component (if applicable):
Running command: 'oc-mirror' version --output=yaml W1227 14:24:55.919668 102 mirror.go:102] ⚠️ oc-mirror v1 is deprecated (starting in 4.18 release) and will be removed in a future release - please migrate to oc-mirror --v2 clientVersion: buildDate: "2024-12-17T11:21:05Z" compiler: gc gitCommit: 27a04ae182eda7a668d0ad99c66a5f1e0435010b gitTreeState: clean gitVersion: 4.19.0-202412170739.p0.g27a04ae.assembly.stream.el9-27a04ae goVersion: go1.23.2 (Red Hat 1.23.2-1.el9) X:strictfipsruntime major: "" minor: "" platform: linux/amd64
How reproducible:
Always
Steps to Reproduce:
1. Install OCP4.19 cluster via oc-mirror 4.19 2. 3.
Actual results:
Users see the error as described in the Description
Expected results:
Installation should be successful.
Additional info:
More details in jira https://issues.redhat.com/browse/OCPQE-27853 More details in thread https://redhat-internal.slack.com/archives/C050P27C71S/p1735550241970219
Description of problem:
Sippy complains about pathological events in ns/openshift-cluster-csi-drivers in vsphere-ovn-serial jobs. See this job as one example.
Jan noticed that the DaemonSet generation is 10-12, while in 4.17 is 2. Why is our operator updating the DaemonSet so often?
I wrote a quick "one-liner" to generate json diffs from the vmware-vsphere-csi-driver-operator logs:
prev=''; grep 'DaemonSet "openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-node" changes' openshift-cluster-csi-drivers_vmware-vsphere-csi-driver-operator-5b79c58f6f-hpr6g_vmware-vsphere-csi-driver-operator.log | sed 's/^.*changes: //' | while read -r line; do diff <(echo $prev | jq .) <(echo $line | jq .); prev=$line; echo "####"; done
It really seems to be only operator.openshift.io/spec-hash and operator.openshift.io/dep-* fields changing in the json diffs:
#### 4,5c4,5 < "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==", < "operator.openshift.io/spec-hash": "fb274874404ad6706171c6774a369876ca54e037fcccc200c0ebf3019a600c36" --- > "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==", > "operator.openshift.io/spec-hash": "27a1bab0c00ace8ac21d95a5fe9a089282e7b2b3ec042045951bd5e26ae01a09" 13c13 < "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==" --- > "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==" #### 4,5c4,5 < "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==", < "operator.openshift.io/spec-hash": "27a1bab0c00ace8ac21d95a5fe9a089282e7b2b3ec042045951bd5e26ae01a09" --- > "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==", > "operator.openshift.io/spec-hash": "fb274874404ad6706171c6774a369876ca54e037fcccc200c0ebf3019a600c36" 13c13 < "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==" --- > "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==" ####
The deployment is also changing in the same way. We need to find what is causing the spec-hash and dep-* fields to change and avoid the unnecessary churn that causes new daemonset / deployment rollouts.
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
~20% failure rate in 4.18 vsphere-ovn-serial jobs
Steps to Reproduce:
Actual results:
operator rolls out unnecessary daemonset / deployment changes
Expected results:
don't roll out changes unless there is a spec change
Additional info:
Description of problem:
The resource-controller endpoint override is not honored in all parts of the machine API provider for power vs.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
While generating delete-images.yaml for the pruning of images using oc-mirror v2, the manifest which are generated under working-dir/cluster-resources (IDMS,ITMS etc) are getting deleted automatically
Version-Release number of selected component (if applicable):
4.17
How reproducible:
100% reproducible
Steps to Reproduce:
1- Create a DeleteImageSetConfiguration file like below
apiVersion: mirror.openshift.io/v2alpha1 kind: DeleteImageSetConfiguration delete: platform: channels: - name: stable-4.17 minVersion: 4.17.3 maxVersion: 4.17.3 operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.17 packages: - name: aws-load-balancer-operator - name: node-observability-operator - name: 3scale-operator additionalImages: - name: registry.redhat.io/ubi8/ubi:latest - name: registry.redhat.io/ubi9/ubi@sha256:20f695d2a91352d4eaa25107535126727b5945bff38ed36a3e59590f495046f0
2- ensure that the manifest are generated by the oc-mirror are present in the working-dir/cluster-resources
ls -lrth /opt/417ocmirror/working-dir/cluster-resources/
total 16K
-rw-r--r--. 1 root root 491 Nov 18 21:57 itms-oc-mirror.yaml
-rw-r--r--. 1 root root 958 Nov 18 21:57 idms-oc-mirror.yaml
-rw-r--r--. 1 root root 322 Nov 18 21:57 updateService.yaml
-rw-r--r--. 1 root root 268 Nov 18 21:57 cs-redhat-operator-index-v4-17.yaml
3- Generate the delete-images.yaml using below command
./oc-mirror delete --config ./deleteimageset.yaml --workspace file:///opt/417ocmirror --v2 --generate docker://bastionmirror.amuhamme.upi:8443/417images2024/11/18 23:53:12 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/11/18 23:53:12 [INFO] : 👋 Hello, welcome to oc-mirror 2024/11/18 23:53:12 [INFO] : ⚙️ setting up the environment for you... 2024/11/18 23:53:12 [INFO] : 🔀 workflow mode: diskToMirror / delete 2024/11/18 23:53:12 [INFO] : 🕵️ going to discover the necessary images... 2024/11/18 23:53:12 [INFO] : 🔍 collecting release images... 2024/11/18 23:53:12 [INFO] : 🔍 collecting operator images... 2024/11/18 23:53:13 [INFO] : 🔍 collecting additional images... 2024/11/18 23:53:13 [INFO] : 📄 Generating delete file... 2024/11/18 23:53:13 [INFO] : /opt/417ocmirror/working-dir/delete file created 2024/11/18 23:53:13 [INFO] : delete time : 712.42082ms 2024/11/18 23:53:13 [INFO] : 👋 Goodbye, thank you for using oc-mirror
4- Verify after generating the delete-images.yaml the manifests present in the working-dir/cluster-resources/ got deleted.
# ls -lrth /opt/417ocmirror/working-dir/cluster-resources/ total 0 # ls -lrth /opt/417ocmirror/working-dir/delete total 72K -rwxr-xr-x. 1 root root 65K Nov 18 23:53 delete-images.yaml -rwxr-xr-x. 1 root root 617 Nov 18 23:53 delete-imageset-config.yaml
Actual results:
Generating delete-images.yaml is deleting the manifest under working-dir/cluster-resources/
Expected results:
Generating delete-images.yaml should not delete the manifest under working-dir/cluster-resources/
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
We think that low disk space is likely the cause of https://issues.redhat.com/browse/OCPBUGS-37785
It's not immediately obvious that this happened during the run without digging into the events.
Could we create a new test to enforce that the kubelet never reports disk pressure during a run?
Description of problem:
On azure(or vsphere) TP cluster upgrade failed from 4.15.0-rc.5-> 4.15.0-rc.7 or 4.15.0-rc.4-> 4.15.0-rc.5, stuck in cluster-api. Seems this only happened on platforms don't support capi, this couldn't be reproduced on aws and gcp, .
Version-Release number of selected component (if applicable):
4.15.0-rc.5-> 4.15.0-rc.7 or 4.15.0-rc.4-> 4.15.0-rc.5
How reproducible:
always
Steps to Reproduce:
1.Build a TP cluster 4.15.0-rc.5 on azure(or vsphere) 2.Upgrade to 4.15.0-rc.7 3.
Actual results:
Upgrade stuck in cluster-api. must-gather: https://drive.google.com/file/d/12ykhEVZvqY_0eNdLwJOWFSxTSdQQrm_y/view?usp=sharing $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-rc.5 True True 82m Working towards 4.15.0-rc.7: 257 of 929 done (27% complete), waiting on cluster-api I0222 04:53:18.733907 1 sync_worker.go:1134] Update error 198 of 929: ClusterOperatorUpdating Cluster operator cluster-api is updating versions (*errors.errorString: cluster operator cluster-api is available and not degraded but has not finished updating to target version) E0222 04:53:18.733944 1 sync_worker.go:638] unable to synchronize image (waiting 2m44.892272217s): Cluster operator cluster-api is updating versions $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.0-rc.5 True False False 99m baremetal 4.15.0-rc.5 True False False 123m cloud-controller-manager 4.15.0-rc.7 True False False 128m cloud-credential 4.15.0-rc.5 True False False 135m cluster-api 4.15.0-rc.5 True False False 124m cluster-autoscaler 4.15.0-rc.5 True False False 123m config-operator 4.15.0-rc.7 True False False 124m console 4.15.0-rc.5 True False False 101m control-plane-machine-set 4.15.0-rc.7 True False False 113m csi-snapshot-controller 4.15.0-rc.5 True False False 112m dns 4.15.0-rc.5 True False False 115m etcd 4.15.0-rc.7 True False False 122m image-registry 4.15.0-rc.5 True False False 107m ingress 4.15.0-rc.5 True False False 106m insights 4.15.0-rc.5 True False False 118m kube-apiserver 4.15.0-rc.7 True False False 108m kube-controller-manager 4.15.0-rc.7 True False False 121m kube-scheduler 4.15.0-rc.7 True False False 120m kube-storage-version-migrator 4.15.0-rc.5 True False False 115m machine-api 4.15.0-rc.7 True False False 111m machine-approver 4.15.0-rc.5 True False False 124m machine-config 4.15.0-rc.5 True False False 121m marketplace 4.15.0-rc.5 True False False 123m monitoring 4.15.0-rc.5 True False False 106m network 4.15.0-rc.5 True False False 126m node-tuning 4.15.0-rc.5 True False False 112m olm 4.15.0-rc.5 True False False 106m openshift-apiserver 4.15.0-rc.5 True False False 115m openshift-controller-manager 4.15.0-rc.5 True False False 115m openshift-samples 4.15.0-rc.5 True False False 111m operator-lifecycle-manager 4.15.0-rc.5 True False False 123m operator-lifecycle-manager-catalog 4.15.0-rc.5 True False False 123m operator-lifecycle-manager-packageserver 4.15.0-rc.5 True False False 112m platform-operators-aggregated 4.15.0-rc.5 True False False 73m service-ca 4.15.0-rc.5 True False False 124m storage 4.15.0-rc.5 True False False 107m
Expected results:
Upgrade is successful
Additional info:
upgrade succeed from 4.15.0-rc.3-> 4.15.0-rc.4
flag: --metrics-bind-addr was deprecated and is now removed, we need to update how we deploy CAPO.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
4.16 and newer
How reproducible:
Always
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
$ oc --context sharedocp416-sbr patch clustercsidriver csi.vsphere.vmware.com --type merge -p "{\"spec\":{\"managementState\":\"Unmanaged\"}}" $ oc --context sharedocp416-sbr -n openshift-cluster-csi-drivers get deploy/vmware-vsphere-csi-driver-controller -o json | jq -r '.spec.template.spec.containers[] | select(.name == "csi-attacher").args' [ "--csi-address=$(ADDRESS)", "--timeout=300s", "--http-endpoint=localhost:8203", "--leader-election", "--leader-election-lease-duration=137s", "--leader-election-renew-deadline=107s", "--leader-election-retry-period=26s", "--v=2" "--reconcile-sync=10m" <<----------------- ADD THE INCREASED RSYNC INTERVAL ]
Description of problem:
[control-plane-operator] azure-file-csi using nfs protocal provision volume failed of "vnetName or location is empty"
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2025-01-21-070749
How reproducible:
Always
Steps to Reproduce:
1. Create aro hosted cluster on azure. 2. Create a new storageclass using azure file csi provisioner and nfs protocol and create pvc with the created storageclass, create pod comsume the pvc. $ oc apply -f - <<EOF allowVolumeExpansion: true apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: azurefile-csi-nfs parameters: protocol: nfs provisioner: file.csi.azure.com reclaimPolicy: Delete volumeBindingMode: Immediate --- kind: PersistentVolumeClaim apiVersion: v1 metadata: name: mypvc spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi storageClassName: azurefile-csi-nfs volumeMode: Filesystem --- apiVersion: v1 kind: Pod metadata: name: mypod spec: containers: - name: hello-app image: quay.io/openshifttest/hello-openshift@sha256:56c354e7885051b6bb4263f9faa58b2c292d44790599b7dde0e49e7c466cf339 volumeMounts: - mountPath: /mnt/storage name: data volumes: - name: data persistentVolumeClaim: claimName: mypvc EOF 3. Check the volume should be provisioned and pod could read and write inside the file volume.
Actual results:
In step 3: the volume provision failed of vnetName or location is empty $ oc describe pvc mypvc Name: mypvc Namespace: default StorageClass: azurefile-csi-nfs Status: Pending Volume: Labels: <none> Annotations: volume.beta.kubernetes.io/storage-provisioner: file.csi.azure.com volume.kubernetes.io/storage-provisioner: file.csi.azure.com Finalizers: [kubernetes.io/pvc-protection] Capacity: Access Modes: VolumeMode: Filesystem Used By: mypod Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ExternalProvisioning 7s (x3 over 10s) persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'file.csi.azure.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered. Normal Provisioning 3s (x4 over 10s) file.csi.azure.com_azure-file-csi-driver-controller-7cb9b5f788-n9ztr_85399802-95c4-468e-814d-2c4df5140069 External provisioner is provisioning volume for claim "default/mypvc" Warning ProvisioningFailed 3s (x4 over 10s) file.csi.azure.com_azure-file-csi-driver-controller-7cb9b5f788-n9ztr_85399802-95c4-468e-814d-2c4df5140069 failed to provision volume with StorageClass "azurefile-csi-nfs": rpc error: code = Internal desc = update service endpoints failed with error: vnetName or location is empty
Expected results:
In step 3: the volume should be provisioned and pod could read and write inside the file volume.
Additional info:
# aro hcp missed the vnetName/vnetResourceGroup will caused using nfs protocol provision volumes failed oc extract secret/azure-file-csi-config --to=- # cloud.conf { "cloud": "AzurePublicCloud", "tenantId": "XXXXXXXXXX", "useManagedIdentityExtension": false, "subscriptionId": "XXXXXXXXXX", "aadClientId": "XXXXXXXXXX", "aadClientSecret": "", "aadClientCertPath": "/mnt/certs/ci-op-gcprj1wl-0a358-azure-file", "resourceGroup": "ci-op-gcprj1wl-0a358-rg", "location": "centralus", "vnetName": "", "vnetResourceGroup": "", "subnetName": "", "securityGroupName": "", "securityGroupResourceGroup": "", "routeTableName": "", "cloudProviderBackoff": false, "cloudProviderBackoffDuration": 0, "useInstanceMetadata": false, "loadBalancerSku": "", "disableOutboundSNAT": false, "loadBalancerName": "" }
Description of problem:
4.18 HyperShift operator's NodePool controller fails to serialize NodePool ConfigMaps that contain ImageDigestMirrorSet. Inspecting the code, it fails on NTO reconciliation logic, where only machineconfiguration API schemas are loaded into the YAML serializer: https://github.com/openshift/hypershift/blob/f7ba5a14e5d0cf658cf83a13a10917bee1168011/hypershift-operator/controllers/nodepool/nto.go#L415-L421
Version-Release number of selected component (if applicable):
4.18
How reproducible:
100%
Steps to Reproduce:
1. Install 4.18 HyperShift operator 2. Create NodePool with configuration ConfigMap that includes ImageDigestMirrorSet 3. HyperShift operator fails to reconcile NodePool
Actual results:
HyperShift operator fails to reconcile NodePool
Expected results:
HyperShift operator to successfully reconcile NodePool
Additional info:
Regression introduced by https://github.com/openshift/hypershift/pull/4717
Description of problem:
Application of PerformanceProfile with invalid cpuset in one of the reserved/isolated/shared/offlined cpu fields causing webhook validation to panic instead of returning an informant error.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-11-22-231049
How reproducible:
Apply a PerformanceProfile with invalid cpu values
Steps to Reproduce:
Apply the following PerformanceProfile with invalid cpu values: apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: pp spec: cpu: isolated: 'garbage' reserved: 0-3 machineConfigPoolSelector: pools.operator.machineconfiguration.openshift.io/worker-cnf: "" nodeSelector: node-role.kubernetes.io/worker-cnf: ""
Actual results:
On OCP >= 4.18 the error is:
Error from server: error when creating "pp.yaml": admission webhook "vwb.performance.openshift.io" denied the request: panic: runtime error: invalid memory address or nil pointer dereference [recovered]
On OCP <= 4.17 the error is:
Validation webhook passes without any errors. Invalid configuration propogates to the cluster and breaks it.
Expected results:
We expect to pushback an informant error when invalid cpuset has been entered, without panicking or accepting it!
Description of problem:
Currently we are creating HFC for all BMH, doesn't mater if they are using ipmi or redfish, this can lead to misunderstanding.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
the master branch is going to be renamed to 'main'
we need to update the automation for the sync to reduce breakage
Description of problem:
Currently check-patternfly-modules.sh checks them serially, which could be improved by checking them in parallel. Since yarn why does not write to anything, this should be easily parallelizable as there is no race condition with writing back to the yarn.lock
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Component Readiness has found a potential regression in the following test:
[sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router converges when multiple routers are writing conflicting status [Suite:openshift/conformance/parallel]
Significant regression detected.
Fishers Exact probability of a regression: 99.98%.
Test pass rate dropped from 99.51% to 93.75%.
Sample (being evaluated) Release: 4.18
Start Time: 2024-10-29T00:00:00Z
End Time: 2024-11-05T23:59:59Z
Success Rate: 93.75%
Successes: 40
Failures: 3
Flakes: 5
Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T23:59:59Z
Success Rate: 99.51%
Successes: 197
Failures: 1
Flakes: 6
Description of problem:
s2i conformance test appears to fail permanently on OCP 4.16.z
Version-Release number of selected component (if applicable):
4.16.z
How reproducible:
Since 2024-11-04 at least
Steps to Reproduce:
Run OpenShift build test suite in PR
Actual results:
Test fails - root cause appears to be that a built/deployed pod crashloops
Expected results:
Test succeeds
Additional info:
Job history https://prow.ci.openshift.org/job-history/gs/test-platform-results/pr-logs/directory/pull-ci-openshift-openshift-controller-manager-release-4.16-e2e-gcp-ovn-builds
Description of problem:
There is a new menu "UserDefinedNetworks" under "Networking", it shows 404 error on the page after nav to this menu.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-11-20-085127
How reproducible:
Always
Steps to Reproduce:
1.Go to Networking->UserDefinedNetworks page. 2. 3.
Actual results:
1. 404 error is shown on the page : 404: Page Not Found The server doesn't have a resource type "UserDefinedNetwork" in "k8s.ovn.org/v1". Try refreshing the page if it was recently added.
Expected results:
1. Should not show 404 error.
Additional info:
Description of problem:
MCO failed to roll out imagepolicy configuration with imagepoliy objects for different namespaces
Version-Release number of selected component (if applicable):
How reproducible:
Create ImagePolicy for testnamespace and mynamespace
apiVersion: config.openshift.io/v1alpha1 kind: ImagePolicy metadata: name: p1 namespace: testnamespace spec: scopes: - example.com/global/image - example.com policy: rootOfTrust: policyType: PublicKey publicKey: keyData: LS0tLS1CRUdJTiBQVUJMSUMgS0VZLS0tLS0KTUZrd0V3WUhLb1pJemowQ0FRWUlLb1pJemowREFRY0RRZ0FFVW9GVW9ZQVJlS1hHeTU5eGU1U1FPazJhSjhvKwoyL1l6NVk4R2NOM3pGRTZWaUl2a0duSGhNbEFoWGFYL2JvME05UjYyczAvNnErK1Q3dXdORnVPZzhBPT0KLS0tLS1FTkQgUFVCTElDIEtFWS0tLS0t signedIdentity: matchPolicy: ExactRepository exactRepository: repository: example.com/foo/bar apiVersion: config.openshift.io/v1alpha1 kind: ImagePolicy metadata: name: p2 namespace: mynamespace spec: scopes: - registry.namespacepolicy.com policy: rootOfTrust: policyType: PublicKey publicKey: keyData: Zm9vIGJhcg== signedIdentity: matchPolicy: ExactRepository exactRepository: repository: example.com/foo/bar
Steps to Reproduce:
1.create namespace test-namespace, the first imagepolicy 2.create the second namespace and imagepolicy
Actual results:
only the first imagepolicy got rolled out machineconfig controller log error: $ oc logs -f machine-config-controller-c997df58b-9dk8t I0108 23:05:09.141699 1 container_runtime_config_controller.go:499] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not update namespace policy JSON from imagepolicy: error decoding policy json for namespaced policies: EOF
Expected results:
both /etc/crio/policies/mynamespace.json and /etc/crio/policies/testnamespace.json created
Additional info:
Description of problem:
1. when there is no UDNs, it just an button to create UDN from form 2. when tehre are UDNs, there are two options: create Cluster UDN and UDN
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
If a user updates a deployment config using Form view instead of yaml, the image pull secret is getting duplicated. ~~~ $ oc get pods ubi9-2-deploy message: | error: couldn't assign source annotation to deployment ubi9-2: failed to create manager for existing fields: failed to convert new object (testdc-dup-sec/ubi9-2; /v1, Kind=ReplicationController) to smd typed: .spec.template.spec.imagePullSecrets: duplicate entries for key [name="test-pull-secret"] reason: Error ~~~
Version-Release number of selected component (if applicable):
4.13.z,4.14.z,4.15.z
How reproducible:
Steps to Reproduce:
1. Edit DeploymentConfig in Form view 2. Update image version 3. Save
Actual results:
Expected results:
Additional info:
Issue is not reproducible on OCP 4.16.7+ version.
Description of problem:
Some control plane pods are not receiving the tolerations specified using the hypershift create cluster azure --toleration command.
Steps to Reproduce:
1. Create Azure HC with hypershift create cluster azure --toleration key=foo-bar.baz/quux,operator=Exists --toleration=key=fred,operator=Equal,value=foo,effect=NoSchedule --toleration key=waldo,operator=Equal,value=bar,effect=NoExecute,tolerationSeconds=3600 ...
2. Run the following script against the MC NAMESPACE="clusters-XXX" PODS="$(oc get pods -n "$NAMESPACE" -o jsonpath='{.items[*].metadata.name}')" for POD in $PODS; do echo "Checking pod: $POD" tolerations="$(oc get po -n $NAMESPACE $POD -o jsonpath='{.spec.tolerations}' | jq -c --sort-keys)" failed="false" if ! grep -q '"key":"foo-bar.baz/quux","operator":"Exists"' <<< "$tolerations"; then echo "No foo-bar.baz/quux key found" >&2 failed="true" fi if ! grep -q '"effect":"NoSchedule","key":"fred","operator":"Equal","value":"foo"' <<< "$tolerations"; then echo "No fred key found" >&2 failed="true" fi if ! grep -q '"effect":"NoExecute","key":"waldo","operator":"Equal","tolerationSeconds":3600,"value":"bar"' <<< "$tolerations"; then echo "No waldo key found" >&2 failed="true" fi if [[ $failed == "true" ]]; then echo "Tolerations: " echo "$tolerations" | jq --sort-keys fi echo done
3. Take note of the results
Actual results (and dump files):
https://drive.google.com/drive/folders/1MQYihLSaK_9WDq3b-H7vx-LheSX69d2O?usp=sharing
Expected results:
All specified tolerations are propagated to all control plane pods.
The following test is failing more than expected:
Undiagnosed panic detected in pod
See the sippy test details for additional context.
Observed in 4.18-e2e-azure-ovn/1864410356567248896 as well as pull-ci-openshift-installer-master-e2e-azure-ovn/1864312373058211840
: Undiagnosed panic detected in pod { pods/openshift-cloud-controller-manager_azure-cloud-controller-manager-5788c6f7f9-n2mnh_cloud-controller-manager_previous.log.gz:E1204 22:27:54.558549 1 iface.go:262] "Observed a panic" panic="interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.EndpointSlice" panicGoValue="&runtime.TypeAssertionError{_interface:(*abi.Type)(0x291daa0), concrete:(*abi.Type)(0x2b73880), asserted:(*abi.Type)(0x2f5cc20), missingMethod:\"\"}" stacktrace=<}
Implementing RFE-3017.
As a bare Story, without a Feature or Epic, because I'm trying to limit the amount of MCO-side paperwork required to get my own RFE itch scratched. As a Story and not a NO-ISSUE pull, because OCP QE had a bit of trouble handling mco#4637 when I went NO-ISSUE on that one, and I think this might be worth a 4.19 release note.
Description of problem:
The default of the default cert in the router, default_pub_keys.pem, uses SHA1 and fails to load if any of the DEFAULT_CERTIFICATE, DEFAULT_CERTIFICATE_PATH, or DEFAULT_CERTIFICATE_DIR are NOT specified on the router deployment. This isn't an active problem for our supported router scenarios because default_pub_keys.pem is never used since DEFAULT_CERTIFICATE_DIR is always specified. But it does impact E2E testing such as when we create router deployments with no default cert, which attempts to load default_pub_keys.pem, which HAProxy fails on now because it's SHA1. So, both a completeness fix, and a fix to help make E2E tests simpler in origin.
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
100%
Steps to Reproduce:
1. openssl x509 -in ./images/router/haproxy/conf/default_pub_keys.pem -noout -text
Actual results:
... Signature Algorithm: sha1WithRSAEncryption ...
Expected results:
... Signature Algorithm: sha256WithRSAEncryption ...
Additional info:
Description of problem:
Created a service for DNS server for secondary networks in Openshift-Virtualizaion, using MetalLB, but the IP is still pending, when accessing the service from the UI, it crash.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Steps to Reproduce:
1. Create an IP pool (for example 1 IP) for Metal LB and fully utilize the IP range (which other service) 2. Allocate a new IP using the oc expose command like below 3. Check the service status on the UI
Actual results:
UI crash
Expected results:
Should show the service status
Additional info:
oc expose -n openshift-cnv deployment/secondary-dns --name=dns-lb --type=LoadBalancer --port=53 --target-port=5353 --protocol='UDP'
Tracking https://github.com/distribution/distribution/issues/4112 and/or our own fixes.
user specified tls skip verify which triggers a bug that do not respect proxy values.
short term fix: if self signed cert is used, specify cacert accordingly instead of skipping verification.
Description of problem:
If a customer populates the serviceEndpoints for powervs via the install config in 4.17, the validation is incorrect and persists lowercase values. status: platformStatus: type: PowerVS powervs: serviceEndpoints: - name: dnsservices url: ... On upgrade, the schema is currently updated to an enum, courtesy of https://github.com/openshift/api/pull/2076 The validation upgrade and ratcheting was tested, but only for the `spec` version of the field. It was assumed that spec and status validation behaved the same. However, https://issues.redhat.com/browse/OCPBUGS-48077, has recently been found, and this means that on upgrade, all writes to the status subresource of the infrastructure object fail, until the serviceEndpoints are fixed. In a steady state, this may not cause general cluster degredation, writing to the status of the infrastructure object is not common. However, any controller that does attempt to write to it, will fail, and end up erroring until the value has been fixed. There are several possible approaches to resolve this: 1. Revert https://github.com/openshift/api/pull/2076 and anything else that depended on it 2. Merge and backport the fix for https://issues.redhat.com/browse/OCPBUGS-48077 3. Introduce something in 4.18 to fix invalid values in the status (eg convert dnsservices to DNSServices) Until one of these three (or perhaps other fixes) is taken, I think this needs to be considered a PowerVS upgrade blocker, and then management can decide if this is enough to block 4.18
Version-Release number of selected component (if applicable):
4.17 to 4.18
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
We have an OKD 4.12 cluster which has persistent and increasing ingresswithoutclassname alerts with no ingresses normally present in the cluster. I believe the ingresswithoutclassname being counted is created as part of the ACME validation process managed by the cert-manager operator with it's openshift route addon which are torn down once the ACME validation is complete.
Version-Release number of selected component (if applicable):
4.12.0-0.okd-2023-04-16-041331
How reproducible:
seems very consistent. went away during an update but came back shortly after and continues to increase.
Steps to Reproduce:
1. create ingress w/o classname 2. see counter increase 3. delete classless ingress 4. counter does not decrease.
Additional info:
https://github.com/openshift/cluster-ingress-operator/issues/912
Description of problem:
checked in 4.18.0-0.nightly-2024-12-05-103644/4.19.0-0.nightly-2024-12-04-03122, admin console go to "Observe -> Metrics", execute one query, make sure there is result for it, for example "cluster_version", click the kebab menu, "Show all series" under the list, it's wrong, should be "Hide all series", click "Show all series" will unselect all series, then "Hide all series" always show under the menu, click it, the series would be changed from selected and unselected, but always see "Hide all series", see recording: https://drive.google.com/file/d/1kfwAH7FuhcloCFdRK--l01JYabtzcG6e/view?usp=drive_link
same issue for developer console for 4.18+, 4.17 and below does not have such issue
Version-Release number of selected component (if applicable):
4.18+
How reproducible:
always with 4.18+
Steps to Reproduce:
see the description
Actual results:
Hide/Show all series status under"Observe -> Metrics" kebab menu is wrong
Expected results:
should be right
Description of problem:
Routes with SHA1 CA certificates (spec.tls.caCertificate) break HAProxy preventing reload
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. create Route with SHA1 CA certificates 2. 3.
Actual results:
HAProxy router fails to reload
Expected results:
HAProxy router should either reject Routes with SHA1 CA certificates, or reload successfully
Additional info:
[ALERT] (312) : config : parsing [/var/lib/haproxy/conf/haproxy.config:131] : 'bind unix@/var/lib/haproxy/run/haproxy-sni.sock' in section 'frontend' : 'crt-list' : error processing line 1 in file '/var/lib/haproxy/conf/cert_config.map' : unable to load chain certificate into SSL Context '/var/lib/haproxy/router/certs/test:test.pem': ca md too weak. [ALERT] (312) : config : Error(s) found in configuration file : /var/lib/haproxy/conf/haproxy.config [ALERT] (312) : config : Fatal errors found in configuration.
This is a continuation/variance of https://issues.redhat.com/browse/OCPBUGS-26498
(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a potential regression in the following tests:
install should succeed: infrastructure
install should succeed: overall
Significant regression detected.
Fishers Exact probability of a regression: 100.00%.
Test pass rate dropped from 99.24% to 89.63%.
Sample (being evaluated) Release: 4.18
Start Time: 2024-12-13T00:00:00Z
End Time: 2024-12-20T12:00:00Z
Success Rate: 89.63%
Successes: 120
Failures: 17
Flakes: 27
Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T23:59:59Z
Success Rate: 99.24%
Successes: 939
Failures: 8
Flakes: 99
View the test details report for additional context.
The kind folks at Pure Storage tell us that if customers upgrade to 4.18 without the following patch issues will occur in CSI migration.
Kube 1.31 backport https://github.com/kubernetes/kubernetes/pull/129675
Master branch PR will full issue description and testing procedure
https://github.com/kubernetes/kubernetes/pull/129630
Description of problem:
Create a UDN network, check NAD list, the UDN network is also presenting there
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
UDN network only presents under UDN, not presents in NAD
Expected results:
Additional info:
Description of problem:
v1alpha1 schema is still present in the v1 ConsolePlugin CRD and should be removed manually since the generator is re-adding it automatically.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Non admin users cannot create UserDefinedNetwork instances.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Create UDN instance as non-admin users.
2.
3.
Actual results:
In the UI, openning the UserDefinedPage fails with the following error
```
userdefinednetworks.k8s.ovn.org is forbidden: User "test" cannot list resource "userdefinednetworks" in API group "k8s.ovn.org" at the cluster scope
```
We get similar error trying to create one.
Expected results:
As a non-admin user I want to be able to create UDN CR w/o cluster-admin intervention.
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
Incorrect capitalization for `Lightspeed` to capitalized `LightSpeed` in ja and zh langs
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
{Failed === RUN TestAutoscaling/EnsureHostedCluster/EnsurePodsWithEmptyDirPVsHaveSafeToEvictAnnotations
util.go:1232:
the pod openstack-manila-csi-controllerplugin-676cc65ffc-tnnkb is not in the audited list for safe-eviction and should not contain the safe-to-evict-local-volume annotation
Expected
<string>: socket-dir
to be empty
--- FAIL: TestAutoscaling/EnsureHostedCluster/EnsurePodsWithEmptyDirPVsHaveSafeToEvictAnnotations (0.02s)
}
Description of problem:
When running oc-mirror V2 (either 4.16 or 4.17 has been tested) on a RHEL 9 FIPS enabled and STIG Security profile enforced system, oc-mirror fails due to a hard coded PGP key in oc-mirror V2.
Version-Release number of selected component (if applicable):
At least 4.16-4.17
How reproducible:
Very reproducible
Steps to Reproduce:
ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: platform: channels: - name: stable-4.16 minVersion: 4.16.18 maxVersion: 4.16.24 shortestPath: true
3. run oc-mirror with the following flags:
[cnovak@localhost ocp4-disconnected-config]$ /pods/content/bin/oc-mirror --config /pods/content/images/cluster-images.yml file:///pods/content/images/cluster-images --v2 2024/12/18 14:40:01 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/12/18 14:40:01 [INFO] : 👋 Hello, welcome to oc-mirror 2024/12/18 14:40:01 [INFO] : ⚙️ setting up the environment for you... 2024/12/18 14:40:01 [INFO] : 🔀 workflow mode: mirrorToDisk 2024/12/18 14:40:01 [INFO] : 🕵️ going to discover the necessary images... 2024/12/18 14:40:01 [INFO] : 🔍 collecting release images... 2024/12/18 14:40:02 [ERROR] : openpgp: invalid data: user ID self-signature invalid: openpgp: invalid signature: RSA verification failure 2024/12/18 14:40:02 [ERROR] : generate release signatures: error list invalid signature for 3f14e29f5b42e1fee7d7e49482cfff4df0e63363bb4a5e782b65c66aba4944e7 image quay.io/openshift-release-dev/ocp-release@sha256:3f14e29f5b42e1fee7d7e49482cfff4df0e63363bb4a5e782b65c66aba4944e7 2024/12/18 14:40:02 [INFO] : 🔍 collecting operator images... 2024/12/18 14:40:02 [INFO] : 🔍 collecting additional images... 2024/12/18 14:40:02 [INFO] : 🚀 Start copying the images... 2024/12/18 14:40:02 [INFO] : images to copy 0 2024/12/18 14:40:02 [INFO] : === Results === 2024/12/18 14:40:02 [INFO] : 📦 Preparing the tarball archive... 2024/12/18 14:40:02 [INFO] : 👋 Goodbye, thank you for using oc-mirror 2024/12/18 14:40:02 [ERROR] : unable to add cache repositories to the archive : lstat /home/cnovak/.oc-mirror/.cache/docker/registry/v2/repositories: no such file or directory
Expected results/immediate workaround:
[cnovak@localhost ~]$ curl -s https://raw.githubusercontent.com/openshift/cluster-update-keys/d44fca585d081a72cb2c67734556a27bbfc9470e/manifests.rhel/0000_90_cluster-update-keys_configmap.yaml | sed -n '/openshift[.]io/d;s/Comment:.*//;s/^ //p' > /tmp/pgpkey [cnovak@localhost ~]$ export OCP_SIGNATURE_VERIFICATION_PK=/tmp/pgpkey [cnovak@localhost ~]$ /pods/content/bin/oc-mirror --config /pods/content/images/cluster-images.yml file:///pods/content/images/cluster-images --v22024/12/19 08:54:42 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/12/19 08:54:42 [INFO] : 👋 Hello, welcome to oc-mirror 2024/12/19 08:54:42 [INFO] : ⚙️ setting up the environment for you... 2024/12/19 08:54:42 [INFO] : 🔀 workflow mode: mirrorToDisk 2024/12/19 08:54:42 [INFO] : 🕵️ going to discover the necessary images... 2024/12/19 08:54:42 [INFO] : 🔍 collecting release images... 2024/12/19 08:54:42 [INFO] : 🔍 collecting operator images... 2024/12/19 08:54:42 [INFO] : 🔍 collecting additional images... 2024/12/19 08:54:42 [INFO] : 🚀 Start copying the images... 2024/12/19 08:54:42 [INFO] : images to copy 382 ⠸ 1/382 : (7s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:32f80a2ee0f52e0c07a6790171be70a1b92010d8d395e9e14b4ee5f268e384bb ✓ 2/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a61b758c659f93e64d4c13a7bbc6151fe8191c2421036d23aa937c44cd478ace ✓ 3/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:29ba4e3ff278741addfa3c670ea9cc0de61f7e6265ebc1872391f5b3d58427d0 ✓ 4/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2809165826b9094873f2bc299a28980f92d7654adb857b73463255eac9265fd8 ⠋ 1/382 : (19s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:32f80a2ee0f52e0c07a6790171be70a1b92010d8d395e9e14b4ee5f268e384bb ✓ 2/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a61b758c659f93e64d4c13a7bbc6151fe8191c2421036d23aa937c44cd478ace ✓ 3/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:29ba4e3ff278741addfa3c670ea9cc0de61f7e6265ebc1872391f5b3d58427d0 ✓ 4/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2809165826b9094873f2bc299a28980f92d7654adb857b73463255eac9265fd8 ✓ 5/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1e54fc21197c341fe257d2f2f2ad14b578483c4450474dc2cf876a885f11e745 ✓ 6/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5c934b4d95545e29f9cb7586964fd43cdb7b8533619961aaa932fe2923ab40db ✓ 7/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:851ba9ac5219a9f11e927200715e666ae515590cd9cc6dde9631070afb66b5d7 ✓ 8/382 : (1s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f614ef855220f2381217c31b8cb94c05ef20edf3ca23b5efa0be1b957cdde3a4
Additional info:
The reason this is a critical issue, is Red Hat has a relatively large footprint within the DoD/U.S Government space, and anyone who is working in a disconnected environment, with a STIG Policy enforced on a RHEL 9 machine, will run into this problem. Additionally, below is output from oc-mirror version [cnovak@localhost ~]$ oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202411251634.p0.g07714b7.assembly.stream.el9-07714b7", GitCommit:"07714b7c836ec3ad1b776f25b44c3b2c2f083aa2", GitTreeState:"clean", BuildDate:"2024-11-26T08:28:42Z", GoVersion:"go1.22.9 (Red Hat 1.22.9-2.el9_5) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
Description of problem:
Trying to setup a disconnected HCP cluster with self-managed image registry. After the cluster installed, all the imagestream failed to import images. With error: ``` Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client ``` The imagestream will talk to openshift-apiserver and get the image target there. After login to the hcp namespace, figured out that I cannot access any external network with https protocol.
Version-Release number of selected component (if applicable):
4.14.35
How reproducible:
always
Steps to Reproduce:
1. Install the hypershift hosted cluster with above setup 2. The cluster can be created successfully and all the pods on the cluster can be running with the expected images pulled 3. Check the internal image-registry 4. Check the openshift-apiserver pod from management cluster
Actual results:
All the imagestreams failed to sync from the remote registry. $ oc describe is cli -n openshift Name: cli Namespace: openshift Created: 6 days ago Labels: <none> Annotations: include.release.openshift.io/ibm-cloud-managed=true include.release.openshift.io/self-managed-high-availability=true openshift.io/image.dockerRepositoryCheck=2024-11-06T22:12:32Z Image Repository: image-registry.openshift-image-registry.svc:5000/openshift/cli Image Lookup: local=false Unique Images: 0 Tags: 1latest updates automatically from registry quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d ! error: Import failed (InternalError): Internal error occurred: [122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-1@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-2@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-3@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-4@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-5@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://quay.io/v2/": http: server gave HTTP response to HTTPS client] Access the external network from the openshift-apiserver pod: sh-5.1$ curl --connect-timeout 5 https://quay.io/v2 curl: (28) Operation timed out after 5001 milliseconds with 0 out of 0 bytes received sh-5.1$ curl --connect-timeout 5 https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/ curl: (28) Operation timed out after 5001 milliseconds with 0 out of 0 bytes received sh-5.1$ env | grep -i http.*proxy HTTPS_PROXY=http://127.0.0.1:8090 HTTP_PROXY=http://127.0.0.1:8090
Expected results:
The openshift-apiserver should be able to talk to the remote https services.
Additional info:
It is working after set the registry to no_proxy sh-5.1$ NO_PROXY=122610517469.dkr.ecr.us-west-2.amazonaws.com curl --connect-timeout 5 https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/ Not Authorized
Console reports its internal version back in segment.io telemetry. This version is opaque and cannot easily be correlated back to a particular OpenShift version. We should use an OpenShift versions like 4.17.4 instead in segment.io events.
There is a number of various inconsistencies while using the observe section in the virtualization perspective.
Description of problem:
The story is to track i18n upload/download routine tasks which are perform every sprint.
A.C.
- Upload strings to Memosource at the start of the sprint and reach out to localization team
- Download translated strings from Memsource when it is ready
- Review the translated strings and open a pull request
- Open a followup story for next sprint
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
From our CI, reproduced several time lately - trying to install 4.17 +ODF + CNV.
Getting those messages after 40 minutes :
Operator odf status: progressing message: installing: waiting for deployment odf-operator-controller-manager to become ready: deployment "odf-operator-controller-manager" not available: Deployment does not have minimum availability.
"Operator odf status: progressing message: installing: waiting for deployment odf-console to become ready: deployment "odf-console" not available: Deployment does not have minimum availability."
CI job waits for cluster to complete for 2.5h. Cluster link - https://console.dev.redhat.com/openshift/assisted-installer/clusters/74b62ea4-61ce-4fde-acbe-cc1cf41f1fb8 Attached the installation logs and a video of installation.
How reproducible:
Still checking
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
Description of problem:
A-06-TC02, A-06-TC05, A-06-TC10 test cases are failing for create-from-git.feature file. The file requires an update
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Tests are failing with timeout error
Expected results:
Test should run green
Additional info:
The story is to track i18n upload/download routine tasks which are perform every sprint.
A.C.
- Upload strings to Memosource at the start of the sprint and reach out to localization team
- Download translated strings from Memsource when it is ready
- Review the translated strings and open a pull request
- Open a followup story for next sprint
Description of problem:
The CEL for AWSNetworkLoadBalancerParameters that ensures Subnets and EIPs are equal, should be "feature gated" by both SetEIPForNLBIngressController and IngressControllerLBSubnetsAWS. Meaning, the CEL should only be present/executed if both feature gates are enabled.
At the time we released this feature, there wasn't a way to do "AND" for the FeatureGateAwareXValidation marker, but recently https://github.com/openshift/kubernetes-sigs-controller-tools/pull/21 has been merged which now supports that.
However, nothing is currently broken since both feature gates are now enabled by default, but if the IngressControllerLBSubnetsAWS feature gate was disabled for any reason, the IngressController CRD would become invalid and unable to install. You'd get an error message similar to:
ERROR: <input>:1:157: undefined field 'subnets'
Version-Release number of selected component (if applicable):
4.17 and 4.18
How reproducible:
100%?
Steps to Reproduce:
1. Disable IngressControllerLBSubnetsAWS feature gate
Actual results:
IngressController CRD is now broken
Expected results:
IngressController shouldn't be broken.
Additional info:
To be clear, this is not a bug with an active impact, but this is more of an inconsistency that could cause problems in the future.
Description of problem:
In ASH arm template 06_workers.json[1], there is an unused variable "identityName" defined, this is harmless, but little weird to be present in official upi installation doc[2], which might confuse user when installing UPI cluster on ASH. [1] https://github.com/openshift/installer/blob/master/upi/azurestack/06_workers.json#L52 [2] https://docs.openshift.com/container-platform/4.17/installing/installing_azure_stack_hub/upi/installing-azure-stack-hub-user-infra.html#installation-arm-worker_installing-azure-stack-hub-user-infra suggest to remove it from arm template.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The following test is failing more than expected:
Undiagnosed panic detected in pod
See the sippy test details for additional context.
Observed in 4.18-e2e-vsphere-ovn-upi-serial/1861922894817267712
Undiagnosed panic detected in pod { pods/openshift-machine-config-operator_machine-config-daemon-4mzxf_machine-config-daemon_previous.log.gz:E1128 00:28:30.700325 4480 panic.go:261] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=<}
In 4.8's installer#4760, the installer began passing oc adm release new ... a manifest so the cluster-version operator would manage a coreos-bootimages ConfigMap in the openshift-machine-config-operator namespace. installer#4797 reported issues with the 0.0.1-snapshot placeholder not getting substituted, and installer#4814 attempted to fix that issue by converting the manifest from JSON to YAML to align with the replacement rexexp. But for reasons I don't understand, that manifest still doesn't seem to be getting replaced.
From 4.8 through 4.15.
100%
With 4.8.0:
$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.8.0-x86_64 $ grep releaseVersion manifests/0000_50_installer_coreos-bootimages.yaml
releaseVersion: 0.0.1-snapshot
releaseVersion: 4.8.0
or other output that matches the extracted release. We just don't want the 0.0.1-snapshot placeholder.
Reproducing in the latest 4.14 RC:
$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.2-x86_64 $ grep releaseVersion manifests/0000_50_installer_coreos-bootimages.yaml releaseVersion: 0.0.1-snapshot
Description of problem:
https://github.com/openshift/api/pull/1997 is the PR where we plan to promote the FG for UDNs in early Jan - the week of 6th
I need 95% pass rate for overlapping IPs test which is currently at 92% threshold,
See the verify jobs: https://storage.googleapis.com/test-platform-results/pr-logs/pull/openshift_api/1997/pull-ci-openshift-api-master-verify/1864369939800920064/build-log.txt that are failing on that PR
Talk with TRT and understand the flakes and improve the tests to get good pass results.
This is blocker for GA
Fix:
INSUFFICIENT CI testing for "NetworkSegmentation". F1204 18:09:40.736814 181286 root.go:64] Error running codegen: error: "[sig-network][OCPFeatureGate:NetworkSegmentation][Feature:UserDefinedPrimaryNetworks] when using openshift ovn-kubernetes created using NetworkAttachmentDefinitions is isolated from the default network with L2 primary UDN [Suite:openshift/conformance/parallel]" only passed 92%, need at least 95% for "NetworkSegmentation" on {gcp amd64 ha } error: "[sig-network][OCPFeatureGate:NetworkSegmentation][Feature:UserDefinedPrimaryNetworks] when using openshift ovn-kubernetes created using NetworkAttachmentDefinitions isolates overlapping CIDRs with L3 primary UDN [Suite:openshift/conformance/parallel]" only passed 92%, need at least 95% for "NetworkSegmentation" on {gcp amd64 ha } error: "[sig-network][OCPFeatureGate:NetworkSegmentation][Feature:UserDefinedPrimaryNetworks] when using openshift ovn-kubernetes created using UserDefinedNetwork is isolated from the default network with L3 primary UDN [Suite:openshift/conformance/parallel]" only passed 92%, need at least 95% for "NetworkSegmentation" on {gcp amd64 ha } error: "[sig-network][OCPFeatureGate:NetworkSegmentation][Feature:UserDefinedPrimaryNetworks] when using openshift ovn-kubernetes created using UserDefinedNetwork isolates overlapping CIDRs with L3 primary UDN [Suite:openshift/conformance/parallel]" only passed 92%, need at least 95% for "NetworkSegmentation" on {gcp amd64 ha } error: "[sig-network][OCPFeatureGate:NetworkSegmentation][Feature:UserDefinedPrimaryNetworks] when using openshift ovn-kubernetes created using NetworkAttachmentDefinitions isolates overlapping CIDRs with L3 primary UDN [Suite:openshift/conformance/parallel]" only passed 92%, need at least 95% for "NetworkSegmentation" on {metal amd64 ha ipv4} error: "[sig-network][OCPFeatureGate:NetworkSegmentation][Feature:UserDefinedPrimaryNetworks] when using openshift ovn-kubernetes created using UserDefinedNetwork isolates overlapping CIDRs with L3 primary UDN [Suite:openshift/conformance/parallel]" only passed 92%, need at least 95% for "NetworkSegmentation" on {metal amd64 ha ipv4}
Description of problem:
When "users in ns/openshift-... must not produce too many applies" test flakes it doesn't output a useful output: it `{ details in audit log}`. Instead it should be ``` {user system:serviceaccount:openshift-infra:serviceaccount-pull-secrets-controller had 43897 applies, check the audit log and operator log to figure out why user system:serviceaccount:openshift-infra:podsecurity-admission-label-syncer-controller had 1034 applies, check the audit log and operator log to figure out why details in audit log} ```
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
A similar testing scenario to OCPBUGS-38719, but with the pre-existing dns private zone is not a peering zone, instead it is a normal dns zone which binds to another VPC network. And the installation will fail finally, because the dns record-set "*.apps.<cluster name>.<base domain>" is added to the above dns private zone, rather than the cluster's dns private zone.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-multi-2024-10-24-093933
How reproducible:
Always
Steps to Reproduce:
Please refer to the steps told in https://issues.redhat.com/browse/OCPBUGS-38719?focusedId=25944076&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25944076
Actual results:
The installation failed, due to the cluster operator "ingress" degraded
Expected results:
The installation should succeed.
Additional info:
Description of problem:
The plugins name is shown as {{plugin}} on Operator details page
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-11-26-075648
How reproducible:
Always
Steps to Reproduce:
1. Prepare an operator has `console.openshift.io/plugins` annotation, or create a catalogsource with image quay.io/openshifttest/dynamic-plugin-oprs:latest annotations: alm-examples: 'xxx' console.openshift.io/plugins: '["prometheus-plugin1", "prometheus-plugin2"]' 2. Install operator, on operator installation page, choose Enable or Disable associated plugins 3. check Operator details page
Actual results:
2. on Operator installation page, associated plugin names are correctly shown 3. There is Console plugins section on Operator details page, in this section all plugins name is shown as {{plugin}}
Expected results:
3. plugin name associated with operator should be correctly displayed
Additional info:
Component Readiness has found a potential regression in the following test:
[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
Significant regression detected.
Fishers Exact probability of a regression: 99.95%.
Test pass rate dropped from 99.06% to 93.75%.
Sample (being evaluated) Release: 4.18
Start Time: 2025-01-06T00:00:00Z
End Time: 2025-01-13T16:00:00Z
Success Rate: 93.75%
Successes: 45
Failures: 3
Flakes: 0
Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T23:59:59Z
Success Rate: 99.06%
Successes: 210
Failures: 2
Flakes: 0
View the test details report for additional context.
From the test details link, two of the three referenced failures are as follows:
[ { "metric": { "__name__": "ALERTS", "alertname": "OperatorHubSourceError", "alertstate": "firing", "container": "catalog-operator", "endpoint": "https-metrics", "exported_namespace": "openshift-marketplace", "instance": "[fd01:0:0:1::1a]:8443", "job": "catalog-operator-metrics", "name": "community-operators", "namespace": "openshift-operator-lifecycle-manager", "pod": "catalog-operator-6c446dcbbb-sxvjz", "prometheus": "openshift-monitoring/k8s", "service": "catalog-operator-metrics", "severity": "warning" }, "value": [ 1736751753.045, "1" ] } ]
This looks to always happen sparodically in CI lately: https://search.dptools.openshift.org/?search=OperatorHubSourceError&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Though overall it looks quite rare.
What is happening to cause these alerts to fire?
At this moment, it's a regression for 4.18 and thus a release blocker. I suspect it will clear naturally, but it might be a good opportunity to look for a reason why. Could use some input from OLM on what exactly is happening in the runs such as these two:
Description of problem:
Konnectivity introduced a smarter readiness check with kubernetes-sigs/apiserver-network-proxy#485. It would be nice to do some better readiness and liveness check on startup.
Version-Release number of selected component (if applicable):
How reproducib
Steps to Reproduce:
-
Actual results
Expected results:
Additional info: Implementation in https://github.com/openshift/hypershift/pull/4829
Description of problem:
when delete platform images, oc-mirror failed with error: Unable to delete my-route-zhouy.apps.yinzhou-1112.qe.devcluster.openshift.com/openshift/release:4.15.37-s390x-alibaba-cloud-csi-driver. Image may not exist or is not stored with a v2 Schema in a v2 registry
Version-Release number of selected component (if applicable):
./oc-mirror.rhel8 version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202411090338.p0.g0a7dbc9.assembly.stream.el9-0a7dbc9", GitCommit:"0a7dbc90746a26ddff3bd438c7db16214dcda1c3", GitTreeState:"clean", BuildDate:"2024-11-09T08:33:46Z", GoVersion:"go1.22.7 (Red Hat 1.22.7-1.module+el8.10.0+22325+dc584f75) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. imagesetconfig as follow : kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: additionalImages: - name: registry.redhat.io/ubi8/ubi:latest - name: quay.io/openshifttest/hello-openshift@sha256:61b8f5e1a3b5dbd9e2c35fd448dc5106337d7a299873dd3a6f0cd8d4891ecc27 operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 packages: - name: devworkspace-operator platform: architectures: - "s390x" channels: - name: stable-4.15 type: ocp 2. run the mirror2disk and disk2mirror command : `oc mirror -c /home/fedora/yinzhou/openshift-tests-private/test/extended/testdata/workloads/config-72708.yaml file://test/yinzhou/debug72708 --v2` `oc mirror -c /home/fedora/yinzhou/openshift-tests-private/test/extended/testdata/workloads/config-72708.yaml --from file://test/yinzhou/debug72708 --v2 docker://my-route-zhouy.apps.yinzhou-1112.qe.devcluster.openshift.com --dest-tls-verify=false` 3. generate delete image list: `oc mirror delete --config /home/fedora/yinzhou/openshift-tests-private/test/extended/testdata/workloads/delete-config-72708.yaml docker://my-route-zhouy.apps.yinzhou-1112.qe.devcluster.openshift.com --v2 --workspace file://test/yinzhou/debug72708 --generate` 4. execute the delete command : `oc mirror delete --delete-yaml-file test/yinzhou/debug72708/working-dir/delete/delete-images.yaml docker://my-route-zhouy.apps.yinzhou-1112.qe.devcluster.openshift.com --v2 --dest-tls-verify=false --force-cache-delete=true`
Actual results:
4. delete hit error : ⠋ 21/396 : (0s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cd62cc631a6bf6e13366d29da5ae64088d3b42410f9b52579077cc82d2ea2ab9 2024/11/12 03:10:07 [ERROR] : Unable to delete my-route-zhouy.apps.yinzhou-1112.qe.devcluster.openshift.com/openshift/release:4.15.37-s390x-alibaba-cloud-csi-driver. Image may not exist or is not stored with a v2 Schema in a v2 registry
Expected results:
4. no error
Additional info:
Missing RBAC causes an error when OVNK tries to annotate the network ID on the NADs. Regression noticed when testing coverage for secondary networks was added.
When building a container image using Dockerfile.dev, the resulting image does not include the necessary font files provided by PatternFly (e.g., RedHatText). As a result, the console renders with a system fallback. The root cause of this issue is an overly broad ignore introduced with https://github.com/openshift/console/pull/12538.
Description of problem:
If the serverless function is not running and on click of Test Serverless button, nothing is happening.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1.Install serverless operator 2.Create serverless function and make sure the status is false 3.Click on Test Serverless function
Actual results:
No response
Expected results:
May be an alert or may be we can hide that option if function is not ready?
Additional info:
Description of problem:
The deprovision CI step[1] e2e-aws-ovn-shared-vpc-edge-zones-ipi-deprovision-deprovision is missing the permission ec2:ReleaseAddress in the installer user to remove the custom IPv4 Address (EIP) allocated in the cluster creation. The BYO IPv4 is default on CI jobs, and enabled when the pool has IP address. Error: level=warning msg=UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-rxxt8srv-bf840-minimal-perm-installer is not authorized to perform: ec2:ReleaseAddress on resource: arn:aws:ec2:us-east-1:[redacted]:elastic-ip/eipalloc-0f4b652b702e73204 because no identity-based policy allows the ec2:ReleaseAddress action. Job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/9413/pull-ci-openshift-installer-main-e2e-aws-ovn-shared-vpc-edge-zones/1884340831955980288
Version-Release number of selected component (if applicable):
4.19
How reproducible:
always when BYO Public IPv4 pool is activated in the install-config
Steps to Reproduce:
1. install a cluster with byo IPv4 pool set on install-config 2. 3.
Actual results:
level=warning msg=UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-rxxt8srv-bf840-minimal-perm-installer is not authorized to perform: ec2:ReleaseAddress on resource: arn:aws:ec2:us-east-1:[Redacted]:elastic-ip/eipalloc-0f4b652b702e73204 because no identity-based policy allows the ec2:ReleaseAddress action.
Expected results:
Permissions granted, EIP released.
Additional info:
Description of problem:
As a User when we attempt to "ReRun" a resolver based pipelinerun from OpenshiftConsole, UI errors with message "Invalid PipelineRun configuration, unable to start Pipeline." Slack thread: https://redhat-internal.slack.com/archives/CG5GV6CJD/p1730876734675309
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create a resolver based pipelinerun 2. Attempt to "ReRun" the same from Console
Actual results:
ReRun Errors
Expected results:
ReRun should be triggered successfully
Additional info:
The aks-e2e test keeps failing on the CreateClusterV2 test because the `ValidReleaseInfo` condition is not set. The patch that sets this status keeps failing. Investigate why & provide a fix.
Description of problem:
Tracking per-operator fixes for the following related issues static pod node, installer, and revision controllers: https://issues.redhat.com/browse/OCPBUGS-45924 https://issues.redhat.com/browse/OCPBUGS-46372 https://issues.redhat.com/browse/OCPBUGS-48276
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Bootstrap process failed due to API_URL and API_INT_URL are not resolvable: Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'. Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Consumed 1min 457ms CPU time. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 1. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: Stopped Bootstrap a Kubernetes cluster. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Consumed 1min 457ms CPU time. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster. Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Check if API and API-Int URLs are resolvable during bootstrap Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Checking if api.yunjiang-dn16d.qe.gcp.devcluster.openshift.com of type API_URL is resolvable Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting stage resolve-api-url Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Unable to resolve API_URL api.yunjiang-dn16d.qe.gcp.devcluster.openshift.com Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Checking if api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com of type API_INT_URL is resolvable Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting stage resolve-api-int-url Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Unable to resolve API_INT_URL api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8905]: https://localhost:2379 is healthy: successfully committed proposal: took = 7.880477ms Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting cluster-bootstrap... Feb 06 06:41:59 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: Starting temporary bootstrap control plane... Feb 06 06:41:59 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: Waiting up to 20m0s for the Kubernetes API Feb 06 06:42:00 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: API is up install logs: ... time="2024-02-06T06:54:28Z" level=debug msg="Unable to connect to the server: dial tcp: lookup api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com on 169.254.169.254:53: no such host" time="2024-02-06T06:54:28Z" level=debug msg="Log bundle written to /var/home/core/log-bundle-20240206065419.tar.gz" time="2024-02-06T06:54:29Z" level=error msg="Bootstrap failed to complete: timed out waiting for the condition" time="2024-02-06T06:54:29Z" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane." ...
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-02-05-184957,openshift/machine-config-operator#4165
How reproducible:
Always.
Steps to Reproduce:
1. Enable custom DNS on gcp: platform.gcp.userProvisionedDNS:Enabled and featureSet:TechPreviewNoUpgrade 2. Create cluster 3.
Actual results:
Failed to complete bootstrap process.
Expected results:
See description.
Additional info:
I believe 4.15 is affected as well once https://github.com/openshift/machine-config-operator/pull/4165 backport to 4.15, currently, it failed at an early phase, see https://issues.redhat.com/browse/OCPBUGS-28969
Description of problem:
network-tools pod-run-netns-command failed due to "ERROR: Can't get netns pid". Seems containerRuntime changed from runc to crun, so we need to update network-tools utils: https://github.com/openshift/network-tools/blob/1df82dfade80ce31b325dab703b37bf7e8924e99/debug-scripts/utils#L108
Version-Release number of selected component (if applicable):
4.18.0-0.test-2024-11-27-013900-ci-ln-s87rfh2-latest
How reproducible:
always
Steps to Reproduce:
1. create test pod in namespace test $ oc get pod -n test NAME READY STATUS RESTARTS AGE hello-pod2 1/1 Running 0 22s 2.run command "ip a" with network-tools script pod-run-netns-command
Actual results:
$ ./network-tools pod-run-netns-command test hello-pod2 "ip route show" Temporary namespace openshift-debug-btzwc is created for debugging node... Starting pod/qiowang-120303-zb568-worker-0-5phll-debug ... To use host binaries, run `chroot /host` Removing debug pod ... Temporary namespace openshift-debug-btzwc was removed. error: non-zero exit code from debug container ERROR: Can't get netns pid <--- Failed INFO: Running ip route show in the netns of pod hello-pod2 Temporary namespace openshift-debug-l7xv4 is created for debugging node... Starting pod/qiowang-120303-zb568-worker-0-5phll-debug ... To use host binaries, run `chroot /host` nsenter: failed to parse pid: 'parse' Removing debug pod ... Temporary namespace openshift-debug-l7xv4 was removed. error: non-zero exit code from debug container ERROR: Command returned non-zero exit code, check output or logs.
Expected results:
run command with network-tools script pod-run-netns-command successfuly
Additional info:
There is no container running: $ oc debug node/qiowang-120303-zb568-worker-0-5phll Temporary namespace openshift-debug-hrr94 is created for debugging node... Starting pod/qiowang-120303-zb568-worker-0-5phll-debug ... To use host binaries, run `chroot /host` Pod IP: 192.168.2.190 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1# runc list ID PID STATUS BUNDLE CREATED OWNER sh-5.1#
Description of problem:
The UDN Details page shows a "group" of attributes called "Layer configuration". That does not really add any benefit and the name is just confusing. Let's just remove the grouping and keep the attributes flat.
Version-Release number of selected component (if applicable):
rc.4
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
OWNERS file updated to include prabhakar and Moe as owners and reviewers
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is to fecilitate easy backport via automation
The image tests are currently failing on presubs as they cannot build our hypershift-tests image. This is caused by the fact that in 4.19 dnf used in CI is really a wrapper and we should now install the Microsoft repositories in /etc/yum/repos.art/ci/ so that the azure-cli can be found when attempting to install it with dnf.
Slack Thread: https://redhat-internal.slack.com/archives/CB95J6R4N/p1737060704212009
Description of problem:
The proposed name for services in the ui has an additional 'asd' after example: `exampleasd`.
4.18.0-0.nightly-multi-2025-01-15-030049
How reproducible: Always
Steps to Reproduce:
1. Go to the UI -> Networking -> Services 2. Click create a new service
Actual results:
--- apiVersion: v1 kind: Service metadata: name: exampleasd namespace: test spec: selector: app: name spec: ...
Expected results:
--- apiVersion: v1 kind: Service metadata: name: example namespace: test spec: ....
Additional info:
Integrate codespell into Make Verify so that things are spell correctly in our upstream docs and codebase.
Description of problem:
Layout incorrect for Service weight on Create Route page,
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-12-05-103644
How reproducible:
Always
Steps to Reproduce:
1. Navigate to ‘Create Route’ page, eg: /k8s/ns/default/route.openshift.io~v1~Route/~new/form 2. Check the field od 'Service weight' 3.
Actual results:
the input field for 'Service weight' is too long
Expected results:
Compared to a similar component in OpenShift, the input field should be shorter
Additional info:
Description of problem:
Additional network is not correctly configured on the secondary interface inside the masters and the workers.
With install-config.yaml with this section:
# This file is autogenerated by infrared openshift plugin apiVersion: v1 baseDomain: "shiftstack.local" compute: - name: worker platform: openstack: zones: [] additionalNetworkIDs: ['26a751c3-c316-483c-91ed-615702bcbba9'] type: "worker" replicas: 3 controlPlane: name: master platform: openstack: zones: [] additionalNetworkIDs: ['26a751c3-c316-483c-91ed-615702bcbba9'] type: "master" replicas: 3 metadata: name: "ostest" networking: clusterNetworks: - cidr: fd01::/48 hostPrefix: 64 serviceNetwork: - fd02::/112 machineNetwork: - cidr: "fd2e:6f44:5dd8:c956::/64" networkType: "OVNKubernetes" platform: openstack: cloud: "shiftstack" region: "regionOne" defaultMachinePlatform: type: "master" apiVIPs: ["fd2e:6f44:5dd8:c956::5"] ingressVIPs: ["fd2e:6f44:5dd8:c956::7"] controlPlanePort: fixedIPs: - subnet: name: "subnet-ssipv6" pullSecret: | {"auths": {"installer-host.example.com:8443": {"auth": "ZHVtbXkxMjM6ZHVtbXkxMjM="}}} sshKey: <hidden> additionalTrustBundle: <hidden> imageContentSources: - mirrors: - installer-host.example.com:8443/registry source: quay.io/openshift-release-dev/ocp-v4.0-art-dev - mirrors: - installer-host.example.com:8443/registry source: registry.ci.openshift.org/ocp/release
The installation works. However, the additional network is not configured on the masters or the workers, which leads in our case to faulty manila integration.
In journal of all OCP nodes, it's observed logs repeteadly like below one from the master-0:
Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info> [1731590504.9667] device (enp4s0): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed') Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <warn> [1731590504.9672] device (enp4s0): Activation: failed for connection 'Wired connection 1' Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info> [1731590504.9674] device (enp4s0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed') Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info> [1731590504.9768] dhcp4 (enp4s0): canceled DHCP transaction Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info> [1731590504.9768] dhcp4 (enp4s0): activation: beginning transaction (timeout in 45 seconds) Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info> [1731590504.9768] dhcp4 (enp4s0): state changed no lease
Where that server has specifically an interface connected to the subnet "StorageNFSSubnet":
$ openstack server list | grep master-0 | da23da4a-4af8-4e54-ac60-88d6db2627b6 | ostest-kmmtt-master-0 | ACTIVE | StorageNFS=fd00:fd00:fd00:5000::fb:d8; network-ssipv6=fd2e:6f44:5dd8:c956::2e4 | ostest-kmmtt-rhcos | master |
That subnet is defined in openstack as dhcpv6-stateful:
$ openstack subnet show StorageNFSSubnet +----------------------+-------------------------------------------------------+ | Field | Value | +----------------------+-------------------------------------------------------+ | allocation_pools | fd00:fd00:fd00:5000::fb:10-fd00:fd00:fd00:5000::fb:fe | | cidr | fd00:fd00:fd00:5000::/64 | | created_at | 2024-11-13T12:34:41Z | | description | | | dns_nameservers | | | dns_publish_fixed_ip | None | | enable_dhcp | True | | gateway_ip | None | | host_routes | | | id | 480d7b2a-915f-4f0c-9717-90c55b48f912 | | ip_version | 6 | | ipv6_address_mode | dhcpv6-stateful | | ipv6_ra_mode | dhcpv6-stateful | | name | StorageNFSSubnet | | network_id | 26a751c3-c316-483c-91ed-615702bcbba9 | | prefix_length | None | | project_id | 4566c393806c43b9b4e9455ebae1cbb6 | | revision_number | 0 | | segment_id | None | | service_types | None | | subnetpool_id | None | | tags | | | updated_at | 2024-11-13T12:34:41Z | +----------------------+-------------------------------------------------------+
I also compared with ipv4 installation, and the storageNFSsubnet IP is successfully configured on enp4s0.
Version-Release number of selected component (if applicable):
How reproducible: Always
Additional info: must-gather and journal of the OCP nodes provided in private comment.
Description of problem:
dynamic plugin in Pending status will block console plugins tab page loading
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-11-27-162407
How reproducible:
Always
Steps to Reproduce:
1. Create a dynamic plugin which will be in Pending status, we can create from file https://github.com/openshift/openshift-tests-private/blob/master/frontend/fixtures/plugin/pending-console-demo-plugin-1.yaml 2. Enable the 'console-demo-plugin-1' plugin and navigate to Console plugins tab at /k8s/cluster/operator.openshift.io~v1~Console/cluster/console-plugins
Actual results:
2. page will always be loading
Expected results:
2. console plugins list table should be displayed
Additional info:
Description of problem:
When Applying profile with isolated field containing huge cpu list, profile doesn't apply and no errors is reported
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-11-26-075648
How reproducible:
Everytime.
Steps to Reproduce:
1. Create a profile as specified below: apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: annotations: kubeletconfig.experimental: '{"topologyManagerPolicy":"restricted"}' creationTimestamp: "2024-11-27T10:25:13Z" finalizers: - foreground-deletion generation: 61 name: performance resourceVersion: "3001998" uid: 8534b3bf-7bf7-48e1-8413-6e728e89e745 spec: cpu: isolated: 25,281,117,373,127,383,3,259,30,286,77,333,23,279,21,277,66,322,12,268,15,271,61,317,120,376,35,291,62,318,93,349,126,382,19,275,52,308,110,366,50,306,92,348,124,380,119,375,2,258,29,285,60,316,115,371,118,374,104,360,108,364,70,326,72,328,76,332,96,352,99,355,64,320,80,336,97,353,8,264,11,267,38,294,53,309,57,313,103,359,14,270,87,343,7,263,40,296,51,307,94,350,116,372,39,295,46,302,90,346,101,357,107,363,26,282,67,323,98,354,106,362,113,369,6,262,10,266,20,276,33,289,112,368,85,341,121,377,68,324,71,327,79,335,81,337,83,339,88,344,9,265,89,345,91,347,100,356,54,310,31,287,58,314,59,315,22,278,47,303,105,361,17,273,114,370,111,367,28,284,49,305,55,311,84,340,27,283,95,351,5,261,36,292,41,297,43,299,45,301,75,331,102,358,109,365,37,293,56,312,63,319,65,321,74,330,125,381,13,269,42,298,44,300,78,334,122,378,4,260,16,272,34,290,123,379,18,274,48,304,69,325,82,338,24,280,32,288,73,329,86,342,220,476,251,507,206,462,226,482,229,485,221,477,253,509,255,511,135,391,144,400,183,439,233,489,137,393,186,442,198,454,190,446,234,490,147,403,163,419,172,428,148,404,149,405,250,506,151,407,217,473,238,494,131,387,230,486,235,491,246,502,145,401,194,450,199,455,143,399,169,425,231,487,245,501,129,385,142,398,179,435,225,481,236,492,152,408,203,459,214,470,166,422,207,463,212,468,130,386,155,411,215,471,188,444,201,457,210,466,193,449,200,456,248,504,141,397,167,423,191,447,181,437,222,478,252,508,128,384,139,395,174,430,164,420,168,424,187,443,232,488,133,389,157,413,208,464,140,396,185,441,241,497,219,475,175,431,184,440,213,469,154,410,197,453,249,505,209,465,218,474,227,483,244,500,134,390,153,409,178,434,160,416,195,451,196,452,211,467,132,388,136,392,146,402,138,394,150,406,239,495,173,429,192,448,202,458,205,461,216,472,158,414,159,415,176,432,189,445,237,493,242,498,177,433,182,438,204,460,240,496,254,510,162,418,171,427,180,436,243,499,156,412,165,421,170,426,228,484,247,503,161,417,223,479,224,480 reserved: 0,256,1,257 hugepages: defaultHugepagesSize: 1G pages: - count: 20 size: 2M machineConfigPoolSelector: machineconfiguration.openshift.io/role: worker-cnf net: userLevelNetworking: true nodeSelector: node-role.kubernetes.io/worker-cnf: "" numa: topologyPolicy: restricted realTimeKernel: enabled: false workloadHints: highPowerConsumption: true perPodPowerManagement: false realTime: true 2. The worker-cnf node doesn't contain any kernel args associated with the above profile. 3.
Actual results:
System doesn't boot with kernel args associated with above profile
Expected results:
System should boot with Kernel args presented from Performance Profile.
Additional info:
We can see MCO gets the details and creates the mc: Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: machine-config-daemon[9550]: "Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1=\"all\" --delete=psi=0 --delete=skew_tick=1 --delete=tsc=reliable --delete=rcupda> Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: cbs=25,281,117,373,127,383,3,259,30,286,77,333,23,279,21,277,66,322,12,268,15,271,61,317,120,376,35,291,62,318,93,349,126,382,19,275,52,308,110,366,50,306,92,348,124,380,119,375,2,258,29,285,60,316,115,3> Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: 4,190,446,234,490,147,403,163,419,172,428,148,404,149,405,250,506,151,407,217,473,238,494,131,387,230,486,235,491,246,502,145,401,194,450,199,455,143,399,169,425,231,487,245,501,129,385,142,398,179,435,2> Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: systemd.cpu_affinity=0,1,256,257 --append=iommu=pt --append=amd_pstate=guided --append=tsc=reliable --append=nmi_watchdog=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=is> Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: ,78,334,122,378,4,260,16,272,34,290,123,379,18,274,48,304,69,325,82,338,24,280,32,288,73,329,86,342,220,476,251,507,206,462,226,482,229,485,221,477,253,509,255,511,135,391,144,400,183,439,233,489,137,393> Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: 510,162,418,171,427,180,436,243,499,156,412,165,421,170,426,228,484,247,503,161,417,223,479,224,480 --append=nohz_full=25,281,117,373,127,383,3,259,30,286,77,333,23,279,21,277,66,322,12,268,15,271,61,317> Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: ,476,251,507,206,462,226,482,229,485,221,477,253,509,255,511,135,391,144,400,183,439,233,489,137,393,186,442,198,454,190,446,234,490,147,403,163,419,172,428,148,404,149,405,250,506,151,407,217,473,238,49> Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: ppend=nosoftlockup --append=skew_tick=1 --append=rcutree.kthread_prio=11 --append=default_hugepagesz=1G --append=hugepagesz=2M --append=hugepages=20]" Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com rpm-ostree[18750]: client(id:machine-config-operator dbus:1.336 unit:crio-36c845a9c9a58a79a0e09dab668f8b21b5e46e5734a527c269c6a5067faa423b.scope uid:0) added; new total=1 Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com rpm-ostree[18750]: Loaded sysroot Actual Kernel args: BOOT_IMAGE=(hd1,gpt3)/boot/ostree/rhcos-854dd632224b34d5f4df1884c4ba8c2f9527422b37744b83e7b1b98172586ff4/vmlinuz-5.14.0-427.44.1.el9_4.x86_64 rw ostree=/ostree/boot.0/rhcos/854dd632224b34d5f4df1884c4ba8c2f9527422b37744b83e7b1b98172586ff4/0 ignition.platform.id=metal ip=dhcp root=UUID=0068e804-432c-409d-aabc-260aa71e3669 rw rootflags=prjquota boot=UUID=7797d927-876e-426b-9a30-d1e600c1a382 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=0 skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on
I had extracted this code, but I broke the case where we're getting the pull secret from the cluster – because it was immediately getting deleted on return. In-line the code instead to prevent that, so it gets deleted when we're done using it.
The error you get when this happens looks like:error running options:
could not create external binary provider: couldn't extract release payload image stream: failed extracting image-references from "quay.io/openshift-release-dev/ocp-release-nightly@sha256:856d044a4f97813fb31bc4edda39b05b2b7c02de1327b9b297bdf93edc08fa95": error during image extract: exit status 1 (error: unable to load --registry-config: stat /tmp/external-binary2166428580/.dockerconfigjson: no such file or directory
During review of ARO MiWi permissions, some permissions in the CCM CredentialsRequest for Azure having other permissions identified through a linked action that are missing.
A linked access check is an action performed by Azure Resource Manager during a incoming request. For example, when you issue a create operation to a network interface ( Microsoft.Network/networkInterfaces/write ) you specify a subnet in the payload. ARM parses the payload, sees you're setting a subnet property, and as a result requires the linked access check Microsoft.Network/virtualNetworks/subnets/join/action to the subnet resource specified in the network interface. If you update a resource but don't include the property in the payload, it will not perform the permission check.
The following permissions were identified as possibly needed in CCM CredsRequest as they are specified as linked action of one of CCM's existing permissions
Microsoft.Network/applicationGateways/backendAddressPools/join/action Microsoft.Network/applicationSecurityGroups/joinIpConfiguration/action Microsoft.Network/applicationSecurityGroups/joinNetworkSecurityRule/action Microsoft.Network/ddosProtectionPlans/join/action Microsoft.Network/gatewayLoadBalancerAliases/join/action Microsoft.Network/loadBalancers/backendAddressPools/join/action Microsoft.Network/loadBalancers/frontendIPConfigurations/join/action Microsoft.Network/loadBalancers/inboundNatRules/join/action Microsoft.Network/networkInterfaces/join/action Microsoft.Network/networkSecurityGroups/join/action Microsoft.Network/publicIPAddresses/join/action Microsoft.Network/publicIPPrefixes/join/action Microsoft.Network/virtualNetworks/subnets/join/action
Each permission needs to be validated as to whether it is needed by CCM through any of its code paths.
Description of problem:
When try to a full catalog will failed with error: 2024/11/21 02:55:48 [ERROR] : unable to rebuild catalog docker://registry.redhat.io/redhat/redhat-operator-index:v4.17: filtered declarative config not found
Version-Release number of selected component (if applicable):
oc-mirror version W1121 02:59:58.748933 61010 mirror.go:102] ⚠️ oc-mirror v1 is deprecated (starting in 4.18 release) and will be removed in a future release - please migrate to oc-mirror --v2WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.2.0-alpha.1-324-gbae91d5", GitCommit:"bae91d55", GitTreeState:"clean", BuildDate:"2024-11-20T02:06:04Z", GoVersion:"go1.23.0", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. mirror ocp with full==true for catalog : cat config.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.17 full: true oc-mirror -c config.yaml docker://localhost:5000 --workspace file://full-catalog --v2
Actual results:
oc-mirror -c config.yaml docker://localhost:5000 --workspace file://full-catalog --v22024/11/21 02:55:27 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/11/21 02:55:27 [INFO] : 👋 Hello, welcome to oc-mirror 2024/11/21 02:55:27 [INFO] : ⚙️ setting up the environment for you... 2024/11/21 02:55:27 [INFO] : 🔀 workflow mode: mirrorToMirror 2024/11/21 02:55:27 [INFO] : 🕵️ going to discover the necessary images... 2024/11/21 02:55:27 [INFO] : 🔍 collecting release images... 2024/11/21 02:55:27 [INFO] : 🔍 collecting operator images... ⠦ (20s) Collecting catalog registry.redhat.io/redhat/redhat-operator-index:v4.17 2024/11/21 02:55:48 [WARN] : error parsing image registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9 : registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9 unable to parse image co ✓ (20s) Collecting catalog registry.redhat.io/redhat/redhat-operator-index:v4.17 2024/11/21 02:55:48 [WARN] : registry.redhat.io/openshift4/ose-kube-rbac-proxy-rhel9 unable to parse image correctly : tag and digest are empty : SKIPPING 2024/11/21 02:55:48 [WARN] : [OperatorImageCollector] gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1@sha256:d4883d7c622683b3319b5e6b3a7edfbf2594c18060131a8bf64504805f875522 has both tag and digest : using digest to pull, but tag only for mirroring 2024/11/21 02:55:48 [INFO] : 🔍 collecting additional images... 2024/11/21 02:55:48 [INFO] : 🔍 collecting helm images... 2024/11/21 02:55:48 [INFO] : 🔂 rebuilding catalogs 2024/11/21 02:55:48 [INFO] : 👋 Goodbye, thank you for using oc-mirror 2024/11/21 02:55:48 [ERROR] : unable to rebuild catalog docker://registry.redhat.io/redhat/redhat-operator-index:v4.17: filtered declarative config not found
Expected results:
no error
Additional info:
Description of problem:
A new format library was introduced for CEL in 4.18, but, it is not usable in 4.18 due to upgrade checks put in place (to allow version skew between API servers and rollbacks). This means that the library is actually only presently usable in 4.19 once 1.32 ships. However, there are some issues we may face. We have a number of APIs in flight currently that would like to use this new library, we cannot get started on those features until this library is enabled. Some of those features would also like to be backported to 4.18. We also have risks on upgrades. If we decide to use this format library in any API that is upgraded prior to KAS, then during an upgrade, the CRD will be applied to the older version of the API server, blocking the upgrade as it will fail. By backporting the library (pretending it was introduced earlier, and then introducing it directly into 4.17), we can enable anything that installs post KAS upgrade to leverage this from 4.18 (solving those features asking for backports), and enable anything that upgrades pre-kas to actually leverage this in 4.19. API approvers will be responsible for making sure the libraries and upgrade compatibility are considered as new APIs are introduced. Presently, the library has had no bug fixes applied to the release-1.31 or release-1.32 branches upstream. The backport from 4.18 to 4.17 was clean bar some conflict in the imports that was easily resolved. So I'm confident that if we do need to backport any bug fixes, this should be straight forward. Any bugs in these libraries can be assigned to me (jspeed)
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The Power VS Machine API provider ignores the authentication endpoint override.
Description of problem:
We’re unable to find a stable and accessible OAuth-proxy image, which is causing a bug that we haven’t fully resolved yet. Krzys made a PR to address this , but it’s not a complete solution since the image path doesn’t seem consistently available. Krzys tried referencing the OAuth-proxy image from the OpenShift openshift namespace, but it didn’t work reliably.There’s an imagestream for OAuth-proxy in the openshift namespace, which we might be able to reference in tests, but not certain of the correct Docker URL format for it. Also, it’s possible that there are permission issues, which could be why the image isn’t accessible when referenced this way.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
While upgrading the Fusion operator, IBM team is facing the following error in the operator's subscription: error validating existing CRs against new CRD's schema for "fusionserviceinstances.service.isf.ibm.com": error validating service.isf.ibm.com/v1, Kind=FusionServiceInstance "ibm-spectrum-fusion-ns/odfmanager": updated validation is too restrictive: [].status.triggerCatSrcCreateStartTime: Invalid value: "number": status.triggerCatSrcCreateStartTime in body must be of type integer: "number" question here, "triggerCatSrcCreateStartTime" has been present in the operator for the past few releases and it's datatype (integer) hasn't changed in the latest release as well. There was one "FusionServiceInstance" CR present in the cluster when this issue was hit and the value of "triggerCatSrcCreateStartTime" field being "1726856593000774400".
Version-Release number of selected component (if applicable):
Its impacting between OCP 4.16.7 and OCP 4.16.14 versions
How reproducible:
Always
Steps to Reproduce:
1.Upgrade the fusion operator ocp version 4.16.7 to ocp 4.16.14 2. 3.
Actual results:
Upgrade fails with error in description
Expected results:
Upgrade should not be failed
Additional info:
Description of problem:
When using PublicIPv4Pool, CAPA will try to allocate IP address in the supplied pool which requires the `ec2:AllocateAddress` permission
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. Minimal permissions and publicIpv4Pool set 2. 3.
Actual results:
time="2024-11-21T05:39:49Z" level=debug msg="E1121 05:39:49.352606 327 awscluster_controller.go:279] \"failed to reconcile load balancer\" err=<" time="2024-11-21T05:39:49Z" level=debug msg="\tfailed to allocate addresses to load balancer: failed to allocate address from Public IPv4 Pool \"ipv4pool-ec2-0768267342e327ea9\" to role lb-apiserver: failed to allocate Elastic IP for \"lb-apiserver\": UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-2cr41ill-663fd-minimal-perm is not authorized to perform: ec2:AllocateAddress on resource: arn:aws:ec2:us-east-1:460538899914:ipv4pool-ec2/ipv4pool-ec2-0768267342e327ea9 because no identity-based policy allows the ec2:AllocateAddress action. Encoded authorization failure message: Iy1gCtvfPxZ2uqo-SHei1yJQvNwaOBl5F_8BnfeEYCLMczeDJDdS4fZ_AesPLdEQgK7ahuOffqIr--PWphjOUbL2BXKZSBFhn3iN9tZrDCnQQPKZxf9WaQmSkoGNWKNUGn6rvEZS5KvlHV5vf5mCz5Bk2lk3w-O6bfHK0q_dphLpJjU-sTGvB6bWAinukxSYZ3xbirOzxfkRfCFdr7nDfX8G4uD4ncA7_D-XriDvaIyvevWSnus5AI5RIlrCuFGsr1_3yEvrC_AsLENZHyE13fA83F5-Abpm6-jwKQ5vvK1WuD3sqpT5gfTxccEqkqqZycQl6nsxSDP2vDqFyFGKLAmPne8RBRbEV-TOdDJphaJtesf6mMPtyMquBKI769GW9zTYE7nQzSYUoiBOafxz6K1FiYFoc1y6v6YoosxT8bcSFT3gWZWNh2upRJtagRI_9IRyj7MpyiXJfcqQXZzXkAfqV4nsJP8wRXS2vWvtjOm0i7C82P0ys3RVkQVcSByTW6yFyxh8Scoy0HA4hTYKFrCAWA1N0SROJsS1sbfctpykdCntmp9M_gd7YkSN882Fy5FanA" time="2024-11-21T05:39:49Z" level=debug msg="\t\tstatus code: 403, request id: 27752e3c-596e-43f7-8044-72246dbca486"
Expected results:
Additional info:
Seems to happen consistently with shared-vpc-edge-zones CI job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/9230/pull-ci-openshift-installer-master-e2e-aws-ovn-shared-vpc-edge-zones/1860015198224519168
Description of problem:
Observed in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn/1866088107347021824/artifacts/e2e-gcp-ovn/ipi-install-install/artifacts/.openshift_install-1733747884.log Distinct issues occurring in this job caused the "etcd bootstrap member to be removed from cluster" gate to take longer than its 5 minute timeout, but there was plenty of time left to complete bootstrapping successfully. It doesn't make sense to have a narrow timeout here because progress toward removal of the etcd bootstrap member begins the moment the etcd cluster starts for the first time, not when the installer starts waiting to observe it.
Version-Release number of selected component (if applicable):
4.19.0
How reproducible:
Sometimes
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When running the delete command on oc-mirror after a mirrorToMirror, the graph-image is not being deleted.
Version-Release number of selected component (if applicable):
How reproducible:
With the following ImageSetConfiguration (use the same for the DeleteImageSetConfiguration only changing the kind and the mirror to delete)
kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: platform: channels: - name: stable-4.13 minVersion: 4.13.10 maxVersion: 4.13.10 graph: true
Steps to Reproduce:
1. Run mirror to mirror ./bin/oc-mirror -c ./alex-tests/alex-isc/isc.yaml --workspace file:///home/aguidi/go/src/github.com/aguidirh/oc-mirror/alex-tests/clid-230 docker://localhost:6000 --v2 --dest-tls-verify=false 2. Run the delete --generate ./bin/oc-mirror delete -c ./alex-tests/alex-isc/isc-delete.yaml --generate --workspace file:///home/aguidi/go/src/github.com/aguidirh/oc-mirror/alex-tests/clid-230 --delete-id clid-230-delete-test docker://localhost:6000 --v2 --dest-tls-verify=false 3. Run the delete ./bin/oc-mirror delete --delete-yaml-file /home/aguidi/go/src/github.com/aguidirh/oc-mirror/alex-tests/clid-230/working-dir/delete/delete-images-clid-230-delete-test.yaml docker://localhost:6000 --v2 --dest-tls-verify=false
Actual results:
During the delete --generate the graph-image is not being included in the delete file 2024/10/25 09:44:21 [WARN] : unable to find graph image in local cache: SKIPPING. %!v(MISSING) 2024/10/25 09:44:21 [WARN] : reading manifest latest in localhost:55000/openshift/graph-image: manifest unknown Because of that the graph-image is not being deleted from the target registry [aguidi@fedora oc-mirror]$ curl http://localhost:6000/v2/openshift/graph-image/tags/list | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 51 100 51 0 0 15577 0 --:--:-- --:--:-- --:--:-- 17000 { "name": "openshift/graph-image", "tags": [ "latest" ] }
Expected results:
graph-image should be deleted even after mirrorToMirror
Additional info:
Description of problem:
In integration, creating a rosa HostedCluster with a shared vpc will result in a VPC endpoint that is not available.
Version-Release number of selected component (if applicable):
4.17.3
How reproducible:
Sometimes (currently every time in integration, but could be due to timing)
Steps to Reproduce:
1. Create a HostedCluster with shared VPC 2. Wait for HostedCluster to come up
Actual results:
VPC endpoint never gets created due to errors like: {"level":"error","ts":"2024-11-18T20:37:51Z","msg":"Reconciler error","controller":"awsendpointservice","controllerGroup":"hypershift.openshift.io","controllerKind":"AWSEndpointService","AWSEndpointService":{"name":"private-router","namespace":"ocm-int-2f4labdgi2grpumbq5ufdsfv7nv9ro4g-cse2etests-gdb"},"namespace":"ocm-int-2f4labdgi2grpumbq5ufdsfv7nv9ro4g-cse2etests-gdb","name":"private-router","reconcileID":"bc5d8a6c-c9ad-4fc8-8ead-6b6c161db097","error":"failed to create vpc endpoint: UnauthorizedOperation","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:222"}
Expected results:
VPC endpoint gets created
Additional info:
Deleting the control plane operator pod will get things working. The theory is that if the control plane operator pod is delayed in obtaining a web identity token, then the client will not assume the role that was passed to it. Currently the client is only created once at the start, we should create it on every reconcile.
Description of problem:
When deploying a disconnected cluster, creating the iso by "openshifit-install agent create image" is failing (authentication required), when the release image resides in a secured local-registry. Actually the issue is this: openshift-install generates registry-config out of the install-config.yaml, and it's only the local regustry credentials (disconnected deploy), but it's not creating an icsp-file to get the image from local registry.
Version-Release number of selected component (if applicable):
How reproducible:
Run an agent-based iso image creation of a disconnected clutser. choose a version (nightly), where the image is in secured registry (such as registry.ci). it will fail on authentication required.
Steps to Reproduce:
1.openshift-install agant create image 2. 3.
Actual results:
failing on authentication required
Expected results:
iso to be created
Additional info:
Description of problem:
The LB name should be yunjiang-ap55-sk6jl-ext-a6aae262b13b0580, rather than ending with ELB service endpoint (elb.ap-southeast-5.amazonaws.com): failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed provisioning resources after infrastructure ready: failed to find HostedZone ID for NLB: failed to list load balancers: ValidationError: The load balancer name 'yunjiang-ap55-sk6jl-ext-a6aae262b13b0580.elb.ap-southeast-5.amazonaws.com' cannot be longer than '32' characters\n\tstatus code: 400, request id: f8adce67-d844-4088-9289-4950ce4d0c83 Checking the tag value, the value of Name key is correct: yunjiang-ap55-sk6jl-ext
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-11-30-141716
How reproducible:
always
Steps to Reproduce:
1. Deploy a cluster on ap-southeast-5 2. 3.
Actual results:
The LB can not be created
Expected results:
Create a cluster successfully.
Additional info:
No such issues on other AWS regions.
Description of problem:
'Channel' and 'Version' dropdowns do not collapse if the user does not select an option
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-12-04-113014
How reproducible:
Always
Steps to Reproduce:
1. Naviage to Operator Insatallation page OR Operator Install details page eg: /operatorhub/ns/openshift-console?source=["Red+Hat"]&details-item=datagrid-redhat-operators-openshift-marketplace&channel=stable&version=8.5.4 /operatorhub/subscribe?pkg=datagrid&catalog=redhat-operators&catalogNamespace=openshift-marketplace&targetNamespace=openshift-console&channel=stable&version=8.5.4&tokenizedAuth= 2. Click the Channel/Update channel OR 'Version' dropdown list 3. Click the dropdow again
Actual results:
The dropdown list cannot collapse, only if user selected an option OR click other area
Expected results:
the dropdown can collapse after click
Additional info:
Description of problem:
Clicking "Don't show again" won't spot "Hide Lightspeed" if current page is on Language/Notifications/Applications tab of "user-preferences"
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-12-16-065305
How reproducible:
Always
Steps to Reproduce:
1.User goes to one of Language/Notifications/Applications tabs on "user-preferences" page. 2.Open "Lightspeed" modal at the right bottom and click "Don't show again". 3.
Actual results:
2. The url changes to "/user-preferences/general?spotlight=[data-test="console.hideLightspeedButton%20field"]", but still stays at the original tab.
Expected results:
2. Should jump to "Hide Lightspeed" part on General tab of "user-preferences" page.
Additional info:
Description of problem:
The status controller creates a ClusterOperator when one does not exist. In the test case verifying behavior with an already present ClusterOperator, it is requried to wait until the ClusterOperator created by the test is ready. Failing to do so can result in the controller attempting to create a duplicate ClusterOperator, causing the test to fail with an "already exists" error.
Version-Release number of selected component (if applicable):
How reproducible:
Sometimes, race condition
Steps to Reproduce:
1. Run ci/prow/unit presubmit job
Actual results:
Test fails with: clusteroperators.config.openshift.io \"machine-approver\" already exists",
Expected results:
Test passes
Additional info:
Unit test only issue. No custommer impact.
Description of problem:
console is showing view release notes on several places, but the current link only point to Y release main release note
Version-Release number of selected component (if applicable):
4.17.2
How reproducible:
Always
Steps to Reproduce:
1. set up 4.17.2 cluster 2. navigate to Cluster Settings page, check 'View release note' link in 'Update history' table
Actual results:
the link only point user to Y release main release note
Expected results:
the link should point to release note of a specific version the correct link should be https://access.redhat.com/documentation/en-us/openshift_container_platform/${major}.${minor}/html/release_notes/ocp-${major}-${minor}-release-notes#ocp-${major}-${minor}-${patch}_release_notes
Additional info:
Original bug title:
cert-manager [v1.15 Regression] Failed to issue certs with ACME Route53 dns01 solver in AWS STS env
Description of problem:
When using Route53 as the dns01 solver to create certificates, it fails in both automated and manual tests. For the full log, please refer to the "Actual results" section.
Version-Release number of selected component (if applicable):
cert-manager operator v1.15.0 staging build
How reproducible:
Always
Steps to Reproduce: also documented in gist
1. Install the cert-manager operator 1.15.0 2. Follow the doc to auth operator with AWS STS using ccoctl: https://docs.openshift.com/container-platform/4.16/security/cert_manager_operator/cert-manager-authenticate.html#cert-manager-configure-cloud-credentials-aws-sts_cert-manager-authenticate 3. Create a ACME issuer with Route53 dns01 solver 4. Create a cert using the created issuer
OR:
Refer by running `/pj-rehearse pull-ci-openshift-cert-manager-operator-master-e2e-operator-aws-sts` on https://github.com/openshift/release/pull/59568
Actual results:
1. The certificate is not Ready. 2. The challenge of the cert is stuck in the pending status: PresentError: Error presenting challenge: failed to change Route 53 record set: operation error Route 53: ChangeResourceRecordSets, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region
Expected results:
The certificate should be Ready. The challenge should succeed.
Additional info:
The only way to get it working again seems to be injecting the "AWS_REGION" environment variable into the controller pod. See upstream discussion/change:
I couldn't find a way to inject the env var into our operator-managed operands, so I only verified this workaround using the upstream build v1.15.3. After applying the patch with the following command, the challenge succeeded and the certificate became Ready.
oc patch deployment cert-manager -n cert-manager \ --patch '{"spec": {"template": {"spec": {"containers": [{"name": "cert-manager-controller", "env": [{"name": "AWS_REGION", "value": "aws-global"}]}]}}}}'
Description of problem:
machine-approver logs
E0221 20:29:52.377443 1 controller.go:182] csr-dm7zr: Pending CSRs: 1871; Max pending allowed: 604. Difference between pending CSRs and machines > 100. Ignoring all CSRs as too many recent pending CSRs seen
.
oc get csr |wc -l
3818
oc get csr |grep "node-bootstrapper" |wc -l
2152
By approving the pending CSR manually I can get the cluster to scaleup.
We can increase the maxPending to a higher number https://github.com/openshift/cluster-machine-approver/blob/2d68698410d7e6239dafa6749cc454272508db19/pkg/controller/controller.go#L330
Description of problem:
dev console, select one project that has alerts, example: openshift-monitoring, silence one Alert, example Watchdog, go to the "Silence details" page, click the Watchdog link under "Firing alerts" section, "No Alert found" shows, which should go to the alert details page, see screen recording: https://drive.google.com/file/d/1lUKLoHpmBKuzd8MmEUaUJRPgIkI1LjCj/view?usp=drive_link
the issue is happen with 4.18.0-0.nightly-2025-01-04-101226/4.19.0-0.nightly-2025-01-07-234605, no issue withe 4.17
checked the links for Watchdog link under "Firing alerts" section, there is undefined in the link, which should be namespace(openshift-monitoring) like 4.17
4.19
https://${console_url}/dev-monitoring/ns/undefined/alerts/1067612101?alertname=Watchdog&namespace=openshift-monitoring&severity=none
4.18
https://${console_url}/dev-monitoring/ns/undefined/alerts/1086044860?alertname=Watchdog&namespace=openshift-monitoring&severity=none
4.17
https://${console_url}/dev-monitoring/ns/openshift-monitoring/alerts/3861382580?namespace=openshift-monitoring&prometheus=openshift-monitoring%2Fk8s&severity=none&alertname=Watchdog
Version-Release number of selected component (if applicable):
4.18+
How reproducible:
always for 4.18+
Steps to Reproduce:
1. see the description
Actual results:
"No Alert found" shows
Expected results:
no error
Description of problem:
The alertmanager-user-workload Service Account has "automountServiceAccountToken: true"
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Always
Steps to Reproduce:
1. Enable Alertmanager for user-defined monitoring. 2. oc get sa -n openshift-user-workload-monitoring alertmanager-user-workload -o yaml 3.
Actual results:
Has "automountServiceAccountToken: true"
Expected results:
Has "automountServiceAccountToken: false" or no mention of automountServiceAccountToken.
Additional info:
It is recommended to not enable token automount for service accounts in general.
This is a recurrence task that needs to be performed from time to time to keep the dependencies updated.
Description of problem:
Installing 4.17 agent-based hosted cluster on bare-metal with IPv6 stack in disconnected environment. We cannot install MetalLB operator on the hosted cluster to expose openshift router and handle ingress because the openshift-marketplace pods that extract the operator bundle and the relative pods are in Error state. They try to execute the following command but cannot reach the cluster apiserver: opm alpha bundle extract -m /bundle/ -n openshift-marketplace -c b5a818607a7a162d7f9a13695046d44e47d8127a45cad69c0d8271b2da945b1 -z INFO[0000] Using in-cluster kube client config Error: error loading manifests from directory: Get "https://[fd02::1]:443/api/v1/namespaces/openshift-marketplace/configmaps/b5a818607a7a162d7f9a13695046d44e47d8127a45cad69c0d8271b2da945b1": dial tcp [fd02::1]:443: connect: connection refused In our hosted cluster fd02::1 is the clusterIP of the kubernetes service and the endpoint associated to the service is [fd00::1]:6443. By debugging the pods we see that connection to clusterIP is refused but if we try to connect to its endpoint the connection is established and we get 403 Forbidden: sh-5.1$ curl -k https://[fd02::1]:443 curl: (7) Failed to connect to fd02::1 port 443: Connection refused sh-5.1$ curl -k https://[fd00::1]:6443 { "kind": "Status", "apiVersion": "v1", "metadata": {}, "status": "Failure", "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"", "reason": "Forbidden", "details": {}, "code": 403 This issue is happening also in other pods in the hosted cluster which are in Error or in CrashLoopBackOff, we have similar error in their logs, e.g.: F1011 09:11:54.129077 1 cmd.go:162] failed checking apiserver connectivity: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-service-ca-operator/leases/service-ca-operator-lock": dial tcp [fd02::1]:443: connect: connection refused IPv6 disconnected 4.16 hosted cluster with same configuration was installed successfully and didn't show this issue, and neither IPv4 disconnected 4.17. So the issue is with IPv6 stack only.
Version-Release number of selected component (if applicable):
Hub cluster: 4.17.0-0.nightly-2024-10-10-004834 MCE 2.7.0-DOWNANDBACK-2024-09-27-14-52-56 Hosted cluster: version 4.17.1 image: registry.ci.openshift.org/ocp/release@sha256:e16ac60ac6971e5b6f89c1d818f5ae711c0d63ad6a6a26ffe795c738e8cc4dde
How reproducible:
100%
Steps to Reproduce:
1. Install MCE 2.7 on 4.17 IPv6 disconnected BM hub cluster 2. Install 4.17 agent-based hosted cluster and scale up the nodepool 3. After worker nodes are installed, attempt to install MetalLB operator to hanlde ingress
Actual results:
MetalLB operator cannot be installed because pods cannot connect to the cluster apiserver.
Expected results:
Pods in the cluster can connect to apiserver.
Additional info:
Description of problem:
Table layout missing on Metrics page. After PR https://github.com/openshift/console/pull/14615 change, based on the changes the PatternFly 4 shared modules has been removed
Version-Release number of selected component (if applicable):
pre-merge
How reproducible:
Always
Steps to Reproduce:
1. Navigate to Observe -> Metrics page 2. Click 'Insert example query' button 3. Check the layout for Query results table
Actual results:
The results table layout issue
Expected results:
Layout should as same as OCP 4.18
Additional info:
More infomation: could be check the PR https://github.com/openshift/console/pull/14615
Description of problem:
Some bundles in the Catalog have been given the property in the FBC (and not in the bundle's CSV) which does not get propagated through to the helm chart annotations.
Version-Release number of selected component (if applicable):
How reproducible:
Install elasticsearch 5.8.13
Steps to Reproduce:
1. 2. 3.
Actual results:
cluster is upgradeable
Expected results:
cluster is not upgradeable
Additional info:
Description of problem:
We had bugs like https://issues.redhat.com/browse/OCPBUGS-44324 from the payload tests in vsphere and gcp, and this was fixed by https://github.com/openshift/api/commit/ec9bf3faa1aa2f52805c44b7b13cd7ab4b984241 There are a few operators which are missing that openshift/api bump. These operators do not have blocking payload jobs but we still need this fix before 4.18 is released. It affects the following operators: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/go.mod#L11 https://github.com/openshift/ibm-powervs-block-csi-driver-operator/blob/main/go.mod#L6 https://github.com/openshift/gcp-filestore-csi-driver-operator/blob/main/go.mod#L8 https://github.com/openshift/secrets-store-csi-driver-operator/blob/main/go.mod#L8
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
All but 4 csi driver operators have the fix
Expected results:
All csi driver operators have this fix vendored: https://github.com/openshift/api/commit/ec9bf3faa1aa2f52805c44b7b13cd7ab4b984241
Additional info:
Description of problem:
Registry storage alerts did not link a runbook
Version-Release number of selected component (if applicable):
4.18
How reproducible:
always
Steps to Reproduce:
According to the doc: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#documentation-required, I think should add runhook link to the registry storage alert pr: https://github.com/openshift/cluster-image-registry-operator/pull/1147/files, thanks
Actual results:
Expected results:
Additional info:
Description of problem:
CAPI install got ImageReconciliationFailed when creating vpc custom image
Version-Release number of selected component (if applicable):
4.19.0-0.nightly-2024-12-06-101930
How reproducible:
always
Steps to Reproduce:
1.add the following in install-config.yaml featureSet: CustomNoUpgrade featureGates: [ClusterAPIInstall=true] 2. create IBMCloud cluster with IPI
Actual results:
level=info msg=Done creating infra manifests level=info msg=Creating kubeconfig entry for capi cluster ci-op-h3ykp5jn-32a54-xprzg level=info msg=Waiting up to 30m0s (until 11:25AM UTC) for network infrastructure to become ready... level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: infrastructure was not ready within 30m0s: client rate limiter Wait returned an error: context deadline exceeded
in IBMVPCCluster-openshift-cluster-api-guests log
reason: ImageReconciliationFailed message: 'error failure trying to create vpc custom image: error unknown failure creating vpc custom image: The IAM token that was specified in the request has expired or is invalid. The request is not authorized to access the Cloud Object Storage resource.'
Expected results:
create cluster succeed
Additional info:
the resources created when install failed: ci-op-h3ykp5jn-32a54-xprzg-cos dff97f5c-bc5e-4455-b470-411c3edbe49c crn:v1:bluemix:public:cloud-object-storage:global:a/fdc2e14cf8bc4d53a67f972dc2e2c861:f648897a-2178-4f02-b948-b3cd53f07d85:: ci-op-h3ykp5jn-32a54-xprzg-vpc is.vpc crn:v1:bluemix:public:is:jp-tok:a/fdc2e14cf8bc4d53a67f972dc2e2c861::vpc:r022-46c7932d-8f4d-4d53-a398-555405dfbf18 copier-resurrect-panzer-resistant is.security-group crn:v1:bluemix:public:is:jp-tok:a/fdc2e14cf8bc4d53a67f972dc2e2c861::security-group:r022-2367a32b-41d1-4f07-b148-63485ca8437b deceiving-unashamed-unwind-outward is.network-acl crn:v1:bluemix:public:is:jp-tok:a/fdc2e14cf8bc4d53a67f972dc2e2c861::network-acl:r022-b50286f6-1052-479f-89bc-fc66cd9bf613
Description of problem:
When the master MCP is paused below alert are triggered Failed to resync 4.12.35 because: Required MachineConfigPool 'master' is paused The node have been rebooted to make sure there is no pending MC rollout
Affects version
4.12
How reproducible:
Steps to Reproduce:
1. Create a MC and apply it to master 2. use below mc apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 99-master-cgroupsv2 spec: kernelArguments: - systemd.unified_cgroup_hierarchy=1 3.Wait until the nodes are rebooted and running 4. pause the mcp Actual results:{code:none} MCP pausing causing the alert
Expected results:
Alerts should not be fired Additional info:{code:none}
Description of problem:
when more than one release is added to ImageSetConfig.yaml the images number is double and incorrect, check the log we could see duplications. ImageSetConfig.yaml: ================= [fedora@preserve-fedora-yinzhou test]$ cat /tmp/clid-232.yaml apiVersion: mirror.openshift.io/v2alpha1 kind: ImageSetConfiguration mirror: platform: channels: - name: stable-4.16 minVersion: 4.16.0 maxVersion: 4.16.0 - name: stable-4.15 minVersion: 4.15.0 maxVersion: 4.15.0 images to copy 958 cat /tmp/sss |grep 1fd628f40d321354832b0f409d2bf9b89910de27bc6263a4fb5a55c25e160a99 ✓ 178/958 : (8s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1fd628f40d321354832b0f409d2bf9b89910de27bc6263a4fb5a55c25e160a99 ✓ 945/958 : (8s) quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1fd628f40d321354832b0f409d2bf9b89910de27bc6263a4fb5a55c25e160a99 cat /tmp/sss |grep x86_64 ✓ 191/958 : (3s) quay.io/openshift-release-dev/ocp-release:4.16.0-x86_64 ✓ 383/958 : (2s) quay.io/openshift-release-dev/ocp-release:4.15.0-x86_64 ✓ 575/958 : (1s) quay.io/openshift-release-dev/ocp-release:4.15.0-x86_64 ✓ 767/958 : (11s) quay.io/openshift-release-dev/ocp-release:4.15.35-x86_64 ✓ 958/958 : (5s) quay.io/openshift-release-dev/ocp-release:4.16.0-x86_64
Version-Release number of selected component (if applicable):
./oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.0.0-unknown-68e608e2", GitCommit:"68e608e2", GitTreeState:"clean", BuildDate:"2024-10-14T05:57:17Z", GoVersion:"go1.23.0", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. clone oc-mirror repo, cd oc-mirror, run make build 2. Now use the imageSetConfig.yaml present above and run mirror2disk & disk2mirror commands 3. oc-mirror -c /tmp/clid-232.yaml file://CLID-232 --v2 ; oc-mirror -c /tmp/clid-232.yaml --from file://CLID-232 docker://localhost:5000/clid-232 --dest-tls-verify=false --v2
Actual results:
1. see mirror duplication
Expected results:
no dup.
Additional info:
Description of problem:
We have two EAP application server clusters and for each of them there is a service created. We have a route configured to the one of the services. When we update the route programmatically to lead to the second service/cluster the response shows it is still being attached to the same service.
Steps to Reproduce:
1. Create two separate clusters of the EAP servers
2. Create one service for the first cluster (hsc1) and one for the second one (hsc2)
3. Create a route for the first service (hsc1)
4. Start both of the clusters and assure the replication works
5. Send a request to the first cluster using the route URL - response should contain identification of the first cluster (hsc-1-xxx)
[2024-08-29 11:30:44,544] INFO - [ForkJoinPool-1-worker-1] 1st request responseString hsc-1-2-0c872b89-ef3e-48c6-8163-372e447e013d 1 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com [2024-08-29 11:30:44,654] INFO - [ForkJoinPool-1-worker-1] 2nd request responseString hsc-1-2-0c872b89-ef3e-48c6-8163-372e447e013d 2 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
6. update the route programatically to redirect to the second service (hsc2)
...
builder.editSpec().editTo().withName("hsc2").endTo().endSpec();
...
7. Send the request again using the same route - in the response there is the same identification of the first cluster
[2024-08-29 11:31:45,098] INFO - [ForkJoinPool-1-worker-1] responseString after route update hsc-1-2-0c872b89-ef3e-48c6-8163-372e447e013d 3 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
although the service was updated in the route yaml:
... kind: Service name: hsc2
When creating a new route hsc2 for a service hsc2 and using it for the third request we can see the second cluster was targetted correctly with his own separate replication working
[2024-08-29 13:43:13,679] INFO - [ForkJoinPool-1-worker-1] 1st request responseString hsc-1-2-00594ca9-f70c-45de-94b8-354a6e1fc293 1 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
[2024-08-29 13:43:13,790] INFO - [ForkJoinPool-1-worker-1] 2nd request responseString hsc-1-2-00594ca9-f70c-45de-94b8-354a6e1fc293 2 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
[2024-08-29 13:44:14,056] INFO - [ForkJoinPool-1-worker-1] responseString after second route for service hsc2 was used hsc-2-2-614582a9-3c71-4690-81d3-32a616ed8e54 1 with route hsc2-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
I also did a different attempt.
I Stopped the test in debug mode after the two requests were executed
[2024-08-30 14:23:43,101] INFO - [ForkJoinPool-1-worker-1] 1st request responseString hsc-1-1-489d4239-be4f-4d5e-9343-3211ae479d51 1 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com [2024-08-30 14:23:43,210] INFO - [ForkJoinPool-1-worker-1] 2nd request responseString hsc-1-1-489d4239-be4f-4d5e-9343-3211ae479d51 2 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
Then manually changed the route yaml to use the hsc2 service and send the request manually:
curl http://hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com/Counter
hsc-2-2-84fa1d7e-4045-4708-b89e-7d7f3cd48541 1
responded correctly with the second service/cluster.
Then resumed the test run in debug mode and sent the request programmatically
[2024-08-30 14:24:59,509] INFO - [ForkJoinPool-1-worker-1] responseString after route update hsc-1-1-489d4239-be4f-4d5e-9343-3211ae479d51 3 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
responded with the wrong first service/cluster.
Actual results: Route directs to the same service and EAP cluster
Expected results: After the update the route should direct to the second service and EAP cluster
Additional info:
This issue started to occur from OCP 4.16. Going through the 4.16 release notes and suggested route configuration didn't lead to any possible configuration chnages which should have been applied.
The code of the MultipleClustersTest.twoClustersTest where was this issue discovered is available here.
All the logs as well as services and route yamls are attached to the EAPQE jira.
Description of problem:
"destroy cluster" doesn't delete the PVC disks which have the label "kubernetes-io-cluster-<infra-id>: owned"
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-multi-2024-11-27-162629
How reproducible:
Always
Steps to Reproduce:
1. include the step which sets the cluster default storageclass to the hyperdisk one before ipi-install (see my debug PR https://github.com/openshift/release/pull/59306) 2. "create cluster", and make sure it succeeds 3. "destroy cluster" Note: although we confirmed with issue with disk type "hyperdisk-balanced", we believe other disk types have the same issue.
Actual results:
The 2 PVC disks of hyperdisk-balanced type are not deleted during "destroy cluster", although the disks have the label "kubernetes-io-cluster-<infra-id>: owned".
Expected results:
The 2 PVC disks should be deleted during "destroy cluster", because they have the correct/expected labels according to which the uninstaller should be able to detect them.
Additional info:
FYI the PROW CI debug job: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/59306/rehearse-59306-periodic-ci-openshift-verification-tests-master-installer-rehearse-4.18-installer-rehearse-debug/1861958752689721344
Description of problem:
oc adm node-image create --pxe does not generate only pxe artifacts, but copies everything from the node-joiner pod. Also, the name of the pxe artifacts are not corrected (prefixed with agent, instead of node)
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. oc adm node-image create --pxe
Actual results:
All the node-joiner pods are copied. PXE artifacts name are wrong.
Expected results:
In the target folder, there should be only the following artifacts:
* node.x86_64-initrd.img * node.x86_64-rootfs.img * node.x86_64-vmlinuz
Additional info:
Description of problem:
Day2 monitoring is not handling api server temporarily disconnection
Version-Release number of selected component (if applicable):
4.17.0-0.ci-2024-08-26-170911
How reproducible:
always in libvirt manually run
Steps to Reproduce:
1. Run agent install in libvirt env manually 2. Run day2 install after cluster is installed succeed 3. Run 'oc adm node-image monitor' to track the day2 install, when there is api server temporarily disconnection , monitoring program will run into error/EOF. 4, Only reproduced in libvirt env, baremetal platform is working fine.
Actual results:
Day2 monitoring should run without break to track day2 install in libvirt
Expected results:
Day2 monitoring run into error/EOF
Additional info:
Monitoring output link: https://docs.google.com/spreadsheets/d/17cOCfYvqxLHlhzBHkwCnFZDUatDRcG1Ej-HQDTDin0c/edit?gid=0#gid=0
Description of problem:
When attempting to install a specific version of an operator from the web console, the install plan of the latest version of that operator is created if the operator version had a + in it.
Version-Release number of selected component (if applicable):
4.17.6 (Tested version)
How reproducible:
Easily reproducible
Steps to Reproduce:
1. Under Operators > Operator Hub, install an operator with a + character in the version. 2. On the next screen, note that the + in the version text box is missing. 3. Make no changes to the default options and proceed to install the operator. 4. An install plan is created to install the operator with the latest version from the channel.
Actual results:
The install plan is created for the latest version from the channel.
Expected results:
The install plan is created for the requested version.
Additional info:
Notes on the reproducer: - For step 1: the selected version shouldn't be the latest version from the channel for the purposes of this bug. - For step 1: The version will need to be selected from the version dropdown to reproduce the bug. If the default version that appears in the dropdown is used, then the bug won't reproduce. Other Notes: - This might also happen with other special characters in the version string other than +, but this is not something that I tested.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The release signature configmap file is invalid with no name defined
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202410011141.p0.g227a9c4.assembly.stream.el9-227a9c4", GitCommit:"227a9c499b6fd94e189a71776c83057149ee06c2", GitTreeState:"clean", BuildDate:"2024-10-01T20:07:43Z", GoVersion:"go1.22.5 (Red Hat 1.22.5-1.module+el8.10.0+22070+9237f38b) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
100%
Steps to Reproduce:
1) with isc : cat /test/yinzhou/config.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: platform: channels: - name: stable-4.16 2) do mirror2disk + disk2mirror 3) use the signature configmap to create resource
Actual results:
3) failed to create resource with error: oc create -f signature-configmap.json The ConfigMap "" is invalid: metadata.name: Required value: name or generateName is required oc create -f signature-configmap.yaml The ConfigMap "" is invalid: metadata.name: Required value: name or generateName is required
Expected results:
No error
Description of problem:
In Agent-Base Installation, storage network on all nodes will be also configured. If both VLAN interfaces in the same L3 switch used as gateways for compute cluster management and storage networks of the OCP cluster have arp-proxy enabled, then the IP collisions validation will report errors. The reason why the IP collision validation fails is that the validation seems to send the arp-request from both interfaces bond0.4082 and bond1.2716 for all addresses used for compute cluster management and storage networks.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Install cluster using Agent-base with proxy enabled. 2. Configur both VLAN interfaces in the same L3 switch, used as gateways for compute cluster management and storage networks of the OCP cluster have arp-proxy enabled
Actual results:
IP collision validation fails, the validation seems to send the arp-request from both interfaces bond0.4082 and bond1.2716 for all addresses used for compute cluster management and storage networks. The validation seems to trigger the arp-requests sent out from all NICs of the nodes. If the gateways connected to the different NICs of the node have arp-proxy configured, then the IP collision failure will be observed.
Expected results:
Validation should pass in Arp Porxy Scenario
Additional info:
In arp-proxy scenario, the arp reply is not needed from the NIC which IP address is not in the same subnet as the destination IP address of the arp-request. Generally speaking, when the node tries to communicate to any destination IP address in the same subnet as the IP address of one NIC, it will only send out the arp-request from this NIC. So even arp-proxy in this case, it will cause any issue.
Description of problem:
Due to the workaround / solution of https://issues.redhat.com/browse/OCPBUGS-42609, namespaces must be created with a specific label to allow the use of primary UDN. This label must be added by the cluster admin - making it impossible for regular users to self-provision their network. With this, the dialog we introduced to the UI where a UDN can be created while defining a Project is no longer functioning (labels cannot be set through Projects). Until a different solution will be introduced, primary UDNs will not be self-service and therefore we should remove the Network tab from Create project dialog.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Always
Steps to Reproduce:
1. Open the new project dialog
Actual results:
The "Network" tab is there
Expected results:
It should be removed for now
Additional info:
I'm marking this as critical, since this UI element is very visible and would easily confuse users.
Description of problem:
[must-gather] should collect the 3rd driver operator clustercsidriver resources
Version-Release number of selected component (if applicable):
Client Version: 4.18.0-ec.3 Kustomize Version: v5.4.2 Server Version: 4.18.0-0.nightly-2024-11-05-163516
How reproducible:
Always
Steps to Reproduce:
1. Install an Openshift cluster on azure. 2. Deploy the smb csi driver operator and create the cluster csidriver. 3. Use oc adm must-gather --dest-dir=./gather-test command gather the cluster info.
Actual results:
In step 3 the gathered data does not contain the clustercsidriver smb.csi.k8s.io object $ wangpenghao@pewang-mac ~ omc get clustercsidriver NAME AGE disk.csi.azure.com 3h file.csi.azure.com 3h wangpenghao@pewang-mac ~ oc get clustercsidriver NAME AGE disk.csi.azure.com 4h45m efs.csi.aws.com 40m file.csi.azure.com 4h45m smb.csi.k8s.io 4h13m wangpenghao@pewang-mac ~ ls -l ~/gather-test/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2a3aa11d261a312215bcba80827ab6c75527f44d1ebde54958e7b7798673787c/cluster-scoped-resources/operator.openshift.io/clustercsidrivers total 32 -rwxr-xr-x@ 1 wangpenghao staff 7191 Nov 6 13:55 disk.csi.azure.com.yaml -rwxr-xr-x@ 1 wangpenghao staff 7191 Nov 6 13:55 file.csi.azure.com.yaml
Expected results:
In step 3 the gathered data should contain the clustercsidriver smb.csi.k8s.io object
Additional info:
aws efs, gcp filestore also have the same issue
Description of problem:
When running 4.18 installer QE full function test, following arm64 instances types are detected and tested passed, so append them in installer doc[1]: * StandardDpdsv6Family * StandardDpldsv6Family * StandardDplsv6Family * StandardDpsv6Family * StandardEpdsv6Family * StandardEpsv6Family [1] https://github.com/openshift/installer/blob/main/docs/user/azure/tested_instance_types_aarch64.md
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Reviewing some 4.17 cluster-ingress-operator logs, I found many (215) of these even when the GatewayAPI feature was disabled: 024-09-03T08:20:03.726Z INFO operator.gatewayapi_controller controller/controller.go:114 reconciling {"request": {"name":"cluster"}} This makes it look like the feature was enabled when it was not. Also check for same in the other gatewayapi controllers in the gateway-service-dns and gatewayclass folders. A search for r.config.GatewayAPIEnabled should show where we are checking whether the feature is enabled.
Version-Release number of selected component (if applicable):
4.17, 4.18 should have the fix
How reproducible:
This was showing up in the CI logs for the e2e-vsphere-ovn-upi test: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-vsphere-ovn-upi/1830872105227390976/artifacts/e2e-vsphere-ovn-upi/gather-extra/artifacts/pods/openshift-ingress-operator_ingress-operator-64df7b9cd4-hqkmh_ingress-operator.log It is probably showing up in all logs to varying degrees.
Steps to Reproduce:
1. Deploy 4.17 2. Review cluster-ingress-operator logs
Actual results:
Seeing a log that makes it look like the GatewayAPI feature is enabled even when it is not.
Expected results:
Only see the log when the GatewayAPI feature is enabled.
Additional info:
The GatewayAPI feature is enabled in the e2e-aws-gatewayapi PR test and any techpreview PR test, and can be manually enabled on a test cluster by running: oc patch featuregates/cluster --type=merge --patch='{"spec":{"featureSet":"CustomNoUpgrade","customNoUpgrade":{"enabled":["GatewayAPI"]}}}'
Managed services marks a couple of nodes as "infra" so user workloads don't get scheduled on them. However, platform daemonsets like iptables-alerter should run there – and the typical toleration for that purpose should be:
tolerations:
- operator: Exists
instead the toleration is
tolerations: - key: "node-role.kubernetes.io/master" operator: "Exists" effect: "NoSchedule"
Examples from other platform DS:
$ for ns in openshift-cluster-csi-drivers openshift-cluster-node-tuning-operator openshift-dns openshift-image-registry openshift-machine-config-operator openshift-monitoring openshift-multus openshift-multus openshift-multus openshift-network-diagnostics openshift-network-operator openshift-ovn-kubernetes openshift-security; do echo "NS: $ns"; oc get ds -o json -n $ns|jq '.items.[0].spec.template.spec.tolerations'; done NS: openshift-cluster-csi-drivers [ { "operator": "Exists" } ] NS: openshift-cluster-node-tuning-operator [ { "operator": "Exists" } ] NS: openshift-dns [ { "key": "node-role.kubernetes.io/master", "operator": "Exists" } ] NS: openshift-image-registry [ { "operator": "Exists" } ] NS: openshift-machine-config-operator [ { "operator": "Exists" } ] NS: openshift-monitoring [ { "operator": "Exists" } ] NS: openshift-multus [ { "operator": "Exists" } ] NS: openshift-multus [ { "operator": "Exists" } ] NS: openshift-multus [ { "operator": "Exists" } ] NS: openshift-network-diagnostics [ { "operator": "Exists" } ] NS: openshift-network-operator [ { "effect": "NoSchedule", "key": "node-role.kubernetes.io/master", "operator": "Exists" } ] NS: openshift-ovn-kubernetes [ { "operator": "Exists" } ] NS: openshift-security [ { "operator": "Exists" } ]
Description of problem:
Bootstrapping currently waits to observe 2 endpoints in the "kubernetes" service in HA topologies. The bootstrap kube-apiserver instance itself appears to be included in that number. Soon after observing 2 (bootstrap instance plus one permanent instance), the bootstrap instance is torn down and leaves the cluster with only 1 instance. Each rollout to that instance causes disruption to kube-apiserver availability until the second permanent instance is started for the first time, easily totaling multiple minutes of 0% kube-apiserver availability.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
[Azure disk/file csi driver]on ARO HCP could not provision volume succeed
Version-Release number of selected component (if applicable):
4.19.0-0.nightly-2024-12-13-083421
How reproducible:
Always
Steps to Reproduce:
1.Install AKS cluster on azure. 2.Install hypershift operator on the AKS cluster. 3.Use hypershift CLI create hosted cluster with the Client Certificate mode. 4.Check the azure disk/file csi dirver work well on hosted cluster.
Actual results:
In step 4: the the azure disk/file csi dirver provision volume failed on hosted cluster # azure disk pvc provision failed $ oc describe pvc mypvc ... Normal WaitForFirstConsumer 74m persistentvolume-controller waiting for first consumer to be created before binding Normal Provisioning 74m disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_2334468f-9d27-4bdd-a53c-27271ee60073 External provisioner is provisioning volume for claim "default/mypvc" Warning ProvisioningFailed 74m disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_2334468f-9d27-4bdd-a53c-27271ee60073 failed to provision volume with StorageClass "managed-csi": rpc error: code = Unavailable desc = error reading from server: EOF Warning ProvisioningFailed 71m disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_28ba5ad9-c4f8-4dc8-be40-c80c546b7ef8 failed to provision volume with StorageClass "managed-csi": rpc error: code = Unavailable desc = error reading from server: EOF Normal Provisioning 71m disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_28ba5ad9-c4f8-4dc8-be40-c80c546b7ef8 External provisioner is provisioning volume for claim "default/mypvc" ... $ oc logs azure-disk-csi-driver-controller-74d944bbcb-7zz89 -c csi-driver W1216 08:07:04.282922 1 main.go:89] nodeid is empty I1216 08:07:04.290689 1 main.go:165] set up prometheus server on 127.0.0.1:8201 I1216 08:07:04.291073 1 azuredisk.go:213] DRIVER INFORMATION: ------------------- Build Date: "2024-12-13T02:45:35Z" Compiler: gc Driver Name: disk.csi.azure.com Driver Version: v1.29.11 Git Commit: 4d21ae15d668d802ed5a35068b724f2e12f47d5c Go Version: go1.23.2 (Red Hat 1.23.2-1.el9) X:strictfipsruntime Platform: linux/amd64 Topology Key: topology.disk.csi.azure.com/zone I1216 08:09:36.814776 1 utils.go:77] GRPC call: /csi.v1.Controller/CreateVolume I1216 08:09:36.814803 1 utils.go:78] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"topology.disk.csi.azure.com/zone":""}}],"requisite":[{"segments":{"topology.disk.csi.azure.com/zone":""}}]},"capacity_range":{"required_bytes":1073741824},"name":"pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316","parameters":{"csi.storage.k8s.io/pv/name":"pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316","csi.storage.k8s.io/pvc/name":"mypvc","csi.storage.k8s.io/pvc/namespace":"default","skuname":"Premium_LRS"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":7}}]} I1216 08:09:36.815338 1 controllerserver.go:208] begin to create azure disk(pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316) account type(Premium_LRS) rg(ci-op-zj9zc4gd-12c20-rg) location(centralus) size(1) diskZone() maxShares(0) panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x190c61d] goroutine 153 [running]: sigs.k8s.io/cloud-provider-azure/pkg/provider.(*ManagedDiskController).CreateManagedDisk(0x0, {0x2265cf0, 0xc0001285a0}, 0xc0003f2640) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/sigs.k8s.io/cloud-provider-azure/pkg/provider/azure_managedDiskController.go:127 +0x39d sigs.k8s.io/azuredisk-csi-driver/pkg/azuredisk.(*Driver).CreateVolume(0xc000564540, {0x2265cf0, 0xc0001285a0}, 0xc000272460) /go/src/github.com/openshift/azure-disk-csi-driver/pkg/azuredisk/controllerserver.go:297 +0x2c59 github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler.func1({0x2265cf0?, 0xc0001285a0?}, {0x1e5a260?, 0xc000272460?}) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:6420 +0xcb sigs.k8s.io/azuredisk-csi-driver/pkg/csi-common.logGRPC({0x2265cf0, 0xc0001285a0}, {0x1e5a260, 0xc000272460}, 0xc00017cb80, 0xc00014ea68) /go/src/github.com/openshift/azure-disk-csi-driver/pkg/csi-common/utils.go:80 +0x409 github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler({0x1f3e440, 0xc000564540}, {0x2265cf0, 0xc0001285a0}, 0xc00029a700, 0x2084458) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:6422 +0x143 google.golang.org/grpc.(*Server).processUnaryRPC(0xc00059cc00, {0x2265cf0, 0xc000128510}, {0x2270d60, 0xc0004f5980}, 0xc000308480, 0xc000226a20, 0x31c8f80, 0x0) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1379 +0xdf8 google.golang.org/grpc.(*Server).handleStream(0xc00059cc00, {0x2270d60, 0xc0004f5980}, 0xc000308480) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1790 +0xe8b google.golang.org/grpc.(*Server).serveStreams.func2.1() /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1029 +0x7f created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 16 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1040 +0x125 # azure file pvc provision failed $ oc describe pvc mypvc Name: mypvc Namespace: openshift-cluster-csi-drivers StorageClass: azurefile-csi Status: Pending Volume: Labels: <none> Annotations: volume.beta.kubernetes.io/storage-provisioner: file.csi.azure.com volume.kubernetes.io/storage-provisioner: file.csi.azure.com Finalizers: [kubernetes.io/pvc-protection] Capacity: Access Modes: VolumeMode: Filesystem Used By: <none> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ExternalProvisioning 14s (x2 over 14s) persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'file.csi.azure.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered. Normal Provisioning 7s (x4 over 14s) file.csi.azure.com_azure-file-csi-driver-controller-879f56577-5hjn8_38c8218e-e52c-4248-ada7-268742afaac0 External provisioner is provisioning volume for claim "openshift-cluster-csi-drivers/mypvc" Warning ProvisioningFailed 7s (x4 over 14s) file.csi.azure.com_azure-file-csi-driver-controller-879f56577-5hjn8_38c8218e-e52c-4248-ada7-268742afaac0 failed to provision volume with StorageClass "azurefile-csi": rpc error: code = Internal desc = failed to ensure storage account: could not list storage accounts for account type Standard_LRS: StorageAccountClient is nil
Expected results:
In step 4: the the azure disk/file csi dirver should provision volume succeed on hosted cluster
Additional info:
Refactor name to Dockerfile.ocp as a better, version independent alternative
Description of problem:
Node was created today with worker label. It was labeled as a loadbalancer to match mcp selector. MCP saw the selector and moved to Updating but the machine-config-daemon pod isn't responding. We tried deleting the pod and it still didn't pick up that it needed to get a new config. Manually editing the desired config appears to workaround the issue but shouldn't be necessary.
Node created today: [dasmall@supportshell-1 03803880]$ oc get nodes worker-048.kub3.sttlwazu.vzwops.com -o yaml | yq .metadata.creationTimestamp '2024-04-30T17:17:56Z' Node has worker and loadbalancer roles: [dasmall@supportshell-1 03803880]$ oc get node worker-048.kub3.sttlwazu.vzwops.com NAME STATUS ROLES AGE VERSION worker-048.kub3.sttlwazu.vzwops.com Ready loadbalancer,worker 1h v1.25.14+a52e8df MCP shows a loadbalancer needing Update and 0 nodes in worker pool: [dasmall@supportshell-1 03803880]$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE loadbalancer rendered-loadbalancer-1486d925cac5a9366d6345552af26c89 False True False 4 3 3 0 87d master rendered-master-47f6fa5afe8ce8f156d80a104f8bacae True False False 3 3 3 0 87d worker rendered-worker-a6be9fb3f667b76a611ce51811434cf9 True False False 0 0 0 0 87d workerperf rendered-workerperf-477d3621fe19f1f980d1557a02276b4e True False False 38 38 38 0 87d Status shows mcp updating: [dasmall@supportshell-1 03803880]$ oc get mcp loadbalancer -o yaml | yq .status.conditions[4] lastTransitionTime: '2024-04-30T17:33:21Z' message: All nodes are updating to rendered-loadbalancer-1486d925cac5a9366d6345552af26c89 reason: '' status: 'True' type: Updating Node still appears happy with worker MC: [dasmall@supportshell-1 03803880]$ oc get node worker-048.kub3.sttlwazu.vzwops.com -o yaml | grep rendered- machineconfiguration.openshift.io/currentConfig: rendered-worker-a6be9fb3f667b76a611ce51811434cf9 machineconfiguration.openshift.io/desiredConfig: rendered-worker-a6be9fb3f667b76a611ce51811434cf9 machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-a6be9fb3f667b76a611ce51811434cf9 machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-a6be9fb3f667b76a611ce51811434cf9 machine-config-daemon pod appears idle: [dasmall@supportshell-1 03803880]$ oc logs -n openshift-machine-config-operator machine-config-daemon-wx2b8 -c machine-config-daemon 2024-04-30T17:48:29.868191425Z I0430 17:48:29.868156 19112 start.go:112] Version: v4.12.0-202311220908.p0.gef25c81.assembly.stream-dirty (ef25c81205a65d5361cfc464e16fd5d47c0c6f17) 2024-04-30T17:48:29.871340319Z I0430 17:48:29.871328 19112 start.go:125] Calling chroot("/rootfs") 2024-04-30T17:48:29.871602466Z I0430 17:48:29.871593 19112 update.go:2110] Running: systemctl daemon-reload 2024-04-30T17:48:30.066554346Z I0430 17:48:30.066006 19112 rpm-ostree.go:85] Enabled workaround for bug 2111817 2024-04-30T17:48:30.297743470Z I0430 17:48:30.297706 19112 daemon.go:241] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858 (412.86.202311271639-0) 828584d351fcb58e4d799cebf271094d5d9b5c1a515d491ee5607b1dcf6ebf6b 2024-04-30T17:48:30.324852197Z I0430 17:48:30.324543 19112 start.go:101] Copied self to /run/bin/machine-config-daemon on host 2024-04-30T17:48:30.325677959Z I0430 17:48:30.325666 19112 start.go:188] overriding kubernetes api to https://api-int.kub3.sttlwazu.vzwops.com:6443 2024-04-30T17:48:30.326381479Z I0430 17:48:30.326368 19112 metrics.go:106] Registering Prometheus metrics 2024-04-30T17:48:30.326447815Z I0430 17:48:30.326440 19112 metrics.go:111] Starting metrics listener on 127.0.0.1:8797 2024-04-30T17:48:30.327835814Z I0430 17:48:30.327811 19112 writer.go:93] NodeWriter initialized with credentials from /var/lib/kubelet/kubeconfig 2024-04-30T17:48:30.327932144Z I0430 17:48:30.327923 19112 update.go:2125] Starting to manage node: worker-048.kub3.sttlwazu.vzwops.com 2024-04-30T17:48:30.332123862Z I0430 17:48:30.332097 19112 rpm-ostree.go:394] Running captured: rpm-ostree status 2024-04-30T17:48:30.332928272Z I0430 17:48:30.332909 19112 daemon.go:1049] Detected a new login session: New session 1 of user core. 2024-04-30T17:48:30.332935796Z I0430 17:48:30.332926 19112 daemon.go:1050] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh 2024-04-30T17:48:30.368619942Z I0430 17:48:30.368598 19112 daemon.go:1298] State: idle 2024-04-30T17:48:30.368619942Z Deployments: 2024-04-30T17:48:30.368619942Z * ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858 2024-04-30T17:48:30.368619942Z Digest: sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858 2024-04-30T17:48:30.368619942Z Version: 412.86.202311271639-0 (2024-04-30T17:05:27Z) 2024-04-30T17:48:30.368619942Z LayeredPackages: kernel-devel kernel-headers 2024-04-30T17:48:30.368619942Z 2024-04-30T17:48:30.368619942Z ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858 2024-04-30T17:48:30.368619942Z Digest: sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858 2024-04-30T17:48:30.368619942Z Version: 412.86.202311271639-0 (2024-04-30T17:05:27Z) 2024-04-30T17:48:30.368619942Z LayeredPackages: kernel-devel kernel-headers 2024-04-30T17:48:30.368907860Z I0430 17:48:30.368884 19112 coreos.go:54] CoreOS aleph version: mtime=2023-08-08 11:20:41.285 +0000 UTC build=412.86.202308081039-0 imgid=rhcos-412.86.202308081039-0-metal.x86_64.raw 2024-04-30T17:48:30.368932886Z I0430 17:48:30.368926 19112 coreos.go:71] Ignition provisioning: time=2024-04-30T17:03:44Z 2024-04-30T17:48:30.368938120Z I0430 17:48:30.368931 19112 rpm-ostree.go:394] Running captured: journalctl --list-boots 2024-04-30T17:48:30.372893750Z I0430 17:48:30.372884 19112 daemon.go:1307] journalctl --list-boots: 2024-04-30T17:48:30.372893750Z -2 847e119666d9498da2ae1bd89aa4c4d0 Tue 2024-04-30 17:03:13 UTC—Tue 2024-04-30 17:06:32 UTC 2024-04-30T17:48:30.372893750Z -1 9617b204b8b8412fb31438787f56a62f Tue 2024-04-30 17:09:06 UTC—Tue 2024-04-30 17:36:39 UTC 2024-04-30T17:48:30.372893750Z 0 3cbf6edcacde408b8979692c16e3d01b Tue 2024-04-30 17:39:20 UTC—Tue 2024-04-30 17:48:30 UTC 2024-04-30T17:48:30.372912686Z I0430 17:48:30.372891 19112 rpm-ostree.go:394] Running captured: systemctl list-units --state=failed --no-legend 2024-04-30T17:48:30.378069332Z I0430 17:48:30.378059 19112 daemon.go:1322] systemd service state: OK 2024-04-30T17:48:30.378069332Z I0430 17:48:30.378066 19112 daemon.go:987] Starting MachineConfigDaemon 2024-04-30T17:48:30.378121340Z I0430 17:48:30.378106 19112 daemon.go:994] Enabling Kubelet Healthz Monitor 2024-04-30T17:48:31.486786667Z I0430 17:48:31.486747 19112 daemon.go:457] Node worker-048.kub3.sttlwazu.vzwops.com is not labeled node-role.kubernetes.io/master 2024-04-30T17:48:31.491674986Z I0430 17:48:31.491594 19112 daemon.go:1243] Current+desired config: rendered-worker-a6be9fb3f667b76a611ce51811434cf9 2024-04-30T17:48:31.491674986Z I0430 17:48:31.491603 19112 daemon.go:1253] state: Done 2024-04-30T17:48:31.495704843Z I0430 17:48:31.495617 19112 daemon.go:617] Detected a login session before the daemon took over on first boot 2024-04-30T17:48:31.495704843Z I0430 17:48:31.495624 19112 daemon.go:618] Applying annotation: machineconfiguration.openshift.io/ssh 2024-04-30T17:48:31.503165515Z I0430 17:48:31.503052 19112 update.go:2110] Running: rpm-ostree cleanup -r 2024-04-30T17:48:32.232728843Z Bootloader updated; bootconfig swap: yes; bootversion: boot.1.1, deployment count change: -1 2024-04-30T17:48:35.755815139Z Freed: 92.3 MB (pkgcache branches: 0) 2024-04-30T17:48:35.764568364Z I0430 17:48:35.764548 19112 daemon.go:1563] Validating against current config rendered-worker-a6be9fb3f667b76a611ce51811434cf9 2024-04-30T17:48:36.120148982Z I0430 17:48:36.120119 19112 rpm-ostree.go:394] Running captured: rpm-ostree kargs 2024-04-30T17:48:36.179660790Z I0430 17:48:36.179631 19112 update.go:2125] Validated on-disk state 2024-04-30T17:48:36.182434142Z I0430 17:48:36.182406 19112 daemon.go:1646] In desired config rendered-worker-a6be9fb3f667b76a611ce51811434cf9 2024-04-30T17:48:36.196911084Z I0430 17:48:36.196879 19112 config_drift_monitor.go:246] Config Drift Monitor started
Version-Release number of selected component (if applicable):
4.12.45
How reproducible:
They can reproduce in multiple clusters
Actual results:
Node stays with rendered-worker config
Expected results:
machineconfigpool updating should prompt a change to the desired config which the machine-config-daemon pod then updates node to
Additional info:
here is the latest must-gather where this issue is occuring: https://attachments.access.redhat.com/hydra/rest/cases/03803880/attachments/3fd0cf52-a770-4525-aecd-3a437ea70c9b?usePresignedUrl=true
Description of problem:
The period is placed inside the quotes of the missingKeyHandler i18n error
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
Always when there is a missingKeyHandler error
Steps to Reproduce:
1. Check browser console 2. Observe period is placed inside the quites 3.
Actual results:
It is placed inside the quotes
Expected results:
It should be placed outside the quotes
Additional info:
Description of problem:
Console plugin details page is throwing error on some specific YAML
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-11-30-141716
How reproducible:
Always
Steps to Reproduce:
1. Create a ConsolePlugin with minimum required fields apiVersion: console.openshift.io/v1 kind: ConsolePlugin metadata: name: console-demo-plugin-two spec: backend: type: Service displayName: OpenShift Console Demo Plugin 2. Visit consoleplugin details page at /k8s/cluster/console.openshift.io~v1~ConsolePlugin/console-demo-plugin
Actual results:
2. We will see an error page
Expected results:
2. we should not show an error page since ConsolePlugin YAML has every required fields although they are not complete
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
Delete the openshift-monitoring/monitoring-plugin-cert secret, SCO will re-create a new one with different content
Actual results:
- monitoring-plugin is still using the old cert content. - If the cluster doesn’t show much activity, the hash may take time to be updated.
Expected results:
CMO should detect that exact change and run a sync to recompute and set the new hash.
Additional info:
- We shouldn't rely on another changeto trigger the sync loop. - CMO should maybe watch that secret? (its name isn't known in advance).
In payloads 4.18.0-0.ci-2024-11-01-110334 and 4.18.0-0.nightly-2024-11-01-101707 we observed GCP install failures
Container test exited with code 3, reason Error --- ails: level=error msg=[ level=error msg= { level=error msg= "@type": "type.googleapis.com/google.rpc.ErrorInfo", level=error msg= "domain": "googleapis.com", level=error msg= "metadatas": { level=error msg= "consumer": "projects/711936183532", level=error msg= "quota_limit": "ListRequestsFilterCostOverheadPerMinutePerProject", level=error msg= "quota_limit_value": "75", level=error msg= "quota_location": "global", level=error msg= "quota_metric": "compute.googleapis.com/filtered_list_cost_overhead", level=error msg= "service": "compute.googleapis.com" level=error msg= }, level=error msg= "reason": "RATE_LIMIT_EXCEEDED" level=error msg= }, level=error msg= { level=error msg= "@type": "type.googleapis.com/google.rpc.Help", level=error msg= "links": [ level=error msg= { level=error msg= "description": "The request exceeds API Quota limit, please see help link for suggestions.", level=error msg= "url": "https://cloud.google.com/compute/docs/api/best-practices#client-side-filter" level=error msg= } level=error msg= ] level=error msg= } level=error msg=] level=error msg=, rateLimitExceeded
Patrick Dillon Noted ListRequestsFilterCostOverheadPerMinutePerProject can not have it's quota limit increased.
The problem subsided over the weekend presumably with fewer jobs run but has started to appear again. opening to track ongoing issue and potential work arounds.
This contributes to the following test failures for GCP
install should succeed: configuration install should succeed: overall
Description of problem:
In CAPI, we use a random machineNetwork instead of using the one passed in by the user.
Description of problem:
Upgrade to 4.18 is not working, because the machine-config update is stuck:
$ oc get co/machine-config
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
machine-config 4.17.0-0.nightly-2025-01-13-120007 True True True 133m Unable to apply 4.18.0-rc.4: error during syncRequiredMachineConfigPools: [context deadline exceeded, MachineConfigPool master has not progressed to latest configuration: controller version mismatch for rendered-master-ef1c06aa9aeedcebfa50569c3aa9472a expected a964f19a214946f0e5f1197c545d3805393d0705 has 3594c4b2eb42d8c9e56a146baea52d9c147721b0: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-826ddf793cf0a677228234437446740f, retrying]
The machine-config-controller shows the responsible for that:
$ oc logs -n openshift-machine-config-operator machine-config-controller-69f59598f7-57lkv [...] I0116 13:54:16.605692 1 drain_controller.go:183] node ostest-xgjnz-master-0: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when evicting pods/"openstack-manila-csi-controllerplugin-6754c7589f-dwjtm" -n "openshift-manila-csi-driver": This pod has more than one PodDisruptionBudget, which the eviction subresource does not support.
There are 2 PDBs on the manila namespace:
$ oc get pdb -n openshift-manila-csi-driver NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE manila-csi-driver-controller-pdb N/A 1 1 80m openstack-manila-csi-controllerplugin-pdb N/A 1 1 134m
So a workaround is to remove the pdb openstack-manila-csi-controllerplugin-pdb.
Version-Release number of selected component (if applicable):
From 4.17.0-0.nightly-2025-01-13-120007 to 4.18.0-rc.4 on top of RHOS-17.1-RHEL-9-20241030.n.1
How reproducible:
Always
Steps to Reproduce:
1. Install latest 4.17
2. Update to 4.18, for example:
oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.18.0-rc.4 --allow-explicit-upgrade --force
Additional info: must-gather on private comment.
Description of problem:
openshift-install has no zsh completion
Version-Release number of selected component (if applicable):
How reproducible:
everytime
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
PR opened at https://github.com/openshift/installer/pull/9116
Description of problem:
There is no clipValue function for the annotation router.openshift.io/haproxy.health.check.interval. Once any value with abnormal values, the router-default starts to report the following messages: [ALERT] (50) : config : [/var/lib/haproxy/conf/haproxy.config:13791] : 'server be_secure:xxx:httpd-gateway-route/pod:xxx:xxx-gateway-service:pass-through-https:10.129.xx.xx:8243' : timer overflow in argument <50000d> to <inter> of server pod:xxx:xxx:pass-through-https:10.129.xx.xx:8243, maximum value is 2147483647 ms (~24.8 days).. In the above case, the value 50000d was passed to the route annotation router.openshift.io/haproxy.health.check.interval accidentally
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Run the following script and this will break the cluster oc get routes -A | awk '{print $1 " " $2}' | tail -n+2 | while read line; do read -r namespace routename <<<$(echo $line) echo -n "NS: $namespace | " echo "ROUTENAME: $routename" CMD="oc annotate route -n $namespace $routename --overwrite router.openshift.io/haproxy.health.check.interval=50000d" echo "Annotating route with:" echo $CMD ; eval "$CMD" echo "---" done
Actual results:
The alert messages are reported and the router-default pod never reaches the ready state.
Expected results:
Clip the value in order to prevent the issue
Additional info:
Description of problem:
A cluster with a default (empty) `configs.spec.samplesRegistry` field but with whitelist entries in `image.spec.registrySources.allowedRegistries` causes openshift-samples CO in degraded state.
Version-Release number of selected component (if applicable):
4.13.30, 4.13.32
How reproducible:
100%
Steps to Reproduce:
1. Add the whitelist entries in image.spec.registrySources.allowedRegistries: ~~~ oc get image.config/cluster -o yaml spec: registrySources: allowedRegistries: - registry.apps.example.com - quay.io - registry.redhat.io - image-registry.openshift-image-registry.svc:5000 - ghcr.io - quay.apps.example.com ~~~ 2. Delete the pod, so it recreates: ~~~ oc delete pod -l name=cluster-samples-operator -n openshift-cluster-samples-operator ~~~ 3. The openshift-samples go to degraded state: ~~~ # oc get co openshift-samples NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE openshift-samples 4.13.30 True True True 79m Samples installation in error at 4.13.30: &errors.errorString{s:"global openshift image configuration prevents the creation of imagestreams using the registry "} ~~~ 4. The configs.samples spec is empty: ~~~ # oc get configs.samples.operator.openshift.io cluster -o jsonpath='{.spec}{"\n"}' {"architectures":["x86_64"],"managementState":"Managed"} ~~~
Actual results:
The openshift-sample go to degraded state.
Expected results:
The openshift-sample should remain in healthy state.
Additional info:
We had a Bug (https://bugzilla.redhat.com/show_bug.cgi?id=2027745) earlier which was fixed in OCP 4.10.3 as per erratta (https://access.redhat.com/errata/RHSA-2022:0056). One of my customer faced this issue when they upgraded the cluster from 4.12 to 4.13.32. As a workaround updating the below lines under `image.config.spec` helped. ~~~ allowedRegistriesForImport - domainName: registry.redhat.io insecure: false ~~~~
In order to fix security issue https://github.com/openshift/assisted-service/security/dependabot/94
Description of problem:
Missing translation for ""Read write once pod (RWOP)" ja and zh
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Beginning with 4.19.0-0.nightly-2024-11-27-025041 this job failed with a pattern I don't recognize.
I'll note some other aws jobs failed on the same payload which looked like infra issues; however this test re-ran in full and so its timing was very different.
Then it failed with much the same pattern on the next payload too.
The failures are mainly on tests like these:
[sig-instrumentation][OCPFeatureGate:MetricsCollectionProfiles] The collection profiles feature-set initially, in a homogeneous default environment, should expose default metrics [Suite:openshift/conformance/parallel] expand_more [sig-instrumentation][OCPFeatureGate:MetricsCollectionProfiles] The collection profiles feature-set in a heterogeneous environment, should revert to default collection profile when an empty collection profile value is specified [Suite:openshift/conformance/parallel] expand_more [sig-instrumentation][OCPFeatureGate:MetricsCollectionProfiles] The collection profiles feature-set in a heterogeneous environment, should expose information about the applied collection profile using meta-metrics [Suite:openshift/conformance/parallel] expand_more [sig-instrumentation][OCPFeatureGate:MetricsCollectionProfiles] The collection profiles feature-set in a heterogeneous environment, should have at least one implementation for each collection profile [Suite:openshift/conformance/parallel] expand_more [sig-instrumentation][OCPFeatureGate:MetricsCollectionProfiles] The collection profiles feature-set in a homogeneous minimal environment, should hide default metrics [Suite:openshift/conformance/parallel] expand_more
Each has a run where it looks like something timed out:
fail [github.com/openshift/origin/test/extended/prometheus/collection_profiles.go:99]: Interrupted by User Ginkgo exit error 1: exit with code 1
and a second run failing to update configmap cluster-monitoring-config
{ fail [github.com/openshift/origin/test/extended/prometheus/collection_profiles.go:197]: Expected <*errors.StatusError | 0xc006738280>: Operation cannot be fulfilled on configmaps "cluster-monitoring-config": the object has been modified; please apply your changes to the latest version and try again { ErrStatus: code: 409 details: kind: configmaps name: cluster-monitoring-config message: 'Operation cannot be fulfilled on configmaps "cluster-monitoring-config": the object has been modified; please apply your changes to the latest version and try again' metadata: {} reason: Conflict status: Failure, } to be nil Ginkgo exit error 1: exit with code 1}
Description of problem:
If the install is performed with an AWS user missing the `ec2:DescribeInstanceTypeOfferings`, the installer will use a hardcoded instance type from the set of non-edge machine pools. This can potentially cause the edge node to fail during provisioning, since the instance type doesn't take into account edge/wavelength zones support. Because edge nodes are not needed for the installation to complete, the issue is not noticed by the installer, only by inspecting the status of the edge nodes.
Version-Release number of selected component (if applicable):
4.16+ (since edge nodes support was added)
How reproducible:
always
Steps to Reproduce:
1. Specify an edge machine pool in the install-config without an instance type 2. Run the install with an user without `ec2:DescribeInstanceTypeOfferings` 3.
Actual results:
In CI the `node-readiness` test step will fail and the edge nodes will show errorMessage: 'error launching instance: The requested configuration is currently not supported. Please check the documentation for supported configurations.' errorReason: InvalidConfiguration
Expected results:
Either 1. the permission is always required when instance type is not set for an edge pool; or 2. a better instance type default is used
Additional info:
Example CI job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/9230/pull-ci-openshift-installer-master-e2e-aws-ovn-edge-zones/1862140149505200128
Description of the problem:
How reproducible:
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
Description of problem:
When deploying with endpoint overrides, the block CSI driver will try to use the default endpoints rather than the ones specified.
Description of problem:
CSI Operator doesn't propagate HCP labels to 2nd level operands
Version-Release number of selected component (if applicable):
4.19.0
How reproducible:
100%
Steps to Reproduce:
1. Create hostedCluster with .spec.Labels
Actual results:
aws-ebs-csi-driver-controller, aws-ebs-csi-driver-operator, csi-snapshot-controller, csi-snapshot-webhook pods don't have the specified labels.
Expected results:
aws-ebs-csi-driver-controller, aws-ebs-csi-driver-operator, csi-snapshot-controller, csi-snapshot-webhook pods have the specified labels.
Additional info:
Description of problem:
Failed to create a disconnected cluster using HCP/HyperShift CLI
Version-Release number of selected component (if applicable):
4.19 4.18
How reproducible:
100%
Steps to Reproduce:
1. create disconnected hostedcluster with hcp cli 2. The environment where the command is executed cannot access the payload.
Actual results:
/tmp/hcp create cluster agent --cluster-cidr fd03::/48 --service-cidr fd04::/112 --additional-trust-bundle=/tmp/secret/registry.2.crt --network-type=OVNKubernetes --olm-disable-default-sources --name=b2ce1d5218a2c7b561d6 --pull-secret=/tmp/.dockerconfigjson --agent-namespace=hypershift-agents --namespace local-cluster --base-domain=ostest.test.metalkube.org --api-server-address=api.b2ce1d5218a2c7b561d6.ostest.test.metalkube.org --image-content-sources /tmp/secret/mgmt_icsp.yaml --ssh-key=/tmp/secret/id_rsa.pub --release-image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:7acdfad179f4571cbf211a87bce87749a1576b72f1d57499e6d9be09b0c4d31d1422024-12-31T08:01:05Z ERROR Failed to create cluster {"error": "failed to retrieve manifest virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:7acdfad179f4571cbf211a87bce87749a1576b72f1d57499e6d9be09b0c4d31d: failed to create repository client for https://virthost.ostest.test.metalkube.org:5000: Get \"https://virthost.ostest.test.metalkube.org:5000/v2/\": Internal Server Error"}143github.com/openshift/hypershift/product-cli/cmd/cluster/agent.NewCreateCommand.func1144 /remote-source/app/product-cli/cmd/cluster/agent/create.go:32145github.com/spf13/cobra.(*Command).execute146 /remote-source/app/vendor/github.com/spf13/cobra/command.go:985147github.com/spf13/cobra.(*Command).ExecuteC148 /remote-source/app/vendor/github.com/spf13/cobra/command.go:1117149github.com/spf13/cobra.(*Command).Execute150 /remote-source/app/vendor/github.com/spf13/cobra/command.go:1041151github.com/spf13/cobra.(*Command).ExecuteContext152 /remote-source/app/vendor/github.com/spf13/cobra/command.go:1034153main.main154 /remote-source/app/product-cli/main.go:59155runtime.main156 /usr/lib/golang/src/runtime/proc.go:272157Error: failed to retrieve manifest virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:7acdfad179f4571cbf211a87bce87749a1576b72f1d57499e6d9be09b0c4d31d: failed to create repository client for https://virthost.ostest.test.metalkube.org:5000: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": Internal Server Error158failed to retrieve manifest virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:7acdfad179f4571cbf211a87bce87749a1576b72f1d57499e6d9be09b0c4d31d: failed to create repository client for https://virthost.ostest.test.metalkube.org:5000: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": Internal Server Error
Expected results:
can be created successful
Additional info:
Description of problem:
The 'Plus' button in the 'Edit Pod Count' popup window overlaps the input field, which is incorrect.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-12-05-103644
How reproducible:
Always
Steps to Reproduce:
1.Navigate to Workloads -> ReplicaSets page, choose one resource, and click the Keban list buton, choose ‘Edit Pod count’ 2. 3.
Actual results:
The Layout is incorrect
Expected results:
The 'Plus' button in the 'Edit Pod Count' popup window should not overlaps the input field
Additional info:
Snapshot: https://drive.google.com/file/d/1mL7xeT7FzkdsM1TZlqGdgCP5BG6XA8uh/view?usp=drive_link https://drive.google.com/file/d/1qmcal_4hypEPjmG6PTG11AJPwdgt65py/view?usp=drive_link
Description of problem:
CNO doesn't propagate HCP labels to 2nd level operands
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create hostedCluster with .spec.Labels
Actual results:
cloud-network-config-controller, multus-admission-controller, network-node-identity, ovnkube-control-plane-6dd8775f97-75f89 pods don't have the specified labels.
Expected results:
cloud-network-config-controller, multus-admission-controller, network-node-identity, ovnkube-control-plane-6dd8775f97-75f89 pods have the specified labels.
Additional info:
Currently, in most of assisted installer components CI images we don't have a way to tell from which commit reference the image was built. Since We use an image stream for each component, and we import these streams from one CI component configuration to another, we might end up with images to are not up-to-date. In this case, we would like to have the ability to check if this is actually the case.
Description of problem:
Based on what discussed in bug OCPBUGS-46514 the openshift installer should not allow the creation of a cluster with different hostPrefix for the ClusterNetwork CIDRs of the same IP family.
Version-Release number of selected component (if applicable):
all the supported releases.
Description of problem:
Alongside users which are updating resources often audit log analyzer should find resources updated often. Existing tests don't trigger when resource is being updated by different users or not namespaced
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
every time
Steps to Reproduce:
1. Create the dashboard with a bar chart and sort query result asc. 2. 3.
Actual results:
bar goes outside of the border
Expected results:
bar should not goes outside of the border
Additional info:
screenshot: https://drive.google.com/file/d/1xPRgenpyCxvUuWcGiWzmw5kz51qKLHyI/view?usp=drive_link
Description of problem:
DEBUG Creating ServiceAccount for control plane nodes DEBUG Service account created for XXXXX-gcp-r4ncs-m DEBUG Getting policy for openshift-dev-installer DEBUG adding roles/compute.instanceAdmin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member DEBUG adding roles/compute.networkAdmin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member DEBUG adding roles/compute.securityAdmin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member DEBUG adding roles/storage.admin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during pre-provisioning: failed to add master roles: failed to set IAM policy, unexpected error: googleapi: Error 400: Service account XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com does not exist., badRequest It appears that the Service account was created correctly. The roles are assigned to the service account. It is possible that there needs to be a "wait for action to complete" on the server side to ensure that this will all be ok.
Version-Release number of selected component (if applicable):
How reproducible:
Random. Appears to be a sync issue
Steps to Reproduce:
1. Run the installer for a normal GCP basic install 2. 3.
Actual results:
Installer fails saying that the Service Account that the installer created does not have the permissions to perform an action. Sometimes it takes numerous tries for this to happen (very intermittent).
Expected results:
Successful install
Additional info:
Description of problem:
openshift virt allows hotplugging block volumes into it's pods, which relies on the fact that changing the cgroup corresponding to the pid of the container suffices. crun is test driving some changes it integrated recently; it's configuring two cgroups, `*.scope` and sub cgroup called `container` while before, the parent existed as sort of a no op (wasn't configured, so, all devices were allowed, for example) This results in the volume hotplug breaking since applying the device filter to the sub cgroup is not enough anymore
Version-Release number of selected component (if applicable):
4.18.0 RC2
How reproducible:
100%
Steps to Reproduce:
1. Block volume hotplug to VM 2. 3.
Actual results:
Failure
Expected results:
Success
Additional info:
https://kubevirt.io/user-guide/storage/hotplug_volumes/
https://github.com/sclorg/nodejs-ex
image-registry.openshift-image-registry.svc:5000/demo/nodejs-app
image-registry.openshift-image-registry.svc:5000/openshift/nodejs
BuildRun fails with the following error:
[User error] invalid input params for task nodejs-app-6cf5j-gk6f9: param types don't match the user-specified type: [registries-block registries-insecure]
BuildRun runs successfully
https://gist.github.com/vikram-raj/fa67186f1860612b5ad378655085745e
Description of problem:
Installations on Google Cloud require the constraints/compute.vmCanIpForward to not be enforced. Error: time=\"2024-12-16T10:20:27Z\" level=debug msg=\"E1216 10:20:27.538990 97 reconcile.go:155] \\"Error creating an instance\\" err=\\"googleapi: Error 412: Constraint constraints/compute.vmCanIpForward violated for projects/ino-paas-tst. Enabling IP forwarding is not allowed for projects/ino-paas-tst/zones/europe-west1-b/instances/paas-osd-tst2-68r4m-master-0., conditionNotMet\\" controller=\\"gcpmachine\\" controllerGroup=\\"infrastructure.cluster.x-k8s.io\\" controllerKind=\\"GCPMachine\\" GCPMachine=\\"openshift-cluster-api-guests/paas-osd-tst2-68r4m-master-0\\" namespace=\\"openshift-cluster-api-guests\\" reconcileID=\\"3af74f44-96fe-408a-a0ad-9d63f023d2ee\\" name=\\"paas-osd-tst2-68r4m-master-0\\" zone=\\"europe-west1-b\\"\"
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Every Time
Steps to Reproduce:
1. Enable constraints/compute.vmCanIpForward on a project 2. Install OSD 4.17 on that project 3. Installation fails
Actual results:
Installation fails
Expected results:
Installation does not fail
Additional info:
More info in the attachments
Description of problem:
Improving the OpenShift installer for Azure Deployments to comply PCI-DSS/BAFIN regluations. The OpenShift installer utilizes thegithub.com/hashicorp/terraform-provider-azurermmodule which in versions < 4 have the cross_tenant_replication_enabled parameter set to true. Two options available to fix this are: 1. adjust the OpenShift installer to create the resourceStorageAccount [1] as requested with the default set to FALSE 2. upgrade the OpenShift installer module version used of terraform-provider-azurerm to 4.x were this parameter now defaults to FALSE [1] https://github.com/hashicorp/terraform-provider-azurerm/blob/57cd1c81d557a49e18b2f49651a4c741b465937b/internal/services/storage/storage_account_resource.go#L212 This security voilation blocks using and scaling Clusters in Public cloud environments for the Banking and Financial industry which need to comply to BAFIN and PCI-DSS regulations.4. List any affected packages or components.OpenShift Installer 4.xCompliance Policy Azure https://learn.microsoft.com/en-us/azure/storage/common/security-controls-policy.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Due the recent changes, using oc 4.17 adm node-image commands on a 4.18 ocp cluster doesn't work
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. oc adm node-image create / monitor 2. 3.
Actual results:
The commands fail
Expected results:
The commands should work as expected
Additional info:
Description of problem:
In Azure, there are 2 regions that don't have availability zones or availability set fault domains (centraluseuap, eastusstg). They are test regions, one of which is in-use by the ARO team. Machine API provider seems to be hardcoding an availability set fault domain count of 2 in creation of the machineset: https://github.com/openshift/machine-api-provider-azure/blob/main/pkg/cloud/azure/services/availabilitysets/availabilitysets.go#L32, so if there is not at least a fault domain count of 2 in the target region, the install will fail because worker nodes get a Failed status. This is the error from Azure, reported by the machine API: `The specified fault domain count 2 must fall in the range 1 to 1.` Because of this, the regions are not able to support OCP clusters.
Version-Release number of selected component (if applicable):
Observed in 4.15
How reproducible:
Very
Steps to Reproduce:
1. Attempt creation of an OCP cluster in centraluseuap or eastusstg regions 2. Observe worker machine failures
Actual results:
Worker machines get a failed state
Expected results:
Worker machines are able to start. I am guessing that this would happen via dynamic setting of the availability set fault domain count rather than hardcoding it to 2, which right now just happens to work in most regions in Azure because the fault domain counts are typically at least 2. In upstream, it looks like we're dynamically setting this by querying the amount of fault domains in a region: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/40f0fabc264388de02a88de7fbe400c21d22e7e2/azure/services/availabilitysets/spec.go#L70
Additional info:
Description of problem:
Ratcheting validation was implemented and made beta in 1.30. Validation ratcheting works for changes to the main resource, but does not work when applying updates to a status subresource. Details in https://github.com/kubernetes/kubernetes/issues/129503
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Install 4.17 2. Set powervs serviceEndpoints in the platformStatus to a valid lowercase string 3. Upgrade to 4.18 - validation has changed 4. Attempt to update an adjacent status field
Actual results:
Validation fails and rejects the update
Expected results:
Ratcheting should kick in and accept the object
Additional info:
At "-v=2" Says nothing about telemetry being disabled on the cluster https://docs.openshift.com/container-platform/4.17/support/remote_health_monitoring/opting-out-of-remote-health-reporting.html
maybe it does in v>2, check that.
See https://issues.redhat.com/browse/OCPBUGS-45683
Description of problem:
clicking on 'create a Project' button on Getting Started page doesn't work
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-25-210451
How reproducible:
Always
Steps to Reproduce:
1. A new normal user login to OCP web console, user will be redirected to Getting Started page 2. try to create a project via 'create a Project' button in message "Select a Project to start adding to it or create a Project." 3.
Actual results:
click on 'create a Project' button doesn't open project creation modal
Expected results:
as it indicated 'create a Project' should open project creation modal
Additional info:
Description of problem:
when mirror ocp payload with digest will failed with error : invalid destination name docker://ci-op-n2k1twzy-c1a88-bastion-mirror-registry-xxxxxxxx-zhouy.apps.yinzhou-1031.qe.devcluster.openshift.com/ci-op-n2k1twzy/release/openshift/release-images:: invalid reference format
Version-Release number of selected component (if applicable):
./oc-mirror.rhel8 version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202410251041.p0.g95f0611.assembly.stream.el9-95f0611", GitCommit:"95f0611c1dc9584a4a9e857912b9eaa539234bbc", GitTreeState:"clean", BuildDate:"2024-10-25T11:28:19Z", GoVersion:"go1.22.7 (Red Hat 1.22.7-1.module+el8.10.0+22325+dc584f75) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. imagesetconfig with digest for ocp payload : cat config.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: platform: release: registry.ci.openshift.org/ocp/release@sha256:e87cdacdf5c575ff99d4cca7ec38758512a408ac1653fef8dd7b2c4b85e295f4 2. run the mirror2mirror command : ./oc-mirror.rhel8.18 -c config.yaml docker://ci-op-n2k1twzy-c1a88-bastion-mirror-registry-xxxxxxxx-zhouy.apps.yinzhou-1031.qe.devcluster.openshift.com/ci-op-n2k1twzy/release --dest-tls-verify=false --v2 --workspace file://out1 --authfile auth.json
Actual results:
2. hit error : ✗ 188/188 : (2s) registry.ci.openshift.org/ocp/release@sha256:e87cdacdf5c575ff99d4cca7ec38758512a408ac1653fef8dd7b2c4b85e295f4 2024/10/31 06:20:03 [INFO] : 👋 Goodbye, thank you for using oc-mirror 2024/10/31 06:20:03 [ERROR] : invalid destination name docker://ci-op-n2k1twzy-c1a88-bastion-mirror-registry-xxxxxxxx-zhouy.apps.yinzhou-1031.qe.devcluster.openshift.com/ci-op-n2k1twzy/release/openshift/release-images:: invalid reference format
Expected results:
3. no error
Additional info:
compared with 4.17 oc-mirror, no such issue : ./oc-mirror -c config.yaml docker://ci-op-n2k1twzy-c1a88-bastion-mirror-registry-xxxxxxxx-zhouy.apps.yinzhou-1031.qe.devcluster.openshift.com/ci-op-n2k1twzy/release --dest-tls-verify=false --v2 --workspace file://out1 --authfile auth.json 2024/10/31 06:23:04 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/10/31 06:23:04 [INFO] : 👋 Hello, welcome to oc-mirror ... ✓ 188/188 : (1s) registry.ci.openshift.org/ocp/release@sha256:e87cdacdf5c575ff99d4cca7ec38758512a408ac1653fef8dd7b2c4b85e295f4 2024/10/31 06:27:58 [INFO] : === Results === 2024/10/31 06:27:58 [INFO] : ✅ 188 / 188 release images mirrored successfully 2024/10/31 06:27:58 [INFO] : 📄 Generating IDMS file... 2024/10/31 06:27:58 [INFO] : out1/working-dir/cluster-resources/idms-oc-mirror.yaml file created 2024/10/31 06:27:58 [INFO] : 📄 No images by tag were mirrored. Skipping ITMS generation. 2024/10/31 06:27:58 [INFO] : 📄 No catalogs mirrored. Skipping CatalogSource file generation. 2024/10/31 06:27:58 [INFO] : mirror time : 4m54.452548695s 2024/10/31 06:27:58 [INFO] : 👋 Goodbye, thank you for using oc-mirror
Description of problem:
When setting up the "webhookTokenAuthenticator" the oauth configure "type" is set to "None". Then controller sets the console configmap with "authType=disabled". Which will cause that the console pod goes in the crash loop back due to the not allowed type: Error: validate.go:76] invalid flag: user-auth, error: value must be one of [oidc openshift], not disabled. This worked before on 4.14, stopped working on 4.15.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.15
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
The console can't start, seems like it is not allowed to change the console.
Expected results:
Additional info:
Description of problem:
OWNERS file updated to include prabhakar and Moe as owners and reviewers
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is to fecilitate easy backport via automation
Description of problem:
When overriding the VPC endpoint in a PowerVS deployment, the VPC endpoint override is ignored by MAPI
Version-Release number of selected component (if applicable):
How reproducible:
easily
Steps to Reproduce:
1. Deploy a disconnected cluster 2. network operator will fail to come up 3.
Actual results:
Deploy fails and endpoint is ignored
Expected results:
Deploy should succeed with endpoint honored
Additional info:
Description of problem:
Finding the Console plugins list can be challenging as it is not in the primary nav. We should add it to the primary nav so it is easier to find.
Description of problem:
After the upgrade to OpenShift Container Platform 4.17, it's being observed that aws-efs-csi-driver-controller-metrics/aws-efs-csi-driver-controller-metrics is reporting target down state. When checking the newly created Container one can find the below logs, that may explain the effect seen/reported. $ oc logs aws-efs-csi-driver-controller-5b8d5cfdf4-zwh67 -c kube-rbac-proxy-8211 W1119 07:53:10.249934 1 deprecated.go:66] ==== Removed Flag Warning ====================== logtostderr is removed in the k8s upstream and has no effect any more. =============================================== I1119 07:53:10.250382 1 kube-rbac-proxy.go:233] Valid token audiences: I1119 07:53:10.250431 1 kube-rbac-proxy.go:347] Reading certificate files I1119 07:53:10.250645 1 kube-rbac-proxy.go:395] Starting TCP socket on 0.0.0.0:9211 I1119 07:53:10.250944 1 kube-rbac-proxy.go:402] Listening securely on 0.0.0.0:9211 I1119 07:54:01.440714 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:54:19.860038 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:54:31.432943 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:54:49.852801 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:55:01.433635 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:55:19.853259 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:55:31.432722 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:55:49.852606 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:56:01.432707 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:56:19.853137 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:56:31.440223 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:56:49.856349 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:57:01.432528 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:57:19.853132 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:57:31.433104 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:57:49.852859 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:58:01.433321 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:58:19.853612 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.17
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4.17 2. Install aws-efs-csi-driver-operator 3. Create efs.csi.aws.com CSIDriver object and wait for the aws-efs-csi-driver-controller to roll out.
Actual results:
The below Target Down Alert is being raised 50% of the aws-efs-csi-driver-controller-metrics/aws-efs-csi-driver-controller-metrics targets in Namespace openshift-cluster-csi-drivers namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.
Expected results:
The ServiceMonitor endpoint should be reachable and properly responding with the desired information to monitor the health of the component.
Additional info:
Description of problem:
The --report and --pxe flags were introduced in 4.18. It should be marked as experimental until 4.19.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Our e2e setup `go install` a few packages with the `@latest` tag. `go install` does not take `go.mod` into consideration, so in older branches we can pull package versions not compatible with the system Go version.
Version-Release number of selected component (if applicable):
All branches using Go < 1.23
How reproducible:
always on branch <= 4.18
Steps to Reproduce:
1. 2. 3.
Actual results:
./test/e2e/e2e-simple.sh ././bin/oc-mirror /go/src/github.com/openshift/oc-mirror/test/e2e/operator-test.17343 /go/src/github.com/openshift/oc-mirror go: downloading github.com/google/go-containerregistry v0.20.3 go: github.com/google/go-containerregistry/cmd/crane@latest: github.com/google/go-containerregistry@v0.20.3 requires go >= 1.23.0 (running go 1.22.9; GOTOOLCHAIN=local) /go/src/github.com/openshift/oc-mirror/test/e2e/lib/util.sh: line 17: PID_DISCONN: unbound variable https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_oc-mirror/1006/pull-ci-openshift-oc-mirror-release-4.18-e2e/1879913390239911936
Expected results:
The package version selected is compatible with the system Go version.
Additional info:
Description of problem:
If resolv-prepender is triggered by the path unit before the dispatcher script has populated the env file it fails because the env file is mandatory. We should make it optional by using EnvironmentFile=-
Version-Release number of selected component (if applicable):
4.16
How reproducible:
$ systemctl cat on-prem-resolv-prepender.service # /etc/systemd/system/on-prem-resolv-prepender.service [Unit] Description=Populates resolv.conf according to on-prem IPI needs # Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe After=crio-wipe.service StartLimitIntervalSec=0 [Service] Type=oneshot # Would prefer to do Restart=on-failure instead of this bash retry loop, but # the version of systemd we have right now doesn't support it. It should be # available in systemd v244 and higher. ExecStart=/bin/bash -c " \ until \ /usr/local/bin/resolv-prepender.sh; \ do \ sleep 10; \ done" EnvironmentFile=/run/resolv-prepender/env
$ systemctl cat crio-wipe.service
No files found for crio-wipe.service.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
If the vCenter cluster has no esxi hosts importing the ova fails. Add a more sane error message
To do
Today, when source images are by digest only, oc-mirror applies a default tag:
This should be unified.
Please review the following PR: https://github.com/openshift/service-ca-operator/pull/246
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Reviving rhbz#1948087, the kube-storage-version-migrator ClusterOperator occasionally goes Available=False with reason=KubeStorageVersionMigrator_Deploying. For example, this run includes:
: [bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available expand_less 1h34m30s { 1 unexpected clusteroperator state transitions during e2e test run. These did not match any known exceptions, so they cause this test-case to fail: Oct 03 22:09:07.933 - 33s E clusteroperator/kube-storage-version-migrator condition/Available reason/KubeStorageVersionMigrator_Deploying status/False KubeStorageVersionMigratorAvailable: Waiting for Deployment
But that is a node rebooting into newer RHCOS, and do not warrant immediate admin intervention. Teaching the KSVM operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.
4.8 and 4.15. Possibly all supported versions of the KSVM operator have this exposure.
Looks like many (all?) 4.15 update jobs have near 100% reproducibility for some kind of issue with KSVM going Available=False, see Actual results below. These are likely for reasons that do not require admin intervention, although figuring that out is tricky today, feel free to push back if you feel that some of these do warrant admin immediate admin intervention.
w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/kube-storage-version-migrator+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort
periodic-ci-openshift-hypershift-release-4.15-periodics-e2e-kubevirt-conformance (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 42% failed, 163% of failures match = 68% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 18 runs, 61% failed, 118% of failures match = 72% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-aws-sdn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-ppc64le (all) - 4 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 19 runs, 47% failed, 189% of failures match = 89% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-sdn-arm64 (all) - 9 runs, 78% failed, 86% of failures match = 67% impact periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 11 runs, 64% failed, 114% of failures match = 73% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 65 runs, 45% failed, 169% of failures match = 75% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-upgrade (all) - 6 runs, 50% failed, 133% of failures match = 67% impact periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 75 runs, 24% failed, 361% of failures match = 87% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 75 runs, 29% failed, 277% of failures match = 81% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 8 runs, 50% failed, 175% of failures match = 88% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 74 runs, 36% failed, 185% of failures match = 68% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 69 runs, 49% failed, 156% of failures match = 77% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade (all) - 7 runs, 57% failed, 175% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 250% of failures match = 83% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 6 runs, 100% failed, 17% of failures match = 17% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 60 runs, 38% failed, 187% of failures match = 72% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-upgrade (all) - 6 runs, 33% failed, 200% of failures match = 67% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-sdn-bm-upgrade (all) - 7 runs, 29% failed, 300% of failures match = 86% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 7 runs, 71% failed, 80% of failures match = 57% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 6 runs, 50% failed, 200% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-sdn-bm-upgrade (all) - 6 runs, 100% failed, 83% of failures match = 83% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 6 runs, 100% failed, 83% of failures match = 83% impact periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 13 runs, 54% failed, 71% of failures match = 38% impact periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 16 runs, 63% failed, 70% of failures match = 44% impact
KSVM goes Available=False if and only if immediate admin intervention is appropriate.
Description of problem:
This bug is filed a result of https://access.redhat.com/support/cases/#/case/03977446 ALthough both nodes topologies are equavilent, PPC reported a false negative: Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU corres differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66]
Version-Release number of selected component (if applicable):
4.14
How reproducible:
always
Steps to Reproduce:
1.TBD 2. 3.
Actual results:
Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU corres differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66]
Expected results:
topologies matches, the PPC should work fine
Additional info:
Description of problem:
Font size of `BuildSpec details` on BuildRun details page is larger than other title on the page
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Navigate to Shipwright BuildRun details page
Actual results:
Font size of `BuildSpec details` on BuildRun details page is larger than other title on the page
Expected results:
All title should be of same size
Additional info:
Screenshots
https://github.com/user-attachments/assets/74853838-1fff-46d5-9ed6-5b605caebbf0
Description of problem:
Bare Metal UPI cluster Nodes lose communication with other nodes and this affects the pod communication on these nodes as well. This issue can be fixed with an OVN rebuild on the nodes db that are hitting the issue but eventually the nodes will degrade again and lose communication again. Note despite an OVN Rebuild fixing the issue temporarily Host Networking is set to True so it's using the kernel routing table. **update: observed on Vsphere with routingViaHost: false, ipForwarding: global configuration as well.
Version-Release number of selected component (if applicable):
4.14.7, 4.14.30
How reproducible:
Can't reproduce locally but reproducible and repeatedly occurring in customer environment
Steps to Reproduce:
identify a host node who's pods can't be reached from other hosts in default namespaces ( tested via openshift-dns). observe curls to that peer pod consistently timeout. TCPdumps to target pod observe that packets are arriving and are acknowledged, but never route back to the client pod successfully. (SYN/ACK seen at pod network layer, not at geneve; so dropped before hitting geneve tunnel).
Actual results:
Nodes will repeatedly degrade and lose communication despite fixing the issue with a ovn db rebuild (db rebuild only provides hours/days of respite, no permanent resolve).
Expected results:
Nodes should not be losing communication and even if they did it should not happen repeatedly
Additional info:
What's been tried so far ======================== - Multiple OVN rebuilds on different nodes (works but node will eventually hit issue again) - Flushing the conntrack (Doesn't work) - Restarting nodes (doesn't work) Data gathered ============= - Tcpdump from all interfaces for dns-pods going to port 7777 (to segregate traffic) - ovnkube-trace - SOSreports of two nodes having communication issues before an OVN rebuild - SOSreports of two nodes having communication issues after an OVN rebuild - OVS trace dumps of br-int and br-ex ==== More data in nested comments below.
linking KCS: https://access.redhat.com/solutions/7091399
Description of problem:
Need to bump k8s to v0.31.1 in 4.18
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Cluster-reader couldn't able to view controlplancemachineset resources
Version-Release number of selected component (if applicable):
4.19.0-0.ci-2024-12-15-181719
How reproducible:
Always
Steps to Reproduce:
1. Add cluster-reader role to a common user $ oc adm policy add-cluster-role-to-user cluster-reader testuser-48 --as system:admin 3. Login in the cluster with the common user $ oc login -u testuser-48 Authentication required for https://api.zhsungcp58.qe.gcp.devcluster.openshift.com:6443 (openshift) Username: testuser-48 Password: Login successful. 4. Check cluster-reader could view controlplancemachineset resources.
Actual results:
cluster-reader couldn't view controlplancemachineset resources $ oc get controlplanemachineset Error from server (Forbidden): controlplanemachinesets.machine.openshift.io is forbidden: User "testuser-48" cannot list resource "controlplanemachinesets" in API group "machine.openshift.io" in the namespace "openshift-machine-api"
Expected results:
cluster-reader could view controlplanemachinesets resources
Additional info:
Description of problem:
When hosted clusters are delayed in deleting, their dedicated request serving nodes may have already been removed, but the configmap indicating that the node pair label is in use remains. Placeholder pods are currently getting scheduled on new nodes that have these pair labels. When the scheduler tries to use these new nodes, it says it can't because there is a configmap associating the pair label with a cluster that is in the process of deleting.
Version-Release number of selected component (if applicable):
4.19
How reproducible:
sometimes
Steps to Reproduce:
1. In a size tagging dedicated request serving architecture, create hosted cluster(s). 2. Place an arbitrary finalizer on the hosted cluster(s) so it cannot be deleted. 3. Delete the hosted clusters 4. Look at placeholder pods in hypershift-request-serving-node-placeholders
Actual results:
some placeholder pods are scheduled on nodes that correspond to fleet manager pairs taken up by the deleting clusters
Expected results:
no placeholder pods are scheduled on nodes that correspond to hosted clusters.
Additional info:
OSD-26887: managed services taints several nodes as infrastructure. This taint appears to be applied after some of the platform DS are scheduled there, causing this alert to fire. Managed services rebalances the DS after the taint is added, and the alert clears, but origin fails this test. Allowing this alert to fire while we investigate why the taint is not added at node birth.
As a ARO HCP user, I would like MachineIdentityID to be removed from the Azure HyperShift API since this field is not needed for ARO HCP.
Description of problem:
No zone for master machines
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-08-15-153405
How reproducible:
Always
Steps to Reproduce:
1. Install an azure cluster 2. Run "oc get machine" 3.
Actual results:
No zone info for master machine $ oc get machine NAME PHASE TYPE REGION ZONE AGE yingwang-0816-tvqdc-master-0 Running Standard_D8s_v3 eastus 104m yingwang-0816-tvqdc-master-1 Running Standard_D8s_v3 eastus 104m yingwang-0816-tvqdc-master-2 Running Standard_D8s_v3 eastus 104m yingwang-0816-tvqdc-worker-eastus1-54ckq Running Standard_D4s_v3 eastus 1 96m yingwang-0816-tvqdc-worker-eastus2-dwr2j Running Standard_D4s_v3 eastus 2 96m yingwang-0816-tvqdc-worker-eastus3-7wchl Running Standard_D4s_v3 eastus 3 96m $ oc get machine --show-labels NAME PHASE TYPE REGION ZONE AGE LABELS yingwang-0816-tvqdc-master-0 Running Standard_D8s_v3 eastus 104m machine.openshift.io/cluster-api-cluster=yingwang-0816-tvqdc,machine.openshift.io/cluster-api-machine-role=master,machine.openshift.io/cluster-api-machine-type=master,machine.openshift.io/instance-type=Standard_D8s_v3,machine.openshift.io/region=eastus yingwang-0816-tvqdc-master-1 Running Standard_D8s_v3 eastus 104m machine.openshift.io/cluster-api-cluster=yingwang-0816-tvqdc,machine.openshift.io/cluster-api-machine-role=master,machine.openshift.io/cluster-api-machine-type=master,machine.openshift.io/instance-type=Standard_D8s_v3,machine.openshift.io/region=eastus yingwang-0816-tvqdc-master-2 Running Standard_D8s_v3 eastus 104m machine.openshift.io/cluster-api-cluster=yingwang-0816-tvqdc,machine.openshift.io/cluster-api-machine-role=master,machine.openshift.io/cluster-api-machine-type=master,machine.openshift.io/instance-type=Standard_D8s_v3,machine.openshift.io/region=eastus yingwang-0816-tvqdc-worker-eastus1-54ckq Running Standard_D4s_v3 eastus 1 96m machine.openshift.io/cluster-api-cluster=yingwang-0816-tvqdc,machine.openshift.io/cluster-api-machine-role=worker,machine.openshift.io/cluster-api-machine-type=worker,machine.openshift.io/cluster-api-machineset=yingwang-0816-tvqdc-worker-eastus1,machine.openshift.io/instance-type=Standard_D4s_v3,machine.openshift.io/interruptible-instance=,machine.openshift.io/region=eastus,machine.openshift.io/zone=1 yingwang-0816-tvqdc-worker-eastus2-dwr2j Running Standard_D4s_v3 eastus 2 96m machine.openshift.io/cluster-api-cluster=yingwang-0816-tvqdc,machine.openshift.io/cluster-api-machine-role=worker,machine.openshift.io/cluster-api-machine-type=worker,machine.openshift.io/cluster-api-machineset=yingwang-0816-tvqdc-worker-eastus2,machine.openshift.io/instance-type=Standard_D4s_v3,machine.openshift.io/interruptible-instance=,machine.openshift.io/region=eastus,machine.openshift.io/zone=2 yingwang-0816-tvqdc-worker-eastus3-7wchl Running Standard_D4s_v3 eastus 3 96m machine.openshift.io/cluster-api-cluster=yingwang-0816-tvqdc,machine.openshift.io/cluster-api-machine-role=worker,machine.openshift.io/cluster-api-machine-type=worker,machine.openshift.io/cluster-api-machineset=yingwang-0816-tvqdc-worker-eastus3,machine.openshift.io/instance-type=Standard_D4s_v3,machine.openshift.io/interruptible-instance=,machine.openshift.io/region=eastus,machine.openshift.io/zone=3
Expected results:
Zone info can be shown when run "oc get machine"
Additional info:
Description of problem:
As more systems have been added to Power VS, the assumption that every zone in a region has the same set of systypes has been broken. To properly represent what system types are available, the powervs_regions struct needed to be altered and parts of the installer referencing it needed to be updated.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Try to deploy with s1022 in dal10 2. SysType not available, even though it is a valid option in Power VS. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When creating a new Shipwright Build via the form (ODC-7595), the form shows all the params available on the build strategy.
Expected results:
The parameters should be hidden behind an "Advanced" botton.
Additional info:
Except the following parameters, the rest for each build strategy should be hidden and moved to the advanced section.
Description of problem:
Tracking per-operator fixes for the following related issues static pod node, installer, and revision controllers: https://issues.redhat.com/browse/OCPBUGS-45924 https://issues.redhat.com/browse/OCPBUGS-46372 https://issues.redhat.com/browse/OCPBUGS-48276
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
node-joiner --pxe does not rename pxe artifacts
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. node-joiner --pxe
Actual results:
agent*.* artifacts are generated in the working dir
Expected results:
In the target folder, there should be only the following artifacts:
* node.x86_64-initrd.img * node.x86_64-rootfs.img * node.x86_64-vmlinuz * node.x86_64.ipxe (if required)
Additional info:
it is meant to select projects that don't even exist yet - it should work with label selectors,
Description of problem:
When installing ROSA/OSD operators OLM "locks up" the Subscription object with "ConstraintsNotSatisfiable" 3-15% of the times, depending on the environment.
Version-Release number of selected component (if applicable):
Recently tested on: - OSD 4.17.5 - 4.18 nightly (from cluster bot) Though prevalence across the ROSA fleet suggests this is not a new issue.
How reproducible:
Very. This is very prevalent across the OSD/ROSA Classic cluster fleet. Any new OSD/ROSA Classic cluster has a good chance of at least one of its ~12 OSD-specific operators being affected on install time.
Steps to Reproduce:
0. Set up a cluster using cluster bot. 1. Label at least one worker node with node-role.kubernetes.io=infra 2. Install must gather operator with "oc apply -f mgo.yaml" (file attached) 3. Wait for the pods to come up. 4. Start this loop: for i in `seq -w 999`; do echo -ne ">>>>>>> $i\t\t"; date; oc get -n openshift-must-gather-operator subscription/must-gather-operator -o yaml >mgo-sub-$i.yaml; oc delete -f mgo.yaml; oc apply -f mgo.yaml; sleep 100; done 5. Let it run for a few hours.
Actual results:
Run "grep ConstraintsNotSatisfiable *.yaml" You should find a few of the Subscriptions ended up in a "locked" state from which there is no upgrade without manual intervention: - message: 'constraints not satisfiable: @existing/openshift-must-gather-operator//must-gather-operator.v4.17.281-gd5416c9 and must-gather-operator-registry/openshift-must-gather-operator/stable/must-gather-operator.v4.17.281-gd5416c9 originate from package must-gather-operator, subscription must-gather-operator requires must-gather-operator-registry/openshift-must-gather-operator/stable/must-gather-operator.v4.17.281-gd5416c9, subscription must-gather-operator exists, clusterserviceversion must-gather-operator.v4.17.281-gd5416c9 exists and is not referenced by a subscription' reason: ConstraintsNotSatisfiable status: "True" type: ResolutionFailed
Expected results:
Each installation attempt should've worked fine.
Additional info:
mgo.yaml:
apiVersion: v1 kind: Namespace metadata: name: openshift-must-gather-operator annotations: package-operator.run/collision-protection: IfNoController package-operator.run/phase: namespaces openshift.io/node-selector: "" labels: openshift.io/cluster-logging: "true" openshift.io/cluster-monitoring: 'true' --- apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: must-gather-operator-registry namespace: openshift-must-gather-operator annotations: package-operator.run/collision-protection: IfNoController package-operator.run/phase: must-gather-operator labels: opsrc-datastore: "true" opsrc-provider: redhat spec: image: quay.io/app-sre/must-gather-operator-registry@sha256:0a0610e37a016fb4eed1b000308d840795838c2306f305a151c64cf3b4fd6bb4 displayName: must-gather-operator icon: base64data: '' mediatype: '' publisher: Red Hat sourceType: grpc grpcPodConfig: securityContextConfig: restricted nodeSelector: node-role.kubernetes.io: infra tolerations: - effect: NoSchedule key: node-role.kubernetes.io/infra operator: Exists --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: must-gather-operator namespace: openshift-must-gather-operator annotations: package-operator.run/collision-protection: IfNoController package-operator.run/phase: must-gather-operator spec: channel: stable name: must-gather-operator source: must-gather-operator-registry sourceNamespace: openshift-must-gather-operator --- apiVersion: operators.coreos.com/v1alpha2 kind: OperatorGroup metadata: name: must-gather-operator namespace: openshift-must-gather-operator annotations: package-operator.run/collision-protection: IfNoController package-operator.run/phase: must-gather-operator olm.operatorframework.io/exclude-global-namespace-resolution: 'true' spec: targetNamespaces: - openshift-must-gather-operator
Description of problem:
HyperShift CEL validation blocks ARM64 NodePool creation for non-AWS/Azure platforms Can't add a Bare Metal worker node to the hosted cluster. This was discussed on #project-hypershift Slack channel.
Version-Release number of selected component (if applicable):
MultiClusterEngine v2.7.2 HyperShift Operator image: registry.redhat.io/multicluster-engine/hypershift-rhel9-operator@sha256:56bd0210fa2a6b9494697dc7e2322952cd3d1500abc9f1f0bbf49964005a7c3a
How reproducible:
Always
Steps to Reproduce:
1. Create a HyperShift HostedCluster on a non-AWS/non-Azure platform 2. Try to create a NodePool with ARM64 architecture specification
Actual results:
- CEL validation blocks creating NodePool with arch: arm64 on non-AWS/Azure platforms - Receive error: "The NodePool is invalid: spec: Invalid value: "object": Setting Arch to arm64 is only supported for AWS and Azure" - Additional validation in NodePool spec also blocks arm64 architecture
Expected results:
- Allow ARM64 architecture specification for NodePools on BareMetal platform - Remove or update the CEL validation to support this use case
Additional info:
NodePool YAML: apiVersion: hypershift.openshift.io/v1beta1 kind: NodePool metadata: name: nodepool-doca5-1 namespace: doca5 spec: arch: arm64 clusterName: doca5 management: autoRepair: false replace: rollingUpdate: maxSurge: 1 maxUnavailable: 0 strategy: RollingUpdate upgradeType: InPlace platform: agent: agentLabelSelector: {} type: Agent release: image: quay.io/openshift-release-dev/ocp-release:4.16.21-multi replicas: 1
Description of problem:
When instance types are not specified in the machine pool, the installer checks which instance types (from a list) are available in a given az. If the ec2:DescribeInstanceType permission is not present, the check will fail gracefully and default to using the m6i instance type. This instance type is not available in all regions (e.g. ap-southeast-4 and eu-south-2), so those installs will fail. OCPBUGS-45218 describes a similar issue with edge nodes. ec2:DescribeInstanceTypeOfferings is not a controversial permission and should be required by default for all installs to avoid this type of issue.
Version-Release number of selected component (if applicable):
Affects all versions, but we will just fix in main (4.19)
How reproducible:
Always
Steps to Reproduce:
See OCPBUGS-45218 for one example. Another example (unverified) 1. Use permissions without ec2:DescribeInstanceTypeOfferings 2. Install config: set region to eu-south-2 or ap-southeast-4. Do not set instance types 3. Installer should default to m6i instance type (can be confirmed from machine manifests). 4. Install will fail as m6i instances are not available in those regions: https://docs.aws.amazon.com/ec2/latest/instancetypes/ec2-instance-regions.html
Actual results:
Install fails due to unavailable m6i instance
Expected results:
Installer should select different instance type, m5
Additional info:
This issue tracks the updation of k8s and related openshift APIs to a recent version, to keep in-line with other MAPI providers.
We should expand upon our current pre-commit hooks:
This will help prevent errors before code makes it on GitHub and CI.
Description of problem:
After adding any directive to the ConsolePlugin CR a hard refresh is required for the changes to actually reflect, but we are not getting a refresh popover for this.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Enable feature gate (CSP feature is behind the FG in 4.18). 2. Add "DefaultSrc" directive to any ConsolePlugin CR. 3.
Actual results:
No refresh popover getting displayed, we need to manually refresh for the changes to get reflected.
Expected results:
No manual refresh. An automatic popover should be rendered.
Additional info:
ex: https://github.com/openshift/enhancements/blob/master/enhancements/console/dynamic-plugins.md#example
Description of problem:
Customers need to be able to configure the DNS nameservers for the OpenStack subnet created by Hypershift (through Cluster API Provider for OpenStack). Without that, the default subnet wouldn't have DNS nameservers and resolution can fail in some environments.
Version-Release number of selected component (if applicable):
4.19, 4.18
How reproducible:
In default RHOSO 18 we don't have DNS forwarded to the DHCP agent so we need to set the DNS nameservers in every subnet that is created.
Currently the location of the cache directory can be set via the environment variable `OC_MIRROR_CACHE`. The only problem is that the env var is not easily discoverable by users. It would be better if we had a command line option (e.g `-cache-dir <dir>`) which is discoverable via `-help`.
CBO-installed Ironic unconditionally has TLS, even though we don't do proper host validation just yet (see bug OCPBUGS-20412). Ironic in the installer does not use TLS (mostly for historical reasons). Now that OCPBUGS-36283 added a TLS certificate for virtual media, we can use the same for Ironic API. At least initially, it will involve disabling host validation for IPA.
Please review the following PR: https://github.com/openshift/coredns/pull/130
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The e2e-ibmcloud-operator presubmit job for the cluster-ingress-operator repo introduced in https://github.com/openshift/release/pull/56785 always fails due to DNS. Note that this job has `always_run: false` and `optional: true` so it requires calling /test e2e-ibmcloud-operator on a PR to make it appear. These failures are not blocking any PRs from merging. Example failure.
The issue is that IBM Cloud has DNS propagation issues, similar to the AWS DNS issues (OCPBUGS-14966), except:
The PR https://github.com/openshift/cluster-ingress-operator/pull/1164 was an attempt at fixing the issue by both resolving the DNS name inside of the cluster and allowing for a couple minute "warmup" interval to avoid negative caching. I found (via https://github.com/openshift/cluster-ingress-operator/pull/1132) that the SOA TTL is ~30 minutes, which if you trigger negative caching, you will have to wait 30 minutes for the IBM DNS Resolver to refresh the DNS name.
However, I found that if you wait ~7 minutes for the DNS record to propagate and don't query the DNS name, it will work after that 7 minute wait (I call it the "warmup" period).
The tests affected are any tests that use a DNS name (wildcard or load balancer record):
The two paths I can think of are:
Version-Release number of selected component (if applicable):
4.19
How reproducible:
90-100%
Steps to Reproduce:
1. Run /test e2e-ibmcloud-operator
Actual results:
Tests are flakey
Expected results:
Tests should work reliably
Additional info:
Description of problem:
The create button on MultiNetworkPolicies and NetworkPolicies list page is in wrong position, it should on the top right.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
When updating a 4.13 cluster to 4.14, the new-in-4.14 ImageRegistry capability will always be enabled, because capabilities cannot be uninstalled.
4.14 oc should learn about this, so they will appropriately extract registry CredentialsRequests when connecting to 4.13 clusters for 4.14 manifests. 4.15 oc will get OTA-1010 to handle this kind of issue automatically, but there's no problem with getting an ImageRegistry hack into 4.15 engineering candidates in the meantime.
100%
1. Connect your oc to a 4.13 cluster.
2. Extract manifests for a 4.14 release.
3. Check for ImageRegistry CredentialsRequests.
$ oc adm upgrade | head -n1 Cluster version is 4.13.12 $ oc adm release extract --included --credentials-requests --to credentials-requests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.0-x86_64 $ grep -r ImageRegistry credentials-requests ...no hits...
$ oc adm upgrade | head -n1 Cluster version is 4.13.12 $ oc adm release extract --included --credentials-requests --to credentials-requests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.0-x86_64 $ grep -r ImageRegistry credentials-requests credentials-requests/0000_50_cluster-image-registry-operator_01-registry-credentials-request.yaml: capability.openshift.io/name: ImageRegistry
We already do this for MachineAPI. The ImageRegistry capability landed later, and this is us catching the oc-extract hack up with that change.
Security Tracking Issue
Do not make this issue public.
Flaw:
Non-linear parsing of case-insensitive content in golang.org/x/net/html
https://bugzilla.redhat.com/show_bug.cgi?id=2333122
An attacker can craft an input to the Parse functions that would be processed non-linearly with respect to its length, resulting in extremely slow parsing. This could cause a denial of service.
~~~
Description of problem:
This function https://github.com/openshift/hypershift/blame/c34a1f6cef0cb41c8a1f83acd4ddf10a4b9e8532/support/util/util.go#L391 does not checks the IDMS/ICSP overrides during the reconciliation, so it breaks the disconnected deployments.
Description of problem:
Backport https://github.com/prometheus/prometheus/pull/15723
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem: I'm trying to import this function https://github.com/pierDipi/node-func-logger using the import function UI flow
Version-Release number of selected component (if applicable): 4.14 OCP and Serverless (current development branch)
How reproducible: Always
Steps to Reproduce:
1. Import this function https://github.com/pierDipi/node-func-logger using the import function UI flow
2. Click create
Actual results: An error occurred Cannot read properties of undefined (reading 'filter')
UI error image: https://drive.google.com/file/d/1GrhX2LUNSzvVuhUmeFYEeZwZ1X58LBAB/view?usp=drive_link
Expected results: No errors
Additional info: As noted above I'm using Serverless development branch, I'm not sure if it's reproducible with a released Serverless release, however, either way we would need to fix it
Description of problem:
See https://github.com/kubernetes/kubernetes/issues/127352
Version-Release number of selected component (if applicable):
See https://github.com/kubernetes/kubernetes/issues/127352
How reproducible:
See https://github.com/kubernetes/kubernetes/issues/127352
Steps to Reproduce:
See https://github.com/kubernetes/kubernetes/issues/127352
Actual results:
See https://github.com/kubernetes/kubernetes/issues/127352
Expected results:
See https://github.com/kubernetes/kubernetes/issues/127352
Additional info:
See https://github.com/kubernetes/kubernetes/issues/127352
Description of problem:
Missing metrics - example: cluster_autoscaler_failed_scale_ups_total
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
#curl the autoscalers metrics endpoint: $ oc exec deployment/cluster-autoscaler-default -- curl -s http://localhost:8085/metrics | grep cluster_autoscaler_failed_scale_ups_total
Actual results:
the metrics does not return a value until an event has happened
Expected results:
The metric counter should be initialized at start up providing a zero value
Additional info:
I have been through the file: https://raw.githubusercontent.com/openshift/kubernetes-autoscaler/master/cluster-autoscaler/metrics/metrics.go and checked off the metrics that do not appear when scraping the metrics endpoint straight after deployment. the following metrics are in metrics.go but are missing from the scrape ~~~ node_group_min_count node_group_max_count pending_node_deletions errors_total scaled_up_gpu_nodes_total failed_scale_ups_total failed_gpu_scale_ups_total scaled_down_nodes_total scaled_down_gpu_nodes_total unremovable_nodes_count skipped_scale_events_count ~~~
Description of problem:
It would be good to fail the build if the rt rpm does not match the kernel. Since 9.4+ based releases, rt comes from the same package as kernel. With this change, ARTs consistency check lost an ability. This bug is to "shift-left" that test, and have the build fail at build time.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
How reproducible:
Steps to Reproduce:
Actual results:
Expected results:
The following test is failing more than expected:
Undiagnosed panic detected in pod
See the sippy test details for additional context.
Observed in 4.18-e2e-telco5g-cnftests/1863644602574049280 and 4.18-e2e-telco5g/1863677472923455488
Undiagnosed panic detected in pod { pods/openshift-kube-apiserver_kube-apiserver-cnfdu3-master-1_kube-apiserver.log.gz:E1202 22:12:02.806740 12 audit.go:84] "Observed a panic" panic="context deadline exceeded" panicGoValue="context.deadlineExceededError{}" stacktrace=<}
Undiagnosed panic detected in pod { pods/openshift-kube-apiserver_kube-apiserver-cnfdu11-master-2_kube-apiserver.log.gz:E1202 22:11:42.359004 14 timeout.go:121] "Observed a panic" panic=<}
Description of problem:
The current `oc adm inspect --all-namespaces` command line results in something like this: oc adm inspect --dest-dir must-gather --rotated-pod-logs csistoragecapacities ns/assisted-installer leases --all-namespaces Which is wrong because of 2 reasons: - The `ns/assisted-installer` is there although a namespace is not namespaced, so it should go to the `named_resources` variable (this happens only in 4.16+) - The rest of the items on `all_ns_resources` variable are group resources, but they are not separated by `,` like we do with `group_resources` (this happens on 4.14+) As a result, we never collect what is intended with this command.
Version-Release number of selected component (if applicable):
Any 4.14+ version
How reproducible:
Always
Steps to Reproduce:
1. Get a must-gather 2. 3.
Actual results:
Data from "oc adm inspect --all-namespaces" missing
Expected results:
No data missing
Additional info:
Description of problem:
checked on 4.18.0-0.nightly-2024-12-07-130635/4.19.0-0.nightly-2024-12-07-115816, admin console, go to alert details page, "No datapoints found." on alert details graph. see picture for CannotRetrieveUpdates alert: https://drive.google.com/file/d/1RJCxUZg7Z8uQaekt39ux1jQH_kW9KYXd/view?usp=drive_link
issue exists in 4.18+, no such issue with 4.17
Version-Release number of selected component (if applicable):
4.18+
How reproducible:
always on 4.18+
Steps to Reproduce:
1. see the description
Actual results:
"No datapoints found." on alert details graph
Expected results:
show correct graph
Description of problem:
When user start a Pipeline, in Pipeline visualization, it will show as Failed for Tasks and after that it will show as Running state.
Version-Release number of selected component (if applicable):
4.17.z
How reproducible:
Not always but more frequently
Steps to Reproduce:
1. Create a Pipeline and start it 2. Observe Pipeline visualization in details page
Actual results:
Pipeline visualisation shows all tasks as Failed and after that goes to Running state
Expected results:
Pipeline visualisation should not shows all tasks as Failed before it goes to Running state
Additional info:
Description of problem:
I was going through the operator hub and searching for "apache zookeeper operator" and "stackable common operator." When I clicked on those operators, I'm getting, `Oh no, something went wrong` error message.
Version-Release number of selected component (if applicable):
4.17.0
How reproducible:
100%
Steps to Reproduce:
1. Go to the web console and then go to operator hub. 2. Search for "apache zookeeper operator" and "stackable common operator" 3. And then click on that operator to install, and you will see that error message.
Actual results:
Getting Oh no something went wrong message
Expected results:
Should show the next page to install the operator
Additional info:
As a developer looking to contribute to OCP BuildConfig I want contribution guidelines that make it easy for me to build and test all the components.
Much of the contributor documentation for openshift/builder is either extremely out of date or buggy. This hinders the ability for newcomers to contribute.
Description of problem:
1 Client can not connect to the kube-apiserver via kubernetes svc, as the kubernetes svc is not in the cert SANs 2 The kube-apiserver-operator generate apiserver certs, and insert the kubernetes svc ip from the network cr status.ServiceNetwork 3 When the temporary control plane is down, and the network cr is not ready yet, Client will not connect to apiserver again
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. I have just met this for very rare conditions, especially when the machine performance is poor 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Dialog creating the primary namespaced UDN does not need "name" field. Users can only use one primary UDN per namespace. We can make the flow smoother by generating (or hardcoding) the name on the UI. This should be static (not random). A side effect of this would be, that it would prevent users from creating multiple primary UDNs by mistake.
Version-Release number of selected component (if applicable):
rc.4
How reproducible:
Always
Steps to Reproduce:
1. Go to the create UDN dialog 2. 3.
Actual results:
It asks for a name
Expected results:
It should not ask for a name, using "primary-udn" as the hardcoded value OR It should still give the option to set it, but use "primary-udn" as the default pre-filled in the textbox
Additional info:
As noted in https://github.com/openshift/api/pull/1963#discussion_r1910598226, we are currently ignoring tags set on a port in a MAPO Machine or MachineSet. This appears to be a mistake that we should correct.
All versions of MAPO that use CAPO under the hood are affected.
n/a
n/a
n/a
n/a
See https://github.com/openshift/api/pull/1963#discussion_r1910598226
Description of problem:
When a primary UDN or CUDN is created, it creates what is known as a secondary zone network controller that handles configuring OVN and getting the network created so that pods can be attached. The time it takes for this to happen can be up to a minute if namespaces are being deleted on the cluster while the UDN controller is starting.
This is because if the namespace is deleted, GetActiveNetworkForNamespace will fail, and the pod will be retried for up to a minute.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create a bunch of namespaces with pods
2. Create a primary CUDN or UDN
3. Very quickly start deleting namespaces that have pods in them while OVNK is starting up its network controller
Actual results:
Logs show time to start the controller taking a minute:
I0131 16:15:30.383221 5583 secondary_layer2_network_controller.go:365] Starting controller for secondary network e2e-network-segmentation-e2e-8086.blue I0131 16:16:30.390813 5583 secondary_layer2_network_controller.go:369] Starting controller for secondary network e2e-network-segmentation-e2e-8086.blue took 1m0.007579788s
Expected results:
Once started the controller should only take a few seconds (depending on cluster size and load) to finish starting.
Description of problem:
[Azure disk csi driver]on ARO HCP could not provision volume succeed
Version-Release number of selected component (if applicable):
4.19.0-0.nightly-2024-12-13-083421
How reproducible:
Always
Steps to Reproduce:
1.Install AKS cluster on azure. 2.Install hypershift operator on the AKS cluster. 3.Use hypershift CLI create hosted cluster with the Client Certificate mode. 4.Check the azure disk/file csi dirver work well on hosted cluster.
Actual results:
In step 4: the the azure disk csi dirver provision volume failed on hosted cluster # azure disk pvc provision failed $ oc describe pvc mypvc ... Normal WaitForFirstConsumer 74m persistentvolume-controller waiting for first consumer to be created before binding Normal Provisioning 74m disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_2334468f-9d27-4bdd-a53c-27271ee60073 External provisioner is provisioning volume for claim "default/mypvc" Warning ProvisioningFailed 74m disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_2334468f-9d27-4bdd-a53c-27271ee60073 failed to provision volume with StorageClass "managed-csi": rpc error: code = Unavailable desc = error reading from server: EOF Warning ProvisioningFailed 71m disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_28ba5ad9-c4f8-4dc8-be40-c80c546b7ef8 failed to provision volume with StorageClass "managed-csi": rpc error: code = Unavailable desc = error reading from server: EOF Normal Provisioning 71m disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_28ba5ad9-c4f8-4dc8-be40-c80c546b7ef8 External provisioner is provisioning volume for claim "default/mypvc" ... $ oc logs azure-disk-csi-driver-controller-74d944bbcb-7zz89 -c csi-driver W1216 08:07:04.282922 1 main.go:89] nodeid is empty I1216 08:07:04.290689 1 main.go:165] set up prometheus server on 127.0.0.1:8201 I1216 08:07:04.291073 1 azuredisk.go:213] DRIVER INFORMATION: ------------------- Build Date: "2024-12-13T02:45:35Z" Compiler: gc Driver Name: disk.csi.azure.com Driver Version: v1.29.11 Git Commit: 4d21ae15d668d802ed5a35068b724f2e12f47d5c Go Version: go1.23.2 (Red Hat 1.23.2-1.el9) X:strictfipsruntime Platform: linux/amd64 Topology Key: topology.disk.csi.azure.com/zone I1216 08:09:36.814776 1 utils.go:77] GRPC call: /csi.v1.Controller/CreateVolume I1216 08:09:36.814803 1 utils.go:78] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"topology.disk.csi.azure.com/zone":""}}],"requisite":[{"segments":{"topology.disk.csi.azure.com/zone":""}}]},"capacity_range":{"required_bytes":1073741824},"name":"pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316","parameters":{"csi.storage.k8s.io/pv/name":"pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316","csi.storage.k8s.io/pvc/name":"mypvc","csi.storage.k8s.io/pvc/namespace":"default","skuname":"Premium_LRS"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":7}}]} I1216 08:09:36.815338 1 controllerserver.go:208] begin to create azure disk(pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316) account type(Premium_LRS) rg(ci-op-zj9zc4gd-12c20-rg) location(centralus) size(1) diskZone() maxShares(0) panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x190c61d] goroutine 153 [running]: sigs.k8s.io/cloud-provider-azure/pkg/provider.(*ManagedDiskController).CreateManagedDisk(0x0, {0x2265cf0, 0xc0001285a0}, 0xc0003f2640) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/sigs.k8s.io/cloud-provider-azure/pkg/provider/azure_managedDiskController.go:127 +0x39d sigs.k8s.io/azuredisk-csi-driver/pkg/azuredisk.(*Driver).CreateVolume(0xc000564540, {0x2265cf0, 0xc0001285a0}, 0xc000272460) /go/src/github.com/openshift/azure-disk-csi-driver/pkg/azuredisk/controllerserver.go:297 +0x2c59 github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler.func1({0x2265cf0?, 0xc0001285a0?}, {0x1e5a260?, 0xc000272460?}) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:6420 +0xcb sigs.k8s.io/azuredisk-csi-driver/pkg/csi-common.logGRPC({0x2265cf0, 0xc0001285a0}, {0x1e5a260, 0xc000272460}, 0xc00017cb80, 0xc00014ea68) /go/src/github.com/openshift/azure-disk-csi-driver/pkg/csi-common/utils.go:80 +0x409 github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler({0x1f3e440, 0xc000564540}, {0x2265cf0, 0xc0001285a0}, 0xc00029a700, 0x2084458) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:6422 +0x143 google.golang.org/grpc.(*Server).processUnaryRPC(0xc00059cc00, {0x2265cf0, 0xc000128510}, {0x2270d60, 0xc0004f5980}, 0xc000308480, 0xc000226a20, 0x31c8f80, 0x0) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1379 +0xdf8 google.golang.org/grpc.(*Server).handleStream(0xc00059cc00, {0x2270d60, 0xc0004f5980}, 0xc000308480) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1790 +0xe8b google.golang.org/grpc.(*Server).serveStreams.func2.1() /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1029 +0x7f created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 16 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1040 +0x125
Expected results:
In step 4: the the azure disk csi dirver should provision volume succeed on hosted cluster
Additional info:
Description of problem:
The data in the table column overlaps on the helm rollback page when the screen width is shrunk
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Open a quick start window 2. Create a helm chart and upgrade it 3. Now select the rollback option from the action menu or kebab menu of the helm chart
Actual results:
The Rollback page has a messed-up UI
https://drive.google.com/file/d/1YXz80YsR5pkRG4dQmqFxpTkzgWQnQWLe/view?usp=sharing
Expected results:
The UI should be similar to the build config page with quick start open
https://drive.google.com/file/d/1UYxdRdV2kGC1m-MjBifTNdsh8gtpYnaU/view?usp=sharing
Additional info:
Description of the problem:
ImageClusterInstall is timing out due to ibi-monitor-cm configmap missing. Seems this is a result of the installation-configuration.service failing on the spoke cluster when attempting to unmarshal the image-digest-sources.json file containing IDMS information for the spoke.
How reproducible:
100%
Steps to reproduce:
1. Configure a spoke with IBIO using IDMS
Additional information:
Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Post pivot operation has started" Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="waiting for block device with label cluster-config or for configuration folder /opt/openshift/cluster-configuration" Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Reading seed image info" Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Reading seed reconfiguration info" Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="/opt/openshift/setSSHKey.done already exists, skipping" Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="/opt/openshift/pull-secret.done already exists, skipping" Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Copying nmconnection files if they were provided" Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="/opt/openshift/apply-static-network.done already exists, skipping" Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Running systemctl restart [NetworkManager.service]" Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Executing systemctl with args [restart NetworkManager.service]" Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Setting new hostname target-0-0" Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Executing hostnamectl with args [set-hostname target-0-0]" Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Writing machine network cidr 192.168.126.0 into /etc/default/nodeip-configuration" Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Setting new dnsmasq and forcedns dispatcher script configuration" Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Running systemctl restart [dnsmasq.service]" Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Executing systemctl with args [restart dnsmasq.service]" Jan 22 10:32:05 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:05" level=info msg="Executing bash with args [-c update-ca-trust]" Jan 22 10:32:07 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:07" level=info msg="/opt/openshift/recert.done already exists, skipping" Jan 22 10:32:07 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:07" level=info msg="No server ssh keys were provided, fresh keys already regenerated by recert, skipping" Jan 22 10:32:07 target-0-0 lca-cli[1216989]: 2025-01-22T10:32:07Z INFO post-pivot-dynamic-client Setting up retry middleware Jan 22 10:32:07 target-0-0 lca-cli[1216989]: 2025-01-22T10:32:07Z INFO post-pivot-dynamic-client Successfully created dynamic client Jan 22 10:32:07 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:07" level=info msg="Running systemctl enable [kubelet --now]" Jan 22 10:32:07 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:07" level=info msg="Executing systemctl with args [enable kubelet --now]" Jan 22 10:32:08 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:08" level=info msg="Start waiting for api" Jan 22 10:32:08 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:08" level=info msg="waiting for api" Jan 22 10:32:08 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:08" level=info msg="Deleting ImageContentSourcePolicy and ImageDigestMirrorSet if they exist" Jan 22 10:32:08 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:08" level=info msg="Deleting default catalog sources" Jan 22 10:32:08 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:08" level=info msg="Applying manifests from /opt/openshift/cluster-configuration/manifests" Jan 22 10:32:09 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:09" level=info msg="manifest applied: /opt/openshift/cluster-configuration/manifests/99-master-ssh.json" Jan 22 10:32:09 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:09" level=info msg="manifest applied: /opt/openshift/cluster-configuration/manifests/99-worker-ssh.json" Jan 22 10:32:09 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:09" level=error msg="failed apply manifests: failed to decode manifest image-digest-sources.json: error unmarshaling JSON: while decoding JSON: json: cannot unmarshal array into Go value of type map[string]interface {}" Jan 22 10:32:09 target-0-0 lca-cli[1216989]: time="2025-01-22 10:32:09" level=fatal msg="Post pivot operation failed" Jan 22 10:32:09 target-0-0 systemd[1]: installation-configuration.service: Main process exited, code=exited, status=1/FAILURE
Seems to stem from the way imagedigestsources are being handled in the installer.
vs
The cluster-baremetal-operator sets up a number of watches for resources using Owns() that have no effect because the Provisioning CR does not (and should not) own any resources of the given type or using EnqueueRequestForObject{}, which similarly has no effect because the resource name and namespace are different from that of the Provisioning CR.
The commit https://github.com/openshift/cluster-baremetal-operator/pull/351/commits/d4e709bbfbae6d316f2e76bec18b0e10a45ac93e should be reverted as it adds considerable complexity to no effect whatsoever.
The correct way to trigger a reconcile of the provisioning CR is using EnqueueRequestsFromMapFunc(watchOCPConfigPullSecret) (note that the map function watchOCPConfigPullSecret() is poorly named - it always returns the name/namespace of the Provisioning CR singleton, regardless of the input, which is what we want). We should replace the ClusterOperator, Proxy, and Machine watches with ones of this form.
See https://github.com/openshift/cluster-baremetal-operator/pull/423/files#r1628777876 and https://github.com/openshift/cluster-baremetal-operator/pull/351/commits/d4e709bbfbae6d316f2e76bec18b0e10a45ac93e#r1628776168 for commentary.
The virtualization perspective wants to have the observe section so that they can be a fully independent perspective.
The prerequisite functionality is added to the monitoring-plugin without showing regressions in the admin and developer perspectives.
Description of problem:
Destroying a private cluster doesn't delete the forwarding-rule/backend-service/health-check/firewall-rules created by ingress operator.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-multi-2024-10-23-202329
How reproducible:
Always
Steps to Reproduce:
1. pre-create vpc network/subnets/router and a bastion host 2. "create install-config", and then insert the network settings under platform.gcp, along with "publish: Internal" (see [1]) 3. "create cluster" (use the above bastion host as http proxy) 4. "destroy cluster" (see [2])
Actual results:
Although "destroy cluster" completes successfully, the forwarding-rule/backend-service/health-check/firewall-rules created by ingress operator are not deleted (see [3]), which leads to deleting the vpc network/subnets failure.
Expected results:
The forwarding-rule/backend-service/health-check/firewall-rules created by ingress operator should also be deleted during "destroy cluster".
Additional info:
FYI one history bug https://issues.redhat.com/browse/OCPBUGS-37683
Description of problem:
AlertmanagerConfig with missing options causes Alertmanager to crash
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
A cluster administrator has enabled monitoring for user-defined projects. CMO ~~~ config.yaml: | enableUserWorkload: true prometheusK8s: retention: 7d ~~~ A cluster administrator has enabled alert routing for user-defined projects. UWM cm / CMO cm ~~~ apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | alertmanager: enabled: true enableAlertmanagerConfig: true ~~~ verify existing config: ~~~ $ oc exec -n openshift-user-workload-monitoring alertmanager-user-workload-0 -- amtool config show --alertmanager.url http://localhost:9093 global: resolve_timeout: 5m http_config: follow_redirects: true smtp_hello: localhost smtp_require_tls: true pagerduty_url: https://events.pagerduty.com/v2/enqueue opsgenie_api_url: https://api.opsgenie.com/ wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/ victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/ telegram_api_url: https://api.telegram.org route: receiver: Default group_by: - namespace continue: false receivers: - name: Default templates: [] ~~~ create alertmanager config without options "smtp_from:" and "smtp_smarthost" ~~~ apiVersion: monitoring.coreos.com/v1alpha1 kind: AlertmanagerConfig metadata: name: example namespace: example-namespace spec: receivers: - emailConfigs: - to: some.username@example.com name: custom-rules1 route: matchers: - name: alertname receiver: custom-rules1 repeatInterval: 1m ~~~ check logs for alertmanager: the following error is seen. ~~~ ts=2023-09-05T12:07:33.449Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="no global SMTP smarthost set" ~~~
Actual results:
Alertmamnager fails to restart.
Expected results:
CRD should be pre validated.
Additional info:
Reproducible with and without user workload Alertmanager.
golang.org/x/net is a CVE-prone dependency, and even if we are not actually exposed to some issues, carrying an old dep exposes us to version-based vulnerability scanners.
Description of problem:
The installation failed in the disconnected environment due to a failure to get controlPlaneOperatorImageLabels: failed to look up image metadata.
Version-Release number of selected component (if applicable):
4.19 4.18
How reproducible:
100%
Steps to Reproduce:
1.disconnected env 2.create agent hostedcluster
Actual results:
cluster can be ready
Expected results:
- lastTransitionTime: "2025-01-05T13:55:14Z" message: 'failed to get controlPlaneOperatorImageLabels: failed to look up image metadata for registry.ci.openshift.org/ocp/4.18-2025-01-04-031500@sha256:ba93b7791accfb38e76634edbc815d596ebf39c3d4683a001f8286b3e122ae69: failed to obtain root manifest for registry.ci.openshift.org/ocp/4.18-2025-01-04-031500@sha256:ba93b7791accfb38e76634edbc815d596ebf39c3d4683a001f8286b3e122ae69: manifest unknown: manifest unknown' observedGeneration: 2 reason: ReconciliationError status: "False" type: ReconciliationSucceeded
Additional info:
- mirrors: - virthost.ostest.test.metalkube.org:5000/localimages/local-release-image source: registry.build01.ci.openshift.org/ci-op-p2mqdwjp/release - mirrors: - virthost.ostest.test.metalkube.org:5000/localimages/local-release-image source: registry.ci.openshift.org/ocp/4.18-2025-01-04-031500 - mirrors: - virthost.ostest.test.metalkube.org:6001/openshifttest source: quay.io/openshifttest - mirrors: - virthost.ostest.test.metalkube.org:6001/openshift-qe-optional-operators source: quay.io/openshift-qe-optional-operators - mirrors: - virthost.ostest.test.metalkube.org:6001/olmqe source: quay.io/olmqe - mirrors: - virthost.ostest.test.metalkube.org:6002 source: registry.redhat.io - mirrors: - virthost.ostest.test.metalkube.org:6002 source: brew.registry.redhat.io - mirrors: - virthost.ostest.test.metalkube.org:6002 source: registry.stage.redhat.io - mirrors: - virthost.ostest.test.metalkube.org:6002 source: registry-proxy.engineering.redhat.com
Description of problem:
In an effort to ensure all HA components are not degraded by design during normal e2e test or upgrades, we are collecting all operators that are blipping Degraded=True during any payload job run. This card captures machine-config operator that blips Degraded=True during some ci job runs. Example Job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview-serial/1843561357304139776 Reasons associated with the blip: MachineConfigDaemonFailed or MachineConfigurationFailed For now, we put an exception in the test. But it is expected that teams take action to fix those and remove the exceptions after the fix go in. Exception is defined here: https://github.com/openshift/origin/blob/e5e76d7ca739b5699639dd4c500f6c076c697da6/pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go#L109 See linked issue for more explanation on the effort.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Our unit test runtime is slow. It seems to run anywhere from ~16-20 minutes locally. On CI it can take at least 30 minutes to run. Investigate whether or not any changes can be made to improve the unit test runtime.
Description of problem: [UDN pre-merge testing] not able to create layer3 UDN from CRD on dualstack cluster
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. On a dualstack cluster, created a UDN namespace with label
2. Attempted to create a layer3 UDN from CRD
$ cat /tmp/e2e-test-udn-networking-udn-bhmwv-n3wn8mngresource.json
{
"kind": "List",
"apiVersion": "v1",
"metadata": {},
"items": [
{
"apiVersion": "k8s.ovn.org/v1",
"kind": "UserDefinedNetwork",
"metadata":
,
"spec": {
"layer3": {
"mtu": 1400,
"role": "Primary",
"subnets": [
,
]
},
"topology": "Layer3"
}
}
]
}
3. got the following error message:
The UserDefinedNetwork "udn-network-77827-ns1" is invalid: spec.layer3.subnets[1]: Invalid value: "object": HostSubnet must < 32 for ipv4 CIDR
The UserDefinedNetwork "udn-network-77827-ns1" is invalid: spec.layer3.subnets[1]: Invalid value: "object": HostSubnet must < 32 for ipv4 CIDR
subnets[1] is ipv6 hostsubnet, but it was compared with IPv4 CIDR
Actual results: Not able to create UDN in UDN namespace on dualstack cluster
Expected results: should be able to create UDN in UDN namespace on dualstack cluster
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
OLMv1 is being GA'd with OCP 4.18, together with OLMv0. The long(ish)-term plan right now is for OLM v0 and v1 to be able to coexist on a cluster. However, their access to installable extensions is through different resources. v0 uses CatalogSource and v1 uses ClusterCatalog. We expect to see catalog content begin to diverge at some point, but don't have a specific timeline for it yet. oc-mirror v2 should generate ClusterCatalog YAML along with CatalogSource YAML. We also work with docs team to document ClusterCatalog YAML is only needed to be applied when managing Operator catalogs with OLM v1.
Version-Release number of selected component (if applicable):
4.18+
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
ClusterCatalog resource is generated for operators.
Additional info:
OLMv1 currently only works for a small subset of operators in the catalog.
Description of problem:
HyperShift CEL validation blocks ARM64 NodePool creation for non-AWS/Azure platforms Can't add a Bare Metal worker node to the hosted cluster. This was discussed on #project-hypershift Slack channel.
Version-Release number of selected component (if applicable):
MultiClusterEngine v2.7.2 HyperShift Operator image: registry.redhat.io/multicluster-engine/hypershift-rhel9-operator@sha256:56bd0210fa2a6b9494697dc7e2322952cd3d1500abc9f1f0bbf49964005a7c3a
How reproducible:
Always
Steps to Reproduce:
1. Create a HyperShift HostedCluster on a non-AWS/non-Azure platform 2. Try to create a NodePool with ARM64 architecture specification
Actual results:
- CEL validation blocks creating NodePool with arch: arm64 on non-AWS/Azure platforms - Receive error: "The NodePool is invalid: spec: Invalid value: "object": Setting Arch to arm64 is only supported for AWS and Azure" - Additional validation in NodePool spec also blocks arm64 architecture
Expected results:
- Allow ARM64 architecture specification for NodePools on BareMetal platform - Remove or update the CEL validation to support this use case
Additional info:
NodePool YAML: apiVersion: hypershift.openshift.io/v1beta1 kind: NodePool metadata: name: nodepool-doca5-1 namespace: doca5 spec: arch: arm64 clusterName: doca5 management: autoRepair: false replace: rollingUpdate: maxSurge: 1 maxUnavailable: 0 strategy: RollingUpdate upgradeType: InPlace platform: agent: agentLabelSelector: {} type: Agent release: image: quay.io/openshift-release-dev/ocp-release:4.16.21-multi replicas: 1
Description of problem:
https://access.redhat.com/errata/RHSA-2024:5422 did not seemingly fix the issue https://issues.redhat.com/browse/OCPBUGS-37060 in ROSA HCP so opening new bug. The builds installed in the hosted clusters are having issues to git-clone repositories from external URLs where their CA are configured in the ca-bundle.crt from trsutedCA section: spec: configuration: apiServer: [...] proxy: trustedCA: name: user-ca-bundle <--- In traditional OCP implementations, the *-global-ca configmap is installed in the same namespace from the build and the ca-bundle.crt is injected into this configmap. In hosted clusters the configmap is being created empty: $ oc get cm -n <app-namespace> <build-name>-global-ca -oyaml apiVersion: v1 data: ca-bundle.crt: "" As mentioned, the user-ca-bundle has the certificates configured: $ oc get cm -n openshift-config user-ca-bundle -oyaml apiVersion: v1 data: ca-bundle.crt: | -----BEGIN CERTIFICATE----- <---
Version-Release number of selected component (if applicable):
4.16.17
How reproducible:
Steps to Reproduce:
1. Install hosted cluster with trustedCA configmap 2. Run a build in the hosted cluster 3. Check the global-ca configmap
Actual results:
global-ca is empty
Expected results:
global-ca injects the ca-bundle.crt properly
Additional info:
Created a new ROSA HCP cluster behind a transparent proxy at version 4.16.8 as it was mentioned as fixed in the above errata and the issue still exists. The transparent proxy certificate provided at cluster installation time is referenced in proxy/cluster as "user-ca-bundle-abcdefgh" and both "user-ca-bundle" and "user-ca-bundle-abcdefgh" configmaps in the "openshift-config" contain the certificate. However starting a template build for example such as "oc new-app cakephp-mysql-persistent" still results in the certificate not being injected into the "cakephp-mysql-persistent-1-global-ca" configmap and the build failing unlike the same scenario in an OCP cluster. oc logs build.build.openshift.io/cakephp-mysql-persistent-1 Cloning "https://github.com/sclorg/cakephp-ex.git" ... error: fatal: unable to access 'https://github.com/sclorg/cakephp-ex.git/': SSL certificate problem: unable to get local issuer certificate Also upgraded the cluster to 4.16.17 and still the issue persists.
Description of problem:
During the EUS to EUS upgrade of a MNO cluster from 4.14.16 to 4.16.11 on baremetal, we have seen that depending on the custom configuration, like performance profile or container runtime config, one or more control plane nodes are rebooted multiple times. Seems that this is a race condition. When the first MachineConfig rendered is generated, the first Control Plane node start the reboot(the maxUnavailable is set to 1 on the master MCP), and at this moment a new MachineConfig render is generated, what means a second reboot. Once this first node is rebooted the second time, the rest of the Control Plane nodes are rebooted just once, because no more new MachineConfig renders are generated.
Version-Release number of selected component (if applicable):
OCP 4.14.16 > 4.15.31 > 4.16.11
How reproducible:
Perform the upgrade of a Multi Node OCP with a custom configuration like a performance profile or container runtime configuration (like force cgroups v1, or update runc to crun)
Steps to Reproduce:
1. Deploy on baremetal a MNO OCP 4.14 with a custom manifest, like the below: --- apiVersion: config.openshift.io/v1 kind: Node metadata: name: cluster spec: cgroupMode: v1 2. Upgrade the cluster to the next minor version available, for instance 4.15.31, make a partial upgrade pausing the worker Machine Config Pool. 3. Monitoring the upgrade process (cluster operators, Machine Configs, Machine Config Pools and nodes)
Actual results:
You will see that once almost all the Cluster Operators are in the 4.15.31 version, except the Machine Config Operator, at this moment review the MachineConfig reders that are generated for the master Machine Config Pool, and also monitor the nodes, to see that new MachineConfig render is generated once the first Control Plane node has been rebooted.
Expected results:
What is expected is that in a upgrade only one Machine Config Render is generated per Machine Config Pool, and only one reboot per node to finish the upgrade.
Additional info:
Description of problem:
The initial set of default endpoint overrides we specified in the installer are missing a v1 at the end of the DNS services override.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
It's the testing scenario of QE test case OCP-24405, i.e. after a successful IPI installation, add an additional compute/worker node without infra_id as name prefix. The expectation is, "destroy cluster" could delete the additional compute/worker machine smoothly. But the testing results is, "destroy cluster" seems unaware of the machine.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-multi-2024-12-17-192034
How reproducible:
Always
Steps to Reproduce:
1. install an IPI cluster on GCP and make sure it succeeds (see [1]) 2. add the additional compute/worker node, and ensure the node's name doesn't have the cluster infra ID (see [2]) 3. wait for the node ready and all cluster operators available 4. (optional) scale ingress operator replica into 3 (see [3]), and wait for ingress operator finishing progressing 5. check the new machine on GCP (see [4]) 6. "destroy cluster" (see [5])
Actual results:
The additional compute/worker node is not deleted, which seems also leading to k8s firewall-rules / forwarding-rule / target-pool / http-health-check not deleted.
Expected results:
"destroy cluster" should be able to detect the additional compute/worker node by the label "kubernetes-io-cluster-<infra id>: owned" and delete it along with all resources of the cluster.
Additional info:
Alternatively, we also tested with creating the additional compute/worker machine by a machineset YAML (rather than a machine YAML), and we got the same issue in such case.
Description of problem:
During bootstrapping we're running into the following scenario: 4 members: master 0, 1 and 2 (are full voting) and bootstrap (torn down/dead member) revision rollout causes 0 to restart and leaves you with 2/4 healthy, which means quorum-loss. This causes apiserver unavailability during the installation and should be avoided.
Version-Release number of selected component (if applicable):
4.17, 4.18 but is likely a longer standing issue
How reproducible:
rarely
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
apiserver should not return any errors
Additional info:
Description of problem:
Observing these e2e failures consistently [sig-builds][Feature:Builds][subscription-content] builds installing subscription content [apigroup:build.openshift.io] should succeed for RHEL 7 base images [Suite:openshift/conformance/parallel] [sig-builds][Feature:Builds][subscription-content] builds installing subscription content [apigroup:build.openshift.io] should succeed for RHEL 8 base images [Suite:openshift/conformance/parallel] [sig-builds][Feature:Builds][subscription-content] builds installing subscription content [apigroup:build.openshift.io] should succeed for RHEL 9 base images [Suite:openshift/conformance/parallel]
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Fails consistently and fails when run steps are run manually
Steps to Reproduce:
1. Setup 4.18 cluster 2. Run e2e 3. Test cases in file - origin/test/extended/builds/subscription_content.go raw - https://raw.githubusercontent.com/openshift/origin/f7e4413793877efb24be86de05319dad00d05897/test/extended/builds/subscription_content.go
Actual results:
Test case fails
Expected results:
Additional info:
Failures were observed in both OCP-4.17 as well as OCP-4.18. Following are the logs.
Component Readiness has found a potential regression in the following test:
install should succeed: infrastructure
installer fails with:
time="2024-10-20T04:34:57Z" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: control-plane machines were not provisioned within 15m0s: client rate limiter Wait returned an error: context deadline exceeded"
Significant regression detected.
Fishers Exact probability of a regression: 99.96%.
Test pass rate dropped from 98.94% to 89.29%.
Sample (being evaluated) Release: 4.18
Start Time: 2024-10-14T00:00:00Z
End Time: 2024-10-21T23:59:59Z
Success Rate: 89.29%
Successes: 25
Failures: 3
Flakes: 0
Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T23:59:59Z
Success Rate: 98.94%
Successes: 93
Failures: 1
Flakes: 0
Description of problem:
OpenShift internal registry panics when deploying OpenShift on AWS ap-southeast-5 region
Version-Release number of selected component (if applicable):
OpenShift 4.15.29
How reproducible:
Always
Steps to Reproduce:
1. Deploy OpenShift 4.15.29 on AWS ap-southeast-5 region 2. The cluster gets deployed but the image-registry Operator is not available and image-registry pods get in CrashLoopBackOff state
Actual results:
panic: invalid region provided: ap-southeast-5goroutine 1 [running]: github.com/distribution/distribution/v3/registry/handlers.NewApp({0x2983cd0?, 0xc00005c088?}, 0xc000640c00) /go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/handlers/app.go:130 +0x2bf1 github.com/openshift/image-registry/pkg/dockerregistry/server/supermiddleware.NewApp({0x2983cd0, 0xc00005c088}, 0x0?, {0x2986620?, 0xc000377560}) /go/src/github.com/openshift/image-registry/pkg/dockerregistry/server/supermiddleware/app.go:96 +0xb9 github.com/openshift/image-registry/pkg/dockerregistry/server.NewApp({0x2983cd0?, 0xc00005c088}, {0x296fa38?, 0xc0008e4148}, 0xc000640c00, 0xc000aa6140, {0x0?, 0x0}) /go/src/github.com/openshift/image-registry/pkg/dockerregistry/server/app.go:138 +0x485 github.com/openshift/image-registry/pkg/cmd/dockerregistry.NewServer({0x2983cd0, 0xc00005c088}, 0xc000640c00, 0xc000aa6140) /go/src/github.com/openshift/image-registry/pkg/cmd/dockerregistry/dockerregistry.go:212 +0x38a github.com/openshift/image-registry/pkg/cmd/dockerregistry.Execute({0x2968300, 0xc000666408}) /go/src/github.com/openshift/image-registry/pkg/cmd/dockerregistry/dockerregistry.go:166 +0x86b main.main() /go/src/github.com/openshift/image-registry/cmd/dockerregistry/main.go:93 +0x496
Expected results:
image-resgistry Operator and pods are available
Additional info:
We can asume the results will be the same while deploying on 4.16 and 4.17 but can't be tested yet as only 4.15 is working in this region. Will open another bug for the Installer to solve the issues while deploying on this region
Description of problem:
In order to test OCL we run e2e automated test cases in a cluster that has OCL enabled in master and worker pools. We have seen that rarely a new machineconfig is rendered but no MOSB resource is created.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Rare
Steps to Reproduce:
We don't have any steps to reproduce it. It happens eventually when we run a regression in a cluster with OCL enabled in master and worker pools.
Actual results:
We see that in some scenarios a new MC is created, then a new rendered MC is created too, but now MOSB is created and the pool is stuck forever.
Expected results:
Whenever a new rendered MC is created, a new MOSB sould be created too to build the new image.
Additional info:
In the comments section we will add all the must-gather files that are related to this issue. In some scenarios we can see this error reported by the os-builder pod: 2024-12-03T16:44:14.874310241Z I1203 16:44:14.874268 1 request.go:632] Waited for 596.269343ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/secrets?labelSelector=machineconfiguration.openshift.io%2Fephemeral-build-object%2Cmachineconfiguration.openshift.io%2Fmachine-os-build%3Dmosc-worker-5fc70e666518756a629ac4823fc35690%2Cmachineconfiguration.openshift.io%2Fon-cluster-layering%2Cmachineconfiguration.openshift.io%2Frendered-machine-config%3Drendered-worker-7c0a57dfe9cd7674b26bc5c030732b35%2Cmachineconfiguration.openshift.io%2Ftarget-machine-config-pool%3Dworker Nevertheless, we only see this error in some of them, not in all of them.
Analyze the data from the new tests and determine what, if anything, we should do.
Description of problem:
Sometimes the ovs-configuration cannot be started with errors as below: Sep 08 12:45:34 openshift-qe-024.lab.eng.rdu2.redhat.com configure-ovs.sh[9472]: + add_nm_conn br-ex type ovs-bridge conn.interface br-ex 802-3-ethernet.mtu 1500 connection.autoconnect-slaves 1 Sep 08 12:45:34 openshift-qe-024.lab.eng.rdu2.redhat.com configure-ovs.sh[9472]: + nmcli c add save no con-name br-ex type ovs-bridge conn.interface br-ex 802-3-ethernet.mtu 1500 connection.autoconnect-slaves 1 connection.autoconnect no Sep 08 12:45:34 openshift-qe-024.lab.eng.rdu2.redhat.com configure-ovs.sh[9781]: Connection 'br-ex' (eb9fdfa0-912f-4ee2-b6ac-a5040b290183) successfully added. Sep 08 12:45:34 openshift-qe-024.lab.eng.rdu2.redhat.com configure-ovs.sh[9472]: + nmcli connection show ovs-port-phys0 Sep 08 12:45:34 openshift-qe-024.lab.eng.rdu2.redhat.com configure-ovs.sh[9472]: + ovs-vsctl --timeout=30 --if-exists del-port br-ex ens1f0np0 Sep 08 12:45:34 openshift-qe-024.lab.eng.rdu2.redhat.com configure-ovs.sh[9472]: + add_nm_conn ovs-port-phys0 type ovs-port conn.interface ens1f0np0 master br-ex connection.autoconnect-slaves 1 Sep 08 12:45:34 openshift-qe-024.lab.eng.rdu2.redhat.com configure-ovs.sh[9472]: + nmcli c add save no con-name ovs-port-phys0 type ovs-port conn.interface ens1f0np0 master br-ex connection.autoconnect-slaves 1 connection.autoconnect no Sep 08 12:45:34 openshift-qe-024.lab.eng.rdu2.redhat.com configure-ovs.sh[9790]: Error: failed to modify connection.port-type: 'ovs-interface' not among [bond, bridge, ovs-bridge, ovs-port, team, vrf]. Sep 08 12:45:34 openshift-qe-024.lab.eng.rdu2.redhat.com configure-ovs.sh[9472]: ++ handle_exit However there is workaround is remove the existing `ovs-if-br-ex` by `nmcli connection delete ovs-if-br-ex` can fix this issue.
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-09-07-151850
How reproducible:
not always
Steps to Reproduce:
1. Create many bond interface by nmstate nncp 2. reboot worker 3.
Actual results:
ovs-configuration service cannot be started up
Expected results:
ovs-configuration service should be started without any issue
Additional info:
Not sure if these bond interface affected this issue. However there is workaround is remove the existing `ovs-if-br-ex` by `nmcli connection delete ovs-if-br-ex` can fix this issue. [root@openshift-qe-024 ~]# nmcli c NAME UUID TYPE DEVICE ens1f0np0 701f8b4e-819d-56aa-9dfb-16c00ea947a8 ethernet ens1f0np0 Wired Connection b7361c63-fb2a-4a95-80f4-c669fd368bbf ethernet eno1 Wired Connection b7361c63-fb2a-4a95-80f4-c669fd368bbf ethernet ens1f1np1 Wired Connection b7361c63-fb2a-4a95-80f4-c669fd368bbf ethernet ens2f2 bond12 ba986131-f4d2-460c-b883-a1d6a9ddfdcb bond bond12 bond12.101 46bc3df0-e093-4096-a747-0e6717573f82 vlan bond12.101 bond12.102 93d68598-7453-4666-aff6-87edfcf2f372 vlan bond12.102 bond12.103 be6013e1-6b85-436f-8ce8-24655db0be17 vlan bond12.103 bond12.104 fabf9a76-3635-48d9-aace-db14ae2fd9c3 vlan bond12.104 bond12.105 0fab3700-ce50-4815-b329-35af8f830cb1 vlan bond12.105 bond12.106 68c20304-f3e9-4238-96d7-5bcce05b3827 vlan bond12.106 bond12.107 f1029614-2e6e-4e20-b9b6-79902dd12ac9 vlan bond12.107 bond12.108 27669b6f-e24d-4ac2-a8ba-35ca0b6c5b05 vlan bond12.108 bond12.109 d421e0bb-a441-4305-be23-d1964cb2bb46 vlan bond12.109 bond12.110 c453e70c-e460-4e80-971c-88fac4bd1d9e vlan bond12.110 bond12.111 2952a2c6-deb4-4982-8a4b-2a962c3dda96 vlan bond12.111 bond12.112 5efe4b2d-2834-4b0b-adb2-8caf153cef2d vlan bond12.112 bond12.113 2ec39bea-2704-4b8a-83fa-d48e1ef1c472 vlan bond12.113 bond12.114 8fc8ae53-cc8f-4412-be7d-a05fc3abdffe vlan bond12.114 bond12.115 58f9e047-fe4f-475d-928f-7dec74cf379f vlan bond12.115 bond12.116 d4d133cb-13cc-43f3-a636-0fbcb1d2b65d vlan bond12.116 bond12.117 3a2d10a1-3fd8-4839-9836-56eb6cab76a7 vlan bond12.117 bond12.118 8d1a22da-efa0-4a06-ab6d-6840aa5617ea vlan bond12.118 bond12.119 b8556371-eba8-43ba-9660-e181ec16f4d2 vlan bond12.119 bond12.120 989f770f-1528-438b-b696-eabcb5500826 vlan bond12.120 bond12.121 b4c651f6-18d7-47ce-b800-b8bbeb28ed60 vlan bond12.121 bond12.122 9a4c9ec2-e5e4-451a-908c-12d5031363c6 vlan bond12.122 bond12.123 aa346590-521a-40c0-8132-a4ef833de60c vlan bond12.123 bond12.124 c26297d6-d965-40e1-8133-a0d284240e46 vlan bond12.124 bond12.125 24040762-b6a0-46f7-a802-a86b74c25a1d vlan bond12.125 bond12.126 24df2984-9835-47c2-b971-b80d911ede8d vlan bond12.126 bond12.127 0cc62ca7-b79d-4d09-8ec3-b48501053e41 vlan bond12.127 bond12.128 bcf53331-84bd-400c-a95c-e7f1b846e689 vlan bond12.128 bond12.129 88631a53-452c-4dfe-bebe-0b736633d15a vlan bond12.129 bond12.130 d157ffb0-2f63-4844-9a16-66a035315a77 vlan bond12.130 bond12.131 a36f8fb2-97d6-4059-8802-ce60faffb04a vlan bond12.131 bond12.132 94aa7a8e-b483-430f-8cd1-a92561719954 vlan bond12.132 bond12.133 7b3a2b6e-72ad-4e0a-8f37-6ecb64d1488c vlan bond12.133 bond12.134 68b80892-414f-4372-8247-9276cea57e88 vlan bond12.134 bond12.135 08f4bdb2-469f-4ff7-9058-4ed84226a1dd vlan bond12.135 bond12.136 a2d13afa-ccac-4efe-b295-1f615f0d001b vlan bond12.136 bond12.137 487e29dc-6741-4406-acec-47e81bed30d4 vlan bond12.137 bond12.138 d6e2438f-2591-4a7a-8a56-6c435550c3ae vlan bond12.138 bond12.139 8a2e21c3-531b-417e-b747-07ca555909b7 vlan bond12.139 bond12.140 8e3c5d65-5098-48a5-80c4-778d41b24634 vlan bond12.140 bond12.141 7aaca678-27e1-4219-9410-956649313c52 vlan bond12.141 bond12.142 6765c730-3240-48c8-ba29-88113c703a88 vlan bond12.142 bond12.143 3e9cef84-4cb1-4f17-98eb-de9a13501453 vlan bond12.143 bond12.144 ebaa63ee-10be-483d-9096-43252757b7fa vlan bond12.144 bond12.145 1ba28e89-0578-4967-85d3-95c03677f036 vlan bond12.145 bond12.146 75ac1594-a761-4066-9ac9-a2f4cc853429 vlan bond12.146 bond12.147 b8c7e473-8179-49f7-9ea8-3494ce4a0244 vlan bond12.147 bond12.148 4c643923-8412-4550-b43c-cdb831dd28e9 vlan bond12.148 bond12.149 418fa841-24ba-4d6f-bc5a-37c8ffb25d45 vlan bond12.149 bond12.150 1eb8d1ce-256e-42f3-bacd-e7e5ac30bd9a vlan bond12.150 bond12.151 aaab839b-0fbc-4ba9-9371-c460172566a2 vlan bond12.151 bond12.152 de2559c4-255b-45ac-8602-968796e647a6 vlan bond12.152 bond12.153 52b5d827-c212-45f1-975d-c0e5456c19e9 vlan bond12.153 bond12.154 26fc0abd-bfe5-4f66-a3a5-fadefdadb9df vlan bond12.154 bond12.155 0677f4a8-9260-475c-93ca-e811a47d5780 vlan bond12.155 bond12.156 4b4039f4-1e7e-4427-bc3a-92fe37bec27e vlan bond12.156 bond12.157 38b7003e-a20c-4ef6-8767-e4fdfb7cd61b vlan bond12.157 bond12.158 7d073e1b-1cf7-4e49-9218-f96daf97150a vlan bond12.158 bond12.159 3d8c5222-e59c-45c9-acb6-1a6169e4eb6d vlan bond12.159 bond12.160 764bce7a-ec99-4f8b-9e39-d47056733c0c vlan bond12.160 bond12.161 63ee9626-2c17-4335-aa17-07a38fa820d8 vlan bond12.161 bond12.162 6f8298ff-4341-42a6-93a8-66876042ca16 vlan bond12.162 bond12.163 7bb90042-f592-49c6-a0c9-f4d2cf829674 vlan bond12.163 bond12.164 3fd8b04f-8bd0-4e8d-b597-4fd37877d466 vlan bond12.164 bond12.165 06268a05-4533-4bd2-abb8-14c80a6d0411 vlan bond12.165 bond12.166 4fa1f0c1-e55d-4298-bfb5-3602ad446e61 vlan bond12.166 bond12.167 494e1a43-deb2-4a69-90da-2602c03400fb vlan bond12.167 bond12.168 d2c034cd-d956-4d02-8b6e-075acfcd9288 vlan bond12.168 bond12.169 8e2467b7-80dd-45b6-becc-77cbc632f1f0 vlan bond12.169 bond12.170 3df788a3-1715-4a1c-9f5d-b51ffd3a5369 vlan bond12.170 dummy1 b4d7daa3-b112-4606-8b9c-cb99b936b2b9 dummy dummy1 dummy2 c99d8aa1-0627-47f3-ae57-f3f397adf0e8 dummy dummy2 Wired Connection b7361c63-fb2a-4a95-80f4-c669fd368bbf ethernet enp138s0np0 Wired Connection b7361c63-fb2a-4a95-80f4-c669fd368bbf ethernet ens2f3 Wired Connection b7361c63-fb2a-4a95-80f4-c669fd368bbf ethernet ens4f2 Wired Connection b7361c63-fb2a-4a95-80f4-c669fd368bbf ethernet ens4f3 Wired Connection b7361c63-fb2a-4a95-80f4-c669fd368bbf ethernet ens8f0 Wired Connection b7361c63-fb2a-4a95-80f4-c669fd368bbf ethernet ens8f1 Wired Connection b7361c63-fb2a-4a95-80f4-c669fd368bbf ethernet ens8f3 lo ae4bbedd-1a2e-4c97-adf7-4339cf8fb226 loopback lo ovs-if-br-ex 90af89d6-a3b0-4497-b6d0-7d2cc2d5098a ovs-interface --
As a maintainer of the SNO CI lane, I would like to ensure that the following test doesn't failure regularly as part of SNO CI.
[sig-architecture] platform pods in ns/openshift-e2e-loki should not exit an excessive amount of times
This issue is a symptom of a greater problem with SNO where there is downtime in resolving DNS after the upgrade reboot where the DNS operator has an outage while its deploying the new DNS pods. During that time, loki exists after hitting the following error:
2024/10/23 07:21:32 OIDC provider initialization failed: Get "https://sso.redhat.com/auth/realms/redhat-external/.well-known/openid-configuration": dial tcp: lookup sso.redhat.com on 172.30.0.10:53: read udp 10.128.0.4:53104->172.30.0.10:53: read: connection refused
This issue is important because it can contribute to payload rejection in our blocking CI jobs.
Acceptance Criteria:
Update openshift/api to k8s 1.32
Description of problem:
If the bootstrap fails, the installer will try to get the VM console logs via the AWS SDK which requires the ec2:GetConsoleOutput permission.
Version-Release number of selected component (if applicable):
all versions where we enabled VM console log gathering
How reproducible:
always
Steps to Reproduce:
1. Use minimal permissions and force a bootstrap failure 2. 3.
Actual results:
level=info msg=Pulling VM console logs level=error msg=UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-xgq2j8ch-f93c7-minimal-perm is not authorized to perform: ec2:GetConsoleOutput on resource: arn:aws:ec2:us-west-1:460538899914:instance/i-0fa40c9966e9f1ab9 because no identity-based policy allows the ec2:GetConsoleOutput action. Encoded authorization failure message: XYfLhyZ0pKnDzJrs9ZbOH8z8YkG03aPhT6U57EoqiLH8iS5PZvFgbgONlBuZfDswNpaNBVOfZcdPc1dWYoIsoPTXtQ_n32tzrdxloK7qpVbvkuesHtb8ytV8iLkpmOGyArMqp7Muphn2yXG9DQ5aijx-zQh_ShwirruMTZWhkZdx7_f1WtfjnCBVJGRwAc-rMZ_Xh82-jjxQlQbtBfgJ8COc3kQm7E_iJ1Ngyrcmu6bmVKCS6cEcGIVwRi03PZRtiemfejZfUT7yhKppB-zeeRm5bBWWVmRiuJhswquIW4dH0E9obNvq76-C0b2PR_V9ep-t0udUcypKGilqzqT1DY51gaP66GlSEfN5b4CTLTQxEhE73feZn4xEK0Qq4MkatPFJeGsUcxY5TXEBsGMooj4_D7wPFwkY46QEle41oqs-KNCWEifZSlV5f4IUyiSear85LlUIxBS9-_jfitV90Qw7MZM4z8ggIinQ_htfvRKgnW9tjREDj6hzpydQbViaeAyBod3Q-qi2vgeK6uh7Q6kqK3f8upu1hS8I7XD_TH-oP-npbVfkiPMIQGfy3vE3J5g1AyhQ24LUjR15y-jXuBOYvGIir21zo9oGKc0GEWRPdZr4suSbbx68rZ9TnTHXfwa0jrhIns24uwnANdR9U2NStE6XPJk9KWhbbz6VD6gRU72qbr2V7QKPiguNpeO_P5uksRDwEBWxDfQzMyDWx1zOhhPPAjOQRup1-vsPpJhkgkrsdhPebN0duz6Hd4yqy0RiEyb1sSMaQn_8ac_2vW9CLuWWbbt5qo2WlRllo3U7-FpvlP6BRGTPjv5z3O4ejrGsnfDxm7KF0ANvLU0KT2dZvKugB6j-Kkz56HXHebIzpzFPRpmo0B6H3FzpQ5IpzmYiWaQ6sNMoaatmoE2z420AJAOjSRBodqhgi2cVxyHDqHt0E0PQKM-Yt4exBGm1ZddC5TUPnCrDnZpdu2WLRNHMxEBgKyOzEON_POuDaOP0paEXFCflt7kNSlBRMRqAbOpGI_F96wlNmDO58KZDbPKgdOfomwkaR5icdeS-tQyQk2PnhieOTNL1M5hQZpLrzWVeJzZEtmZ_0vsePUdvXYusvL828ldyg8VCwq-B2oGD_ym_iPCINBC7sIy8Q0HVb5v5dzbs4l2UKcC7OzTG-TMlxphV20DqNmC5yCnHEdmnleNA48J69HdTMw_G7N9mo5IrXw049MjvYnia4NwarMGUvoBYnxROfQ2jprN7_BW-Cdyp2Ca2P9uU9AeSubeeQdzieazkXNeR9_4Su_EGsbQm Instance=i-0fa40c9966e9f1ab9
Expected results:
No failures.
Additional info:
See https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/57437/rehearse-57437-pull-ci-openshift-installer-master-e2e-aws-ovn-user-provisioned-dns/1860020284598259712 for an example of a failed job
Description of problem:
We are currently using node 18, but our types are for node 10
Version-Release number of selected component (if applicable):
4.19.0
How reproducible:
Always
Steps to Reproduce:
1. Open frontend/package.json 2. Observe @types/node and engine version 3.
Actual results:
They are different
Expected results:
They are the same
Additional info:
Description of problem:
When user projects monitoring feature is turned off the operator is cleaning up resources for user project monitoring, running multiple delete requests to apiserver. This has several drawbacks: * API Server can't cache DELETE requests, so it has to request etcd every time * Audit log is flooded with "delete failed: object 'foo' not found" record The function should first check that the object exists (GET requests are cachable) before issuing a DELETE request
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Always
Steps to Reproduce:
1. Start 4.16.0 2. Check audit log 3.
Actual results:
Audit log has messages like: configmaps "serving-certs-ca-bundle" not found servicemonitors.monitoring.coreos.com "thanos-ruler" not found printed every few minutes
Expected results:
No failed delete requests
Additional info:
There's a possible flake in openshift-tests because the upstream test "Should recreate evicted statefulset" occasionally causes kubelet to emit a "failed to bind hostport" event because it tries to recreate a deleted pod too quickly, and this gets flagged by openshift-tests as a bad thing, even though it's not (because it retries and succeeds).
I filed a PR to fix this a long time ago, it just needs review.
Description of problem:
There are duplicate external link icon on Operator details modal for 'Purchase' button
Version-Release number of selected component (if applicable):
4.19.0-0.nightly-2024-12-18-013707
How reproducible:
Always
Steps to Reproduce:
1. find one Marketplace operator and click on operator tile
Actual results:
1. on Operator details modal, there is `Purchase` button and there are two duplicate external link icon displayed beside Purchase button
Expected results:
1. only one external link icon is required
Additional info:
screenshot https://drive.google.com/file/d/1uGCXxXdR8ayXRafhabHepW5mVqwePzcq/view?usp=drive_link
Description of problem:
Once Machine Config Pool goes into degraded state due to incorrect machine config, pool doesn't recover from this state even after updating machine config with correct config.
Version-Release number of selected component (if applicable):
4.16.0, applicable for previous versions as well.
How reproducible:
Always
Steps to Reproduce:
1. Create Machine Config with invalid extension name. 2. Wait for Machine Config Pool goes into degraded state. 3. Update Machine Config with correct extension name or delete Machine Config.
Actual results:
Machine Config Pool doesn't recover and always in degraded state.
Expected results:
Machine Config Pool must be restored and degraded condition must be set with false.
Additional info:
conditions: - lastTransitionTime: "2024-05-16T11:15:51Z" message: "" reason: "" status: "False" type: RenderDegraded - lastTransitionTime: "2024-05-27T15:05:50Z" message: "" reason: "" status: "False" type: Updated - lastTransitionTime: "2024-05-27T15:07:41Z" message: 'Node worker-1 is reporting: "invalid extensions found: [ipsec11]"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded - lastTransitionTime: "2024-05-27T15:07:41Z" message: "" reason: "" status: "True" type: Degraded - lastTransitionTime: "2024-05-27T15:05:50Z" message: All nodes are updating to MachineConfig rendered-worker-c585a5140738aa0a2792cf5f25b4eb20 reason: "" status: "True" type: Updating
Description of problem:Description of problem:
When we enable techpreview and we try to scale up a new node using a 4.5 base image, the node cannot join the cluster
Version-Release number of selected component (if applicable):
IPI on AWS $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.17.0-0.nightly-2024-08-19-165854 True False 5h25m Cluster version is 4.17.0-0.nightly-2024-08-19-165854
How reproducible:
Always
Steps to Reproduce:
1. Create a new machineset using a 4.5 base image and a 2.2.0 ignition version Detailed commands to create this machineset can be found here: [OCP-52822-Create new config resources with 2.2.0 ignition boot image nodes|https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-52822] 2. Scale up this machineset to create a new worker node
Actual results:
The node cannot join the cluster. We can find this message in the machine-config-daemon-pull.service in the failed node Wed 2024-08-21 13:02:19 UTC ip-10-0-29-231 machine-config-daemon-pull.service[1971]: time="2024-08-21T13:02:19Z" level=warning msg="skip_mount_home option is no longer supported, ignoring option" Wed 2024-08-21 13:02:20 UTC ip-10-0-29-231 machine-config-daemon-pull.service[1971]: Error: error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2a0afcde0e240601cb4a761e95f8311984b02ee76f827527d425670be3a39797": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2a0afcde0e240601cb4a761e95f8311984b02ee76f827527d425670be3a39797: invalid policy in "/etc/containers/policy.json": Unknown policy requirement type "sigstoreSigned"
Expected results:
Nodes should join the cluster
Additional info:
If techpreview is not enabled, the node can join the cluster without problems The podman version in a 4.5 base image is: $ podman version WARN[0000] skip_mount_home option is no longer supported, ignoring option Version: 1.9.3 RemoteAPI Version: 1 Go Version: go1.13.4 OS/Arch: linux/amd64
Description of problem:
Project dropdown is partially hidden due to web terminal
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Install and initialize web-terminal 2. Open the Project bar 3.
Actual results:
Attaching screenshot:
https://drive.google.com/file/d/1AaYXCZsEcBiXVCIBqXkavKvXbjFb1YlP/view?usp=sharing
Expected results:
Project namespace bar should be at the front
Additional info:
Description of problem:
When doing deployments on baremetal with assisted installer, it is not possible to use nmstate-configuration because it is only enabled for platform baremetal, and AI uses platform none. Since we have many baremetal users on assisted we should enable it there as well.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
We are aiming to find containers that are restarting more than 3 times in the progress of a test.
If this link code rots, you just need to search for "restarted .* times" in the openshift CI for 4.18.
PS I took a guess at how owns the openshift/ingress-operator so please assign once you find the correct owner.
We are adding an exclusion for this container but we ask that you look into fixing this.
Description of problem:
While working on ARO-13685 I (accidentally) crashed the CVO payload init containers. I found that the removal logic based on plain "rm" is not idempotent, so if any of the init containers crash mid-way, the restart will never be able to succeed. The fix is to use "rm -f" in all places instead.
Version-Release number of selected component (if applicable):
4.18 / main, but existed in prior versions
How reproducible:
always
Steps to Reproduce:
1. inject a crash in the bootstrap init container https://github.com/openshift/hypershift/blob/99c34c1b6904448fb065cd65c7c12545f04fb7c9/control-plane-operator/controllers/hostedcontrolplane/cvo/reconcile.go#L353 2. the restarting previous init container "prepare-payload" will crash loop on "rm" not succeeding as the previous invocation already deleted all manifests
Actual results:
the prepare-payload init container will crash loop forever, preventing the container CVO from running
Expected results:
a crashing init container should be able to restart gracefully without getting stuck on file removal and eventually run the CVO container
Additional info:
based off the work in https://github.com/openshift/hypershift/pull/5315
Description of problem:
In an effort to ensure all HA components are not degraded by design during normal e2e test or upgrades, we are collecting all operators that are blipping Degraded=True during any payload job run. This card captures machine-config operator that blips Degraded=True during upgrade runs. Example Job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-azure-ovn-upgrade/1843023092004163584 Reasons associated with the blip: RenderConfigFailed For now, we put an exception in the test. But it is expected that teams take action to fix those and remove the exceptions after the fix go in. Exceptions are defined here: See linked issue for more explanation on the effort.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
when run the oc-mirror command twice , the rebuild catalog failed with error: [ERROR] : unable to rebuild catalog oci:///test/yinzhou/out20/working-dir/operator-catalogs/redhat-operator-index/33dd53f330f4518bd0427772debd3331aa4e21ef4ff4faeec0d9064f7e4f24a9/catalog-image: filtered declarative config not found
Version-Release number of selected component (if applicable):
oc-mirror version W1120 10:40:11.056507 6751 mirror.go:102] ⚠️ oc-mirror v1 is deprecated (starting in 4.18 release) and will be removed in a future release - please migrate to oc-mirror --v2WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.2.0-alpha.1-324-gbae91d5", GitCommit:"bae91d55", GitTreeState:"clean", BuildDate:"2024-11-20T02:06:04Z", GoVersion:"go1.23.0", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. run the mirror2mirror twice with same imageseconfig and same workspace, the twice command failed with error: cat config-20.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: additionalImages: - name: registry.redhat.io/ubi8/ubi:latest - name: quay.io/openshifttest/hello-openshift@sha256:61b8f5e1a3b5dbd9e2c35fd448dc5106337d7a299873dd3a6f0cd8d4891ecc27 operators: - catalog: oci:///test/redhat-operator-index packages: - name: aws-load-balancer-operator - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 packages: - name: devworkspace-operator `oc-mirror -c config-20.yaml docker://my-route-e2e-test-ocmirrorv2-pxbg4.apps.yinzhou11202.qe.devcluster.openshift.com --workspace file://out20 --v2 --dest-tls-verify=false`
Actual results:
oc-mirror -c config-20.yaml docker://my-route-e2e-test-ocmirrorv2-pxbg4.apps.yinzhou11202.qe.devcluster.openshift.com --workspace file://out20 --v2 --dest-tls-verify=false2024/11/20 10:34:00 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/11/20 10:34:00 [INFO] : 👋 Hello, welcome to oc-mirror 2024/11/20 10:34:00 [INFO] : ⚙️ setting up the environment for you... 2024/11/20 10:34:00 [INFO] : 🔀 workflow mode: mirrorToMirror 2024/11/20 10:34:00 [INFO] : 🕵️ going to discover the necessary images... 2024/11/20 10:34:00 [INFO] : 🔍 collecting release images... 2024/11/20 10:34:00 [INFO] : 🔍 collecting operator images... ✓ () Collecting catalog oci:///test/redhat-operator-index ✓ (2s) Collecting catalog registry.redhat.io/redhat/redhat-operator-index:v4.15 2024/11/20 10:34:02 [INFO] : 🔍 collecting additional images... 2024/11/20 10:34:02 [INFO] : 🔍 collecting helm images... 2024/11/20 10:34:02 [INFO] : 🔂 rebuilding catalogs 2024/11/20 10:34:02 [INFO] : 👋 Goodbye, thank you for using oc-mirror 2024/11/20 10:34:02 [ERROR] : unable to rebuild catalog oci:///test/yinzhou/out20/working-dir/operator-catalogs/redhat-operator-index/33dd53f330f4518bd0427772debd3331aa4e21ef4ff4faeec0d9064f7e4f24a9/catalog-image: filtered declarative config not found
Expected results:
no error
Additional info:
delete the workspace file , run again, no issue.
ISSUE:
The cluster storage operator is in a degraded state because it is unable to find the UUID for the Windows node.
DESCRIPTIONS:
The customer has one Windows node in the OCP environment, the OCP environment is installed on vSphere. Storage CO is in a degraded state with the following error:
~~~
'VSphereCSIDriverOperatorCRDegraded: VMwareVSphereOperatorCheckDegraded:
unable to find VM win-xx-xx by UUID
vSphere CSI driver operator is trying to search UUID of that windows machine which should not be intended.
~~~
2024-09-27T15:44:27.836266729Z E0927 15:44:27.836234 1 check_error.go:147] vsphere driver install failed with unable to find VM win-ooiv8vljg7 by UUID , found existing driver
2024-09-27T15:44:27.860300261Z W0927 15:44:27.836249 1 vspherecontroller.go:499] Marking cluster as degraded: vcenter_api_error unable to find VM win--xx-xx by UUID
~~~
So, the operator pod should exclude the Windows node and should not go in a 'Degraded' state.
User Story:
As an OpenShift Engineer I want Create PR for machine-api refactoring of feature gate parameters so We need to pull out the logic from Neil's PR that removes individual feature gate parameters to use the new FeatureGate mutable map.
Description:
< Record any background information >
Acceptance Criteria:
< Record how we'll know we're done >
Other Information:
< Record anything else that may be helpful to someone else picking up the card >
issue created by splat-bot
Description of problem:
The manila controller[1] defines labels that are not based on the asset prefix defined in the manila config[2], consequently when assets that selects this resource are generated they use the asset prefix as a base to define the label, resulting in them not being selected. For example in the pod antifinity[3] and controller pbd[4]. We need to change the labels used in the selectors to match the actual labels of the controller.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
when delete the logs are wrong, still saying mirror is ongoing: Mirroring is ongoing. No errors
Version-Release number of selected component (if applicable):
./oc-mirror.rhel8 version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202411090338.p0.g0a7dbc9.assembly.stream.el9-0a7dbc9", GitCommit:"0a7dbc90746a26ddff3bd438c7db16214dcda1c3", GitTreeState:"clean", BuildDate:"2024-11-09T08:33:46Z", GoVersion:"go1.22.7 (Red Hat 1.22.7-1.module+el8.10.0+22325+dc584f75) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
when execute the delete, the logs still say mirror is ongoing: oc mirror delete --delete-yaml-file test/yinzhou/debug72708/working-dir/delete/delete-images.yaml docker://my-route-zhouy.apps.yinzhou-1112.qe.devcluster.openshift.com --v2 --dest-tls-verify=false --force-cache-delete=true envar TEST_E2E detected - bypassing unshare2024/11/12 03:10:04 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/11/12 03:10:04 [INFO] : 👋 Hello, welcome to oc-mirror 2024/11/12 03:10:04 [INFO] : ⚙️ setting up the environment for you... 2024/11/12 03:10:04 [INFO] : 🔀 workflow mode: diskToMirror / delete 2024/11/12 03:10:04 [INFO] : 👀 Reading delete file... 2024/11/12 03:10:04 [INFO] : 🚀 Start deleting the images... 2024/11/12 03:10:04 [INFO] : images to delete 396 ✓ 1/396 : (0s) docker://registry.redhat.io/devworkspace/devworkspace-operator-bundle@sha256:5689ad3d80dea99cd842992523debcb1aea17b6db8dbd80e412cb2e… 2024/11/12 03:10:04 [INFO] : Mirroring is ongoing. No errors.
Actual results:
oc mirror delete --delete-yaml-file test/yinzhou/debug72708/working-dir/delete/delete-images.yaml docker://my-route-zhouy.apps.yinzhou-1112.qe.devcluster.openshift.com --v2 --dest-tls-verify=false --force-cache-delete=true envar TEST_E2E detected - bypassing unshare2024/11/12 03:10:04 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/11/12 03:10:04 [INFO] : 👋 Hello, welcome to oc-mirror 2024/11/12 03:10:04 [INFO] : ⚙️ setting up the environment for you... 2024/11/12 03:10:04 [INFO] : 🔀 workflow mode: diskToMirror / delete 2024/11/12 03:10:04 [INFO] : 👀 Reading delete file... 2024/11/12 03:10:04 [INFO] : 🚀 Start deleting the images... 2024/11/12 03:10:04 [INFO] : images to delete 396 ✓ 1/396 : (0s) docker://registry.redhat.io/devworkspace/devworkspace-operator-bundle@sha256:5689ad3d80dea99cd842992523debcb1aea17b6db8dbd80e412cb2e… 2024/11/12 03:10:04 [INFO] : Mirroring is ongoing. No errors.
Expected results:
Show correct delete logs
Additional info:
The edit route action shows an "Edit" button in order to save the changes, instead of a "Save" button.
The button label is "Save" on other forms e.g. Deployment.
Description of problem:
When updating cypress-axe, new changes and bugfixes in the axe-core accessibility auditing package have surfaced various accessibility violations that have to be addressed
Version-Release number of selected component (if applicable):
openshift4.18.0
How reproducible:
always
Steps to Reproduce:
1. Update axe-core and cypress-axe to the latest versions 2. Run test-cypress-console and run a cypress test, I used other-routes.cy.ts
Actual results:
The tests fail with various accessibility violations
Expected results:
The tests pass without accessibility violations
Additional info:
Description of problem:
Following error returns in IPI Baremetal install with recent 4.18 builds. In bootstrap vm, https is not configured on 6180 port used in boot iso url. openshift-master-1: inspection error: Failed to inspect hardware. Reason: unable to start inspection: HTTP POST https://[2620:52:0:834::f1]:8000/redfish/v1/Managers/7fffdce9-ff4a-4e6a-b598-381c58564ca5/VirtualMedia/Cd/Actions/VirtualMedia.InsertMedia returned code 500. Base.1.0.GeneralError: Failed fetching image from URL https://[2620:52:0:834:f112:3cff:fe47:3a0a]:6180/redfish/boot-93d79ad0-0d56-4c8f-a299-6dc1b3f40e74.iso: HTTPSConnectionPool(host='2620:52:0:834:f112:3cff:fe47:3a0a', port=6180): Max retries exceeded with url: /redfish/boot-93d79ad0-0d56-4c8f-a299-6dc1b3f40e74.iso (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1131)'))) Extended information: [{'@odata.type': '/redfish/v1/$metadata#Message.1.0.0.Message', 'MessageId': 'Base.1.0.GeneralError'}]"
Version-Release number of selected component (if applicable):
4.18 ec.4, 4.18.0-0.nightly-2024-11-27-162407
How reproducible:
100%
Steps to Reproduce:
1. trigger ipi baremetal install with dual stack config using virtual media 2. 3.
Actual results:
inspection fails at fetching boot iso
Expected results:
Additional info:
# port 6180 used in ironic ipv6 url is not configured for https. Instead, ssl service is running # at https://[2620:52:0:834:f112:3cff:fe47:3a0a]:6183. # May be introduced by OCPBUGS-39404. [root@api core]# cat /etc/metal3.env AUTH_DIR=/opt/metal3/auth IRONIC_ENDPOINT="http://bootstrap-user:pJ0R9XXsxUfoYVK2@localhost:6385/v1" IRONIC_EXTERNAL_URL_V6="https://[2620:52:0:834:f112:3cff:fe47:3a0a]:6180/" METAL3_BAREMETAL_OPERATOR_IMAGE="quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e142d5989415da3c1035d04f84fa765c127bf2cf3406c4612e36607bb03384d9" [root@api core]# echo "" | openssl s_client -connect localhost:6180 CONNECTED(00000003) 405CE187787F0000:error:0A00010B:SSL routines:ssl3_get_record:wrong version number:ssl/record/ssl3_record.c:354: --- no peer certificate available --- No client certificate CA names sent --- SSL handshake has read 5 bytes and written 295 bytes Verification: OK --- New, (NONE), Cipher is (NONE) Secure Renegotiation IS NOT supported Compression: NONE Expansion: NONE No ALPN negotiated Early data was not sent Verify return code: 0 (ok) ---
4.19 CI payloads have now failed mutiple times in a row on hypershift-e2e, same two Karpenter tests.
: TestKarpenter/Main expand_less 0s {Failed === RUN TestKarpenter/Main util.go:153: Successfully waited for kubeconfig to be published for HostedCluster e2e-clusters-t8sw8/example-vr6sz in 25ms util.go:170: Successfully waited for kubeconfig secret to have data in 25ms util.go:213: Successfully waited for a successful connection to the guest API server in 25ms karpenter_test.go:52: Expected success, but got an error: <*meta.NoKindMatchError | 0xc002d931c0>: no matches for kind "NodePool" in version "karpenter.sh/v1" { GroupKind: { Group: "karpenter.sh", Kind: "NodePool", }, SearchedVersions: ["v1"], } --- FAIL: TestKarpenter/Main (0.10s) } : TestKarpenter expand_less 27m15s {Failed === RUN TestKarpenter === PAUSE TestKarpenter === CONT TestKarpenter hypershift_framework.go:316: Successfully created hostedcluster e2e-clusters-t8sw8/example-vr6sz in 24s hypershift_framework.go:115: Summarizing unexpected conditions for HostedCluster example-vr6sz util.go:1699: Successfully waited for HostedCluster e2e-clusters-t8sw8/example-vr6sz to have valid conditions in 25ms hypershift_framework.go:194: skipping postTeardown() hypershift_framework.go:175: skipping teardown, already called --- FAIL: TestKarpenter (1635.11s) }
https://github.com/openshift/hypershift/pull/5404 is in the first payload and looks extremely related.
bootstrap API server should be terminated only after API is HA, we should wait for API to be available on at least 2 master nodes, these are the steps:
We should note the difference between a) the bootstrap node itself existing, and b) API being available on the bootstrap node. Today inside the cluster bootstrap, we remove the bootstrap API (b) as soon as two master nodes appear. This is what happens today on the bootstrap node:
a) create the static assets
b) wait for 2 master nodes to appear
c) remove the kube-apiserver from the bootstrap node
d) mark the bootstrap process as completed
But we already might have a time window where API is not available [starting from c, and until api is available on a master node].
cluster bootstrap executable is invoked here:
https://github.com/openshift/installer/blob/c534bb90b780ae488bc6ef7901e0f3f6273e2764/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L541
start --tear-down-early=false --asset-dir=/assets --required-pods="${REQUIRED_PODS}"
Then, cluster bootstrap removes the bootstrap API here: https://github.com/openshift/cluster-bootstrap/blob/bcd73a12a957ce3821bdfc0920751b8e3528dc98/pkg/start/start.go#L203-L209
but the wait for API to be HA is done here: https://github.com/openshift/installer/blob/c534bb90b780ae488bc6ef7901e0f3f6273e2764/data/data/bootstrap/files/usr/local/bin/report-progress.sh#L24
The wait should happen from within cluster-bootstrap, this PR moves the wait before cluster bootstrap tears down the bootstrap API/control plane
During review of ARO MiWi permissions, some permissions in the MAPI CredentialsRequest for Azure having other permissions identified through a linked action that are missing.
A linked access check is an action performed by Azure Resource Manager during a incoming request. For example, when you issue a create operation to a network interface ( Microsoft.Network/networkInterfaces/write ) you specify a subnet in the payload. ARM parses the payload, sees you're setting a subnet property, and as a result requires the linked access check Microsoft.Network/virtualNetworks/subnets/join/action to the subnet resource specified in the network interface. If you update a resource but don't include the property in the payload, it will not perform the permission check.
The following permissions were identified as possibly needed in MAPI CredsRequest as they are specified as linked action of one of MAPI's existing permissions
Microsoft.Compute/disks/beginGetAccess/action Microsoft.KeyVault/vaults/deploy/action Microsoft.ManagedIdentity/userAssignedIdentities/assign/action Microsoft.Network/applicationGateways/backendAddressPools/join/action Microsoft.Network/applicationSecurityGroups/joinIpConfiguration/action Microsoft.Network/applicationSecurityGroups/joinNetworkSecurityRule/action Microsoft.Network/ddosProtectionPlans/join/action Microsoft.Network/gatewayLoadBalancerAliases/join/action Microsoft.Network/loadBalancers/backendAddressPools/join/action Microsoft.Network/loadBalancers/frontendIPConfigurations/join/action Microsoft.Network/loadBalancers/inboundNatPools/join/action Microsoft.Network/loadBalancers/inboundNatRules/join/action Microsoft.Network/networkInterfaces/join/action Microsoft.Network/networkSecurityGroups/join/action Microsoft.Network/publicIPAddresses/join/action Microsoft.Network/publicIPPrefixes/join/action Microsoft.Network/virtualNetworks/subnets/join/action
Each permission needs to be validated as to whether it is needed by MAPI through any of its code paths.
Description of problem:
`sts:AssumeRole` is required by creating Shared-VPC [1], otherwise which will cause the error: level=fatal msg=failed to fetch Cluster Infrastructure Variables: failed to fetch dependency of "Cluster Infrastructure Variables": failed to generate asset "Platform Provisioning Check": aws.hostedZone: Invalid value: "Z01991651G3UXC4ZFDNDU": unable to retrieve hosted zone: could not get hosted zone: Z01991651G3UXC4ZFDNDU: AccessDenied: User: arn:aws:iam::301721915996:user/ci-op-1c2w7jv2-ef4fe-minimal-perm-installer is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::641733028092:role/ci-op-1c2w7jv2-ef4fe-shared-role level=fatal msg= status code: 403, request id: ab7160fa-ade9-4afe-aacd-782495dc9978 Installer exit with code 1 [1]https://docs.openshift.com/container-platform/4.17/installing/installing_aws/installing-aws-account.html
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-12-03-174639
How reproducible:
Always
Steps to Reproduce:
1. Create install-config for Shared-VPC cluster 2. Run openshift-install create permissions-policy 3. Create cluster by using the above installer-required policy.
Actual results:
See description
Expected results:
sts:AssumeRole is included in the policy file when Shared VPC is configured.
Additional info:
The configuration of Shared-VPC is like: platform: aws: hostedZone: hostedZoneRole:
Description of problem:
OCP cluster upgrade is stuck with image registry pod in degraded state. The image registry co shows the below error message. - lastTransitionTime: "2024-09-13T03:15:05Z" message: "Progressing: All registry resources are removed\nNodeCADaemonProgressing: The daemon set node-ca is deployed\nAzurePathFixProgressing: Migration failed: I0912 18:18:02.117077 1 main.go:233] Azure Stack Hub environment variables not present in current environment, skipping setup...\nAzurePathFixProgressing: panic: Get \"https://xxxxximageregistry.blob.core.windows.net/xxxxcontainer?comp=list&prefix=docker&restype=container\": dial tcp: lookup xxxximageregistry.blob.core.windows.net on 192.168.xx.xx. no such host\nAzurePathFixProgressing: \nAzurePathFixProgressing: goroutine 1 [running]:\nAzurePathFixProgressing: main.main()\nAzurePathFixProgressing: \t/go/src/github.com/openshift/cluster-image-registry-operator/cmd/move-blobs/main.go:53 +0x12a\nAzurePathFixProgressing: " reason: AzurePathFixFailed::Removed status: "False" type: Progressing
Version-Release number of selected component (if applicable):
4.14.33
How reproducible:
Steps to Reproduce:
1. configure azure storage in configs.imageregistry.operator.openshift.io/cluster 2. then mark the managementState as Removed 3. check the operator status
Actual results:
CO image-registry remain is degraded state
Expected results:
Operator should not be in degraded state
Additional info:
Description of problem:
We can input invalid value to zone field in gcpproviderSpec
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-10-16-094159
How reproducible:
Always
Steps to Reproduce:
1.Edit machineset with invalid zone value , scale machineset
Actual results:
Machineset edited successfully Machines stuck with blank status and do not fail miyadav@miyadav-thinkpadx1carbongen8:~/multifieldsgcp$ oc get machines NAME PHASE TYPE REGION ZONE AGE miyadav-1809g-7bdh4-master-0 Running n2-standard-4 us-central1 us-central1-a 62m miyadav-1809g-7bdh4-master-1 Running n2-standard-4 us-central1 us-central1-b 62m miyadav-1809g-7bdh4-master-2 Running n2-standard-4 us-central1 us-central1-c 62m miyadav-1809g-7bdh4-worker-a-9kmdv Running n2-standard-4 us-central1 us-central1-a 57m miyadav-1809g-7bdh4-worker-b-srj28 Running n2-standard-4 us-central1 us-central1-b 57m miyadav-1809g-7bdh4-worker-c-828v9 Running n2-standard-4 us-central1 us-central1-c 57m miyadav-1809g-7bdh4-worker-f-7d9bx 11m miyadav-1809g-7bdh4-worker-f-bcr7v Running n2-standard-4 us-central1 us-central1-f 20m miyadav-1809g-7bdh4-worker-f-tjfjk 7m3s
Expected results:
machines status can report failed status and the reason , may be timeout instead of waiting continuously .
Additional info:
logs are present in machine-controller "E1018 03:55:39.735293 1 controller.go:316] miyadav-1809g-7bdh4-worker-f-7d9bx: failed to check if machine exists: unable to verify project/zone exists: openshift-qe/us-central1-in; err: googleapi: Error 400: Invalid value for field 'zone': 'us-central1-in'. Unknown zone., invalid" the machines will be stuck in deletion also because of no status. for Invalid ProjectID - Errors in logs - urce project OPENSHIFT-QE. Details: [ { "@type": "type.googleapis.com/google.rpc.Help", "links": [ { "description": "Google developers console", "url": "https://console.developers.google.com" } ] }, { "@type": "type.googleapis.com/google.rpc.ErrorInfo", "domain": "googleapis.com", "metadatas": { "consumer": "projects/OPENSHIFT-QE", "service": "compute.googleapis.com" }, "reason": "CONSUMER_INVALID" } ] , forbidden E1018 08:59:40.405238 1 controller.go:316] "msg"="Reconciler error" "error"="unable to verify project/zone exists: OPENSHIFT-QE/us-central1-f; err: googleapi: Error 403: Permission denied on resource project OPENSHIFT-QE.\nDetails:\n[\n {\n \"@type\": \"type.googleapis.com/google.rpc.Help\",\n \"links\": [\n {\n \"description\": \"Google developers console\",\n \"url\": \"https://console.developers.google.com\"\n }\n ]\n },\n {\n \"@type\": \"type.googleapis.com/google.rpc.ErrorInfo\",\n \"domain\": \"googleapis.com\",\n \"metadatas\": {\n \"consumer\": \"projects/OPENSHIFT-QE\",\n \"service\": \"compute.googleapis.com\"\n },\n \"reason\": \"CONSUMER_INVALID\"\n }\n]\n, forbidden" "controller"="machine-controller" "name"="miyadav-1809g-7bdh4-worker-f-dcnf5" "namespace"="openshift-machine-api" "object"={"name":"miyadav-1809g-7bdh4-worker-f-dcnf5","namespace":"openshift-machine-api"} "reconcileID"="293f9d09-1387-4702-8b67-2d209316585e"
must-gather- https://drive.google.com/file/d/1N--U8V3EfdEYgQUvK-fcrGxBYRDnzK1G/view?usp=sharing
ProjectID issue must-gather -https://drive.google.com/file/d/1lKNOu4eVmJJbo23gbieD5uVNtw_qF7p6/view?usp=sharing
Description of problem:
When PowerVS deletes a cluster, it does it via pattern matching in the name. Limit the searches by resource group ID to prevent collisions.
Testing OCP Console 4.17.9, the NetworkAttachmentDefinition Creation Button is misspelled as NetworkArrachmentDefinition.
I have attached a picture.
https://issues.redhat.com/secure/attachment/13328391/console_bug.png
Description of problem:
The CUDN creation view doesnt prevent namespace-selector with no rules
Version-Release number of selected component (if applicable):
4.18
How reproducible:
100%
Steps to Reproduce:
1.In the UI, go to CUDN creation view, create CDUN with empty namespace-selector. 2. 3.
Actual results:
The CUDN will select all namespaces exist in the cluster, including openshift-* namespaces. Affecting the cluster system components including the api-server, etcd.
Expected results:
I expect the UI to block creating CUDN with namespace-selector that has zero rules.
Additional info:
Description of problem:
The StaticPodOperatorStatus API validations permit: - nodeStatuses[].currentRevision can be cleared and can decrease - more than one entry in nodeStatuses can have a targetRevision > 0 But both of these signal a bug in one or more of the static pod controllers that write to them.
Version-Release number of selected component (if applicable):
This has been the case ~forever but we are aware of bugs in 4.18+ that are resulting in controllers trying to make these invalid writes. We also have more expressive validation mechanisms today that make it possible to plug the holes.
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
The bug in 4.18+ that are resulting in some static pod node/installer controllers trying to make these invalid write requests.
Expected results:
To add some validation rules to help to see them
Additional info:
Description of problem:
The UserDefinedNetwork page lists UDN and CUDN objects. UDN is namespaces scope, CUDN is clsuter scope. The list view "namespace" column for CUDN objects presents "All Namespaces" which is confusing, making me think the CUDN selects all namespaces in the clsuter.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
100%
Steps to Reproduce:
1. Create CUDN, check "Namespace" column in the UDN page list view 2. 3.
Actual results:
UDN page list view "Namespace" column present "All Namespaces" for CUDN objects
Expected results:
I expect "Namespace" column to not present "All Namespaces" for CUDN objects because its confusing. I think its better for "Namespace" to remain empty for CUDNs objects.
Additional info:
The CUDN spec has namespace selector, controlling which namespaces the CUDN is affecting, I think this is the source for confusion. Maybe having the "Namespace" column preset "All Namespaces" for cluster-scope objects make sense, but in this particular case I find it confusing.
Due to fundamental Kubernetes design, all OpenShift Container Platform updates between minor versions must be serialized. You must update from OpenShift Container Platform <4.y> to <4.y+1>, and then to <4.y+2>. You cannot update from OpenShift Container Platform <4.y> to <4.y+2> directly. However, administrators who want to update between two even-numbered minor versions can do so incurring only a single reboot of non-control plane hosts.
We should add a new precondition that enforces that policy, so cluster admins who run --to-image ... don't hop straight from 4.y.z to 4.(y+2).z' or similar without realizing that they were outpacing testing and policy.
The policy and current lack-of guard both date back to all OCP 4 releases, and since they're Kube-side constraints, they may date back to the start of Kube.
Every time.
1. Install a 4.y.z cluster.
2. Use --to-image to request an update to a 4.(y+2).z release.
3. Wait a few minutes for the cluster-version operator to consider the request.
4. Check with oc adm upgrade.
Update accepted.
Update rejected (unless it was forced), complaining about the excessively long hop.
Description of problem:
Test pxe boot in ABI day2 install, day2 host can not reboot from the disk properly, but reboot from pxe again. This is not reproduced in all hosts and reboots.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-11-26-075648
How reproducible:
Not always, about 70% in amd64 and 100% in arm64
Steps to Reproduce:
1. Run ABI day1 install booting from pxe 2. After day1 cluster is installed, run ABI day2 install booting from pxe 3. Day2 host hasn't reboot from disk as expected, but reboot from pxe again. From agent.service log, we can see the error: level=info msg=\"SetBootOrder, runtime.GOARCH: amd64, device: /dev/sda\"\ntime=\"2024-11-27T06:48:15Z\" level=info msg=\"Setting efibootmgr to boot from disk\"\ntime=\"2024-11-27T06:48:15Z\" level=error msg=\"failed to find EFI directory\" error=\"failed to mount efi device: failed executing /usr/bin/nsenter [--target 1 --cgroup --mount --ipc --pid -- mount /dev/sda2 /mnt], Error exit status 32, LastOutput \\\"mount: /var/mnt: special device /dev/sda2 does not exist.\\\"\"\ntime=\"2024-11-27T06:48:15Z\" level=warning msg=\"Failed to set boot order\" error=\"failed to mount efi device: failed executing /usr/bin/nsenter [--target 1 --cgroup --mount --ipc --pid -- mount /dev/sda2 /mnt], Error exit status 32
Actual results:
Day2 host hasn't reboot from disk as expected
Expected results:
Day2 host should reboot from disk to complete the installation
Additional info:
When deploying with an endpoint override for the resourceController, the Power VS machine API provider will ignore the override.
Description of problem:
When a new configuration is picked up by the authentication operator, it rolls out a new revision of oauth-server pods. However, since the pods define `terminationGracePeriodSeconds`, the old-revision pods would still be running even after the oauth-server deployment reports that all pods have been updated to the newest revision, which triggers the authentication operator to report Progressing=false. The above behavior might cause that tests (and possible any observers) that expect the newer revision would be still routed to the old pods, causing confusion.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1. Trigger new oauth-server rollout 2. Observe the authentication operator reporting Progressing while also watching the number of pods in the openshift-authentication namespace
Actual results:
CAO reports Progressing=false even though there are more than the expected number of pods running.
Expected results:
CAO waits to report Progressing=false when only the new revision of pods is running in the openshift-authentication NS.
Additional info:
-
Description of problem:
When doing firmware updates we saw cases where the update is successful but the newer information wasn't stored in the HFC, the root cause was that ironic didn't save the newer information in the DB.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
When the CCO updates a CredentialsRequest's status, the current logs are not clear on what's changing:
time="2024-12-05T21:44:49Z" level=info msg="status has changed, updating" controller=credreq cr=openshift-cloud-credential-operator/aws-ebs-csi-driver-operator secret=openshift-cluster-csi-drivers/ebs-cloud-credentials
We should make it possible to get the CCO to log the diff it's trying to push, even if that requires bumping the operator's log level to debug. That would make it easier to understand hotloops like OCPBUGS-47505.
Description of problem:
After creating a pair of self-signed tls cert and private key, then add it into trustde-ca-bundle by using the following cmd: oc patch proxy/cluster \ --type=merge \ --patch='{"spec":{"trustedCA":{"name":"custom-ca"}}}' The insights-runtime-extractor pod will return response with 500 status code, this is the https flow details: * Trying 10.129.2.15:8000... * Connected to exporter.openshift-insights.svc.cluster.local (10.129.2.15) port 8000 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * CAfile: /var/run/configmaps/service-ca-bundle/service-ca.crt * TLSv1.0 (OUT), TLS header, Certificate Status (22): * TLSv1.3 (OUT), TLS handshake, Client hello (1): * TLSv1.2 (IN), TLS header, Certificate Status (22): * TLSv1.3 (IN), TLS handshake, Server hello (2): * TLSv1.2 (IN), TLS header, Finished (20): * TLSv1.2 (IN), TLS header, Unknown (23): * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): * TLSv1.2 (IN), TLS header, Unknown (23): * TLSv1.3 (IN), TLS handshake, Request CERT (13): * TLSv1.2 (IN), TLS header, Unknown (23): * TLSv1.3 (IN), TLS handshake, Certificate (11): * TLSv1.2 (IN), TLS header, Unknown (23): * TLSv1.3 (IN), TLS handshake, CERT verify (15): * TLSv1.2 (IN), TLS header, Unknown (23): * TLSv1.3 (IN), TLS handshake, Finished (20): * TLSv1.2 (OUT), TLS header, Finished (20): * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): * TLSv1.2 (OUT), TLS header, Unknown (23): * TLSv1.3 (OUT), TLS handshake, Certificate (11): * TLSv1.2 (OUT), TLS header, Unknown (23): * TLSv1.3 (OUT), TLS handshake, Finished (20): * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server accepted to use h2 * Server certificate: * subject: CN=*.exporter.openshift-insights.svc * start date: Jan 2 02:19:07 2025 GMT * expire date: Jan 2 02:19:08 2027 GMT * subjectAltName: host "exporter.openshift-insights.svc.cluster.local" matched cert's "exporter.openshift-insights.svc.cluster.local" * issuer: CN=openshift-service-serving-signer@1735784302 * SSL certificate verify ok. * Using HTTP2, server supports multi-use * Connection state changed (HTTP/2 confirmed) * Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0 * TLSv1.2 (OUT), TLS header, Unknown (23): * TLSv1.2 (OUT), TLS header, Unknown (23): * TLSv1.2 (OUT), TLS header, Unknown (23): * Using Stream ID: 1 (easy handle 0x5577a19094a0) * TLSv1.2 (OUT), TLS header, Unknown (23): > GET /gather_runtime_info HTTP/2 > Host: exporter.openshift-insights.svc.cluster.local:8000 > accept: */* > user-agent: insights-operator/one10time200gather184a34f6a168926d93c330 cluster/_f19625f5-ee5f-40c0-bc49-23a8ba1abe61_ > authorization: Bearer sha256~x9jj_SnjJf6LVlhhWFdUG8UqnPDHzZW0xMYa0WU05Gw > * TLSv1.2 (IN), TLS header, Unknown (23): * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * TLSv1.2 (IN), TLS header, Unknown (23): * TLSv1.2 (OUT), TLS header, Unknown (23): * TLSv1.2 (IN), TLS header, Unknown (23): * TLSv1.2 (IN), TLS header, Unknown (23): < HTTP/2 500 < content-type: text/plain; charset=utf-8 < date: Thu, 02 Jan 2025 08:18:59 GMT < x-content-type-options: nosniff < content-length: 33 < * TLSv1.2 (IN), TLS header, Unknown (23): stat : no such file or directory
Version-Release number of selected component (if applicable):
4.19
How reproducible:
True
Steps to Reproduce:
1. Create a pair of self-signed tls cert and key 2. Update trusted-ca-bundle by using following cmd: oc patch proxy/cluster \ --type=merge \ --patch='{"spec":{"trustedCA":{"name":"custom-ca"}}}' 3. Pull a request to insights-runtime-extractor pod via the following cmd: curl -v --cacert /var/run/configmaps/trusted-ca-bundle/ca-bundle.crt -H "User-Agent: insights-operator/one10time200gather184a34f6a168926d93c330 cluster/_<cluster_id>_" -H "Authorization: <token>" -H 'Cache-Control: no-cache' https://api.openshift.com/api/accounts_mgmt/v1/certificates
Actual results:
3. The status code of response to this request is 500
Expected results:
3. The status code of response to this request should be 200 and return the runtime info as expected.
Additional info:
Description of problem:
Below tests fail on ipv6primary dualstack cluster because the router deployed is not prepared for dualstack:
[sig-network][Feature:Router][apigroup:image.openshift.io] The HAProxy router should serve a route that points to two services and respect weights [Skipped:Disconnected] [Suite:openshift/conformance/parallel] [sig-network][Feature:Router][apigroup:operator.openshift.io] The HAProxy router should respond with 503 to unrecognized hosts [Skipped:Disconnected] [Suite:openshift/conformance/parallel] [sig-network][Feature:Router][apigroup:operator.openshift.io] The HAProxy router should serve routes that were created from an ingress [apigroup:route.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] [sig-network][Feature:Router][apigroup:route.openshift.io][apigroup:operator.openshift.io] The HAProxy router should support reencrypt to services backed by a serving certificate automatically [Skipped:Disconnected] [Suite:openshift/conformance/parallel] [sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router should override the route host for overridden domains with a custom value [apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] [sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router should override the route host with a custom value [Skipped:Disconnected] [Suite:openshift/conformance/parallel] [sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router should run even if it has no access to update status [apigroup:image.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] [sig-network][Feature:Router][apigroup:route.openshift.io] The HAProxy router should serve the correct routes when scoped to a single namespace and label set [Skipped:Disconnected] [Suite:openshift/conformance/parallel] [sig-network][Feature:Router][apigroup:route.openshift.io] when FIPS is disabled the HAProxy router should serve routes when configured with a 1024-bit RSA key [Feature:Networking-IPv4] [Suite:openshift/conformance/parallel] [sig-network][Feature:Router] The HAProxy router should expose prometheus metrics for a route [apigroup:route.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
That is confirmed by accessing to the router pod and checking the connectivity locally:
sh-4.4$ curl -k -s -m 5 -o /dev/null -w '%{http_code}\n' --header 'Host: FIRST.example.com' "http://127.0.0.1/Letter" 200 sh-4.4$ echo $? 0
sh-4.4$ curl -k -s -m 5 -o /dev/null -w '%{http_code}\n' --header 'Host: FIRST.example.com' "http://fd01:0:0:5::551/Letter" 000 sh-4.4$ echo $? 3
sh-4.4$ curl -k -s -m 5 -o /dev/null -w '%{http_code}\n' --header 'Host: FIRST.example.com' "http://[fd01:0:0:5::551]/Letter" 000 sh-4.4$ echo $? 7
The default router deployed in the cluster supports dualstack. Hence it's possible and required to update the router image configuration usedin the tests to be able to answer both ipv4 and ipv6.
Version-Release number of selected component (if applicable): https://github.com/openshift/origin/tree/release-4.15/test/extended/router/
How reproducible: Always.
Steps to Reproduce: Run the tests in ipv6primary dualstack cluster.
Actual results: Tests failing as below:
<*errors.errorString | 0xc001eec080>:
last response from server was not 200:
{
s: "last response from server was not 200:\n",
}
occurred
Ginkgo exit error 1: exit with code 1
Expected results: Test passing
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The helper doesn't have all the namespaces in it, and we're getting some flakes in CI like this:
{{batch/v1/Job/openshift-backplane-managed-scripts/<batch_job>/container/osd-delete-backplane-script-resources
does not have a cpu request (rule: "batch/v1/Job/openshift-backplane-managed-scripts/<batch_job>/container/osd-delete-backplane-script-resources/request[cpu]")}}
Description of problem:
Issue can be observed on Hypershift e2e-powervs CI (reference : https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.18-periodics-e2e-powervs-ovn/1879800904798965760) HostedCluster is deployed but still getting incorrect condition for status,"HostedCluster is deploying, upgrading, or reconfiguring)" This is happening because of following issue observed on cluster-version Logs reference : https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.18-periodics-e2e-powervs-ovn/1879800904798965760/artifacts/e2e-powervs-ovn/run-e2e/build-log.txt ``` eventually.go:226: - incorrect condition: wanted Progressing=False, got Progressing=True: Progressing(HostedCluster is deploying, upgrading, or reconfiguring) eventually.go:226: - wanted HostedCluster to desire image registry.build01.ci.openshift.org/ci-op-cxr9zifq/release@sha256:7e40dc5dace8cb816ce91829517309e3609c7f4f6de061bf12a8b21ee97bb713, got registry.build01.ci.openshift.org/ci-op-cxr9zifq/release@sha256:e17cb3eab53be67097dc9866734202cb0f882afc04b2972c02997d9bc1a6e96b eventually.go:103: Failed to get *v1beta1.HostedCluster: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline ```
Note : Issue oberved on 4.19.0 Hypershift e2e-aws-multi CI as well (reference : https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-multi/1880072687829651456)
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
100%
Steps to Reproduce:
1. Create PowerVS hypershift cluster with 4.18.0 release 2. 3.
Actual results:
HostedCluster getting deployed but still getting incorrect condition for cluster-version
Expected results:
HostedCluster should get deployed successfully with all conditions met
Additional info:
The issue was first observed on Dec 25, 2024. Currently reproducible on 4.19.0 (reference : https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-multi/1880072687829651456/artifacts/e2e-aws-multi/hypershift-aws-run-e2e-external/build-log.txt)
https://github.com/openshift/oc-mirror/blob/main/v2/docs/operator-filtering-investigation.md#annex---acceptance-criteria-set-for-v2 is obsolete since work on https://github.com/sherine-k/catalog-filter/pull/7 was done as part of https://issues.redhat.com/browse/OCPBUGS-43731 and https://issues.redhat.com/browse/CLID-235
Description of problem:
Openshift cluster upgrade from 4.12.10 to 4.12.30 failing because pod version-4.12.30-xxx is in CreateContainerConfigError. Also tested in 4.14
Steps to Reproduce:
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.10 True False 12m Cluster version is 4.12.10
--- allowHostDirVolumePlugin: true allowHostIPC: false allowHostNetwork: false allowHostPID: false allowHostPorts: false allowPrivilegeEscalation: true allowPrivilegedContainer: true allowedCapabilities: null apiVersion: security.openshift.io/v1 defaultAddCapabilities: null fsGroup: type: RunAsAny groups: [] kind: SecurityContextConstraints metadata: name: scc-hostpath-cnf-cat-1 priority: null readOnlyRootFilesystem: false requiredDropCapabilities: - KILL - MKNOD - SETUID - SETGID runAsUser: type: MustRunAsNonRoot seLinuxContext: type: MustRunAs supplementalGroups: type: RunAsAny users: [] volumes: - configMap - downwardAPI - emptyDir - hostPath - persistentVolumeClaim - projected - secret
$ oc adm upgrade --to=4.12.30 $ oc get pod -n openshift-cluster-version NAME READY STATUS RESTARTS AGE cluster-version-operator-85db98885c-jt25z 1/1 Running 0 41m version-4.12.30-vw4pm-l2nng 0/1 Init:CreateContainerConfigError 0 42s $ oc get events | grep Failed 10s Warning Failed pod/version-4.12.30-p6k4r-nmn6m Error: container has runAsNonRoot and image will run as root (pod: "version-4.12.30-p6k4r-nmn6m_openshift-cluster-version(4d1704d9-ca34-4aa3-86e1-1742e8cead0c)", container: cleanup) $ oc get pod version-4.12.30-97nbr-88mxp -o yaml |grep scc openshift.io/scc: scc-hostpath-cnf-cat-1
As a workaround, we can remove the scc "scc-hostpath-cnf-cat-1" and the pod version-xxx and the upgrade worked. Customer has created custom scc for use of applications.
$ oc get pod version-4.12.30-nmskz-d5x2c -o yaml | grep scc
openshift.io/scc: node-exporter
$ oc get pod
NAME READY STATUS RESTARTS AGE
cluster-version-operator-6cb5557f8f-v65vb 1/1 Running 0 54s
version-4.12.30-nmskz-d5x2c 0/1 Completed 0 67s
There's an old bug https://issues.redhat.com/browse/OCPBUGSM-47192 which was fixed setting readOnlyRootFilesystem to false, but in this case the scc is still failing.
--- container.SecurityContext = &corev1.SecurityContext{ Privileged: pointer.BoolPtr(true), ReadOnlyRootFilesystem: pointer.BoolPtr(false), } ---
Description of problem:
Previously, failed task rus did not emit results, now they do but the UI still shows "No TaskRun results available due to failure" even though task run's status contains a result.
Version-Release number of selected component (if applicable):
4.14.3
How reproducible:
Always with a task run producing a result but failing afterwards
Steps to Reproduce:
1. Create the pipelinerun below 2. have a look on its task run
apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
name: hello-pipeline
spec:
tasks:
- name: hello
taskSpec:
results:
- name: greeting1
steps:
- name: greet
image: registry.access.redhat.com/ubi8/ubi-minimal
script: |
#!/usr/bin/env bash
set -e
echo -n "Hello World!" | tee $(results.greeting1.path)
exit 1
results:
- name: greeting2
value: $(tasks.hello.results.greeting1)
Actual results:
No results in UI
Expected results:
One result should be displayed even though task run failed
Additional info:
Pipelines 1.13.0
Description of problem:
Some permissions are missing when edge zones are specified in the install-config.yaml, probably those related to Carrier Gateways (but maybe more)
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always with minimal permissions
Steps to Reproduce:
1. 2. 3.
Actual results:
time="2024-11-20T22:40:58Z" level=debug msg="\tfailed to describe carrier gateways in vpc \"vpc-0bdb2ab5d111dfe52\": UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-girt7h2j-4515a-minimal-perm is not authorized to perform: ec2:DescribeCarrierGateways because no identity-based policy allows the ec2:DescribeCarrierGateways action"
Expected results:
All required permissions are listed in pkg/asset/installconfig/aws/permissions.go
Additional info:
See https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/9222/pull-ci-openshift-installer-master-e2e-aws-ovn-edge-zones/1859351015715770368 for a failed min-perms install
Description of problem:
OWNERS file updated to include prabhakar and Moe as owners and reviewers
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is to fecilitate easy backport via automation
We should be using marketplace images when testing aks as that's what will be used in production.
Description of problem:
I get this build warning when building: warning Pattern ["asn1js@latest"] is trying to unpack in the same destination "~/.cache/yarn/v6/npm-asn1js-2.0.26-0a6d435000f556a96c6012969d9704d981b71251-integrity/node_modules/asn1js" as pattern ["asn1js@^2.0.26"]. This could result in non-deterministic behavior, skipping.
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
Always
Steps to Reproduce:
1. Run ./clean-frontend && ./build-frontend.sh 2. Observe build output 3.
Actual results:
I get the warning
Expected results:
No warning
Additional info:
Description of problem:
Currently both the nodepool controller and capi controller set the updatingConfig condition on nodepool upgrades. We should only use one to set the condition to avoid constant switching between conditions and to ensure the logic used for setting this condition is the same.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
CAPI and Nodepool controller set a different status because their logic is not consistent.
Expected results:
CAPI and Nodepool controller set the same status because their logic is not cosolidated.
Additional info:
Description of problem:
oc-mirror can't support mirror image with bundle
Version-Release number of selected component (if applicable):
oc-mirror version W1121 06:10:37.581138 159435 mirror.go:102] ⚠️ oc-mirror v1 is deprecated (starting in 4.18 release) and will be removed in a future release - please migrate to oc-mirror --v2WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202411191706.p0.g956fc31.assembly.stream.el9-956fc31", GitCommit:"956fc318cc67769aedb2db8c0c4672bf7ed9f909", GitTreeState:"clean", BuildDate:"2024-11-19T18:08:35Z", GoVersion:"go1.22.7 (Red Hat 1.22.7-1.el9_5) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. mirror the image with bundles : kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.16 packages: - name: cluster-kube-descheduler-operator bundles: - name: clusterkubedescheduleroperator.v5.0.1 - name: clusterkubedescheduleroperator.4.13.0-202309181427 - catalog: registry.redhat.io/redhat/community-operator-index:v4.14 packages: - name: 3scale-community-operator bundles: - name: 3scale-community-operator.v0.11.0 oc-mirror -c /tmp/config-73420.yaml file://out73420 --v2
Actual results:
1. hit error : oc-mirror -c /tmp/ssss.yaml file:///home/cloud-user/outss --v22024/11/21 05:57:40 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/11/21 05:57:40 [INFO] : 👋 Hello, welcome to oc-mirror 2024/11/21 05:57:40 [INFO] : ⚙️ setting up the environment for you... 2024/11/21 05:57:40 [INFO] : 🔀 workflow mode: mirrorToDisk 2024/11/21 05:57:40 [INFO] : 🕵️ going to discover the necessary images... 2024/11/21 05:57:40 [INFO] : 🔍 collecting release images... 2024/11/21 05:57:40 [INFO] : 🔍 collecting operator images... ✓ (43s) Collecting catalog registry.redhat.io/redhat/redhat-operator-index:v4.16 ⠼ (20s) Collecting catalog registry.redhat.io/redhat/community-operator-index:v4.14 ✗ (20s) Collecting catalog registry.redhat.io/redhat/community-operator-index:v4.14 2024/11/21 05:58:44 [ERROR] : filtering on the selected bundles leads to invalidating channel "threescale-2.11" for package "3scale-community-operator": cha ✗ (20s) Collecting catalog registry.redhat.io/redhat/community-operator-index:v4.14
Expected results:
1. no error
Additional info:
no such issue with older version : 4.18.0-ec3
Description of problem:
Customer is trying to install Self managed OCP cluster in aws. This customer use AWS VPC DHCPOptionSet. where it has a trailing dot (.) at the end of domain name in dhcpoptionset. due to this setting Master nodes hostname also has trailing dot & this cause failure in OpenShift installation.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1.Please create a aws vpc with DHCPOptionSet, where DHCPoptionSet has trailing dot at the domain name. 2.Try installation of cluster with IPI.
Actual results:
Openshift Installer should allowed to create AWS Master nodes, where domain has trailing dot(.).
Expected results:
Additional info:
Creating clusters in which machines are created in a public subnet and use a public IP makes it possible to avoid creating NAT gateways (or proxies) for AWS clusters. While not applicable for every test, this configuration will save us money and cloud resources.
Description of problem:
For same image , v1 has more than 1 tags , but for v2 only has 1 tag, when delete executed, may remain some tags created by v1.
Version-Release number of selected component (if applicable):
./oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"v0.2.0-alpha.1-309-g63a5556", GitCommit:"63a5556a", GitTreeState:"clean", BuildDate:"2024-10-23T02:42:55Z", GoVersion:"go1.23.0", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. mirror catalog package for v1: kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 mirror: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 packages: - name: sandboxed-containers-operator `oc-mirror -c config-catalog-v1.yaml docker://localhost:5000/catalog --dest-use-http` 2. mirror same package for v2 : kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 packages: - name: sandboxed-containers-operator `oc-mirror -c config-catalog-v2.yaml --workspace file://ws docker://localhost:5000/catalog --v2 --dest-tls-verify=false` 3. generate the delete image list with config : kind: DeleteImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 delete: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 packages: - name: sandboxed-containers-operator `oc-mirror delete -c config-d-catalog-v2.yaml --workspace file://ws docker://localhost:5000/catalog --v2 --dest-tls-verify=false --generate` 4. Execute the delete action: `oc-mirror delete --delete-yaml-file ws/working-dir/delete/delete-images.yaml docker://localhost:5000/catalog --v2 --dest-tls-verify=false `
Actual results:
1. v1 has more than 1 tags: [fedora@preserve-fedora-yinzhou ~]$ curl localhost:5000/v2/catalog/openshift4/ose-kube-rbac-proxy/tags/list |json_reformat % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 81 100 81 0 0 24382 0 --:--:-- --:--:-- --:--:-- 27000 { "name": "catalog/openshift4/ose-kube-rbac-proxy", "tags": [ "cb9a8d8a", "d07492b2" ] } 2. v2 only has 1 tag: [fedora@preserve-fedora-yinzhou ~]$ curl localhost:5000/v2/catalog/openshift4/ose-kube-rbac-proxy/tags/list |json_reformat % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 148 100 148 0 0 60408 0 --:--:-- --:--:-- --:--:-- 74000 { "name": "catalog/openshift4/ose-kube-rbac-proxy", "tags": [ "cb9a8d8a", "f6c37678f1eb3279e603f63d2a821b72394c52d25c2ed5058dc29d4caa15d504", "d07492b2" ] } 4. after delete , we could see still has tags remaining: [fedora@preserve-fedora-yinzhou ~]$ curl localhost:5000/v2/catalog/openshift4/ose-kube-rbac-proxy/tags/list |json_reformat % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 70 100 70 0 0 28056 0 --:--:-- --:--:-- --:--:-- 35000 { "name": "catalog/openshift4/ose-kube-rbac-proxy", "tags": [ "d07492b2" ] }
Expected results:
4. delete all the tags for the same image.
Additional info:
Description of problem:
The machine-os-builder deployment manifest does not set the openshift.io/required-scc annotation, which appears to be required for the upgrade conformance suite to pass. The rest of the MCO components currently set this annotation and we can probably use the same setting for the Machine Config Controller (which is restricted-v2). What I'm unsure of is whether this also needs to be set on the builder pods as well and what the appropriate setting would be for that case.
Version-Release number of selected component (if applicable):
How reproducible:
This always occurs in the new CI jobs, e2e-aws-ovn-upgrade-ocb-techpreview and e2e-aws-ovn-upgrade-ocb-conformance-suite-techpreview. Here's two examples from rehearsal failures:
Steps to Reproduce:
Run either of the aforementioned CI jobs.
Actual results:
Test [sig-auth] all workloads in ns/openshift-machine-config-operator must set the 'openshift.io/required-scc' annotation fails.
Expected results:
Test{{ [sig-auth] all workloads in ns/openshift-machine-config-operator must set the 'openshift.io/required-scc' annotation}} should pass.
Additional info:
Description of problem:
Add E2E test cases for PPC related to PerPodPowermanagement
Version-Release number of selected component (if applicable):
4.19.0
How reproducible:
Steps to Reproduce:
E2E Test cases were missing in PPC related to PerPodPowerManagment workload hint
Actual results:
Expected results:
Additional info:
Description of problem:
checked in 4.18.0-0.nightly-2024-12-05-103644/4.19.0-0.nightly-2024-12-04-031229, OCPBUGS-34533 is reproduced on 4.18+, no such issue with 4.17 and below.
steps: login admin console or developer console(admin console go to "Observe -> Alerting -> Silences" tab, developer console go to "Observe -> Silences" tab), to create silence, edit the "Until" option, even with a valid timestamp or invalid stamp, will get error "[object Object]" in the "Until" field. see screen recording: https://drive.google.com/file/d/14JYcNyslSVYP10jFmsTaOvPFZSky1eg_/view?usp=drive_link
checked 4.17 fix for OCPBUGS-34533 is already in 4.18+ code
Version-Release number of selected component (if applicable):
4.18+
How reproducible:
always
Steps to Reproduce:
1. see the descriptions
Actual results:
Unable to edit "until" filed in silences
Expected results:
able to edit "until" filed in silences
Description of problem:
clusters running on OpenShift Virt (Agent Based Install) where I see `metal3-ramdisk-logs` container eating up a core, but the logs are empty:
oc adm top pod --sort-by=cpu --sum -n openshift-machine-api --containers POD NAME CPU(cores) MEMORY(bytes) metal3-55c9bc8ff4-nh792 metal3-ramdisk-logs 988m 1Mi metal3-55c9bc8ff4-nh792 metal3-httpd 1m 20Mi metal3-55c9bc8ff4-nh792 metal3-ironic 0m 121Mi cluster-baremetal-operator-5bf8bcbbdd-jvhq7 cluster-baremetal-operator 1m 25Mi
Version-Release number of selected component (if applicable):
4.17.12
How reproducible:
always
Steps to Reproduce:
Cluster is reachable on Red Hat VPN - reach out on slack to get access
Actual results:
logs are empty, but a core is consumed
Expected results:
container should be more or less idle
Additional info:
Description of problem:
Unit tests for openshift/builder permanently failing for v4.18
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Always
Steps to Reproduce:
1. Run PR against openshift/builder
Actual results:
Test fails: --- FAIL: TestUnqualifiedClone (0.20s) source_test.go:171: unable to add submodule: "Cloning into '/tmp/test-unqualified335202210/sub'...\nfatal: transport 'file' not allowed\nfatal: clone of 'file:///tmp/test-submodule643317239' into submodule path '/tmp/test-unqualified335202210/sub' failed\n" source_test.go:195: unable to find submodule dir panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference
Expected results:
Tests pass
Additional info:
Example: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_builder/401/pull-ci-openshift-builder-master-unit/1853816128913018880
Description of problem:
openshift-install is always raise WANING when install a cluster on AWS Region us-east-1 with default configuration, no zones is set. ~~~ WARNING failed to find default instance type for worker pool: no instance type found for the zone constraint WARNING failed to find default instance type: no instance type found for the zone constraint ~~~ The process to discover zones list all zones in the region, then tries to describe instance type offerings across all zones from a list of supported instance types by installer. The problem that there is an "dead" zone in this region, the us-east-1e (ID use1-az3), which does not support any instance type we support, leading in creating infra resources in one zone which isn't useful as may not able to launch supported instance types there.
Version-Release number of selected component (if applicable):
* (?)
How reproducible:
always
Steps to Reproduce:
1. create install-config targeting AWS region us-east-1, without setting zones (default) 2. create manifests, or create cluster
Actual results:
~~~ WARNING failed to find default instance type for worker pool: no instance type found for the zone constraint WARNING failed to find default instance type: no instance type found for the zone constraint ~~~ The WARNING is raised, the install does not fail because the fallback instance type is supported across zones used to make Control Plane and Worker nodes
Expected results:
No WARNINGS/failures
Additional info:
Description of problem:
Consoleplugin could be enabled repeatedly when it's already enabled.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-11-14-090045
How reproducible:
Always
Steps to Reproduce:
1.Go to console operator's 'Console Plugins' tab("/k8s/cluster/operator.openshift.io~v1~Console/cluster/console-plugins"), choose on console plugin, enable consoleplugin from modal by cicking 'Enabled/Disabled' edit button, try several times even though the plugin has been enabled. 2.Check console operator yaml. 3.
Actual results:
1. Could enable consoleplugin repeatedly. 2. The same consoleplugin are added in console operator several times. $ oc get consoles.operator.openshift.io cluster -ojson | jq '.spec.plugins' [ "monitoring-plugin", "monitoring-plugin", "networking-console-plugin", "networking-console-plugin" ]
Expected results:
1.Should not enable repeatedly. 2. Should not add the same consoleplugin multiple times in console operator
Additional info:
We can even add the same plugin name in console operator yaml directly, that's not corect.
Description of problem:
Starting with OpenShift Container Platform 4.16, it was observed that cluster-network-operator is stuck in CrashLoopBackOff state because of the below error reported. 2024-09-17T16:32:46.503056041Z I0917 16:32:46.503016 1 controller.go:242] "All workers finished" controller="pod-watcher" 2024-09-17T16:32:46.503056041Z I0917 16:32:46.503045 1 internal.go:526] "Stopping and waiting for caches" 2024-09-17T16:32:46.503209536Z I0917 16:32:46.503189 1 internal.go:530] "Stopping and waiting for webhooks" 2024-09-17T16:32:46.503209536Z I0917 16:32:46.503206 1 internal.go:533] "Stopping and waiting for HTTP servers" 2024-09-17T16:32:46.503217413Z I0917 16:32:46.503212 1 internal.go:537] "Wait completed, proceeding to shutdown the manager" 2024-09-17T16:32:46.503231142Z F0917 16:32:46.503221 1 operator.go:130] Failed to start controller-runtime manager: failed to start metrics server: failed to create listener: listen tcp :8080: bind: address already in use That problem seems to be related to the change done in https://github.com/openshift/cluster-network-operator/pull/2274/commits/acd67b432be4ef2efb470710aebba2e3551bc00d#diff-99c0290799daf9abc6240df64063e20bfaf67b371577b67ac7eec6f4725622ff, where it was missed to pass BindAddress with 0 https://github.com/openshift/cluster-network-operator/blob/master/vendor/sigs.k8s.io/controller-runtime/pkg/metrics/server/server.go#L70 to keep previous functionality. With the current code in place, cluster-network-operator will expose a metrics server on port 8080 which was not the case and can create conflicts with custom application. This is especially true in environment, where compact OpenShift Container Platform 4 - Clusters are running (three-node cluster)
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.16
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4.15 (three-node cluster) and create a service that is listening on HostNetwork with port 8080 2. Update to OpenShift Container Platform 4.16 3. Watch cluster-network-operator being stuck in CrashLoopBackOff state because port 8080 is already bound
Actual results:
2024-09-17T16:32:46.503056041Z I0917 16:32:46.503016 1 controller.go:242] "All workers finished" controller="pod-watcher" 2024-09-17T16:32:46.503056041Z I0917 16:32:46.503045 1 internal.go:526] "Stopping and waiting for caches" 2024-09-17T16:32:46.503209536Z I0917 16:32:46.503189 1 internal.go:530] "Stopping and waiting for webhooks" 2024-09-17T16:32:46.503209536Z I0917 16:32:46.503206 1 internal.go:533] "Stopping and waiting for HTTP servers" 2024-09-17T16:32:46.503217413Z I0917 16:32:46.503212 1 internal.go:537] "Wait completed, proceeding to shutdown the manager" 2024-09-17T16:32:46.503231142Z F0917 16:32:46.503221 1 operator.go:130] Failed to start controller-runtime manager: failed to start metrics server: failed to create listener: listen tcp :8080: bind: address already in use
Expected results:
In previous version BindAddress was set to 0 for the Metrics server, meaning it would not start respectively expose on port 8080. Therefore the same should be done with OpenShift Container Platform 4.16 to keep backward compatability and prevent port conflicts.
Additional info:
CVO manifests contain some feature-gated ones:
We observed HyperShift CI jobs to fail when adding DevPreview-gated deployment manifests to CVO, which was unexpected. Investigating further, we discovered that HyperShift applies them:
error: error parsing /var/payload/manifests/0000_00_update-status-controller_03_deployment-DevPreviewNoUpgrade.yaml: error converting YAML to JSON: yaml: invalid map key: map[interface {}]interface {}{".ReleaseImage":interface {}(nil)}
But even without these added manifests, this happens for existing ClusterVersion CRD manifests present in the payload:
$ ls -1 manifests/*clusterversions*crd.yaml manifests/0000_00_cluster-version-operator_01_clusterversions-CustomNoUpgrade.crd.yaml manifests/0000_00_cluster-version-operator_01_clusterversions-Default.crd.yaml manifests/0000_00_cluster-version-operator_01_clusterversions-DevPreviewNoUpgrade.crd.yaml manifests/0000_00_cluster-version-operator_01_clusterversions-TechPreviewNoUpgrade.crd.yaml
In a passing HyperShift CI job, the same log shows that all four manifests are applied instead of just one:
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
4.18
Always
1. inspect the cluster-version-operator-*-bootstrap.log of a HyperShift CI job
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
= all four ClusterVersion CRD manifests are applied
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created
= ClusterVersion CRD manifest is applied just once
I'm filing this card so that I can link it to the "easy" fix https://github.com/openshift/hypershift/pull/5093 which is not the perfect fix, but allows us to add featureset-gated manifests to CVO without breaking HyperShift. It is desirable to improve this even further and actually correctly select the manifests to be applied for CVO bootstrap, but that involves non-trivial logic similar to one used by CVO and it seems to be better approached as a feature to be properly assessed and implemented, rather than a bugfix, so I'll file a separate HOSTEDCP card for that.
Description of problem:
HorizontalNav component of @openshift-console/dynamic-plugin-sdk doest not have customData prop which is available in console repo. This prop is needed to pass any value between tabs in details page
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Improving tests to remove the issue in the following helm test case Perform the helm chart upgrade for already upgraded helm chart : HR-08-TC02: Helm Release Perform the helm chart upgrade for already upgraded helm chart : HR-08-TC02 expand_less 37s {The following error originated from your application code, not from Cypress. It was caused by an unhandled promise rejection. > Cannot read properties of undefined (reading 'repoName') When Cypress detects uncaught errors originating from your application it will automatically fail the current test.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Trying to solve the root issue from this bug: https://issues.redhat.com/browse/OCPBUGS-39199?focusedId=26104570&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-26104570
To fix this, we need each of the sync functions to be able to individually clear a CO degrade that they have set earlier. Our current flow only clears a CO degrade when the all of sync functions are successful and that tends to be problematic if they happen to get stuck in one of the sync functions. We typically see this for syncRequiredMachineConfigPools, which waits until the master nodes have finished updating during an upgrade.
Description of problem:
the go to arrow and new doc link icon not aligned with text any more
Version-Release number of selected component (if applicable):
4.19.0-0.nightly-2024-12-12-144418
How reproducible:
Always
Steps to Reproduce:
1. goes to Home -> Overview page 2. 3.
Actual results:
the go to arrow and new doc link icon are not horizontal aligned with their text any more
Expected results:
icon and text should be aligned
Additional info:
screenshot https://drive.google.com/file/d/1S61XY-lqmmJgGbwB5hcR2YU_O1JSJPtI/view?usp=drive_link
Description of problem:
When user tries to run oc-mirror delete command with `--force-cache-delete=true` after a (M2D + D2M) for catalog operators, it only delete the manifests on local cache, don't delete the blobs,which is not expected , from the help information , we should also delete the blobs for catalog operators : --force-cache-delete Used to force delete the local cache manifests and blobs
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202410011141.p0.g227a9c4.assembly.stream.el9-227a9c4", GitCommit:"227a9c499b6fd94e189a71776c83057149ee06c2", GitTreeState:"clean", BuildDate:"2024-10-01T20:07:43Z", GoVersion:"go1.22.5 (Red Hat 1.22.5-1.module+el8.10.0+22070+9237f38b) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. Using follow imagesetconfig to do mirror2disk+disk2mirror: kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: additionalImages: - name: registry.redhat.io/ubi8/ubi:latest - name: quay.io/openshifttest/hello-openshift@sha256:61b8f5e1a3b5dbd9e2c35fd448dc5106337d7a299873dd3a6f0cd8d4891ecc27 operators: - catalog: oci:///test/redhat-operator-index packages: - name: aws-load-balancer-operator - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 packages: - name: devworkspace-operator 2. Generate delete file : cat delete.yaml kind: DeleteImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 delete: additionalImages: - name: registry.redhat.io/ubi8/ubi:latest - name: quay.io/openshifttest/hello-openshift@sha256:61b8f5e1a3b5dbd9e2c35fd448dc5106337d7a299873dd3a6f0cd8d4891ecc27 operators: - catalog: oci:///test/redhat-operator-index packages: - name: aws-load-balancer-operator - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.15 packages: - name: devworkspace-operator 3. execute the delete with --force-cache-delete=true `oc-mirror delete --v2 --delete-yaml-file out/working-dir/delete/delete-images.yaml --force-cache-delete=true docker://localhost:5000 --dest-tls-verify=false`
Actual results:
3. Check the local cache, didn't see any blobs deleted.
Expected results:
3. Not only delete the manifest for catalog operator , should also delete the blobs.
Additional info:
This error is resolved upon using --src-tls-verify=false with the oc-mirror delete --generate command More details in the slack thread here https://redhat-internal.slack.com/archives/C050P27C71S/p1722601331671649?thread_ts=1722597021.825099&cid=C050P27C71S
Also the logs show some logs from the registry, when --force-cache-delete is true
Description of problem:
After https://github.com/openshift/api/pull/2076, the validation in the image registry operator for the Power VS platform does not match the API's expectations.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Disable delete button for UDN if the UDN cannot be deleted
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create a UDN and then delete it 2. 3.
Actual results:
The UDN is not deleted
Expected results:
Disable the delete button if UDN cannot be removed
Additional info:
Description of problem:
We have recently enabled a few endpoint overrides, but ResourceManager was accidentally excluded.
Description of problem:
OWNERS file updated to include prabhakar and Moe as owners and reviewers
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is to fecilitate easy backport via automation
Description of problem: EIP UDN Layer 3 pre-merge testing] In SGW and LGW modes, after restarting ovnkube-node pod of client host of local EIP pod, EIP traffic from remote EIP pod can not be captured on egress node ovs-if-phys0
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. labeled a node to be egress node, created an egressIP object
2. Created a namespace, applied layer3 UDN CRD to it
3. Created two test pods, one local to egress node, the other one is remote to egress node
4. Restarted the ovnkube-node pod of the local EIP pod's client host (or egress node), waited till the ovnkube-node pod recreated and ovnkube-node ds rollout succeeded
5. Curl external from both test pods
Actual results: egressing packets from remote EIP pod can not be captured on egress node ovs-if-phys0 after restarting the ovnkube-node pod of egress node
Expected results: egressing packets from either EIP pod can be captured on egress node ovs-if-phys0 after restarting the ovnkube-node pod of egress node
Additional info:
egressing packets from local EIP pod can not be captured on egress node ovs-if-phys0 after restarting the ovnkube-node pod of egress node
must-gather: https://drive.google.com/file/d/12aonBDHMPsmoGKmM47yGTBFzqyl1IRlx/view?usp=drive_link
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
The OpenShift installer fails to create a cluster on an OpenStack Single-stack IPv6 environment - failed to run cluster api system
Version-Release number of selected component (if applicable):
Installer version: openshift-install version openshift-install 4.18.0-rc.3 built from commit 0f87b38910a84cfe3243fb878436bc052afc3187 release image registry.ci.openshift.org/ocp/release@sha256:668c92b06279cb5c7a2a692860b297eeb9013af10d49d2095f2c3fe9ad02baaa WARNING Release Image Architecture not detected. Release Image Architecture is unknown release architecture unknown default architecture amd64
RHOSO version:
[zuul@controller-0 ~]$ oc get openstackversions.core.openstack.org NAME TARGET VERSION AVAILABLE VERSION DEPLOYED VERSION controlplane 18.0.4-trunk-20241112.1 18.0.4-trunk-20241112.1 18.0.4-trunk-20241112.1
How reproducible:
Always
Steps to Reproduce:
1. Prepare openstack infra for openshift installation with Single-stack IPv6 (see the install-config.yaml below) 2. openshift-install create cluster
install-config.yaml:
apiVersion: v1 baseDomain: "shiftstack.local" controlPlane: name: master platform: openstack: type: "master" replicas: 3 compute: - name: worker platform: openstack: type: "worker" replicas: 2 metadata: name: "ostest" networking: clusterNetworks: - cidr: fd01::/48 hostPrefix: 64 machineNetwork: - cidr: "fd2e:6f44:5dd8:c956::/64" serviceNetwork: - fd02::/112 networkType: "OVNKubernetes" platform: openstack: cloud: "shiftstack" region: "regionOne" apiVIPs: ["fd2e:6f44:5dd8:c956::5"] ingressVIPs: ["fd2e:6f44:5dd8:c956::7"] controlPlanePort: fixedIPs: - subnet: name: "subnet-ssipv6" pullSecret: <omitted> sshKey: <omitted>
Actual results:
The openshift-install fails to start the controlplane - kube-apiserver:
INFO Started local control plane with envtest E0109 13:17:36.425059 30979 server.go:328] "unable to start the controlplane" err="timeout waiting for process kube-apiserver to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready)" logger="controller-runtime.test-env" tries=0 E0109 13:17:38.365005 30979 server.go:328] "unable to start the controlplane" err="timeout waiting for process kube-apiserver to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready)" logger="controller-runtime.test-env" tries=1 E0109 13:17:40.142385 30979 server.go:328] "unable to start the controlplane" err="timeout waiting for process kube-apiserver to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready)" logger="controller-runtime.test-env" tries=2 E0109 13:17:41.947245 30979 server.go:328] "unable to start the controlplane" err="timeout waiting for process kube-apiserver to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready)" logger="controller-runtime.test-env" tries=3 E0109 13:17:43.761197 30979 server.go:328] "unable to start the controlplane" err="timeout waiting for process kube-apiserver to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready)" logger="controller-runtime.test-env" tries=4 DEBUG Collecting applied cluster api manifests... ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to run cluster api system: failed to run local control plane: unable to start control plane itself: failed to start the controlplane. retried 5 times: timeout waiting for process kube-apiserver to start successfully (it may have failed to start, or stopped unexpectedly before becoming ready)
Additional info:
After the openshift-install failure, we observe that the kube-apiserver attempts to find service IPv4, even though our environment exclusively supports IPv6:
$ cat ostest/.clusterapi_output/kube-apiserver.log I0109 13:17:36.402549 31041 options.go:228] external host was not specified, using fd01:0:0:3::97 E0109 13:17:36.403397 31041 run.go:72] "command failed" err="service IP family \"10.0.0.0/24\" must match public address family \"fd01:0:0:3::97\"" I0109 13:17:38.351573 31096 options.go:228] external host was not specified, using fd01:0:0:3::97 E0109 13:17:38.352116 31096 run.go:72] "command failed" err="service IP family \"10.0.0.0/24\" must match public address family \"fd01:0:0:3::97\"" I0109 13:17:40.129451 31147 options.go:228] external host was not specified, using fd01:0:0:3::97 E0109 13:17:40.130026 31147 run.go:72] "command failed" err="service IP family \"10.0.0.0/24\" must match public address family \"fd01:0:0:3::97\"" I0109 13:17:41.517490 31203 options.go:228] external host was not specified, using fd01:0:0:3::97 E0109 13:17:41.518118 31203 run.go:72] "command failed" err="service IP family \"10.0.0.0/24\" must match public address family \"fd01:0:0:3::97\"" I0109 13:17:43.750048 31258 options.go:228] external host was not specified, using fd01:0:0:3::97 E0109 13:17:43.750649 31258 run.go:72] "command failed" err="service IP family \"10.0.0.0/24\" must match public address family \"fd01:0:0:3::97\""
$ ip addr show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if174: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default link/ether 0a:58:19:b4:10:b3 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet6 fd01:0:0:3::97/64 scope global valid_lft forever preferred_lft forever inet6 fe80::858:19ff:feb4:10b3/64 scope link valid_lft forever preferred_lft forever
Description of problem:
The UI to create a new project is asking for a "Subnet", wile expecting the CIDR for the subnet
Version-Release number of selected component (if applicable):
4.18.0-rc.4
How reproducible:
100%
Steps to Reproduce:
1. Change to dev console 2. Create new project 3. got to network tab
Actual results:
"Subnet" expects a CIDRO
Expected results:
"Subnet CIDR" expects a CIDR
Additional info:
https://ibb.co/CQX7RTJ
Description of problem:
When listing the UserDefinedNetwork via the UI (or CLI), the MTU is not reported.
This makes the user flow cumbersome, since they won't know which MTU they're using unless they log into the VM, and actually check what's there.
Version-Release number of selected component (if applicable):
4.18.0-rc4
How reproducible:
Always
Steps to Reproduce:
1. Provision a primary UDN (namespace scoped)
2. Read the created UDN data.
Actual results:
The MTU of the network is not available to the user.
Expected results:
The MTU of the network should be available in the UDN contents.
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
All
Description of problem:
when deleting operator images with v2 for operator images mirrored by v1 from oci catalog, oc-mirror doesn' t find the same tags to delete and fails to delete the images
Version-Release number of selected component (if applicable):
GitCommit:"affa0177"
How reproducible:
always
Steps to Reproduce:
1. Mirror to mirror with v1: ./bin/oc-mirror -c config_logs/bugs.yaml docker://localhost:5000/437311 --dest-skip-tls --dest-use-http kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 mirror: operators: - catalog: oci:///home/skhoury/redhat-index-all targetCatalog: "ocicatalog73452" targetTag: "v16" packages: - name: cluster-kube-descheduler-operator 2. mirror to disk with v2, and almost same ISC (but v2alpha1): ./bin/oc-mirror --v2 -c config_logs/bugs.yaml file:///home/skhoury/43774v2 3. delete with ./bin/oc-mirror delete --generate --delete-v1-images --v2 -c config_logs/bugs.yaml --workspace file:///home/skhoury/43774v2 docker://sherinefedora:5000/437311 kind: DeleteImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 delete: operators: - catalog: oci:///home/skhoury/redhat-index-all targetCatalog: "ocicatalog73452" targetTag: "v16" packages: - name: cluster-kube-descheduler-operator
Actual results:
mapping.txt of v1: registry.redhat.io/openshift-sandboxed-containers/osc-cloud-api-adaptor-webhook-rhel9@sha256:4da2fe27ef0235afcac1a1b5e90522d072426f58c0349702093ea59c40e5ca68=localhost:5000/437311/openshift-sandboxed-containers/osc-cloud-api-adaptor-webhook-rhel9:491be520 registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:cb836456974e510eb4bccbffadbc6d99d5f57c36caec54c767a158ffd8a025d5=localhost:5000/437311/openshift4/ose-kube-rbac-proxy:d07492b2 registry.redhat.io/openshift-sandboxed-containers/osc-operator-bundle@sha256:7465f4e228cfc44a3389f042f7d7b68d75cbb03f2adca1134a7ec417bbd89663=localhost:5000/437311/openshift-sandboxed-containers/osc-operator-bundle:a2f35fa7 registry.redhat.io/openshift-sandboxed-containers-tech-preview/osc-rhel8-operator@sha256:c6c589d5e47ba9564c66c84fc2bc7e5e046dae1d56a3dc99d7343f01e42e4d31=localhost:5000/437311/openshift-sandboxed-containers-tech-preview/osc-rhel8-operator:d7b79dea registry.redhat.io/openshift-sandboxed-containers/osc-operator-bundle@sha256:ff2bb666c2696fed365df55de78141a02e372044647b8031e6d06e7583478af4=localhost:5000/437311/openshift-sandboxed-containers/osc-operator-bundle:695e2e19 registry.redhat.io/openshift-sandboxed-containers/osc-rhel8-operator@sha256:5d2b03721043e5221dfb0cf164cf59eba396ba3aae40a56c53aa3496c625eea0=localhost:5000/437311/openshift-sandboxed-containers/osc-rhel8-operator:204cb113 registry.redhat.io/openshift-sandboxed-containers-tech-preview/osc-rhel8-operator@sha256:b1e824e126c579db0f56d04c3d1796d82ed033110c6bc923de66d95b67099611=localhost:5000/437311/openshift-sandboxed-containers-tech-preview/osc-rhel8-operator:1957f330 registry.redhat.io/openshift-sandboxed-containers/osc-rhel8-operator@sha256:a660f0b54b9139bed9a3aeef3408001c0d50ba60648364a98a09059b466fbcc1=localhost:5000/437311/openshift-sandboxed-containers/osc-rhel8-operator:ab38b9d5 registry.redhat.io/openshift-sandboxed-containers/osc-operator-bundle@sha256:8da62ba1c19c905bc1b87a6233ead475b047a766dc2acb7569149ac5cfe7f0f1=localhost:5000/437311/openshift-sandboxed-containers/osc-operator-bundle:1adce9f registry.redhat.io/redhat/redhat-operator-index:v4.15=localhost:5000/437311/redhat/redhat-operator-index:v4.15 registry.redhat.io/openshift-sandboxed-containers/osc-monitor-rhel9@sha256:03381ad7a468abc1350b229a8a7f9375fcb315e59786fdacac8e5539af4a3cdc=localhost:5000/437311/openshift-sandboxed-containers/osc-monitor-rhel9:53bbc3cb registry.redhat.io/openshift-sandboxed-containers-tech-preview/osc-operator-bundle@sha256:2808a0397495982b4ea0001ede078803a043d5c9b0285662b08044fe4c11f243=localhost:5000/437311/openshift-sandboxed-containers-tech-preview/osc-operator-bundle:c30c7861 registry.redhat.io/openshift-sandboxed-containers/osc-podvm-payload-rhel9@sha256:4bca24d469a41be77db7450e02fa01660a14f4c68e829cba4a8ae253d427bbfd=localhost:5000/437311/openshift-sandboxed-containers/osc-podvm-payload-rhel9:d25beb31 registry.redhat.io/openshift-sandboxed-containers/osc-cloud-api-adaptor-rhel9@sha256:7185c1b6658147e2cfbb0326e6b5f59899f14f5de73148ef9a07aa5c7b9ead74=localhost:5000/437311/openshift-sandboxed-containers/osc-cloud-api-adaptor-rhel9:18ba6d86 registry.redhat.io/openshift-sandboxed-containers/osc-rhel8-operator@sha256:8f30a9129d817c3f4e404d2c43fb47e196d8c8da3badba4c48f65d440a4d7584=localhost:5000/437311/openshift-sandboxed-containers/osc-rhel8-operator:17b81cfd registry.redhat.io/openshift-sandboxed-containers-tech-preview/osc-rhel8-operator@sha256:051bd7f1dad8cc3251430fee32184be8d64077aba78580184cef0255d267bdcf=localhost:5000/437311/openshift-sandboxed-containers-tech-preview/osc-rhel8-operator:6a87f996 registry.redhat.io/openshift-sandboxed-containers/osc-rhel9-operator@sha256:3e3b8849f8a0c8cd750815e6bde7eb2006e5a2b4ea898c9d3ea27f2bfed635d9=localhost:5000/437311/openshift-sandboxed-containers/osc-rhel9-operator:4c46a1f7 registry.redhat.io/openshift-sandboxed-containers-tech-preview/osc-operator-bundle@sha256:a91cee14f47824ce49759628d06bf4e48276e67dae00b50123d3233d78531720=localhost:5000/437311/openshift-sandboxed-containers-tech-preview/osc-operator-bundle:d22b8cff registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:7efeeb8b29872a6f0271f651d7ae02c91daea16d853c50e374c310f044d8c76c=localhost:5000/437311/openshift4/ose-kube-rbac-proxy:5574585a registry.redhat.io/openshift-sandboxed-containers/osc-podvm-builder-rhel9@sha256:a4099ea5ad907ad1daee3dc2c9d659b5a751adf2da65f8425212e82577b227e7=localhost:5000/437311/openshift-sandboxed-containers/osc-podvm-builder-rhel9:36a60f3f delete-images.yaml of v2 apiVersion: mirror.openshift.io/v2alpha1 items: - imageName: docker://registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator-bundle@sha256:b473fba287414d3ccb09aaabc64f463af2c912c322ca2c41723020b216d98d14 imageReference: docker://sherinefedora:5000/437311/openshift4/ose-cluster-kube-descheduler-operator-bundle:52836815 type: operatorBundle - imageName: docker://registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator-bundle@sha256:b148d5cf4943d0341781a0f7c6f2a7116d315c617f8beb65c9e7a24ac99304ff imageReference: docker://sherinefedora:5000/437311/openshift4/ose-cluster-kube-descheduler-operator-bundle:bd7c9abe type: operatorBundle - imageName: docker://registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:c7b198e686dc7117994d71027710ebc6ac0bf21afa436a79794d2e64970c8003 imageReference: docker://sherinefedora:5000/437311/openshift4/ose-cluster-kube-descheduler-operator:223f8a32 type: operatorRelatedImage - imageName: docker://registry.redhat.io/openshift4/ose-cluster-kube-descheduler-operator@sha256:ba0b71ff2a30a069b4a8a8f3c1e0898aaadc6db112e4cc12aff7c77ced7a0405 imageReference: docker://sherinefedora:5000/437311/openshift4/ose-cluster-kube-descheduler-operator:b0b2a0ab type: operatorRelatedImage - imageName: docker://registry.redhat.io/openshift4/ose-descheduler@sha256:257b69180cc667f2b8c1ce32c60fcd23a119195ad9ba2fdd6a6155ec5290f8cf imageReference: docker://sherinefedora:5000/437311/openshift4/ose-descheduler:6585e5e1 type: operatorRelatedImage - imageName: docker://registry.redhat.io/openshift4/ose-descheduler@sha256:45dc69ad93ab50bdf9ce1bb79f6d98f849e320db68af30475b10b7f5497a1b13 imageReference: docker://sherinefedora:5000/437311/openshift4/ose-descheduler:7ac5ce2 type: operatorRelatedImage kind: DeleteImageList
Expected results:
same tags found for destination images
Additional info:
Description of problem:
ingress-to-route controller does not provide any information about failed conversions from ingress to route. This is a big issue in environments heavily dependent on the ingress objects. The only way to find why a route is not created is guess and try as the only information one can get is that the route is not created.
Version-Release number of selected component (if applicable):
OCP 4.14
How reproducible:
100%
Steps to Reproduce:
apiVersion: networking.k8s.io/v1 kind: Ingress metadata: annotations: route.openshift.io/termination: passthrough name: hello-openshift-class namespace: test spec: ingressClassName: openshift-default rules: - host: ingress01-rhodain-test01.apps.rhodain03.sbr-virt.gsslab.brq2.redhat.com http: paths: - backend: service: name: myapp02 port: number: 8080 path: / pathType: Prefix tls: - {}
Actual results:
Route is not created and no error is logged
Expected results:
En error is provided in the events or at least in the controllers logs. The events are prefered as the ingress objects are mainly created by uses without cluster admin privileges.
Additional info:
Test is failing due to oddness with oc adm logs.
We think it is related to PodLogsQuery feature that went into 1.32.
Description of problem:
When creating a kubevirt hosted cluster with the following apiserver publishing configuration - service: APIServer servicePublishingStrategy: type: NodePort nodePort: address: my.hostna.me port: 305030 Shows following error: "failed to reconcile virt launcher policy: could not determine if amy.hostna.me is an IPv4 or IPv6 address" And network policies and not propertly deployed at the virtual machine namespaces.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Always
Steps to Reproduce:
1.Create a kubevirt hosted cluster with apiserver nodeport publish with a hostname 2. Wait for hosted cluster creation.
Actual results:
Following error pops up and network policies are not created "failed to reconcile virt launcher policy: could not determine if amy.hostna.me is an IPv4 or IPv6 address"
Expected results:
No error pops ups and network policies are created.
Additional info:
This is where the error get originated -> https://github.com/openshift/hypershift/blob/ef8596d4d69a53eb60838ae45ffce2bca0bfa3b2/hypershift-operator/controllers/hostedcluster/network_policies.go#L644 That error should prevent network policies creation.
Description of problem:
Initially, the clusters at version 4.16.9 were having issues with reconciling the IDP. The error which was found in Dynatrace was
"error": "failed to update control plane: failed to reconcile openshift oauth apiserver: failed to reconcile oauth server config: failed to generate oauth config: failed to apply IDP AAD config: Service Unavailable",
Initially it was assumed that the IDP service was unavialble but the CU confirmed that they also had the GroupSync operator running inside all clusters, which can successfully connect to the customer IDP and sync User + Group information from the IDP into the cluster.
The CU was advised to upgrade to 4.16.18 keeping in mind few of the other OCPBUGS which were related to proxy and would be resolved by upgrading to 4.16.15+
However, after upgrade the IDP is still failing to apply it seems. It looks like IDP reconciler isn't considering the Additional Trust Bundle for the customer proxy
Checking DT Logs, it seems to fail to verify the certificate
"error": "failed to update control plane: failed to reconcile openshift oauth apiserver: failed to reconcile oauth server config: failed to generate oauth config: failed to apply IDP AAD config: tls: failed to verify certificate: x509: certificate signed by unknown authority", "error": "failed to update control plane: [failed to reconcile openshift oauth apiserver: failed to reconcile oauth server config: failed to generate oauth config: failed to apply IDP AAD config: tls: failed to verify certificate: x509: certificate signed by unknown authority, failed to update status: Operation cannot be fulfilled on hostedcontrolplanes.hypershift.openshift.io \"rosa-staging\": the object has been modified; please apply your changes to the latest version and try again]",
Version-Release number of selected component (if applicable):
4.16.18
How reproducible:
Customer has a few clusters deployed and each of them has the same issue.
Steps to Reproduce:
1. Create a HostedCluster with a proxy configuration that specifies an additionalTrustBundle, and an OpenID idp that can be publicly verified (ie. EntraID or Keycloak with LetsEncrypt certs) 2. Wait for the cluster to come up and try to use the IDP 3.
Actual results:
IDP is failing to work for HCP
Expected results:
IDP should be working for the clusters
Additional info:
The issue will happen only if the IDP does not require a custom trust bundle to be verified.
Description of problem:
While creating a project and move to network, select "Refer an existing ClusterUserDefinedNetwork", the title "Project name" is not correct, which should be "UserDefinedNetwork name".
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Missing translations for "PodDisruptionBudget violated" string
Code:
"count PodDisruptionBudget violated_one": "count PodDisruptionBudget violated_one", "count PodDisruptionBudget violated_other": "count PodDisruptionBudget violated_other", | |
Code:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
2024/10/08 12:50:50 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/10/08 12:50:50 [INFO] : 👋 Hello, welcome to oc-mirror 2024/10/08 12:50:50 [INFO] : ⚙️ setting up the environment for you... 2024/10/08 12:50:50 [INFO] : 🔀 workflow mode: diskToMirror 2024/10/08 12:52:19 [INFO] : 🕵️ going to discover the necessary images... 2024/10/08 12:52:19 [INFO] : 🔍 collecting release images... 2024/10/08 12:52:19 [ERROR] : [ReleaseImageCollector] open /home/fedora/test-oc-mirror/hold-release/working-dir/release-images/ocp-release/4.14.20-x86_64/release-manifests/image-references: no such file or directory 2024/10/08 12:52:19 [INFO] : 👋 Goodbye, thank you for using oc-mirror 2024/10/08 12:52:19 [ERROR] : [ReleaseImageCollector] open /home/fedora/test-oc-mirror/hold-release/working-dir/release-images/ocp-release/4.14.20-x86_64/release-manifests/image-references: no such file or directory
Version-Release number of selected component (if applicable):
[fedora@preserve-fedora-yinzhou test]$ ./oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.17.0-202409120935.p0.gc912303.assembly.stream.el9-c912303", GitCommit:"c9123030d5df99847cf3779856d90ff83cf64dcb", GitTreeState:"clean", BuildDate:"2024-09-12T09:57:57Z", GoVersion:"go1.22.5 (Red Hat 1.22.5-1.module+el8.10.0+22070+9237f38b) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
Always
Steps to Reproduce:
1. Install 4.17 cluster and 4.17 oc-mirror 2. Now use the ImageSetConfig.yaml below and perform mirror2disk using the command below [root@bastion-dsal oc-mirror]# cat imageset.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: platform: channels: - name: stable-4.14 minVersion: 4.14.20 maxVersion: 4.14.20 shortestPath: true graph: true oc-mirror -c /tmp/imagesetconfig.yaml file:///home/fedora/test-oc-mirror/release-images --v2 3. Now perform disk2mirror using the command below oc-mirror -c /tm/imagesetconfig.yaml --from file:///home/fedora/test-oc-mirror/release-images --v2 --dry-run
Actual results:
When performing disk2mirror errors are seen as below [fedora@preserve-fedora-yinzhou test]$ ./oc-mirror -c /tmp/imageset.yaml --from file:///home/fedora/test-oc-mirror/release-images docker://localhost:5000 --v2 --dest-tls-verify=false --dry-run 2024/10/08 12:50:50 [WARN] : ⚠️ --v2 flag identified, flow redirected to the oc-mirror v2 version. This is Tech Preview, it is still under development and it is not production ready. 2024/10/08 12:50:50 [INFO] : 👋 Hello, welcome to oc-mirror 2024/10/08 12:50:50 [INFO] : ⚙️ setting up the environment for you... 2024/10/08 12:50:50 [INFO] : 🔀 workflow mode: diskToMirror 2024/10/08 12:52:19 [INFO] : 🕵️ going to discover the necessary images... 2024/10/08 12:52:19 [INFO] : 🔍 collecting release images... 2024/10/08 12:52:19 [ERROR] : [ReleaseImageCollector] open /home/fedora/test-oc-mirror/hold-release/working-dir/release-images/ocp-release/4.14.20-x86_64/release-manifests/image-references: no such file or directory 2024/10/08 12:52:19 [INFO] : 👋 Goodbye, thank you for using oc-mirror 2024/10/08 12:52:19 [ERROR] : [ReleaseImageCollector] open /home/fedora/test-oc-mirror/hold-release/working-dir/release-images/ocp-release/4.14.20-x86_64/release-manifests/image-references: no such file or directory
Expected results:
No errors should be seen when performing disk2mirror
Additional info:
If not using nestedpaths for file i.e like file:///home/fedora/test-oc-mirror/release-images and just using file://test-oc-mirror things work fine and no error as above is seen.
Description of problem:
Reported by customer IHAC, see https://redhat-internal.slack.com/archives/C02A3BM5DGS/p1736514939074049 "The timeout needs to be increased for Nutanix IPI installations using OpenShift versions >= 4.16, as image creation that takes more than 5 minutes will fail. OpenShift versions 4.11-4.15 work as expected with the OpenShift installer because the image creation timeout is set to a value greater than 5 minutes."
Version-Release number of selected component (if applicable):
How reproducible:
In some slow Prism-Central env. (slow network, etc.)
Steps to Reproduce:
In some slow Prism-Central env. (such as slow network), run the installer (4.16 and later) to create a Nutanix OCP cluster. The installation will fail with timeout when trying to upload the RHCOS image to PC.
Actual results:
The installation failed with timeout when uploading the RHCOS image to PC.
Expected results:
The installation successfully create the OCP cluster.
Additional info:
In some slow Prism-Central env. (such as slow network), run the installer (4.16 and later) to create a Nutanix OCP cluster. The installation will failed with timeout when trying to upload the RHCOS image to PC.
Description of problem:
"Cannot read properties of undefined (reading 'state')" Error in search tool when filtering Subscriptions while adding new Subscriptions
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. As an Administrator, go to Home -> Search and filter by Subscription component 2. Start creating subscriptions (bulk)
Actual results:
The filtered results will turn in "Oh no! Something went wrong" view
Expected results:
Get updated results every few seconds
Additional info:
If the view is reloaded -> Fix
Stack Trace:
TypeError: Cannot read properties of undefined (reading 'state') at L (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/subscriptions-chunk-89fe3c19814d1f6cdc84.min.js:1:3915) at na (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:58879) at Hs (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:111315) at Sc (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:98327) at Cc (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:98255) at _c (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:98118) at pc (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:95105) at https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:44774 at t.unstable_runWithPriority (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:289:3768) at Uo (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:44551)
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled
The UDN CRD is only installed in the OpenShift cluster when the `NetworkSegmentation feature gate is enabled.
Hence, the tests must reflect this.
Adding new device for accelerator monitoring in order to start inventory data flowing for the current GPUs that we have on the cluster
Note: also notify the Hive team we're doing these bumps.
Refer to https://github.com/okd-project/okd/discussions/2060
Upgrade to scos-release:4.16.0-okd-scos.0 from 4.15.0-0.okd-scos-2024-01-18-223523 for me stuck on network-operator rolling out the DaemonSet "/openshift-multus/whereabouts-reconciler".
The whereabouts-reconciler Pod is crashlooping with:
{{ [entrypoint.sh] FATAL ERROR: Unsupported OS ID=centos}}
Indeed, is image is based on:
{{cat /etc/os-release:
NAME="CentOS Stream"
VERSION="9"
ID="centos"
...}}
But entrypoint.sh does not handle the centos name:
{{# Collect host OS information
. /etc/os-release
rhelmajor=
Need to update the openshift/api project to contain correct disk size limit
Description of problem:
Instead of synthesised "operator conditions" test we should use real create / schedule pods via deployment controller to test kube-apiserver / kcm / scheduler functionality
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The following Insights APIs use duration attributes:
The kubebuilder validation patterns are defined as
^0|([1-9][0-9]*(\\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$
and
^([1-9][0-9]*(\\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$
Unfortunately this is not enough and it fails in case of updating the resource e.g with value "2m0s".
The validation pattern must allow these values.
Description of problem:
Tracking per-operator fixes for the following related issues static pod node, installer, and revision controllers: https://issues.redhat.com/browse/OCPBUGS-45924 https://issues.redhat.com/browse/OCPBUGS-46372 https://issues.redhat.com/browse/OCPBUGS-48276
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Open Questions
We've seen a high rate of failure for this test since last Thursday.
(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a potential regression in the following test:
[sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel]
Extreme regression detected.
Fishers Exact probability of a regression: 100.00%.
Test pass rate dropped from 100.00% to 78.33%.
Sample (being evaluated) Release: 4.18
Start Time: 2024-11-04T00:00:00Z
End Time: 2024-11-18T23:59:59Z
Success Rate: 78.33%
Successes: 47
Failures: 13
Flakes: 0
Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T23:59:59Z
Success Rate: 100.00%
Successes: 133
Failures: 0
Flakes: 0
View the test details report for additional context.
Description of problem:
ConsolePlugin CRD is missing connect-src and object-src CSP directives, which need to be added to its API and ported into both console-operator and console itself.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Set connect-src and object-src CSP directives in the ConsolePlugin CR. 2. Save changes 3.
Actual results:
API server will error out with unknown DirectiveType type
Expected results:
Added CSP directives should be saved as part of the updated ConsolePlugin CR, and aggregated CSP directives should be set as part of the bridge server response header, containing the added CSP directives
Additional info:
The Installer does not seem to be checking the server group quota before installing.
With OSASINFRA-2570, each cluster will need two server groups. Additionally, it might be worth checking that server groups are set to accept at least n members (where n is the number of Compute replicas).
One of our customers observed this issue. In order to reproduce, In my test cluster, I intentionally increased the overall CPU limits to over 200% and monitored the cluster for more than 2 days. However, I did not see the KubeCPUOvercommit alert, which ideally should trigger after 10 minutes of overcommitment.
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 2654m (75%) 8450m (241%)
memory 5995Mi (87%) 12264Mi (179%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
OCP console --> Observe --> alerting --> alert rule and select for the `KubeCPUOvercommit` alert.
Expression:
sum by (cluster) (namespace_cpu:kube_pod_container_resource_requests:sum{job="kube-state-metrics"}) - (sum by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) - max by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"})) > 0 and (sum by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) - max by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"})) > 0
The monitoring-plugin is still using Patternfly v4; it needs to be upgraded to Patternfly v5. This major version release deprecates components in the monitoring-plugin. These components will need to be replaced/removed to accommodate the version update.
We need to remove the deprecated components from the monitoring plugin, extending the work from CONSOLE-4124
Work to be done:
Description of problem:
4.18: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-cluster-etcd-operator-1384-ci-4.18-e2e-gcp-ovn/1877471255607644160 40e0ff5ee27c98d0 = bootstrap 1edfd46c2c62c92c = master-0 a0dabab5ae6f967b = master-1 7a509502206916e3 = master-2 22:20:03 (term 2) 1 bootstrap_teardown_controller.go:144] Removing bootstrap member [40e0ff5ee27c98d0] 22:20:03 (term 2) 1 bootstrap_teardown_controller.go:151] Successfully removed bootstrap member [40e0ff5ee27c98d0] 22:20:04 (term 2) leader lost 22:20:04 (term 3) leader=master-1 {"level":"warn","ts":"2025-01-09T22:20:19.459912Z","logger":"raft","caller":"etcdserver/zap_raft.go:85","msg":"a0dabab5ae6f967b stepped down to follower since quorum is not active"} 4.19: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn-upgrade/1877300145469526016 26129ddf465384ed = bootstrap 10.0.0.3 fb4ff9e6f7280fcb = master-0 10.0.0.5 283343bd1d4e0df2 = master-1 10.0.0.6 896ca4df9c7807c1 = master-2 10.0.0.4 bootstrap_teardown_controller.go:144] Removing bootstrap member [26129ddf465384ed] I0109 10:48:33.201639 1 bootstrap_teardown_controller.go:151] Successfully removed bootstrap member [26129ddf465384ed] {"level":"info","ts":"2025-01-09T10:48:34.588799Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"896ca4df9c7807c1 became leader at term 3"} {"level":"warn","ts":"2025-01-09T10:48:51.583385Z","logger":"raft","caller":"etcdserver/zap_raft.go:85","msg":"896ca4df9c7807c1 stepped down to follower since quorum is not active"}
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Use the /livez/ping endpoint to proxy unauthenticated health checks on the master. /livez/ping is a faster and more reliable endpoint, so it should be used over /version.
Version-Release number of selected component (if applicable):
Impacts all releases, fix can be limited to 4.18
How reproducible:
Always
Steps to Reproduce:
1. https://github.com/openshift/hypershift/blob/6356dab0c28b77cca1a74d911f7154f70a3cb68d/hypershift-operator/controllers/nodepool/apiserver-haproxy/haproxy.cfg#L26 2. https://github.com/openshift/hypershift/blob/6356dab0c28b77cca1a74d911f7154f70a3cb68d/hypershift-operator/controllers/nodepool/haproxy.go#L381
Actual results:
/version
Expected results:
/livez/ping
Additional info:
We should add golangci-lint to our `make verify`. This will catch common golang errors. There are a number of linters we can include if we want - https://golangci-lint.run/usage/linters/.
In addition, we can introduce GCI through golangci-lint so that all imports are sorted in a proper format.
Description of problem:
openstack-manila-csi-controllerplugin-csi-driver container is not functional in the first run, it needs to restart once and then it's good. This causes HCP e2e to fail on the EnsureNoCrashingPods test.
Version-Release number of selected component (if applicable):
4.19, 4.18
How reproducible:
Deploy Shift on Stack with Manila available in the cloud.
Actual results:
The openstack-manila-csi-controllerplugin pod will restart once and then it'll be functional.
Expected results:
No restart should be needed. This is likely an orchestration issue.
The Cluster API provider Azure has a deployment manifest that deploys Azure service operator from mcr.microsoft.com/k8s/azureserviceoperator:v2.6.0 image.
We need to set up OpenShift builds of the operator and update the manifest generator to use the OpenShift image.
Azure have split the API calls out of their provider so that they now use the service operator. We now need to ship service operator as part of the CAPI operator to make sure that we can support CAPZ.
Description of problem:
We are getting the below error on OCP 4.16 while creating the egressfirewall with uppercase:
~~~
[a-z0-9])?\.?$'
#
~~~
When I check the code
https://github.com/openshift/ovn-kubernetes/blob/release-4.15/go-controller/pkg/crd/egressfirewall/v1/types.go#L80-L82
types.go
~~~
// dnsName is the domain name to allow/deny traffic to. If this is set, cidrSelector and nodeSelector must be unset.
// kubebuilder:validation:Pattern=^([A-Za-z0-9-]\.)*[A-Za-z0-9-]+\.?$
DNSName string `json:"dnsName,omitempty"`
~~~
https://github.com/openshift/ovn-kubernetes/blob/release-4.16/go-controller/pkg/crd/egressfirewall/v1/types.go#L80-L85
types.go
~~~
// dnsName is the domain name to allow/deny traffic to. If this is set, cidrSelector and nodeSelector must be unset.
// For a wildcard DNS name, the '' will match only one label. Additionally, only a single '' can be
// used at the beginning of the wildcard DNS name. For example, '*.example.com' will match 'sub1.example.com' // but won't match 'sub2.sub1.example.com'.
// kubebuilder:validation:Pattern=`^(*\.)?([A-Za-z0-9-]\.)*[A-Za-z0-9-]+\.?$`
DNSName string `json:"dnsName,omitempty"`
~~~
Code looks its supported for the upper case.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Deploy the cluster with OCP 4.16.x
2. Create EgressFirewall with the upper case for dnsName
~~~
apiVersion: k8s.ovn.org/v1
kind: EgressFirewall
metadata:
name: default
spec:
egress:
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
CMO should create and deploy a configmap that contains the data for accelerators monitoring. When CMO creates node-exporter daemonsets and mounts the the configmap into the node-exporter's pods
Description of problem:
Tracking per-operator fixes for the following related issues static pod node, installer, and revision controllers: https://issues.redhat.com/browse/OCPBUGS-45924 https://issues.redhat.com/browse/OCPBUGS-46372 https://issues.redhat.com/browse/OCPBUGS-48276
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info: