Jump to: Complete Features | Incomplete Features | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
Goal:
Track Insights Operator Data Enhancements epic in 2024
Description of problem:
Context OpenShift Logging is migrating from Elasticsearch to Loki. While the option to use Loki has existed forquite a while, the information about end of Elasticsearch support has not been available until recently. With the information available now, we can expect more and more customers to migrate and hit the issue described in INSIGHTOCP-1927. P.S. Note the bar chart in INSIGHTOCP-1927 which shows how frequently is the related KCS linked in customer cases. Data to gather LokiStack custom resources (any name, any namespace) Backports The option to use Loki is available since Logging 5.5 whose compatibility started at OCP 4.9. Considering the OCP life cycle, backports to up to OCP 4.14 would be nice. Unknowns Since Logging 5.7, Logging supports installation of multiple instances in customer namespaces. The Insights Operator would have to look for the CRs in all namespaces, which poses the following questions: What is the expected number of the LokiStack CRs in a cluster? Should the Insights operator look for the resource in all namespaces? Is there a way to narrow down the scope? The CR will contain the name of a customer namespaces which is a sensitive information. What is the API group of the CR? Is there a risk of LokiStack CRs in customer namespaces that would NOT be related to OpenShift Logging? SME Oscar Arribas Arribas
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
N/A
Actual results:
Expected results:
Additional info:
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
Provide a simple way to get a VM-friendly networking setup, without having to configure the underlying physical network.
Primary used-defined networks can be managed from the UI and the user flow is seamless.
Placeholder feature for ccx-ocp-core maintenance tasks.
This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.
It looks like the insights-operator doesn't work with IPv6, there are log errors like this:
E1209 12:20:27.648684 37952 run.go:72] "command failed" err="failed to run groups: failed to listen on secure address: listen tcp: address fd01:0:0:5::6:8000: too many colons in address"
It's showing up in metal techpreview jobs.
The URL isn't being constructed correctly, use NetJoinHostPort over Sprintf. Some more details here https://github.com/stbenjam/no-sprintf-host-port. There's a non-default linter in golangci-lint for this.
Component Readiness has found a potential regression in the following test:
[sig-architecture] platform pods in ns/openshift-insights should not exit an excessive amount of times
Test has a 56.36% pass rate, but 95.00% is required.
Sample (being evaluated) Release: 4.18
Start Time: 2024-12-02T00:00:00Z
End Time: 2024-12-09T16:00:00Z
Success Rate: 56.36%
Successes: 31
Failures: 24
Flakes: 0
View the test details report for additional context.
The admin console's alert rule details page is provided by https://github.com/openshift/monitoring-plugin, but the dev console's equivalent page is still provided by code in the console codebase.
That dev console page is loaded from monitoring-plugin and the code for the page is removed from the console codebase.
Ensure removal of deprecated patternfly components from kebab-dropdown.tsx and alerting.tsx once this story and OU-257 are completed.
In order to allow customers and internal teams to see dashboards created using Perses, we must add them as new elements on the current dashboard list
tldr: three basic claims, the rest is explanation and one example
While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.
One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.
I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.
We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.
Relevant links:
Add a NID alias to OWNERS_ALIASES and update the OWNERS file in test/extended/router and add OWNERS file to test/extended/dns
As a cluster-admin, I want to run update in discrete steps. Update control plane and worker nodes independently.
I also want to back-up and restore incase of a problematic upgrade.
Background:
This Feature is a continuation of https://issues.redhat.com/browse/OCPSTRAT-180.
Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.Below is the list of done tasks.
These are alarming conditions which may frighten customers, and we don't want to see them in our own, controlled, repeatable update CI. This example job had logs like:
Feb 18 21:11:25.799 E clusteroperator/openshift-apiserver changed Degraded to True: APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
And the job failed, but none of the failures were "something made openshift-apiserver mad enough to go Degraded".
Definition of done:
This feature aims to enable customers of OCP to integrate 3rd party KMS solutions for encrypting etcd values at rest in accordance with:
https://kubernetes.io/docs/tasks/administer-cluster/kms-provider/
Scenario:
For an OCP cluster with external KMS enabled:
How doe the above scenario(s) impact the cluster? The API may be unavailable
Goal:
Investigation Steps:
Detection:
Actuation:
User stories that might result in KCS:
Plugins research:
POCs:
Acceptance Criteria:
We did something similar for the aesgcm encryption type in https://github.com/openshift/api/pull/1413/.
Volume Group Snapshots is a key new Kubernetes storage feature that allows multiple PVs to be grouped together and snapshotted at the same time. This enables customers to takes consistent snapshots of applications that span across multiple PVs.
This is also a key requirement for backup and DR solutions.
https://kubernetes.io/blog/2023/05/08/kubernetes-1-27-volume-group-snapshot-alpha/
https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/3476-volume-group-snapshot
Productise the volume group snapshots feature as tech preview have docs, testing as well as a feature gate to enable it in order for customers and partners to test it in advance.
The feature should be graduated beta upstream to become TP in OCP. Tests and CI must pass and a feature gate should allow customers and partners to easily enable it. We should identify all OCP shipped CSI drivers that support this feature and configure them accordingly.
CSI drivers development/support of this feature.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Drivers must support this feature and enable it. Partners may need to change their operator and/or doc to support it.
Document how to enable the feature, what this feature does and how to use it. Update the OCP driver's table to include this capability.
Can be leveraged by ODF and OCP virt, especially around backup and DR scenarios.
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Add Volume Group Snapshots as Tech Preview. This is a key new Kubernetes storage feature that allows multiple PVs to be grouped together and snapshotted at the same time. This enables customers to takes consistent snapshots of applications that span across multiple PVs.
We will rely on the newly beta promoted feature. This feature is driver dependent.
This will need a new external-snapshotter rebase + removal of the feature gate check in csi-snapshot-controller-operator. Freshly installed or upgraded from older release, will have group snapshot v1beta1 API enabled + enabled support for it in the snapshot-controller (+ ship corresponding external-snapshotter sidecar).
No opt-in, no opt-out.
OCP itself will not ship any CSI driver that supports it.
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
This is also a key requirement for backup and DR solutions specially for OCP virt.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
External snapshotter rebase to the upstream version that include the beta API.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Since we don't ship any driver with OCP that support the feature we need to have testing with ODF
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
We're looking at enabling it by default which could introduce risk. Since the feature has recently landed upstream, we will need to rebase on a newer external snapshotter that we initially targeted.
When moving to v1 there may be non backward compatible changes.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
We need to rebase against a version of snapshotter which has all the latest changes.
Crun is GA as non default since OCP 4.14 . We want to make it as default in 4.18 while still supporting runc as non-default
Benefits of Crun is covered here https://github.com/containers/crun
FAQ.: https://docs.google.com/document/d/1N7tik4HXTKsXS-tMhvnmagvw6TE44iNccQGfbL_-eXw/edit
***Note -> making Crun default does not means we will remove the support for runc nor we have any plans in foreseeable future to do that
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
This is Image mode on OpenShift. It uses the rpm-ostree native containers interface and not bootc but that is an implementation detail.
In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.
The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.
On-cluster, automated RHCOS Layering builds are important for multiple reasons:
This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.
As a cluster admin of user provided infrastructure,
when I apply the machine config that opts a pool into On Cluster Layering,
I want to also be able to remove that config and have the pool revert back to its non-layered state with the previously applied config.
As a cluster admin using on cluster layering,
when an image build has failed,
I want it to retry 3 times automatically without my intervention and show me where to find the log of the failure.
As a cluster admin,
when I enable On Cluster Layering,
I want to know that the builder image I am building with is stable and will not change unless I change it
so that I keep the same API promises as we do elsewhere in the platform.
To test:
As a cluster admin using on cluster layering,
when I try to upgrade my cluster and the Cluster Version Operator is not available,
I want the upgrade operation to be blocked.
As a cluster admin,
when I use a disconnected environment,
I want to still be able to use On Cluster Layering.
As a cluster admin using On Cluster layering,
When there has been config drift of any sort that degrades a node and I have resolved the issue,
I want to it to resync without forcing a reboot.
As a cluster admin using on cluster layering,
when a pool is using on cluster layering and references an internal registry
I want that registry available on the host network so that the pool can successfully scale up
(MCO-770, MCO-578, MCO-574 )
As a cluster admin using on cluster layering,
when a pool is using on cluster layering and I want to scale up nodes,
the nodes should have the same config as the other nodes in the pool.
Maybe:
Entitlements: MCO-1097, MCO-1099
Not Likely:
As a cluster admin using on cluster layering,
when I try to upgrade my cluster,
I want the upgrade operation to succeed at the same rate as non-OCL upgrades do.
With OCL GA'ing soon, we'll need a blocking path within our e2e test suite that must pass before a PR can be merged. Since e2e-gcp-op-techpreview is a non-blocking job, we should do both of the following:
As a cluster admin for standalone OpenShift, I want to customize the prefix of the machine names created by CPMS due to company policies related to nomenclature. Implement the Control Plane Machine Set (CPMS) feature in OpenShift to support machine names where user can set custom names prefixes. Note the prefix will always be suffixed by "<5-chars>-<index>" as this is part of the CPMS internal design.
A new field called machineNamePrefix has been added to CPMS CR.
This field would allow the customer to specify a custom prefix for the machine names.
The machine names would then be generated using the format: <machineNamePrefix><5-chars><index>
Where:
<machineNamePrefix> is the custom prefix provided by the customer
<5-chars> is a random 5 character string (this is required and cannot be changed)
<index> represents the index of the machine (0, 1, 2, etc.)
Ensure that if the machineNamePrefix is changed, the operator reconciles and succeeds in rolling out the changes.
Bump openshift/api to vendor machineNamePrefix field and CPMSMachineNamePrefix feature-gate into cpms-operator
Edge customers requiring computing on-site to serve business applications (e.g., point of sale, security & control applications, AI inference) are asking for a 2-node HA solution for their environments. Only two nodes at the edge, because the 3d node induces too much cost, but still they need HA for critical workload. To address this need, a 2+1 topology is introduced. It supports a small cheap arbiter node that can optionally be remote/virtual to reduce onsite HW cost.
Support OpenShift on 2+1 topology, meaning two primary nodes with large capacity to run workload and control plan, and a third small “arbiter” node which ensure quorum. See requirements for more details.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | self-managed |
Classic (standalone cluster) | yes |
Hosted control planes | no |
Multi node, Compact (three node), or Single node (SNO), or all | Multi node and Compact (three node) |
Connected / Restricted Network | both |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86_86 and ARM |
Operator compatibility | full |
Backport needed (list applicable versions) | no |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | no |
Other (please specify) | n/a |
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
See requirements - there are two main groups of customers: co-located arbiter node, and remote arbiter node.
Once the HighlyAvailableArbiter has been added to the ocp/api, we need to update the cluster-config-operator dependencies to reference the new change, so that it propagates to cluster installs in our payloads.
Update the dependencies for CEO for library-go and ocp/api to support the Arbiter additions, doing this in a separate PR to keep things clean and easier to test.
We need to update CEO (cluster etcd operator) to understand what an arbiter/witness node is so it can properly assign an etcd member on our less powerful node.
We need to add the `HighlyAvailableArbiter` value to the controlPlaneTopology in ocp/api package as a TechPreview
https://github.com/openshift/api/blob/master/config/v1/types_infrastructure.go#L95-L103
Support mapping OpenShift zones to vSphere host groups, in addition to vSphere clusters.
When defining zones for vSphere administrators can map regions to vSphere datacenters and zones to vSphere clusters.
There are use cases where vSphere clusters have only one cluster construct with all their ESXi hosts but the administrators want to divide the ESXi hosts in host groups. A common example is vSphere stretched clusters, where there is only one logical vSphere cluster but the ESXi nodes are distributed across to physical sites, and grouped by site in vSphere host groups.
In order for OpenShift to be able to distribute its nodes on vSphere matching the physical grouping of hosts, OpenShift zones have to be able to map to vSphere host groups too.
Support mapping OpenShift zones to vSphere host groups, in addition to vSphere clusters.
When defining zones for vSphere administrators can map regions to vSphere datacenters and zones to vSphere clusters.
There are use cases where vSphere clusters have only one cluster construct with all their ESXi hosts but the administrators want to divide the ESXi hosts in host groups. A common example is vSphere stretched clusters, where there is only one logical vSphere cluster but the ESXi nodes are distributed across to physical sites, and grouped by site in vSphere host groups.
In order for OpenShift to be able to distribute its nodes on vSphere matching the physical grouping of hosts, OpenShift zones have to be able to map to vSphere host groups too.
As an openshift engineer enable host vm group zonal in mao so that compute nodes properly are deployed
Acceptance Criteria:
Description of problem:
When we set multiple networks on LRP:
port rtoe-GR_227br_tenant.red_ovn-control-plane mac: "02:42:ac:12:00:07" ipv6-lla: "fe80::42:acff:fe12:7" networks: ["169.254.0.15/17", "172.18.0.7/16", "fc00:f853:ccd:e793::7/64", "fd69::f/112"]
and also use lb_force_snat_ip=routerip it picks the lexicographically sorted first item from the set of networks - there is no ordering for this
This breaks Services implementation on L2 UDNs
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
We want to do Network Policies not MultiNetwork POlicies
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
<your text here>
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
<your text here>
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
PopupKebabMenu is orphaned and contains a reference to `@patternfly/react-core/deprecated`. It and related code should be removed so we can drop PF4 and adopt PF6.
https://github.com/openshift/console/blob/master/frontend/packages/console-shared/src/components/actions/menu/ActionMenuItem.tsx#L3 contains a reference to `@patternfly/react-core/deprecated`. In order to drop PF4 and adopt PF6, this reference needs to be removed.
This component was never finished and should be removed as it includes a reference to `@patternfly/react-core/deprecated`, which blocks the removal of PF4 and the adoption of PF6.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
<your text here>
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
<your text here>
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
As a developer, I do not want to maintain the code for a project already dead.
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
K8s 1.31 introduces VolumeAttributesClass as beta (code in external provisioner). We should make it available to customers as tech preview.
VolumeAttributesClass allows PVC to be modified after their creation and while attached. There is as vast number of parameters that can be updated but the most popular is to change the QoS values. Parameters that can be changed depend on the driver used.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Productise VolumeAttributesClass as TP in anticipation for GA. Customer can start testing VolumeAttributesClass.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | yes |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | all |
Connected / Restricted Network | both |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all |
Operator compatibility | N/A core storage |
Backport needed (list applicable versions) | None |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | TBD for TP |
Other (please specify) | n/A |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
As an OCP user, I want to change parameters of my existing PVC such as the QoS attributes.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
UI for TP
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
There's been some limitations and complains on the fact that PVC attributes are sealed after their creation avoiding customers to update them. This is particularly impacting for a specific QoS is set and the volume requirements are changing.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Customer should not use it in production atm.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Document VolumeAttributesClass creation and how to update PVC. Mention any limitation. Mention it's tech preview no upgrade. Add drivers support if needed.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Check which drivers support it for which parameters.
Support upstream feature "VolumeAttributesClass" in OCP as Beta, i.e. test it and have docs for it.
A common concern with dealing with escalations/incidents in Managed OpenShift Hosted Control Planes is the resolution time incurred when the fix needs to be delivered in a component of the solution that ships within the OpenShift release payload. This is because OpenShift's release payloads:
This feature seeks to provide mechanisms that put the upper time boundary in delivering such fixes to match the current HyperShift Operator <24h expectation
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | managed (ROSA and ARO) |
Classic (standalone cluster) | No |
Hosted control planes | Yes |
Multi node, Compact (three node), or Single node (SNO), or all | All supported ROSA/HCP topologies |
Connected / Restricted Network | All supported ROSA/HCP topologies |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | All supported ROSA/HCP topologies |
Operator compatibility | CPO and Operators depending on it |
Backport needed (list applicable versions) | TBD |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | No |
Other (please specify) | No |
Discussed previously during incident calls. Design discussion document
SOP needs to be defined for:
The default PR posting and pushing tekton file that Konflux generates builds always. We can be more efficient with resources.
It is necessary to get the builds off of main for CPO overrides.
The OpenShift IPsec implementation will be enhanced for a growing set of enterprise use cases, and for larger scale deployments.
The OpenShift IPsec implementation was originally built for purpose-driven use cases from telco NEPs, but was also be useful for a specific set of other customer use cases outside of that context. As customer adoption grew and it was adopted by some of the largest (by number of cluster nodes) deployments in the field, it became obvious that some redesign is necessary in order to continue to deliver enterprise-grade IPsec, for both East-West and North-South traffic, and for some of our most-demanding customer deployments.
Key enhancements include observability and blocked traffic across paths if IPsec encryption is not functioning properly.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
The OpenShift IPsec feature is fundamental to customer deployments for ensuring that all traffic between cluster nodes (East-West) and between cluster nodes and external-to-the-cluster entities that also are configured for IPsec (North-South) is encrypted by default. This encryption must scale to the largest of deployments.
Questions to be addressed:
While running IPsec e2e tests in the CI, the data plane traffic is not flowing with desired traffic type esp or udp. For example, ipsec mode external, the traffic type seems to seen as esp for EW traffic, but it's supposed to be geneve (udp) taffic.
This issue was reproducible on a local cluster after many attempts and noticed ipsec states are not cleanup on the node which is a residue from previous test run with ipsec full mode.
[peri@sdn-09 origin]$ kubectl get networks.operator.openshift.io cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
creationTimestamp: "2024-05-13T18:55:57Z"
generation: 1362
name: cluster
resourceVersion: "593827"
uid: 10f804c9-da46-41ee-91d5-37aff920bee4
spec:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
defaultNetwork:
ovnKubernetesConfig:
egressIPConfig: {}
gatewayConfig:
ipv4: {}
ipv6: {}
routingViaHost: false
genevePort: 6081
ipsecConfig:
mode: External
mtu: 1400
policyAuditConfig:
destination: "null"
maxFileSize: 50
maxLogFiles: 5
rateLimit: 20
syslogFacility: local0
type: OVNKubernetes
deployKubeProxy: false
disableMultiNetwork: false
disableNetworkDiagnostics: false
logLevel: Normal
managementState: Managed
observedConfig: null
operatorLogLevel: Normal
serviceNetwork:
- 172.30.0.0/16
unsupportedConfigOverrides: null
useMultiNetworkPolicy: false
status:
conditions:
- lastTransitionTime: "2024-05-13T18:55:57Z"
status: "False"
type: ManagementStateDegraded
- lastTransitionTime: "2024-05-14T10:13:12Z"
status: "False"
type: Degraded
- lastTransitionTime: "2024-05-13T18:55:57Z"
status: "True"
type: Upgradeable
- lastTransitionTime: "2024-05-14T11:50:26Z"
status: "False"
type: Progressing
- lastTransitionTime: "2024-05-13T18:57:13Z"
status: "True"
type: Available
readyReplicas: 0
version: 4.16.0-0.nightly-2024-05-08-222442
[peri@sdn-09 origin]$ oc debug node/worker-0
Starting pod/worker-0-debug-k6nlm ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.111.23
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# toolbox
Checking if there is a newer version of registry.redhat.io/rhel9/support-tools available...
Container 'toolbox-root' already exists. Trying to start...
(To remove the container and start with a fresh toolbox, run: sudo podman rm 'toolbox-root')
toolbox-root
Container started successfully. To exit, type 'exit'.
[root@worker-0 /]# tcpdump -i enp2s0 -c 1 -v --direction=out esp and src 192.168.111.23 and dst 192.168.111.24
dropped privs to tcpdump
tcpdump: listening on enp2s0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
16:07:01.854214 IP (tos 0x0, ttl 64, id 20451, offset 0, flags [DF], proto ESP (50), length 152)
worker-0 > worker-1: ESP(spi=0x52cc9c8d,seq=0xe1c5c), length 132
1 packet captured
6 packets received by filter
0 packets dropped by kernel
[root@worker-0 /]# exit
exit
sh-5.1# ipsec whack --trafficstatus
006 #20: "ovn-1184d9-0-in-1", type=ESP, add_time=1715687134, inBytes=206148172, outBytes=0, maxBytes=2^63B, id='@1184d960-3211-45c4-a482-d7b6fe995446'
006 #19: "ovn-1184d9-0-out-1", type=ESP, add_time=1715687112, inBytes=0, outBytes=40269835, maxBytes=2^63B, id='@1184d960-3211-45c4-a482-d7b6fe995446'
006 #27: "ovn-185198-0-in-1", type=ESP, add_time=1715687419, inBytes=71406656, outBytes=0, maxBytes=2^63B, id='@185198f6-7dde-4e9b-b2aa-52439d2beef5'
006 #26: "ovn-185198-0-out-1", type=ESP, add_time=1715687401, inBytes=0, outBytes=17201159, maxBytes=2^63B, id='@185198f6-7dde-4e9b-b2aa-52439d2beef5'
006 #14: "ovn-922aca-0-in-1", type=ESP, add_time=1715687004, inBytes=116384250, outBytes=0, maxBytes=2^63B, id='@922aca42-b893-496e-bb9b-0310884f4cc1'
006 #13: "ovn-922aca-0-out-1", type=ESP, add_time=1715686986, inBytes=0, outBytes=986900228, maxBytes=2^63B, id='@922aca42-b893-496e-bb9b-0310884f4cc1'
006 #6: "ovn-f72f26-0-in-1", type=ESP, add_time=1715686855, inBytes=115781441, outBytes=98, maxBytes=2^63B, id='@f72f2622-e7dc-414e-8369-6013752ea15b'
006 #5: "ovn-f72f26-0-out-1", type=ESP, add_time=1715686833, inBytes=9320, outBytes=29002449, maxBytes=2^63B, id='@f72f2622-e7dc-414e-8369-6013752ea15b'
sh-5.1# ip xfrm state; echo ' '; ip xfrm policy
src 192.168.111.21 dst 192.168.111.23
proto esp spi 0x7f7ddcf5 reqid 16413 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x6158d9a0f4a28598500e15f81a40ef715502b37ecf979feb11bbc488479c8804598011ee 128
lastused 2024-05-14 16:07:11
anti-replay esn context:
seq-hi 0x0, seq 0x18564, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff
sel src 192.168.111.21/32 dst 192.168.111.23/32 proto udp dport 6081
src 192.168.111.23 dst 192.168.111.21
proto esp spi 0xda57e42e reqid 16413 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x810bebecef77951ae8bb9a46cf53a348a24266df8b57bf2c88d4f23244eb3875e88cc796 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.21/32 proto udp sport 6081
src 192.168.111.21 dst 192.168.111.23
proto esp spi 0xf84f2fcf reqid 16417 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x0f242efb072699a0f061d4c941d1bb9d4eb7357b136db85a0165c3b3979e27b00ff20ac7 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.21/32 dst 192.168.111.23/32 proto udp sport 6081
src 192.168.111.23 dst 192.168.111.21
proto esp spi 0x9523c6ca reqid 16417 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xe075d39b6e53c033f5225f8be48efe537c3ba605cee2f5f5f3bb1cf16b6c53182ecf35f7 128
lastused 2024-05-14 16:07:11
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x10fb2
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.21/32 proto udp dport 6081
src 192.168.111.20 dst 192.168.111.23
proto esp spi 0x459d8516 reqid 16397 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xee778e6db2ce83fa24da3b18e028451bbfcf4259513bca21db832c3023e238a6b55fdacc 128
lastused 2024-05-14 16:07:13
anti-replay esn context:
seq-hi 0x0, seq 0x3ec45, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff
sel src 192.168.111.20/32 dst 192.168.111.23/32 proto udp dport 6081
src 192.168.111.23 dst 192.168.111.20
proto esp spi 0x3142f53a reqid 16397 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x6238fea6dffdd36cbb909f6aab48425ba6e38f9d32edfa0c1e0fc6af8d4e3a5c11b5dfd1 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.20/32 proto udp sport 6081
src 192.168.111.20 dst 192.168.111.23
proto esp spi 0xeda1ccb9 reqid 16401 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xef84a90993bd71df9c97db940803ad31c6f7d2e72a367a1ec55b4798879818a6341c38b6 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.20/32 dst 192.168.111.23/32 proto udp sport 6081
src 192.168.111.23 dst 192.168.111.20
proto esp spi 0x02c3c0dd reqid 16401 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x858ab7326e54b6d888825118724de5f0c0ad772be2b39133c272920c2cceb2f716d02754 128
lastused 2024-05-14 16:07:13
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x26f8e
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.20/32 proto udp dport 6081
src 192.168.111.24 dst 192.168.111.23
proto esp spi 0xc9535b47 reqid 16405 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xd7a83ff4bd6e7704562c597810d509c3cdd4e208daabf2ec074d109748fd1647ab2eff9d 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x53d4c, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff
sel src 192.168.111.24/32 dst 192.168.111.23/32 proto udp dport 6081
src 192.168.111.23 dst 192.168.111.24
proto esp spi 0xb66203c8 reqid 16405 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xc207001a7f1ed7f114b3e327308ddbddc36de5272a11fe0661d03eaecc84b6761c7ec9c4 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.24/32 proto udp sport 6081
src 192.168.111.24 dst 192.168.111.23
proto esp spi 0x2e4d4deb reqid 16409 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x91e399d83aa1c2626424b502d4b8dae07d4a170f7ef39f8d1baca8e92b8a1dee210e2502 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.24/32 dst 192.168.111.23/32 proto udp sport 6081
src 192.168.111.23 dst 192.168.111.24
proto esp spi 0x52cc9c8d reqid 16409 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0xb605451f32f5dd7a113cae16e6f1509270c286d67265da2ad14634abccf6c90f907e5c00 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0xe2735
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.24/32 proto udp dport 6081
src 192.168.111.22 dst 192.168.111.23
proto esp spi 0x973119c3 reqid 16389 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x87d13e67b948454671fb8463ec0cd4d9c38e5e2dd7f97cbb8f88b50d4965fb1f21b36199 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x2af9a, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
ffffffff ffffffff ffffffff ffffffff
sel src 192.168.111.22/32 dst 192.168.111.23/32 proto udp dport 6081
src 192.168.111.23 dst 192.168.111.22
proto esp spi 0x4c3580ff reqid 16389 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x2c09750f51e86d60647a60e15606f8b312036639f8de2d7e49e733cda105b920baade029 128
lastused 2024-05-14 14:36:43
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x1
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.22/32 proto udp sport 6081
src 192.168.111.22 dst 192.168.111.23
proto esp spi 0xa3e469dc reqid 16393 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x1d5c5c232e6fd4b72f3dad68e8a4d523cbd297f463c53602fad429d12c0211d97ae26f47 128
lastused 2024-05-14 14:18:42
anti-replay esn context:
seq-hi 0x0, seq 0xb, oseq-hi 0x0, oseq 0x0
replay_window 128, bitmap-length 4
00000000 00000000 00000000 000007ff
sel src 192.168.111.22/32 dst 192.168.111.23/32 proto udp sport 6081
src 192.168.111.23 dst 192.168.111.22
proto esp spi 0xdee8476f reqid 16393 mode transport
replay-window 0 flag esn
aead rfc4106(gcm(aes)) 0x5895025ce5b192a7854091841c73c8e29e7e302f61becfa3feb44d071ac5c64ce54f5083 128
lastused 2024-05-14 16:07:14
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x1f1a3
replay_window 128, bitmap-length 4
00000000 00000000 00000000 00000000
sel src 192.168.111.23/32 dst 192.168.111.22/32 proto udp dport 6081
src 192.168.111.23/32 dst 192.168.111.21/32 proto udp sport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16413 mode transport
src 192.168.111.21/32 dst 192.168.111.23/32 proto udp dport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16413 mode transport
src 192.168.111.23/32 dst 192.168.111.21/32 proto udp dport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16417 mode transport
src 192.168.111.21/32 dst 192.168.111.23/32 proto udp sport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16417 mode transport
src 192.168.111.23/32 dst 192.168.111.20/32 proto udp sport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16397 mode transport
src 192.168.111.20/32 dst 192.168.111.23/32 proto udp dport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16397 mode transport
src 192.168.111.23/32 dst 192.168.111.20/32 proto udp dport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16401 mode transport
src 192.168.111.20/32 dst 192.168.111.23/32 proto udp sport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16401 mode transport
src 192.168.111.23/32 dst 192.168.111.24/32 proto udp sport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16405 mode transport
src 192.168.111.24/32 dst 192.168.111.23/32 proto udp dport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16405 mode transport
src 192.168.111.23/32 dst 192.168.111.24/32 proto udp dport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16409 mode transport
src 192.168.111.24/32 dst 192.168.111.23/32 proto udp sport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16409 mode transport
src 192.168.111.23/32 dst 192.168.111.22/32 proto udp sport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16389 mode transport
src 192.168.111.22/32 dst 192.168.111.23/32 proto udp dport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16389 mode transport
src 192.168.111.23/32 dst 192.168.111.22/32 proto udp dport 6081
dir out priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16393 mode transport
src 192.168.111.22/32 dst 192.168.111.23/32 proto udp sport 6081
dir in priority 1360065 ptype main
tmpl src 0.0.0.0 dst 0.0.0.0
proto esp reqid 16393 mode transport
src ::/0 dst ::/0
socket out priority 0 ptype main
src ::/0 dst ::/0
socket in priority 0 ptype main
src ::/0 dst ::/0
socket out priority 0 ptype main
src ::/0 dst ::/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket out priority 0 ptype main
src 0.0.0.0/0 dst 0.0.0.0/0
socket in priority 0 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 135
dir out priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 135
dir fwd priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 135
dir in priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 136
dir out priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 136
dir fwd priority 1 ptype main
src ::/0 dst ::/0 proto ipv6-icmp type 136
dir in priority 1 ptype main
sh-5.1# cat /etc/ipsec.conf
# /etc/ipsec.conf - Libreswan 4.0 configuration file
#
# see 'man ipsec.conf' and 'man pluto' for more information
#
# For example configurations and documentation, see https://libreswan.org/wiki/
config setup
# If logfile= is unset, syslog is used to send log messages too.
# Note that on busy VPN servers, the amount of logging can trigger
# syslogd (or journald) to rate limit messages.
#logfile=/var/log/pluto.log
#
# Debugging should only be used to find bugs, not configuration issues!
# "base" regular debug, "tmi" is excessive and "private" will log
# sensitive key material (not available in FIPS mode). The "cpu-usage"
# value logs timing information and should not be used with other
# debug options as it will defeat getting accurate timing information.
# Default is "none"
# plutodebug="base"
# plutodebug="tmi"
#plutodebug="none"
#
# Some machines use a DNS resolver on localhost with broken DNSSEC
# support. This can be tested using the command:
# dig +dnssec DNSnameOfRemoteServer
# If that fails but omitting '+dnssec' works, the system's resolver is
# broken and you might need to disable DNSSEC.
# dnssec-enable=no
#
# To enable IKE and IPsec over TCP for VPN server. Requires at least
# Linux 5.7 kernel or a kernel with TCP backport (like RHEL8 4.18.0-291)
# listen-tcp=yes
# To enable IKE and IPsec over TCP for VPN client, also specify
# tcp-remote-port=4500 in the client's conn section.
# if it exists, include system wide crypto-policy defaults
include /etc/crypto-policies/back-ends/libreswan.config
# It is best to add your IPsec connections as separate files
# in /etc/ipsec.d/
include /etc/ipsec.d/*.conf
sh-5.1# cat /etc/ipsec.d/openshift.conf
# Generated by ovs-monitor-ipsec...do not modify by hand!
config setup
uniqueids=yes
conn %default
keyingtries=%forever
type=transport
auto=route
ike=aes_gcm256-sha2_256
esp=aes_gcm256
ikev2=insist
conn ovn-f72f26-0-in-1
left=192.168.111.23
right=192.168.111.22
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@f72f2622-e7dc-414e-8369-6013752ea15b
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp/6081
rightprotoport=udp
conn ovn-f72f26-0-out-1
left=192.168.111.23
right=192.168.111.22
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@f72f2622-e7dc-414e-8369-6013752ea15b
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp
rightprotoport=udp/6081
conn ovn-1184d9-0-in-1
left=192.168.111.23
right=192.168.111.20
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@1184d960-3211-45c4-a482-d7b6fe995446
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp/6081
rightprotoport=udp
conn ovn-1184d9-0-out-1
left=192.168.111.23
right=192.168.111.20
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@1184d960-3211-45c4-a482-d7b6fe995446
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp
rightprotoport=udp/6081
conn ovn-922aca-0-in-1
left=192.168.111.23
right=192.168.111.24
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@922aca42-b893-496e-bb9b-0310884f4cc1
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp/6081
rightprotoport=udp
conn ovn-922aca-0-out-1
left=192.168.111.23
right=192.168.111.24
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@922aca42-b893-496e-bb9b-0310884f4cc1
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp
rightprotoport=udp/6081
conn ovn-185198-0-in-1
left=192.168.111.23
right=192.168.111.21
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@185198f6-7dde-4e9b-b2aa-52439d2beef5
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp/6081
rightprotoport=udp
conn ovn-185198-0-out-1
left=192.168.111.23
right=192.168.111.21
leftid=@cf36db5c-5c54-4329-9141-b83679b18ecc
rightid=@185198f6-7dde-4e9b-b2aa-52439d2beef5
leftcert="ovs_certkey_cf36db5c-5c54-4329-9141-b83679b18ecc"
leftrsasigkey=%cert
rightca=%same
leftprotoport=udp
rightprotoport=udp/6081
sh-5.1#
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
<your text here>
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
<your text here>
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
link back to OCPSTRAT-1644 somehow
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Epic Goal*
Drive the technical part of the Kubernetes 1.32 upgrade, including rebasing openshift/kubernetes repository and coordination across OpenShift organization to get e2e tests green for the OCP release.
Why is this important? (mandatory)
OpenShift 4.19 cannot be released without Kubernetes 1.32
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Slack Discussion Channel - https://redhat.enterprise.slack.com/archives/C07V32J0YKF
As a customer of self managed OpenShift or an SRE managing a fleet of OpenShift clusters I should be able to determine the progress and state of an OCP upgrade and only be alerted if the cluster is unable to progress. Support a cli-status command and status-API which can be used by cluster-admin to monitor the progress. status command/API should also contain data to alert users about potential issues which can make the updates problematic.
Here are common update improvements from customer interactions on Update experience
oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0 True True 16s Working towards 4.12.4: 9 of 829 done (1% complete)
Update docs for UX and CLI changes
Reference : https://docs.google.com/presentation/d/1cwxN30uno_9O8RayvcIAe8Owlds-5Wjr970sDxn7hPg/edit#slide=id.g2a2b8de8edb_0_22
Epic Goal*
Add a new command `oc adm upgrade status` command which is backed by an API. Please find the mock output of the command output attached in this card.
Why is this important? (mandatory)
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
After OTA-960 is fixed, ClusterVersion/version and oc adm upgrade can be used to monitor the process of migrating a cluster to multi-arch.
$ oc adm upgrade info: An upgrade is in progress. Working towards 4.18.0-ec.3: 761 of 890 done (85% complete), waiting on machine-config Upgradeable=False Reason: PoolUpdating Message: Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details Upstream: https://api.integration.openshift.com/api/upgrades_info/graph Channel: candidate-4.18 (available channels: candidate-4.18) No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.
But oc adm upgrade status reports COMPLETION 100% while the migration/upgrade is still ongoing.
$ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status Unable to fetch alerts, ignoring alerts in 'Update Health': failed to get alerts from Thanos: no token is currently in use for this session = Control Plane = Assessment: Completed Target Version: 4.18.0-ec.3 (from 4.18.0-ec.3) Completion: 100% (33 operators updated, 0 updating, 0 waiting) Duration: 15m Operator Status: 33 Healthy Control Plane Nodes NAME ASSESSMENT PHASE VERSION EST MESSAGE ip-10-0-95-224.us-east-2.compute.internal Unavailable Updated 4.18.0-ec.3 - Node is unavailable ip-10-0-33-81.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 - ip-10-0-45-170.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 - = Worker Upgrade = WORKER POOL ASSESSMENT COMPLETION STATUS worker Completed 100% 3 Total, 2 Available, 0 Progressing, 0 Outdated, 0 Draining, 0 Excluded, 0 Degraded Worker Pool Nodes: worker NAME ASSESSMENT PHASE VERSION EST MESSAGE ip-10-0-72-40.us-east-2.compute.internal Unavailable Updated 4.18.0-ec.3 - Node is unavailable ip-10-0-17-117.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 - ip-10-0-22-179.us-east-2.compute.internal Completed Updated 4.18.0-ec.3 - = Update Health = SINCE LEVEL IMPACT MESSAGE - Warning Update Speed Node ip-10-0-95-224.us-east-2.compute.internal is unavailable - Warning Update Speed Node ip-10-0-72-40.us-east-2.compute.internal is unavailable Run with --details=health for additional description and links to related online documentation $ oc get clusterversion version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.18.0-ec.3 True True 14m Working towards 4.18.0-ec.3: 761 of 890 done (85% complete), waiting on machine-config $ oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE machine-config 4.18.0-ec.3 True True False 63m Working towards 4.18.0-ec.3
The reason is that PROGRESSING=True is not detected for co/machine-config as the status command checks only operator.Status.Versions[name=="operator"] and it needs to check ClusterOperator.Status.Versions[name=="operator-image"] as well.
For grooming:
It will be challenging for the status command to check the operator image's pull spec because it does not know the expected value. CVO knows it because CVO holds the manifests (containing the expected value) from the multi-arch payload.
One "hacky" workaround is that the status command gets the pull spec from the MCO deployment:
oc get deployment -n openshift-machine-config-operator machine-config-operator -o json | jq -r '.spec.template.spec.containers[]|select(.name=="machine-config-operator")|.image' quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:787a505ca594b0a727549353c503dec9233a9d3c2dcd6b64e3de5f998892a1d5
Note this co/machine-config -> deployment/machine-config-operator trick may not be feasible if we want to extend it to all cluster operators. But it should work as a hacky workaround to check only MCO.
We may claim that the status command is not designed for monitoring the multi-arch migration and suggest to use oc adm upgrade instead. In that case, we can close this card as Obsolete/Won'tDo.
manifests.ziphas the mockData/manifests for the status cmd that are taken during the migration.
oc#1920 started the work for the status command to recognize the migration and we need to extend the work to cover (the comments from Petr's review):
We need to maintain our dependencies across all the libraries we use in order to stay in compliance.
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
<your text here>
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | |
Classic (standalone cluster) | |
Hosted control planes | |
Multi node, Compact (three node), or Single node (SNO), or all | |
Connected / Restricted Network | |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | |
Operator compatibility | |
Backport needed (list applicable versions) | |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | |
Other (please specify) |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
Currently console is using TypeScript 4, which is preventing us from upgrading to NodeJS-22. Due to that we need to update TypeScript 5 (not necessarily latest version).
AC:
Note: In case of higher complexity we should be splitting the story into multiple stories, per console package.
As a developer I want to make sure we are running the latest version of webpack in order to take advantage of the latest benefits and also keep current so that future updating is a painless as possible.
We are currently on v4.47.0.
Changelog: https://webpack.js.org/blog/2020-10-10-webpack-5-release/
By updating to version 5 we will need to update following pkgs as well:
AC: Update webpack to version 5 and determine what should be the ideal minor version.
The ability in OpenShift to create trust and directly consume access tokens issued by external OIDC Authentication Providers using an authentication approach similar to upstream Kubernetes.
BYO Identity will help facilitate CLI only workflows and capabilities of the Authentication Provider (such as Keycloak, Dex, Azure AD) similar to upstream Kubernetes.
Ability in OpenShift to provide a direct, pluggable Authentication workflow such that the OpenShift/K8s API server can consume access tokens issued by external OIDC identity providers. Kubernetes provides this integration as described here. Customer/Users can then configure their IDPs to support the OIDC protocols and workflows they desire such as Client credential flow.
OpenShift OAuth server is still available as default option, with the ability to tune in the external OIDC provider as a Day-2 configuration.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal
The ability to provide a direct authentication workflow such that OpenShift can consume bearer tokens issued by external OIDC identity providers, replacing the built-in OAuth stack by deactivating/removing its components as necessary.
Why is this important? (mandatory)
OpenShift has its own built-in OAuth server which can be used to obtain OAuth access tokens for authentication to the API. The server can be configured with an external identity provider (including support for OIDC), however it is still the built-in server that issues tokens, and thus authentication is limited to the capabilities of the oauth-server.
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
Contributing Teams(and contacts) (mandatory)
Acceptance Criteria (optional)
Drawbacks or Risk (optional)
Done - Checklist (mandatory)
Description of problem:
This is a bug found during pre-merge test of 4.18 epic AUTH-528 PRs and filed for better tracking per existing "OpenShift - Testing Before PR Merges - Left-Shift Testing" google doc workflow.
co/console degraded with AuthStatusHandlerDegraded after OCP BYO external oidc is configured and then removed (i.e. reverted back to OAuth IDP).
Version-Release number of selected component (if applicable):
Cluster-bot build which is built at 2024-11-25 09:39 CST (UTC+800) build 4.18,openshift/cluster-authentication-operator#713,openshift/cluster-authentication-operator#740,openshift/cluster-kube-apiserver-operator#1760,openshift/console-operator#940
How reproducible:
Always (tried twice, both hit it)
Steps to Reproduce:
1. Launch a TechPreviewNoUpgrade standalone OCP cluster with above build. Configure htpasswd IDP. Test users can login successfully. 2. Configure BYO external OIDC in this OCP cluster using Microsoft Entra ID. KAS and console pods can roll out successfully. oc login and console login to Microsoft Entra ID can succeed. 3. Remove BYO external OIDC configuration, i.e. go back to original htpasswd OAuth IDP: [xxia@2024-11-25 21:10:17 CST my]$ oc patch authentication.config/cluster --type=merge -p=' spec: type: "" oidcProviders: null ' authentication.config.openshift.io/cluster patched [xxia@2024-11-25 21:15:24 CST my]$ oc get authentication.config cluster -o yaml apiVersion: config.openshift.io/v1 kind: Authentication metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" release.openshift.io/create-only: "true" creationTimestamp: "2024-11-25T04:11:59Z" generation: 5 name: cluster ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: e814f1dc-0b51-4b87-8f04-6bd99594bf47 resourceVersion: "284724" uid: 2de77b67-7de4-4883-8ceb-f1020b277210 spec: oauthMetadata: name: "" serviceAccountIssuer: "" type: "" webhookTokenAuthenticator: kubeConfig: name: webhook-authentication-integrated-oauth status: integratedOAuthMetadata: name: oauth-openshift oidcClients: - componentName: cli componentNamespace: openshift-console - componentName: console componentNamespace: openshift-console conditions: - lastTransitionTime: "2024-11-25T13:10:23Z" message: "" reason: OIDCConfigAvailable status: "False" type: Degraded - lastTransitionTime: "2024-11-25T13:10:23Z" message: "" reason: OIDCConfigAvailable status: "False" type: Progressing - lastTransitionTime: "2024-11-25T13:10:23Z" message: "" reason: OIDCConfigAvailable status: "True" type: Available currentOIDCClients: - clientID: 95fbae1d-69a7-4206-86bd-00ea9e0bb778 issuerURL: https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/v2.0 oidcProviderName: microsoft-entra-id KAS and console pods indeed can roll out successfully; and now oc login and console login indeed can succeed using the htpasswd user and password: [xxia@2024-11-25 21:49:32 CST my]$ oc login -u testuser-1 -p xxxxxx Login successful. ... But co/console degraded, which is weird: [xxia@2024-11-25 21:56:07 CST my]$ oc get co | grep -v 'True *False *False' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE console 4.18.0-0.test-2024-11-25-020414-ci-ln-71cvsj2-latest True False True 9h AuthStatusHandlerDegraded: Authentication.config.openshift.io "cluster" is invalid: [status.oidcClients[1].currentOIDCClients[0].issuerURL: Invalid value: "": oidcClients[1].currentOIDCClients[0].issuerURL in body should match '^https:\/\/[^\s]', status.oidcClients[1].currentOIDCClients[0].oidcProviderName: Invalid value: "": oidcClients[1].currentOIDCClients[0].oidcProviderName in body should be at least 1 chars long]
Actual results:
co/console degraded, as above.
Expected results:
co/console is normal.
Additional info:
Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission.
With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.
With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn".
Epic Goal
Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.
When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.
To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).
Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).
The following tables track progress.
# namespaces | 4.19 | 4.18 | 4.17 | 4.16 | 4.15 | 4.14 |
---|---|---|---|---|---|---|
monitored | 82 | 82 | 82 | 82 | 82 | 82 |
fix needed | 68 | 68 | 68 | 68 | 68 | 68 |
fixed | 39 | 39 | 35 | 32 | 39 | 1 |
remaining | 29 | 29 | 33 | 36 | 29 | 67 |
~ remaining non-runlevel | 8 | 8 | 12 | 15 | 8 | 46 |
~ remaining runlevel (low-prio) | 21 | 21 | 21 | 21 | 21 | 21 |
~ untested | 5 | 2 | 2 | 2 | 82 | 82 |
# | namespace | 4.19 | 4.18 | 4.17 | 4.16 | 4.15 | 4.14 |
---|---|---|---|---|---|---|---|
1 | oc debug node pods | #1763 | #1816 | #1818 | |||
2 | openshift-apiserver-operator | #573 | #581 | ||||
3 | openshift-authentication | #656 | #675 | ||||
4 | openshift-authentication-operator | #656 | #675 | ||||
5 | openshift-catalogd | #50 | #58 | ||||
6 | openshift-cloud-credential-operator | #681 | #736 | ||||
7 | openshift-cloud-network-config-controller | #2282 | #2490 | #2496 | |||
8 | openshift-cluster-csi-drivers | #118 #5310 #135 | #524 #131 #306 #265 #75 | #170 #459 | #484 | ||
9 | openshift-cluster-node-tuning-operator | #968 | #1117 | ||||
10 | openshift-cluster-olm-operator | #54 | n/a | n/a | |||
11 | openshift-cluster-samples-operator | #535 | #548 | ||||
12 | openshift-cluster-storage-operator | #516 | #459 #196 | #484 #211 | |||
13 | openshift-cluster-version | #1038 | #1068 | ||||
14 | openshift-config-operator | #410 | #420 | ||||
15 | openshift-console | #871 | #908 | #924 | |||
16 | openshift-console-operator | #871 | #908 | #924 | |||
17 | openshift-controller-manager | #336 | #361 | ||||
18 | openshift-controller-manager-operator | #336 | #361 | ||||
19 | openshift-e2e-loki | #56579 | #56579 | #56579 | #56579 | ||
20 | openshift-image-registry | #1008 | #1067 | ||||
21 | openshift-ingress | #1032 | |||||
22 | openshift-ingress-canary | #1031 | |||||
23 | openshift-ingress-operator | #1031 | |||||
24 | openshift-insights | #1033 | #1041 | #1049 | #915 | #967 | |
25 | openshift-kni-infra | #4504 | #4542 | #4539 | #4540 | ||
26 | openshift-kube-storage-version-migrator | #107 | #112 | ||||
27 | openshift-kube-storage-version-migrator-operator | #107 | #112 | ||||
28 | openshift-machine-api | #1308 #1317 | #1311 | #407 | #315 #282 #1220 #73 #50 #433 | #332 #326 #1288 #81 #57 #443 | |
29 | openshift-machine-config-operator | #4636 | #4219 | #4384 | #4393 | ||
30 | openshift-manila-csi-driver | #234 | #235 | #236 | |||
31 | openshift-marketplace | #578 | #561 | #570 | |||
32 | openshift-metallb-system | #238 | #240 | #241 | |||
33 | openshift-monitoring | #2298 #366 | #2498 | #2335 | #2420 | ||
34 | openshift-network-console | #2545 | |||||
35 | openshift-network-diagnostics | #2282 | #2490 | #2496 | |||
36 | openshift-network-node-identity | #2282 | #2490 | #2496 | |||
37 | openshift-nutanix-infra | #4504 | #4539 | #4540 | |||
38 | openshift-oauth-apiserver | #656 | #675 | ||||
39 | openshift-openstack-infra | #4504 | #4539 | #4540 | |||
40 | openshift-operator-controller | #100 | #120 | ||||
41 | openshift-operator-lifecycle-manager | #703 | #828 | ||||
42 | openshift-route-controller-manager | #336 | #361 | ||||
43 | openshift-service-ca | #235 | #243 | ||||
44 | openshift-service-ca-operator | #235 | #243 | ||||
45 | openshift-sriov-network-operator | #995 | #999 | #1003 | |||
46 | openshift-user-workload-monitoring | #2335 | #2420 | ||||
47 | openshift-vsphere-infra | #4504 | #4542 | #4539 | #4540 | ||
48 | (runlevel) kube-system | ||||||
49 | (runlevel) openshift-cloud-controller-manager | ||||||
50 | (runlevel) openshift-cloud-controller-manager-operator | ||||||
51 | (runlevel) openshift-cluster-api | ||||||
52 | (runlevel) openshift-cluster-machine-approver | ||||||
53 | (runlevel) openshift-dns | ||||||
54 | (runlevel) openshift-dns-operator | ||||||
55 | (runlevel) openshift-etcd | ||||||
56 | (runlevel) openshift-etcd-operator | ||||||
57 | (runlevel) openshift-kube-apiserver | ||||||
58 | (runlevel) openshift-kube-apiserver-operator | ||||||
59 | (runlevel) openshift-kube-controller-manager | ||||||
60 | (runlevel) openshift-kube-controller-manager-operator | ||||||
61 | (runlevel) openshift-kube-proxy | ||||||
62 | (runlevel) openshift-kube-scheduler | ||||||
63 | (runlevel) openshift-kube-scheduler-operator | ||||||
64 | (runlevel) openshift-multus | ||||||
65 | (runlevel) openshift-network-operator | ||||||
66 | (runlevel) openshift-ovn-kubernetes | ||||||
67 | (runlevel) openshift-sdn | ||||||
68 | (runlevel) openshift-storage |
Implement Migration core for MAPI to CAPI for AWS
When customers use CAPI, There must be no negative effect to switching over to using CAPI . Seamless migration of Machine resources. the fields in MAPI/CAPI should reconcile from both CRDs.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As QE have tried to test upstream CAPI pausing, we've hit a few issues with running the migration controller. & cluster capi operator on a real cluster vs envtest.
This card captures the work required to iron out these kinks, and get things running (i.e not crashing).
I also think we want an e2e or some sort of automated testing to ensure we don't break things again.
Goal: Stop the CAPI operator crashing on startup in a real cluster.
Non goals: get the entire conversion flow running from CAPI -> MAPI and MAPI -> CAPI. We still need significant feature work before we're here.
As a cluster administrator, I want to use Karpenter on an OpenShift cluster running in AWS to scale nodes instead of Cluster Autoscalar(CAS). I want to automatically manage heterogeneous compute resources in my OpenShift cluster without the additional manual task of managing node pools. Additional features I want are:
This feature covers the work done to integrate upstream Karpenter 1.x with ROSA HCP. This eliminates the need for manual node pool management while ensuring cost-effective compute selection for workloads. Red Hat manages the node lifecycle and upgrades.
The feature will be rolled out with ROSA (AWS) since it has more mature Karpenter ecosystem, followed by ARO (Azure) implementation(check OCPSTRAT-1498)
As a cluster-admin or SRE I should be able to configure Karpenter with OCP on AWS. Both cli and UI should enable users to configure Karpenter and disable CAS.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | managed ROSA HCP |
Classic (standalone cluster) | |
Hosted control planes | yes |
Multi node, Compact (three node), or Single node (SNO), or all | MNO |
Connected / Restricted Network | Connected |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | x86_x64, ARM (aarch64) |
Operator compatibility | |
Backport needed (list applicable versions) | No |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | yes - console |
Other (please specify) | rosa-cli |
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
Goal Summary
This feature aims to make sure that the HyperShift operator and the control-plane it deploys uses Managed Service Identities (MSI) and have access to scoped credentials (also via access to AKS's image gallery potentially). Additionally, for operators deployed in customers account (system components), they would be scoped with Azure workload identities.
The Cluster Network Operator can authenticate with Service Principal backed by a certificate stored in an Azure Key Vault. The Secrets CSI driver will be used to mount the certificate as a volume on the image registry deployment in a hosted control plane.
Azure SDK
Which degree of coverage should run on AKS e2e vs on existing e2es
CI - Existing CI is running, tests are automated and merged.
CI - AKS CI is running, tests are automated and merged.
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>
Today Azure installation requires manually created service principal which involves relations, permission granting, credential setting, credential storage, credentials rotation, credentials clean up, and service principal deletion. This is not only mundane and time-consuming but also less secure and risks access to resources by adversaries due to lack of credential rotation.
Employ Azure managed credentials which drastically reduce the steps required to just managed identity creation, permission granting, and resource deletion.
Ideally, this should be a HyperShift-native functionality. I.e., HyperShift should use managed identities for the control plane, the kubelet, and any add-on that needs access to Azure resources.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
This epic tracks any part of our codebase / solutions we implemented taking shortcuts.
Whenever a shortcut is taken, we should add a story here not to forget to improve it in a safer and more maintainabile way.
maintanability and debuggability, and in general fighting the technical debt, is critical to keep velocity and ensure overall high quality
https://issues.redhat.com/browse/CNF-796
https://issues.redhat.com/browse/CNF-1479
https://issues.redhat.com/browse/CNF-2134
https://issues.redhat.com/browse/CNF-6745
https://issues.redhat.com/browse/CNF-8036
https://issues.redhat.com/browse/CNF-9566
Capture the necessary accidental work to get CI / Konflux unstuck during the 4.19 cycle
Due to capacity problems on the s390x environment, the Konflux team recommended disabling the s390x platform from the PR pipeline.
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
When running ./build-frontend.sh, I am getting the following warnings in the build log:
warning " > cypress-axe@0.12.0" has unmet peer dependency "axe-core@^3 || ^4". warning " > cypress-axe@0.12.0" has incorrect peer dependency "cypress@^3 || ^4 || ^5 || ^6".
To fix:
When adding the assignContributorRole to assign contributor roles for appropriate scopes to existing SPs we missed the assignment of the role over the DNS RG scope
We are constantly bumping up against quotas when trying to create new ServicePrincipals per test. Example:
=== NAME TestCreateClusterV2
hypershift_framework.go:291: failed to create cluster, tearing down: failed to create infra: ERROR: The directory object quota limit for the Tenant has been exceeded. Please ask your administrator to increase the quota limit or delete objects to reduce the used quota.
We need to create a set of ServicePrincipals to use during testing, and we need to reuse them while executing the e2e-aks.
These are items that the team has prioritized to address in 4.18.
In https://issues.redhat.com/browse/MCO-1469, we are migrating my helper binaries into the MCO repository. I had to make changes to several of my helpers in the original repository to address bugs and other issues in order to unblock https://github.com/openshift/release/pull/58241. Because of the changes I requested during the PR review to make the integration easier, it may be a little tricky to incorporate all of my changes into the MCO repository, but it is still doable.
Done When:
In OCP 4.7 and before, you were able to see the MCD logs of the previous container post upgrade. Now it seems that we no longer do in newer versions. I am not sure if this is a change in kube pod logging behaviour, how the pod gets shutdown and brought up, or something in the MCO.
This however makes it relatively hard to debug in newer versions of the MCO, and in numerous bugs we could not pinpoint the source of the issue since we no longer have necessary logs. We should find a way to properly save the previous boot MCD logs if possible.
This epic has been repurposed for handling bugs and issues related to DataImage api ( see comments by Zane and slack discussion below ). Some issues have already been added, will add more issues to improve the stability and reliability of this feature.
Reference links :
Issue opened for IBIO : https://issues.redhat.com/browse/OCPBUGS-43330
Slack discussion threads :
https://redhat-internal.slack.com/archives/CFP6ST0A3/p1729081044547689?thread_ts=1728928990.795199&cid=CFP6ST0A3
https://redhat-internal.slack.com/archives/C0523LQCQG1/p1732110124833909?thread_ts=1731660639.803949&cid=C0523LQCQG1
Description of problem:
After deleting a BaremetalHost which has a related DataImage, the DataImage is still present. I'd expect that together with the bmh deletion the dataimage gets deleted as well.
Version-Release number of selected component (if applicable):
4.17.0-rc.0
How reproducible:
100%
Steps to Reproduce:
1. Create BaremetalHost object as part of the installation process using Image Based Install operator 2. Image Based Install operator will create a dataimage as part of the install process 3. Delete the BaremetalHost object 4. Check the DataImage assigned to the BareMetalHost
Actual results:
While the BaremetalHost was deleted the DataImage is still present: oc -n kni-qe-1 get bmh No resources found in kni-qe-1 namespace. oc -n kni-qe-1 get dataimage -o yaml apiVersion: v1 items: - apiVersion: metal3.io/v1alpha1 kind: DataImage metadata: creationTimestamp: "2024-09-24T11:58:10Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2024-09-24T14:06:15Z" finalizers: - dataimage.metal3.io generation: 2 name: sno.kni-qe-1.lab.eng.rdu2.redhat.com namespace: kni-qe-1 ownerReferences: - apiVersion: metal3.io/v1alpha1 blockOwnerDeletion: true controller: true kind: BareMetalHost name: sno.kni-qe-1.lab.eng.rdu2.redhat.com uid: 0a8bb033-5483-4fe8-8e44-06bf43ae395f resourceVersion: "156761793" uid: 2358cae9-b660-40e6-9095-7daabb4d9e48 spec: url: https://image-based-install-config.multicluster-engine.svc:8000/images/kni-qe-1/ec274bfe-a295-4cd4-8847-4fe4d232b255.iso status: attachedImage: url: https://image-based-install-config.multicluster-engine.svc:8000/images/kni-qe-1/ec274bfe-a295-4cd4-8847-4fe4d232b255.iso error: count: 0 message: "" lastReconciled: "2024-09-24T12:03:28Z" kind: List metadata: resourceVersion: ""
Expected results:
The DataImage gets deleted when the BaremetalHost owner gets deleted.
Additional info:
This is impacting automated test pipelines which use ImageBasedInstall operator as the cleanup stage gets stuck waiting for the namespace deletion which still holds the DataImage. Also deleting the DataImage gets stuck and it can only be deleted by removing the finalizer. oc get namespace kni-qe-1 -o yaml apiVersion: v1 kind: Namespace metadata: annotations: openshift.io/sa.scc.mcs: s0:c33,c2 openshift.io/sa.scc.supplemental-groups: 1001060000/10000 openshift.io/sa.scc.uid-range: 1001060000/10000 creationTimestamp: "2024-09-24T11:40:03Z" deletionTimestamp: "2024-09-24T14:06:14Z" labels: app.kubernetes.io/instance: clusters cluster.open-cluster-management.io/managedCluster: kni-qe-1 kubernetes.io/metadata.name: kni-qe-1 name: kni-qe-1-namespace open-cluster-management.io/cluster-name: kni-qe-1 pod-security.kubernetes.io/audit: restricted pod-security.kubernetes.io/audit-version: v1.24 pod-security.kubernetes.io/enforce: restricted pod-security.kubernetes.io/enforce-version: v1.24 pod-security.kubernetes.io/warn: restricted pod-security.kubernetes.io/warn-version: v1.24 name: kni-qe-1 resourceVersion: "156764765" uid: ee984850-665a-4f5e-8f17-0c44b57eb925 spec: finalizers: - kubernetes status: conditions: - lastTransitionTime: "2024-09-24T14:06:23Z" message: All resources successfully discovered reason: ResourcesDiscovered status: "False" type: NamespaceDeletionDiscoveryFailure - lastTransitionTime: "2024-09-24T14:06:23Z" message: All legacy kube types successfully parsed reason: ParsedGroupVersions status: "False" type: NamespaceDeletionGroupVersionParsingFailure - lastTransitionTime: "2024-09-24T14:06:23Z" message: All content successfully deleted, may be waiting on finalization reason: ContentDeleted status: "False" type: NamespaceDeletionContentFailure - lastTransitionTime: "2024-09-24T14:06:23Z" message: 'Some resources are remaining: dataimages.metal3.io has 1 resource instances' reason: SomeResourcesRemain status: "True" type: NamespaceContentRemaining - lastTransitionTime: "2024-09-24T14:06:23Z" message: 'Some content in the namespace has finalizers remaining: dataimage.metal3.io in 1 resource instances' reason: SomeFinalizersRemain status: "True" type: NamespaceFinalizersRemaining phase: Terminating
Tracking all things Konflux related for the Metal Platform Team
Full enablement should happen during OCP 4.19 development cycle
Description of problem:
The host that gets used in production builds to download the iso will change soon. It would be good to allow this host to be set through configuration from the release team / ocp-build-data
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
List of component:
Adept the current dockerfiles and add them to the upstream repos.
List of components:
A lot of time our pipelines as well as other teams' pipelines are stuck because they are unable to provision hosts with different architectures to build the images.
Because we currently don't use the multi-arch images we build with konflux, we will stop building multi-arch for now and readd those architectures when we need them.
Currently, the monitoring stack is configured using a configmap. In OpenShift though the best practice is to configure operators using custom resources.
To start the effort we should create a feature gate behind which we can start implementing a CRD config approach. This allows us to iterate in smaller increments without having to support full feature parity with the config map from the start. We can start small and add features as they evolve.
One proposal for a minimal DoD was:
Feature parity should be planned in one or more separate epics.
This story covers the implementation of our initial CRD in CMO. When the feature gate is enabled, CMO watches a singleton CR (name tbd) and acts on changes. The inital feature could be a boolean flag (defaults to true) that tells CMO to merge the configmap settings. If a user sets this flag to false, the config map is ignored and default settings are applied.
thanos needs to be upgraded to support prometheus3
all origin tests were failing
The history of this epic starts with this PR which triggered a lengthy conversation around the workings of the image API with respect to importing imagestreams images as single vs manifestlisted. The imagestreams today by default have the `importMode` flag set to `Legacy` to avoid breaking behavior of existing clusters in the field. This makes sense for single arch clusters deployed with a single arch payload, but when users migrate to use the multi payload, more often than not, their intent is to add nodes of other architecture types. When this happens - it gives rise to problems when using imagestreams with the default behavior of importing a single manifest image. The oc commands do have a new flag to toggle the importMode, but this breaks functionality of existing users who just want to create an imagestream and use it with existing commands.
There was a discussion with David Eads and other staff engineers and it was decided that the approach to be taken is to default imagestreams' importMode to `preserveOriginal` if the cluster is installed with/ upgraded to a multi payload. So a few things need to happen to achieve this:
Some open questions:
This is a container Epic for tasks which we know need to be done for Tech Preview but which we don't intend to do now. It needs to be groomed before it is useful for planning.
This task focuses on ensuring that all OpenStack resources automatically created by Hypershift for Hosted Control Planes are tagged with a unique identifier, such as the HostedCluster ID. These resources include, but are not limited to, servers, ports, and security groups. Proper tagging will enable administrators to clearly identify and manage resources associated with specific OpenShift clusters.
Acceptance Criteria:
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
We need to send an enhancement proposal that would contain the design changes we suggest in openshift/api/config/v1/types_cluster_version.go to allow changing the log level of the CVO using an API configuration before implementing such changes in the API.
Definition of Done:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
We have decided to remove IPI and UPI support for Alibaba Cloud, which until recently has been in Tech Preview due to the following reasons:
(1) Low customer interest of using Openshift on Alibaba Cloud
(2) Removal of Terraform usage
(3) MAPI to CAPI migration
(4) CAPI adoption for installation (Day 1) in OpenShift 4.16 and Day 2 (post-OpenShift 4.16)
Impacted areas based on CI:
alibaba-cloud-csi-driver/openshift-alibaba-cloud-csi-driver-release-4.16.yaml
alibaba-disk-csi-driver-operator/openshift-alibaba-disk-csi-driver-operator-release-4.16.yaml
cloud-provider-alibaba-cloud/openshift-cloud-provider-alibaba-cloud-release-4.16.yaml
cluster-api-provider-alibaba/openshift-cluster-api-provider-alibaba-release-4.16.yaml
cluster-cloud-controller-manager-operator/openshift-cluster-cloud-controller-manager-operator-release-4.16.yaml
machine-config-operator/openshift-machine-config-operator-release-4.16.yaml
Acceptance Criteria
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
Test is failing due to oddness with oc adm logs.
We think it is related to PodLogsQuery feature that went into 1.32.
Description of problem:
In ASH arm template 06_workers.json[1], there is an unused variable "identityName" defined, this is harmless, but little weird to be present in official upi installation doc[2], which might confuse user when installing UPI cluster on ASH. [1] https://github.com/openshift/installer/blob/master/upi/azurestack/06_workers.json#L52 [2] https://docs.openshift.com/container-platform/4.17/installing/installing_azure_stack_hub/upi/installing-azure-stack-hub-user-infra.html#installation-arm-worker_installing-azure-stack-hub-user-infra suggest to remove it from arm template.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Initially, the clusters at version 4.16.9 were having issues with reconciling the IDP. The error which was found in Dynatrace was
"error": "failed to update control plane: failed to reconcile openshift oauth apiserver: failed to reconcile oauth server config: failed to generate oauth config: failed to apply IDP AAD config: Service Unavailable",
Initially it was assumed that the IDP service was unavialble but the CU confirmed that they also had the GroupSync operator running inside all clusters, which can successfully connect to the customer IDP and sync User + Group information from the IDP into the cluster.
The CU was advised to upgrade to 4.16.18 keeping in mind few of the other OCPBUGS which were related to proxy and would be resolved by upgrading to 4.16.15+
However, after upgrade the IDP is still failing to apply it seems. It looks like IDP reconciler isn't considering the Additional Trust Bundle for the customer proxy
Checking DT Logs, it seems to fail to verify the certificate
"error": "failed to update control plane: failed to reconcile openshift oauth apiserver: failed to reconcile oauth server config: failed to generate oauth config: failed to apply IDP AAD config: tls: failed to verify certificate: x509: certificate signed by unknown authority", "error": "failed to update control plane: [failed to reconcile openshift oauth apiserver: failed to reconcile oauth server config: failed to generate oauth config: failed to apply IDP AAD config: tls: failed to verify certificate: x509: certificate signed by unknown authority, failed to update status: Operation cannot be fulfilled on hostedcontrolplanes.hypershift.openshift.io \"rosa-staging\": the object has been modified; please apply your changes to the latest version and try again]",
Version-Release number of selected component (if applicable):
4.16.18
How reproducible:
Customer has a few clusters deployed and each of them has the same issue.
Steps to Reproduce:
1. Create a HostedCluster with a proxy configuration that specifies an additionalTrustBundle, and an OpenID idp that can be publicly verified (ie. EntraID or Keycloak with LetsEncrypt certs) 2. Wait for the cluster to come up and try to use the IDP 3.
Actual results:
IDP is failing to work for HCP
Expected results:
IDP should be working for the clusters
Additional info:
The issue will happen only if the IDP does not require a custom trust bundle to be verified.
Description of problem:
The initial set of default endpoint overrides we specified in the installer are missing a v1 at the end of the DNS services override.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Created a service for DNS server for secondary networks in Openshift-Virtualizaion, using MetalLB, but the IP is still pending, when accessing the service from the UI, it crash.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Steps to Reproduce:
1. Create an IP pool (for example 1 IP) for Metal LB and fully utilize the IP range (which other service) 2. Allocate a new IP using the oc expose command like below 3. Check the service status on the UI
Actual results:
UI crash
Expected results:
Should show the service status
Additional info:
oc expose -n openshift-cnv deployment/secondary-dns --name=dns-lb --type=LoadBalancer --port=53 --target-port=5353 --protocol='UDP'
Description of problem:
Improving tests to remove the issue in the following helm test case Perform the helm chart upgrade for already upgraded helm chart : HR-08-TC02: Helm Release Perform the helm chart upgrade for already upgraded helm chart : HR-08-TC02 expand_less 37s {The following error originated from your application code, not from Cypress. It was caused by an unhandled promise rejection. > Cannot read properties of undefined (reading 'repoName') When Cypress detects uncaught errors originating from your application it will automatically fail the current test.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The Power VS Machine API provider ignores the authentication endpoint override.
Description of problem:
When the master MCP is paused below alert are triggered Failed to resync 4.12.35 because: Required MachineConfigPool 'master' is paused The node have been rebooted to make sure there is no pending MC rollout
Affects version
4.12
How reproducible:
Steps to Reproduce:
1. Create a MC and apply it to master 2. use below mc apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 99-master-cgroupsv2 spec: kernelArguments: - systemd.unified_cgroup_hierarchy=1 3.Wait until the nodes are rebooted and running 4. pause the mcp Actual results:{code:none} MCP pausing causing the alert
Expected results:
Alerts should not be fired Additional info:{code:none}
Description of problem:
The story is to track i18n upload/download routine tasks which are perform every sprint.
A.C.
- Upload strings to Memosource at the start of the sprint and reach out to localization team
- Download translated strings from Memsource when it is ready
- Review the translated strings and open a pull request
- Open a followup story for next sprint
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
If the vCenter cluster has no esxi hosts importing the ova fails. Add a more sane error message
Description of problem:
In CAPI, we use a random machineNetwork instead of using the one passed in by the user.
Description of problem:
Due the recent changes, using oc 4.17 adm node-image commands on a 4.18 ocp cluster doesn't work
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. oc adm node-image create / monitor 2. 3.
Actual results:
The commands fail
Expected results:
The commands should work as expected
Additional info:
Description of problem:
Currently both the nodepool controller and capi controller set the updatingConfig condition on nodepool upgrades. We should only use one to set the condition to avoid constant switching between conditions and to ensure the logic used for setting this condition is the same.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
CAPI and Nodepool controller set a different status because their logic is not consistent.
Expected results:
CAPI and Nodepool controller set the same status because their logic is not cosolidated.
Additional info:
Description of problem:
We are currently using node 18, but our types are for node 10
Version-Release number of selected component (if applicable):
4.19.0
How reproducible:
Always
Steps to Reproduce:
1. Open frontend/package.json 2. Observe @types/node and engine version 3.
Actual results:
They are different
Expected results:
They are the same
Additional info:
Description of problem:
checked in 4.18.0-0.nightly-2024-12-05-103644/4.19.0-0.nightly-2024-12-04-031229, OCPBUGS-34533 is reproduced on 4.18+, no such issue with 4.17 and below.
steps: login admin console or developer console(admin console go to "Observe -> Alerting -> Silences" tab, developer console go to "Observe -> Silences" tab), to create silence, edit the "Until" option, even with a valid timestamp or invalid stamp, will get error "[object Object]" in the "Until" field. see screen recording: https://drive.google.com/file/d/14JYcNyslSVYP10jFmsTaOvPFZSky1eg_/view?usp=drive_link
checked 4.17 fix for OCPBUGS-34533 is already in 4.18+ code
Version-Release number of selected component (if applicable):
4.18+
How reproducible:
always
Steps to Reproduce:
1. see the descriptions
Actual results:
Unable to edit "until" filed in silences
Expected results:
able to edit "until" filed in silences
Description of problem:
v1alpha1 schema is still present in the v1 ConsolePlugin CRD and should be removed manually since the generator is re-adding it automatically.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Integrate codespell into Make Verify so that things are spell correctly in our upstream docs and codebase.
To do
Description of problem:
This function https://github.com/openshift/hypershift/blame/c34a1f6cef0cb41c8a1f83acd4ddf10a4b9e8532/support/util/util.go#L391 does not checks the IDMS/ICSP overrides during the reconciliation, so it breaks the disconnected deployments.
Description of problem:
"destroy cluster" doesn't delete the PVC disks which have the label "kubernetes-io-cluster-<infra-id>: owned"
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-multi-2024-11-27-162629
How reproducible:
Always
Steps to Reproduce:
1. include the step which sets the cluster default storageclass to the hyperdisk one before ipi-install (see my debug PR https://github.com/openshift/release/pull/59306) 2. "create cluster", and make sure it succeeds 3. "destroy cluster" Note: although we confirmed with issue with disk type "hyperdisk-balanced", we believe other disk types have the same issue.
Actual results:
The 2 PVC disks of hyperdisk-balanced type are not deleted during "destroy cluster", although the disks have the label "kubernetes-io-cluster-<infra-id>: owned".
Expected results:
The 2 PVC disks should be deleted during "destroy cluster", because they have the correct/expected labels according to which the uninstaller should be able to detect them.
Additional info:
FYI the PROW CI debug job: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/59306/rehearse-59306-periodic-ci-openshift-verification-tests-master-installer-rehearse-4.18-installer-rehearse-debug/1861958752689721344
Description of problem:
when TechPreviewNoUpgrade feature gate is enabled, console will show a customized 'Create Project' modal to all users. In the customized modal, 'Display name' and 'Description' values user typed are not taking effect
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-12-16-065305
How reproducible:
Always when TechPreviewNoUpgrade feature gate is enabled
Steps to Reproduce:
1. Enable TechPreviewNoUpgrade feature gate $ oc patch featuregate cluster -p '{"spec": {"featureSet": "TechPreviewNoUpgrade"}}' --type merge 2. any user login to console and create a project from web, set 'Display name' and 'Description' then click on 'Create' 3. Check created project YAML $ oc get project ku-5 -o json | jq .metadata.annotations { "openshift.io/description": "", "openshift.io/display-name": "", "openshift.io/requester": "kube:admin", "openshift.io/sa.scc.mcs": "s0:c28,c17", "openshift.io/sa.scc.supplemental-groups": "1000790000/10000", "openshift.io/sa.scc.uid-range": "1000790000/10000" }
Actual results:
display-name and description are all empty
Expected results:
display-name and description should be set to the values user had configured
Additional info:
once TP is enabled, customized create project modal is looking like https://drive.google.com/file/d/1HmIlm0u_Ia_TPsa0ZAGyTloRmpfD0WYk/view?usp=drive_link
Description of problem:
When attempting to install a specific version of an operator from the web console, the install plan of the latest version of that operator is created if the operator version had a + in it.
Version-Release number of selected component (if applicable):
4.17.6 (Tested version)
How reproducible:
Easily reproducible
Steps to Reproduce:
1. Under Operators > Operator Hub, install an operator with a + character in the version. 2. On the next screen, note that the + in the version text box is missing. 3. Make no changes to the default options and proceed to install the operator. 4. An install plan is created to install the operator with the latest version from the channel.
Actual results:
The install plan is created for the latest version from the channel.
Expected results:
The install plan is created for the requested version.
Additional info:
Notes on the reproducer: - For step 1: the selected version shouldn't be the latest version from the channel for the purposes of this bug. - For step 1: The version will need to be selected from the version dropdown to reproduce the bug. If the default version that appears in the dropdown is used, then the bug won't reproduce. Other Notes: - This might also happen with other special characters in the version string other than +, but this is not something that I tested.
Description of problem:
The StaticPodOperatorStatus API validations permit: - nodeStatuses[].currentRevision can be cleared and can decrease - more than one entry in nodeStatuses can have a targetRevision > 0 But both of these signal a bug in one or more of the static pod controllers that write to them.
Version-Release number of selected component (if applicable):
This has been the case ~forever but we are aware of bugs in 4.18+ that are resulting in controllers trying to make these invalid writes. We also have more expressive validation mechanisms today that make it possible to plug the holes.
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
After the upgrade to OpenShift Container Platform 4.17, it's being observed that aws-efs-csi-driver-controller-metrics/aws-efs-csi-driver-controller-metrics is reporting target down state. When checking the newly created Container one can find the below logs, that may explain the effect seen/reported. $ oc logs aws-efs-csi-driver-controller-5b8d5cfdf4-zwh67 -c kube-rbac-proxy-8211 W1119 07:53:10.249934 1 deprecated.go:66] ==== Removed Flag Warning ====================== logtostderr is removed in the k8s upstream and has no effect any more. =============================================== I1119 07:53:10.250382 1 kube-rbac-proxy.go:233] Valid token audiences: I1119 07:53:10.250431 1 kube-rbac-proxy.go:347] Reading certificate files I1119 07:53:10.250645 1 kube-rbac-proxy.go:395] Starting TCP socket on 0.0.0.0:9211 I1119 07:53:10.250944 1 kube-rbac-proxy.go:402] Listening securely on 0.0.0.0:9211 I1119 07:54:01.440714 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:54:19.860038 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:54:31.432943 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:54:49.852801 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:55:01.433635 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:55:19.853259 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:55:31.432722 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:55:49.852606 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:56:01.432707 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:56:19.853137 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:56:31.440223 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:56:49.856349 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:57:01.432528 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:57:19.853132 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:57:31.433104 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:57:49.852859 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:58:01.433321 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused I1119 07:58:19.853612 1 log.go:245] http: proxy error: dial tcp 127.0.0.1:8211: connect: connection refused
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.17
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4.17 2. Install aws-efs-csi-driver-operator 3. Create efs.csi.aws.com CSIDriver object and wait for the aws-efs-csi-driver-controller to roll out.
Actual results:
The below Target Down Alert is being raised 50% of the aws-efs-csi-driver-controller-metrics/aws-efs-csi-driver-controller-metrics targets in Namespace openshift-cluster-csi-drivers namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.
Expected results:
The ServiceMonitor endpoint should be reachable and properly responding with the desired information to monitor the health of the component.
Additional info:
When deploying with an endpoint override for the resourceController, the Power VS machine API provider will ignore the override.
Creating clusters in which machines are created in a public subnet and use a public IP makes it possible to avoid creating NAT gateways (or proxies) for AWS clusters. While not applicable for every test, this configuration will save us money and cloud resources.
Description of problem:
If the install is performed with an AWS user missing the `ec2:DescribeInstanceTypeOfferings`, the installer will use a hardcoded instance type from the set of non-edge machine pools. This can potentially cause the edge node to fail during provisioning, since the instance type doesn't take into account edge/wavelength zones support. Because edge nodes are not needed for the installation to complete, the issue is not noticed by the installer, only by inspecting the status of the edge nodes.
Version-Release number of selected component (if applicable):
4.16+ (since edge nodes support was added)
How reproducible:
always
Steps to Reproduce:
1. Specify an edge machine pool in the install-config without an instance type 2. Run the install with an user without `ec2:DescribeInstanceTypeOfferings` 3.
Actual results:
In CI the `node-readiness` test step will fail and the edge nodes will show errorMessage: 'error launching instance: The requested configuration is currently not supported. Please check the documentation for supported configurations.' errorReason: InvalidConfiguration
Expected results:
Either 1. the permission is always required when instance type is not set for an edge pool; or 2. a better instance type default is used
Additional info:
Example CI job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/9230/pull-ci-openshift-installer-master-e2e-aws-ovn-edge-zones/1862140149505200128
Description of problem:
When deploying with endpoint overrides, the block CSI driver will try to use the default endpoints rather than the ones specified.
Description of problem:
In order to test OCL we run e2e automated test cases in a cluster that has OCL enabled in master and worker pools. We have seen that rarely a new machineconfig is rendered but no MOSB resource is created.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Rare
Steps to Reproduce:
We don't have any steps to reproduce it. It happens eventually when we run a regression in a cluster with OCL enabled in master and worker pools.
Actual results:
We see that in some scenarios a new MC is created, then a new rendered MC is created too, but now MOSB is created and the pool is stuck forever.
Expected results:
Whenever a new rendered MC is created, a new MOSB sould be created too to build the new image.
Additional info:
In the comments section we will add all the must-gather files that are related to this issue. In some scenarios we can see this error reported by the os-builder pod: 2024-12-03T16:44:14.874310241Z I1203 16:44:14.874268 1 request.go:632] Waited for 596.269343ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/secrets?labelSelector=machineconfiguration.openshift.io%2Fephemeral-build-object%2Cmachineconfiguration.openshift.io%2Fmachine-os-build%3Dmosc-worker-5fc70e666518756a629ac4823fc35690%2Cmachineconfiguration.openshift.io%2Fon-cluster-layering%2Cmachineconfiguration.openshift.io%2Frendered-machine-config%3Drendered-worker-7c0a57dfe9cd7674b26bc5c030732b35%2Cmachineconfiguration.openshift.io%2Ftarget-machine-config-pool%3Dworker Nevertheless, we only see this error in some of them, not in all of them.
Description of problem:
checked on 4.18.0-0.nightly-2024-12-07-130635/4.19.0-0.nightly-2024-12-07-115816, admin console, go to alert details page, "No datapoints found." on alert details graph. see picture for CannotRetrieveUpdates alert: https://drive.google.com/file/d/1RJCxUZg7Z8uQaekt39ux1jQH_kW9KYXd/view?usp=drive_link
issue exists in 4.18+, no such issue with 4.17
Version-Release number of selected component (if applicable):
4.18+
How reproducible:
always on 4.18+
Steps to Reproduce:
1. see the description
Actual results:
"No datapoints found." on alert details graph
Expected results:
show correct graph
Description of problem:
AlertmanagerConfig with missing options causes Alertmanager to crash
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
A cluster administrator has enabled monitoring for user-defined projects. CMO ~~~ config.yaml: | enableUserWorkload: true prometheusK8s: retention: 7d ~~~ A cluster administrator has enabled alert routing for user-defined projects. UWM cm / CMO cm ~~~ apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | alertmanager: enabled: true enableAlertmanagerConfig: true ~~~ verify existing config: ~~~ $ oc exec -n openshift-user-workload-monitoring alertmanager-user-workload-0 -- amtool config show --alertmanager.url http://localhost:9093 global: resolve_timeout: 5m http_config: follow_redirects: true smtp_hello: localhost smtp_require_tls: true pagerduty_url: https://events.pagerduty.com/v2/enqueue opsgenie_api_url: https://api.opsgenie.com/ wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/ victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/ telegram_api_url: https://api.telegram.org route: receiver: Default group_by: - namespace continue: false receivers: - name: Default templates: [] ~~~ create alertmanager config without options "smtp_from:" and "smtp_smarthost" ~~~ apiVersion: monitoring.coreos.com/v1alpha1 kind: AlertmanagerConfig metadata: name: example namespace: example-namespace spec: receivers: - emailConfigs: - to: some.username@example.com name: custom-rules1 route: matchers: - name: alertname receiver: custom-rules1 repeatInterval: 1m ~~~ check logs for alertmanager: the following error is seen. ~~~ ts=2023-09-05T12:07:33.449Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="no global SMTP smarthost set" ~~~
Actual results:
Alertmamnager fails to restart.
Expected results:
CRD should be pre validated.
Additional info:
Reproducible with and without user workload Alertmanager.
When updating a 4.13 cluster to 4.14, the new-in-4.14 ImageRegistry capability will always be enabled, because capabilities cannot be uninstalled.
4.14 oc should learn about this, so they will appropriately extract registry CredentialsRequests when connecting to 4.13 clusters for 4.14 manifests. 4.15 oc will get OTA-1010 to handle this kind of issue automatically, but there's no problem with getting an ImageRegistry hack into 4.15 engineering candidates in the meantime.
100%
1. Connect your oc to a 4.13 cluster.
2. Extract manifests for a 4.14 release.
3. Check for ImageRegistry CredentialsRequests.
$ oc adm upgrade | head -n1 Cluster version is 4.13.12 $ oc adm release extract --included --credentials-requests --to credentials-requests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.0-x86_64 $ grep -r ImageRegistry credentials-requests ...no hits...
$ oc adm upgrade | head -n1 Cluster version is 4.13.12 $ oc adm release extract --included --credentials-requests --to credentials-requests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.0-x86_64 $ grep -r ImageRegistry credentials-requests credentials-requests/0000_50_cluster-image-registry-operator_01-registry-credentials-request.yaml: capability.openshift.io/name: ImageRegistry
We already do this for MachineAPI. The ImageRegistry capability landed later, and this is us catching the oc-extract hack up with that change.
The cluster-baremetal-operator sets up a number of watches for resources using Owns() that have no effect because the Provisioning CR does not (and should not) own any resources of the given type or using EnqueueRequestForObject{}, which similarly has no effect because the resource name and namespace are different from that of the Provisioning CR.
The commit https://github.com/openshift/cluster-baremetal-operator/pull/351/commits/d4e709bbfbae6d316f2e76bec18b0e10a45ac93e should be reverted as it adds considerable complexity to no effect whatsoever.
The correct way to trigger a reconcile of the provisioning CR is using EnqueueRequestsFromMapFunc(watchOCPConfigPullSecret) (note that the map function watchOCPConfigPullSecret() is poorly named - it always returns the name/namespace of the Provisioning CR singleton, regardless of the input, which is what we want). We should replace the ClusterOperator, Proxy, and Machine watches with ones of this form.
See https://github.com/openshift/cluster-baremetal-operator/pull/423/files#r1628777876 and https://github.com/openshift/cluster-baremetal-operator/pull/351/commits/d4e709bbfbae6d316f2e76bec18b0e10a45ac93e#r1628776168 for commentary.
Description of problem:
Some bundles in the Catalog have been given the property in the FBC (and not in the bundle's CSV) which does not get propagated through to the helm chart annotations.
Version-Release number of selected component (if applicable):
How reproducible:
Install elasticsearch 5.8.13
Steps to Reproduce:
1. 2. 3.
Actual results:
cluster is upgradeable
Expected results:
cluster is not upgradeable
Additional info:
Description of problem:
For various reasons, Pods may get evicted. Once they are evicted, the owner of the Pod should recreate the Pod so it is scheduled again.
With OLM, we can see that evicted Pods owned by Catalogsources are not rescheduled. The outcome is that all subscriptions have a "ResolutionFailed=True" condition, which hinders an upgrade of the operator. Specifically the customer is seeing an affected CatalogSource is "multicluster-engine-CENSORED_NAME-redhat-operator-index "in openshift-marketplace namespace, pod name: "multicluster-engine-CENSORED_NAME-redhat-operator-index-5ng9j"
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.16.21
How reproducible:
Sometimes, when Pods are evicted on the cluster
Steps to Reproduce:
1. Set up an OpenShift Container Platform 4.16 cluster, install various Operators
2. Create a condition that a Node will evict Pods (for example by creating DiskPressure on the Node)
3. Observe if any Pods owned by CatalogSources are being evicted
Actual results:
If Pods owned by CatalogSources are being evicted, they are not recreated / rescheduled.
Expected results:
When Pods owned by CatalogSources are being evicted, they are being recreacted / rescheduled.
Additional info:
In order to fix security issue https://github.com/openshift/assisted-service/security/dependabot/94
Description of problem:
A similar testing scenario to OCPBUGS-38719, but with the pre-existing dns private zone is not a peering zone, instead it is a normal dns zone which binds to another VPC network. And the installation will fail finally, because the dns record-set "*.apps.<cluster name>.<base domain>" is added to the above dns private zone, rather than the cluster's dns private zone.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-multi-2024-10-24-093933
How reproducible:
Always
Steps to Reproduce:
Please refer to the steps told in https://issues.redhat.com/browse/OCPBUGS-38719?focusedId=25944076&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25944076
Actual results:
The installation failed, due to the cluster operator "ingress" degraded
Expected results:
The installation should succeed.
Additional info:
Due to fundamental Kubernetes design, all OpenShift Container Platform updates between minor versions must be serialized. You must update from OpenShift Container Platform <4.y> to <4.y+1>, and then to <4.y+2>. You cannot update from OpenShift Container Platform <4.y> to <4.y+2> directly. However, administrators who want to update between two even-numbered minor versions can do so incurring only a single reboot of non-control plane hosts.
We should add a new precondition that enforces that policy, so cluster admins who run --to-image ... don't hop straight from 4.y.z to 4.(y+2).z' or similar without realizing that they were outpacing testing and policy.
The policy and current lack-of guard both date back to all OCP 4 releases, and since they're Kube-side constraints, they may date back to the start of Kube.
Every time.
1. Install a 4.y.z cluster.
2. Use --to-image to request an update to a 4.(y+2).z release.
3. Wait a few minutes for the cluster-version operator to consider the request.
4. Check with oc adm upgrade.
Update accepted.
Update rejected (unless it was forced), complaining about the excessively long hop.
Description of problem:
When setting up the "webhookTokenAuthenticator" the oauth configure "type" is set to "None". Then controller sets the console configmap with "authType=disabled". Which will cause that the console pod goes in the crash loop back due to the not allowed type: Error: validate.go:76] invalid flag: user-auth, error: value must be one of [oidc openshift], not disabled. This worked before on 4.14, stopped working on 4.15.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.15
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
The console can't start, seems like it is not allowed to change the console.
Expected results:
Additional info:
Description of problem:
In an effort to ensure all HA components are not degraded by design during normal e2e test or upgrades, we are collecting all operators that are blipping Degraded=True during any payload job run. This card captures machine-config operator that blips Degraded=True during some ci job runs. Example Job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview-serial/1843561357304139776 Reasons associated with the blip: MachineConfigDaemonFailed or MachineConfigurationFailed For now, we put an exception in the test. But it is expected that teams take action to fix those and remove the exceptions after the fix go in. Exception is defined here: https://github.com/openshift/origin/blob/e5e76d7ca739b5699639dd4c500f6c076c697da6/pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go#L109 See linked issue for more explanation on the effort.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
During bootstrapping we're running into the following scenario: 4 members: master 0, 1 and 2 (are full voting) and bootstrap (torn down/dead member) revision rollout causes 0 to restart and leaves you with 2/4 healthy, which means quorum-loss. This causes apiserver unavailability during the installation and should be avoided.
Version-Release number of selected component (if applicable):
4.17, 4.18 but is likely a longer standing issue
How reproducible:
rarely
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
apiserver should not return any errors
Additional info:
The following test is failing more than expected:
Undiagnosed panic detected in pod
See the sippy test details for additional context.
Observed in 4.18-e2e-vsphere-ovn-upi-serial/1861922894817267712
Undiagnosed panic detected in pod { pods/openshift-machine-config-operator_machine-config-daemon-4mzxf_machine-config-daemon_previous.log.gz:E1128 00:28:30.700325 4480 panic.go:261] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=<}
Description of problem:
Pull support from upstream kubernetes (see KEP 4800: https://github.com/kubernetes/enhancements/issues/4800) for LLC alignment support in cpumanager
Version-Release number of selected component (if applicable):
4.19
How reproducible:
100%
Steps to Reproduce:
1. try to schedule a pod which requires exclusive CPU allocation and whose CPUs should be affine to the same LLC block 2. observe random and likely wrong (not LLC-aligned) allocation 3.
Actual results:
Expected results:
Additional info:
Description of problem:
DEBUG Creating ServiceAccount for control plane nodes DEBUG Service account created for XXXXX-gcp-r4ncs-m DEBUG Getting policy for openshift-dev-installer DEBUG adding roles/compute.instanceAdmin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member DEBUG adding roles/compute.networkAdmin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member DEBUG adding roles/compute.securityAdmin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member DEBUG adding roles/storage.admin role, added serviceAccount:XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com member ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during pre-provisioning: failed to add master roles: failed to set IAM policy, unexpected error: googleapi: Error 400: Service account XXXXX-gcp-r4ncs-m@openshift-dev-installer.iam.gserviceaccount.com does not exist., badRequest It appears that the Service account was created correctly. The roles are assigned to the service account. It is possible that there needs to be a "wait for action to complete" on the server side to ensure that this will all be ok.
Version-Release number of selected component (if applicable):
How reproducible:
Random. Appears to be a sync issue
Steps to Reproduce:
1. Run the installer for a normal GCP basic install 2. 3.
Actual results:
Installer fails saying that the Service Account that the installer created does not have the permissions to perform an action. Sometimes it takes numerous tries for this to happen (very intermittent).
Expected results:
Successful install
Additional info:
Description of problem:
When creating a kubevirt hosted cluster with the following apiserver publishing configuration - service: APIServer servicePublishingStrategy: type: NodePort nodePort: address: my.hostna.me port: 305030 Shows following error: "failed to reconcile virt launcher policy: could not determine if amy.hostna.me is an IPv4 or IPv6 address" And network policies and not propertly deployed at the virtual machine namespaces.
Version-Release number of selected component (if applicable):
4.17
How reproducible:
Always
Steps to Reproduce:
1.Create a kubevirt hosted cluster with apiserver nodeport publish with a hostname 2. Wait for hosted cluster creation.
Actual results:
Following error pops up and network policies are not created "failed to reconcile virt launcher policy: could not determine if amy.hostna.me is an IPv4 or IPv6 address"
Expected results:
No error pops ups and network policies are created.
Additional info:
This is where the error get originated -> https://github.com/openshift/hypershift/blob/ef8596d4d69a53eb60838ae45ffce2bca0bfa3b2/hypershift-operator/controllers/hostedcluster/network_policies.go#L644 That error should prevent network policies creation.
Description of problem:
machine-approver logs
E0221 20:29:52.377443 1 controller.go:182] csr-dm7zr: Pending CSRs: 1871; Max pending allowed: 604. Difference between pending CSRs and machines > 100. Ignoring all CSRs as too many recent pending CSRs seen
.
oc get csr |wc -l
3818
oc get csr |grep "node-bootstrapper" |wc -l
2152
By approving the pending CSR manually I can get the cluster to scaleup.
We can increase the maxPending to a higher number https://github.com/openshift/cluster-machine-approver/blob/2d68698410d7e6239dafa6749cc454272508db19/pkg/controller/controller.go#L330
Description of problem:
"Cannot read properties of undefined (reading 'state')" Error in search tool when filtering Subscriptions while adding new Subscriptions
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. As an Administrator, go to Home -> Search and filter by Subscription component 2. Start creating subscriptions (bulk)
Actual results:
The filtered results will turn in "Oh no! Something went wrong" view
Expected results:
Get updated results every few seconds
Additional info:
If the view is reloaded -> Fix
Stack Trace:
TypeError: Cannot read properties of undefined (reading 'state') at L (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/subscriptions-chunk-89fe3c19814d1f6cdc84.min.js:1:3915) at na (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:58879) at Hs (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:111315) at Sc (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:98327) at Cc (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:98255) at _c (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:98118) at pc (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:95105) at https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:44774 at t.unstable_runWithPriority (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:289:3768) at Uo (https://console-openshift-console.apps.ods-qe-psi-02.osp.rh-ods.com/static/vendors~main-chunk-c416c917452592bcdcba.min.js:281:44551)
Description of problem:
4.18 HyperShift operator's NodePool controller fails to serialize NodePool ConfigMaps that contain ImageDigestMirrorSet. Inspecting the code, it fails on NTO reconciliation logic, where only machineconfiguration API schemas are loaded into the YAML serializer: https://github.com/openshift/hypershift/blob/f7ba5a14e5d0cf658cf83a13a10917bee1168011/hypershift-operator/controllers/nodepool/nto.go#L415-L421
Version-Release number of selected component (if applicable):
4.18
How reproducible:
100%
Steps to Reproduce:
1. Install 4.18 HyperShift operator 2. Create NodePool with configuration ConfigMap that includes ImageDigestMirrorSet 3. HyperShift operator fails to reconcile NodePool
Actual results:
HyperShift operator fails to reconcile NodePool
Expected results:
HyperShift operator to successfully reconcile NodePool
Additional info:
Regression introduced by https://github.com/openshift/hypershift/pull/4717
Description of problem:
Currently check-patternfly-modules.sh checks them serially, which could be improved by checking them in parallel. Since yarn why does not write to anything, this should be easily parallelizable as there is no race condition with writing back to the yarn.lock
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Missing metrics - example: cluster_autoscaler_failed_scale_ups_total
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
#curl the autoscalers metrics endpoint: $ oc exec deployment/cluster-autoscaler-default -- curl -s http://localhost:8085/metrics | grep cluster_autoscaler_failed_scale_ups_total
Actual results:
the metrics does not return a value until an event has happened
Expected results:
The metric counter should be initialized at start up providing a zero value
Additional info:
I have been through the file: https://raw.githubusercontent.com/openshift/kubernetes-autoscaler/master/cluster-autoscaler/metrics/metrics.go and checked off the metrics that do not appear when scraping the metrics endpoint straight after deployment. the following metrics are in metrics.go but are missing from the scrape ~~~ node_group_min_count node_group_max_count pending_node_deletions errors_total scaled_up_gpu_nodes_total failed_scale_ups_total failed_gpu_scale_ups_total scaled_down_nodes_total scaled_down_gpu_nodes_total unremovable_nodes_count skipped_scale_events_count ~~~
CVO manifests contain some feature-gated ones:
We observed HyperShift CI jobs to fail when adding DevPreview-gated deployment manifests to CVO, which was unexpected. Investigating further, we discovered that HyperShift applies them:
error: error parsing /var/payload/manifests/0000_00_update-status-controller_03_deployment-DevPreviewNoUpgrade.yaml: error converting YAML to JSON: yaml: invalid map key: map[interface {}]interface {}{".ReleaseImage":interface {}(nil)}
But even without these added manifests, this happens for existing ClusterVersion CRD manifests present in the payload:
$ ls -1 manifests/*clusterversions*crd.yaml manifests/0000_00_cluster-version-operator_01_clusterversions-CustomNoUpgrade.crd.yaml manifests/0000_00_cluster-version-operator_01_clusterversions-Default.crd.yaml manifests/0000_00_cluster-version-operator_01_clusterversions-DevPreviewNoUpgrade.crd.yaml manifests/0000_00_cluster-version-operator_01_clusterversions-TechPreviewNoUpgrade.crd.yaml
In a passing HyperShift CI job, the same log shows that all four manifests are applied instead of just one:
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
4.18
Always
1. inspect the cluster-version-operator-*-bootstrap.log of a HyperShift CI job
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io configured
= all four ClusterVersion CRD manifests are applied
customresourcedefinition.apiextensions.k8s.io/clusterversions.config.openshift.io created
= ClusterVersion CRD manifest is applied just once
I'm filing this card so that I can link it to the "easy" fix https://github.com/openshift/hypershift/pull/5093 which is not the perfect fix, but allows us to add featureset-gated manifests to CVO without breaking HyperShift. It is desirable to improve this even further and actually correctly select the manifests to be applied for CVO bootstrap, but that involves non-trivial logic similar to one used by CVO and it seems to be better approached as a feature to be properly assessed and implemented, rather than a bugfix, so I'll file a separate HOSTEDCP card for that.
Description of problem:
Some permissions are missing when edge zones are specified in the install-config.yaml, probably those related to Carrier Gateways (but maybe more)
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always with minimal permissions
Steps to Reproduce:
1. 2. 3.
Actual results:
time="2024-11-20T22:40:58Z" level=debug msg="\tfailed to describe carrier gateways in vpc \"vpc-0bdb2ab5d111dfe52\": UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-girt7h2j-4515a-minimal-perm is not authorized to perform: ec2:DescribeCarrierGateways because no identity-based policy allows the ec2:DescribeCarrierGateways action"
Expected results:
All required permissions are listed in pkg/asset/installconfig/aws/permissions.go
Additional info:
See https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/9222/pull-ci-openshift-installer-master-e2e-aws-ovn-edge-zones/1859351015715770368 for a failed min-perms install
Description of problem:
When using PublicIPv4Pool, CAPA will try to allocate IP address in the supplied pool which requires the `ec2:AllocateAddress` permission
Version-Release number of selected component (if applicable):
4.16+
How reproducible:
always
Steps to Reproduce:
1. Minimal permissions and publicIpv4Pool set 2. 3.
Actual results:
time="2024-11-21T05:39:49Z" level=debug msg="E1121 05:39:49.352606 327 awscluster_controller.go:279] \"failed to reconcile load balancer\" err=<" time="2024-11-21T05:39:49Z" level=debug msg="\tfailed to allocate addresses to load balancer: failed to allocate address from Public IPv4 Pool \"ipv4pool-ec2-0768267342e327ea9\" to role lb-apiserver: failed to allocate Elastic IP for \"lb-apiserver\": UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::460538899914:user/ci-op-2cr41ill-663fd-minimal-perm is not authorized to perform: ec2:AllocateAddress on resource: arn:aws:ec2:us-east-1:460538899914:ipv4pool-ec2/ipv4pool-ec2-0768267342e327ea9 because no identity-based policy allows the ec2:AllocateAddress action. Encoded authorization failure message: Iy1gCtvfPxZ2uqo-SHei1yJQvNwaOBl5F_8BnfeEYCLMczeDJDdS4fZ_AesPLdEQgK7ahuOffqIr--PWphjOUbL2BXKZSBFhn3iN9tZrDCnQQPKZxf9WaQmSkoGNWKNUGn6rvEZS5KvlHV5vf5mCz5Bk2lk3w-O6bfHK0q_dphLpJjU-sTGvB6bWAinukxSYZ3xbirOzxfkRfCFdr7nDfX8G4uD4ncA7_D-XriDvaIyvevWSnus5AI5RIlrCuFGsr1_3yEvrC_AsLENZHyE13fA83F5-Abpm6-jwKQ5vvK1WuD3sqpT5gfTxccEqkqqZycQl6nsxSDP2vDqFyFGKLAmPne8RBRbEV-TOdDJphaJtesf6mMPtyMquBKI769GW9zTYE7nQzSYUoiBOafxz6K1FiYFoc1y6v6YoosxT8bcSFT3gWZWNh2upRJtagRI_9IRyj7MpyiXJfcqQXZzXkAfqV4nsJP8wRXS2vWvtjOm0i7C82P0ys3RVkQVcSByTW6yFyxh8Scoy0HA4hTYKFrCAWA1N0SROJsS1sbfctpykdCntmp9M_gd7YkSN882Fy5FanA" time="2024-11-21T05:39:49Z" level=debug msg="\t\tstatus code: 403, request id: 27752e3c-596e-43f7-8044-72246dbca486"
Expected results:
Additional info:
Seems to happen consistently with shared-vpc-edge-zones CI job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/9230/pull-ci-openshift-installer-master-e2e-aws-ovn-shared-vpc-edge-zones/1860015198224519168
Description of problem:
The LB name should be yunjiang-ap55-sk6jl-ext-a6aae262b13b0580, rather than ending with ELB service endpoint (elb.ap-southeast-5.amazonaws.com): failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed provisioning resources after infrastructure ready: failed to find HostedZone ID for NLB: failed to list load balancers: ValidationError: The load balancer name 'yunjiang-ap55-sk6jl-ext-a6aae262b13b0580.elb.ap-southeast-5.amazonaws.com' cannot be longer than '32' characters\n\tstatus code: 400, request id: f8adce67-d844-4088-9289-4950ce4d0c83 Checking the tag value, the value of Name key is correct: yunjiang-ap55-sk6jl-ext
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-11-30-141716
How reproducible:
always
Steps to Reproduce:
1. Deploy a cluster on ap-southeast-5 2. 3.
Actual results:
The LB can not be created
Expected results:
Create a cluster successfully.
Additional info:
No such issues on other AWS regions.
Description of problem:
oc adm node-image create --pxe does not generate only pxe artifacts, but copies everything from the node-joiner pod. Also, the name of the pxe artifacts are not corrected (prefixed with agent, instead of node)
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. oc adm node-image create --pxe
Actual results:
All the node-joiner pods are copied. PXE artifacts name are wrong.
Expected results:
In the target folder, there should be only the following artifacts:
* node.x86_64-initrd.img * node.x86_64-rootfs.img * node.x86_64-vmlinuz
Additional info:
Description of problem:
As more systems have been added to Power VS, the assumption that every zone in a region has the same set of systypes has been broken. To properly represent what system types are available, the powervs_regions struct needed to be altered and parts of the installer referencing it needed to be updated.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Try to deploy with s1022 in dal10 2. SysType not available, even though it is a valid option in Power VS. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When running the delete command on oc-mirror after a mirrorToMirror, the graph-image is not being deleted.
Version-Release number of selected component (if applicable):
How reproducible:
With the following ImageSetConfiguration (use the same for the DeleteImageSetConfiguration only changing the kind and the mirror to delete)
kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: platform: channels: - name: stable-4.13 minVersion: 4.13.10 maxVersion: 4.13.10 graph: true
Steps to Reproduce:
1. Run mirror to mirror ./bin/oc-mirror -c ./alex-tests/alex-isc/isc.yaml --workspace file:///home/aguidi/go/src/github.com/aguidirh/oc-mirror/alex-tests/clid-230 docker://localhost:6000 --v2 --dest-tls-verify=false 2. Run the delete --generate ./bin/oc-mirror delete -c ./alex-tests/alex-isc/isc-delete.yaml --generate --workspace file:///home/aguidi/go/src/github.com/aguidirh/oc-mirror/alex-tests/clid-230 --delete-id clid-230-delete-test docker://localhost:6000 --v2 --dest-tls-verify=false 3. Run the delete ./bin/oc-mirror delete --delete-yaml-file /home/aguidi/go/src/github.com/aguidirh/oc-mirror/alex-tests/clid-230/working-dir/delete/delete-images-clid-230-delete-test.yaml docker://localhost:6000 --v2 --dest-tls-verify=false
Actual results:
During the delete --generate the graph-image is not being included in the delete file 2024/10/25 09:44:21 [WARN] : unable to find graph image in local cache: SKIPPING. %!v(MISSING) 2024/10/25 09:44:21 [WARN] : reading manifest latest in localhost:55000/openshift/graph-image: manifest unknown Because of that the graph-image is not being deleted from the target registry [aguidi@fedora oc-mirror]$ curl http://localhost:6000/v2/openshift/graph-image/tags/list | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 51 100 51 0 0 15577 0 --:--:-- --:--:-- --:--:-- 17000 { "name": "openshift/graph-image", "tags": [ "latest" ] }
Expected results:
graph-image should be deleted even after mirrorToMirror
Additional info:
{Failed === RUN TestAutoscaling/EnsureHostedCluster/EnsurePodsWithEmptyDirPVsHaveSafeToEvictAnnotations
util.go:1232:
the pod openstack-manila-csi-controllerplugin-676cc65ffc-tnnkb is not in the audited list for safe-eviction and should not contain the safe-to-evict-local-volume annotation
Expected
<string>: socket-dir
to be empty
--- FAIL: TestAutoscaling/EnsureHostedCluster/EnsurePodsWithEmptyDirPVsHaveSafeToEvictAnnotations (0.02s)
}
Description of problem:
We have an OKD 4.12 cluster which has persistent and increasing ingresswithoutclassname alerts with no ingresses normally present in the cluster. I believe the ingresswithoutclassname being counted is created as part of the ACME validation process managed by the cert-manager operator with it's openshift route addon which are torn down once the ACME validation is complete.
Version-Release number of selected component (if applicable):
4.12.0-0.okd-2023-04-16-041331
How reproducible:
seems very consistent. went away during an update but came back shortly after and continues to increase.
Steps to Reproduce:
1. create ingress w/o classname 2. see counter increase 3. delete classless ingress 4. counter does not decrease.
Additional info:
https://github.com/openshift/cluster-ingress-operator/issues/912
Description of problem:
Observed in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn/1866088107347021824/artifacts/e2e-gcp-ovn/ipi-install-install/artifacts/.openshift_install-1733747884.log Distinct issues occurring in this job caused the "etcd bootstrap member to be removed from cluster" gate to take longer than its 5 minute timeout, but there was plenty of time left to complete bootstrapping successfully. It doesn't make sense to have a narrow timeout here because progress toward removal of the etcd bootstrap member begins the moment the etcd cluster starts for the first time, not when the installer starts waiting to observe it.
Version-Release number of selected component (if applicable):
4.19.0
How reproducible:
Sometimes
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Incorrect capitalization for `Lightspeed` to capitalized `LightSpeed` in ja and zh langs
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
This is part of the plan to improve stability of ipsec in ocp releases.
There are several regressions identified in libreswan-4.9 (default in 4.14.z and 4.15.z) which needs to be addressed in an incremental approach. The first step is to introduce libreswan-4.6-3.el9_0.3 which is the oldest major version(4.6) that can still be released in rhel9. It includes a libreswan crash fix and some CVE backports that are present in libreswan-4.9 but not in libreswan-4.5 (so that it can pass the internal CVE scanner check).
This pinning of libreswan-4.6-3.el9_0.3 is only needed for 4.14.z since containerized ipsec is used in 4.14. Starting 4.15, ipsec is moved to host and this CNO PR (about to merge as of writing) will allow ovnk to use host ipsec execs which only requires libreswan pkg update in rhcos extension.
Description of problem:
bump ovs version to openvswitch3.4-3.4.0-18.el9fdp for ocp 4.19 to include the ovs-monitor-ipsec improvement https://issues.redhat.com/browse/FDP-846
Description of problem:
This bug is filed a result of https://access.redhat.com/support/cases/#/case/03977446 ALthough both nodes topologies are equavilent, PPC reported a false negative: Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU corres differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66]
Version-Release number of selected component (if applicable):
4.14
How reproducible:
always
Steps to Reproduce:
1.TBD 2. 3.
Actual results:
Error: targeted nodes differ: nodes host1.development.lab and host2.development.lab have different topology: the CPU corres differ: processor core #20 (2 threads), logical processors [2 66] vs processor core #20 (2 threads), logical processors [2 66]
Expected results:
topologies matches, the PPC should work fine
Additional info:
Description of problem:
`sts:AssumeRole` is required by creating Shared-VPC [1], otherwise which will cause the error: level=fatal msg=failed to fetch Cluster Infrastructure Variables: failed to fetch dependency of "Cluster Infrastructure Variables": failed to generate asset "Platform Provisioning Check": aws.hostedZone: Invalid value: "Z01991651G3UXC4ZFDNDU": unable to retrieve hosted zone: could not get hosted zone: Z01991651G3UXC4ZFDNDU: AccessDenied: User: arn:aws:iam::301721915996:user/ci-op-1c2w7jv2-ef4fe-minimal-perm-installer is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::641733028092:role/ci-op-1c2w7jv2-ef4fe-shared-role level=fatal msg= status code: 403, request id: ab7160fa-ade9-4afe-aacd-782495dc9978 Installer exit with code 1 [1]https://docs.openshift.com/container-platform/4.17/installing/installing_aws/installing-aws-account.html
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-12-03-174639
How reproducible:
Always
Steps to Reproduce:
1. Create install-config for Shared-VPC cluster 2. Run openshift-install create permissions-policy 3. Create cluster by using the above installer-required policy.
Actual results:
See description
Expected results:
sts:AssumeRole is included in the policy file when Shared VPC is configured.
Additional info:
The configuration of Shared-VPC is like: platform: aws: hostedZone: hostedZoneRole:
Description of problem:
In an effort to ensure all HA components are not degraded by design during normal e2e test or upgrades, we are collecting all operators that are blipping Degraded=True during any payload job run. This card captures machine-config operator that blips Degraded=True during upgrade runs. Example Job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-azure-ovn-upgrade/1843023092004163584 Reasons associated with the blip: RenderConfigFailed For now, we put an exception in the test. But it is expected that teams take action to fix those and remove the exceptions after the fix go in. Exceptions are defined here: See linked issue for more explanation on the effort.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
OSD-26887: managed services taints several nodes as infrastructure. This taint appears to be applied after some of the platform DS are scheduled there, causing this alert to fire. Managed services rebalances the DS after the taint is added, and the alert clears, but origin fails this test. Allowing this alert to fire while we investigate why the taint is not added at node birth.
Description of problem:
Missing translations for "PodDisruptionBudget violated" string
Code:
"count PodDisruptionBudget violated_one": "count PodDisruptionBudget violated_one", "count PodDisruptionBudget violated_other": "count PodDisruptionBudget violated_other", | |
Code:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
HyperShift CEL validation blocks ARM64 NodePool creation for non-AWS/Azure platforms Can't add a Bare Metal worker node to the hosted cluster. This was discussed on #project-hypershift Slack channel.
Version-Release number of selected component (if applicable):
MultiClusterEngine v2.7.2 HyperShift Operator image: registry.redhat.io/multicluster-engine/hypershift-rhel9-operator@sha256:56bd0210fa2a6b9494697dc7e2322952cd3d1500abc9f1f0bbf49964005a7c3a
How reproducible:
Always
Steps to Reproduce:
1. Create a HyperShift HostedCluster on a non-AWS/non-Azure platform 2. Try to create a NodePool with ARM64 architecture specification
Actual results:
- CEL validation blocks creating NodePool with arch: arm64 on non-AWS/Azure platforms - Receive error: "The NodePool is invalid: spec: Invalid value: "object": Setting Arch to arm64 is only supported for AWS and Azure" - Additional validation in NodePool spec also blocks arm64 architecture
Expected results:
- Allow ARM64 architecture specification for NodePools on BareMetal platform - Remove or update the CEL validation to support this use case
Additional info:
NodePool YAML: apiVersion: hypershift.openshift.io/v1beta1 kind: NodePool metadata: name: nodepool-doca5-1 namespace: doca5 spec: arch: arm64 clusterName: doca5 management: autoRepair: false replace: rollingUpdate: maxSurge: 1 maxUnavailable: 0 strategy: RollingUpdate upgradeType: InPlace platform: agent: agentLabelSelector: {} type: Agent release: image: quay.io/openshift-release-dev/ocp-release:4.16.21-multi replicas: 1
Description of problem:
ingress-to-route controller does not provide any information about failed conversions from ingress to route. This is a big issue in environments heavily dependent on the ingress objects. The only way to find why a route is not created is guess and try as the only information one can get is that the route is not created.
Version-Release number of selected component (if applicable):
OCP 4.14
How reproducible:
100%
Steps to Reproduce:
apiVersion: networking.k8s.io/v1 kind: Ingress metadata: annotations: route.openshift.io/termination: passthrough name: hello-openshift-class namespace: test spec: ingressClassName: openshift-default rules: - host: ingress01-rhodain-test01.apps.rhodain03.sbr-virt.gsslab.brq2.redhat.com http: paths: - backend: service: name: myapp02 port: number: 8080 path: / pathType: Prefix tls: - {}
Actual results:
Route is not created and no error is logged
Expected results:
En error is provided in the events or at least in the controllers logs. The events are prefered as the ingress objects are mainly created by uses without cluster admin privileges.
Additional info:
It looks like OLMv1 doesn't handle proxies correctly, aws-ovn-proxy job is permafailing https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-proxy/1861444783696777216
I suspect it's on the OLM operator side, are you looking at the cluster-wide proxy object and wiring it into your containers if set?
Description of problem:
HorizontalNav component of @openshift-console/dynamic-plugin-sdk doest not have customData prop which is available in console repo. This prop is needed to pass any value between tabs in details page
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Application of PerformanceProfile with invalid cpuset in one of the reserved/isolated/shared/offlined cpu fields causing webhook validation to panic instead of returning an informant error.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-11-22-231049
How reproducible:
Apply a PerformanceProfile with invalid cpu values
Steps to Reproduce:
Apply the following PerformanceProfile with invalid cpu values: apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: pp spec: cpu: isolated: 'garbage' reserved: 0-3 machineConfigPoolSelector: pools.operator.machineconfiguration.openshift.io/worker-cnf: "" nodeSelector: node-role.kubernetes.io/worker-cnf: ""
Actual results:
On OCP >= 4.18 the error is:
Error from server: error when creating "pp.yaml": admission webhook "vwb.performance.openshift.io" denied the request: panic: runtime error: invalid memory address or nil pointer dereference [recovered]
On OCP <= 4.17 the error is:
Validation webhook passes without any errors. Invalid configuration propogates to the cluster and breaks it.
Expected results:
We expect to pushback an informant error when invalid cpuset has been entered, without panicking or accepting it!
The "oc adm pod-network" command for working with openshift-sdn multitenant mode is now totally useless in OCP 4.17 and newer clusters (since it's only useful with openshift-sdn, and openshift-sdn no longer exists as of OCP 4.17). Of course, people might use a new oc binary to talk to an older cluster, but probably the built-in documentation should make it clearer that this is not a command that will be useful to 99% of users.
If it's possible to make "pod-network" not show up as a subcommand in "oc adm -h" that would probably be good. If not, it should probably have a description like "Manage OpenShift-SDN Multitenant mode networking [DEPRECATED]", and likewise, the longer descriptions of the pod-network subcommands should talk about "OpenShift-SDN Multitenant mode" rather than "the redhat/openshift-ovs-multitenant network plugin" (which is OCP 3 terminology), and maybe should explicitly say something like "this has no effect when using the default OpenShift Networking plugin (OVN-Kubernetes)".
Description of problem:
The release signature configmap file is invalid with no name defined
Version-Release number of selected component (if applicable):
oc-mirror version WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.18.0-202410011141.p0.g227a9c4.assembly.stream.el9-227a9c4", GitCommit:"227a9c499b6fd94e189a71776c83057149ee06c2", GitTreeState:"clean", BuildDate:"2024-10-01T20:07:43Z", GoVersion:"go1.22.5 (Red Hat 1.22.5-1.module+el8.10.0+22070+9237f38b) X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
100%
Steps to Reproduce:
1) with isc : cat /test/yinzhou/config.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v2alpha1 mirror: platform: channels: - name: stable-4.16 2) do mirror2disk + disk2mirror 3) use the signature configmap to create resource
Actual results:
3) failed to create resource with error: oc create -f signature-configmap.json The ConfigMap "" is invalid: metadata.name: Required value: name or generateName is required oc create -f signature-configmap.yaml The ConfigMap "" is invalid: metadata.name: Required value: name or generateName is required
Expected results:
No error
Description of problem:
If the serverless function is not running and on click of Test Serverless button, nothing is happening.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1.Install serverless operator 2.Create serverless function and make sure the status is false 3.Click on Test Serverless function
Actual results:
No response
Expected results:
May be an alert or may be we can hide that option if function is not ready?
Additional info:
Description of problem:
Node was created today with worker label. It was labeled as a loadbalancer to match mcp selector. MCP saw the selector and moved to Updating but the machine-config-daemon pod isn't responding. We tried deleting the pod and it still didn't pick up that it needed to get a new config. Manually editing the desired config appears to workaround the issue but shouldn't be necessary.
Node created today: [dasmall@supportshell-1 03803880]$ oc get nodes worker-048.kub3.sttlwazu.vzwops.com -o yaml | yq .metadata.creationTimestamp '2024-04-30T17:17:56Z' Node has worker and loadbalancer roles: [dasmall@supportshell-1 03803880]$ oc get node worker-048.kub3.sttlwazu.vzwops.com NAME STATUS ROLES AGE VERSION worker-048.kub3.sttlwazu.vzwops.com Ready loadbalancer,worker 1h v1.25.14+a52e8df MCP shows a loadbalancer needing Update and 0 nodes in worker pool: [dasmall@supportshell-1 03803880]$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE loadbalancer rendered-loadbalancer-1486d925cac5a9366d6345552af26c89 False True False 4 3 3 0 87d master rendered-master-47f6fa5afe8ce8f156d80a104f8bacae True False False 3 3 3 0 87d worker rendered-worker-a6be9fb3f667b76a611ce51811434cf9 True False False 0 0 0 0 87d workerperf rendered-workerperf-477d3621fe19f1f980d1557a02276b4e True False False 38 38 38 0 87d Status shows mcp updating: [dasmall@supportshell-1 03803880]$ oc get mcp loadbalancer -o yaml | yq .status.conditions[4] lastTransitionTime: '2024-04-30T17:33:21Z' message: All nodes are updating to rendered-loadbalancer-1486d925cac5a9366d6345552af26c89 reason: '' status: 'True' type: Updating Node still appears happy with worker MC: [dasmall@supportshell-1 03803880]$ oc get node worker-048.kub3.sttlwazu.vzwops.com -o yaml | grep rendered- machineconfiguration.openshift.io/currentConfig: rendered-worker-a6be9fb3f667b76a611ce51811434cf9 machineconfiguration.openshift.io/desiredConfig: rendered-worker-a6be9fb3f667b76a611ce51811434cf9 machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-a6be9fb3f667b76a611ce51811434cf9 machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-a6be9fb3f667b76a611ce51811434cf9 machine-config-daemon pod appears idle: [dasmall@supportshell-1 03803880]$ oc logs -n openshift-machine-config-operator machine-config-daemon-wx2b8 -c machine-config-daemon 2024-04-30T17:48:29.868191425Z I0430 17:48:29.868156 19112 start.go:112] Version: v4.12.0-202311220908.p0.gef25c81.assembly.stream-dirty (ef25c81205a65d5361cfc464e16fd5d47c0c6f17) 2024-04-30T17:48:29.871340319Z I0430 17:48:29.871328 19112 start.go:125] Calling chroot("/rootfs") 2024-04-30T17:48:29.871602466Z I0430 17:48:29.871593 19112 update.go:2110] Running: systemctl daemon-reload 2024-04-30T17:48:30.066554346Z I0430 17:48:30.066006 19112 rpm-ostree.go:85] Enabled workaround for bug 2111817 2024-04-30T17:48:30.297743470Z I0430 17:48:30.297706 19112 daemon.go:241] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858 (412.86.202311271639-0) 828584d351fcb58e4d799cebf271094d5d9b5c1a515d491ee5607b1dcf6ebf6b 2024-04-30T17:48:30.324852197Z I0430 17:48:30.324543 19112 start.go:101] Copied self to /run/bin/machine-config-daemon on host 2024-04-30T17:48:30.325677959Z I0430 17:48:30.325666 19112 start.go:188] overriding kubernetes api to https://api-int.kub3.sttlwazu.vzwops.com:6443 2024-04-30T17:48:30.326381479Z I0430 17:48:30.326368 19112 metrics.go:106] Registering Prometheus metrics 2024-04-30T17:48:30.326447815Z I0430 17:48:30.326440 19112 metrics.go:111] Starting metrics listener on 127.0.0.1:8797 2024-04-30T17:48:30.327835814Z I0430 17:48:30.327811 19112 writer.go:93] NodeWriter initialized with credentials from /var/lib/kubelet/kubeconfig 2024-04-30T17:48:30.327932144Z I0430 17:48:30.327923 19112 update.go:2125] Starting to manage node: worker-048.kub3.sttlwazu.vzwops.com 2024-04-30T17:48:30.332123862Z I0430 17:48:30.332097 19112 rpm-ostree.go:394] Running captured: rpm-ostree status 2024-04-30T17:48:30.332928272Z I0430 17:48:30.332909 19112 daemon.go:1049] Detected a new login session: New session 1 of user core. 2024-04-30T17:48:30.332935796Z I0430 17:48:30.332926 19112 daemon.go:1050] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh 2024-04-30T17:48:30.368619942Z I0430 17:48:30.368598 19112 daemon.go:1298] State: idle 2024-04-30T17:48:30.368619942Z Deployments: 2024-04-30T17:48:30.368619942Z * ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858 2024-04-30T17:48:30.368619942Z Digest: sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858 2024-04-30T17:48:30.368619942Z Version: 412.86.202311271639-0 (2024-04-30T17:05:27Z) 2024-04-30T17:48:30.368619942Z LayeredPackages: kernel-devel kernel-headers 2024-04-30T17:48:30.368619942Z 2024-04-30T17:48:30.368619942Z ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858 2024-04-30T17:48:30.368619942Z Digest: sha256:20b4937e8d107af19d8e39329e1767471b78ba6abd07b5a3e328dafd7b146858 2024-04-30T17:48:30.368619942Z Version: 412.86.202311271639-0 (2024-04-30T17:05:27Z) 2024-04-30T17:48:30.368619942Z LayeredPackages: kernel-devel kernel-headers 2024-04-30T17:48:30.368907860Z I0430 17:48:30.368884 19112 coreos.go:54] CoreOS aleph version: mtime=2023-08-08 11:20:41.285 +0000 UTC build=412.86.202308081039-0 imgid=rhcos-412.86.202308081039-0-metal.x86_64.raw 2024-04-30T17:48:30.368932886Z I0430 17:48:30.368926 19112 coreos.go:71] Ignition provisioning: time=2024-04-30T17:03:44Z 2024-04-30T17:48:30.368938120Z I0430 17:48:30.368931 19112 rpm-ostree.go:394] Running captured: journalctl --list-boots 2024-04-30T17:48:30.372893750Z I0430 17:48:30.372884 19112 daemon.go:1307] journalctl --list-boots: 2024-04-30T17:48:30.372893750Z -2 847e119666d9498da2ae1bd89aa4c4d0 Tue 2024-04-30 17:03:13 UTC—Tue 2024-04-30 17:06:32 UTC 2024-04-30T17:48:30.372893750Z -1 9617b204b8b8412fb31438787f56a62f Tue 2024-04-30 17:09:06 UTC—Tue 2024-04-30 17:36:39 UTC 2024-04-30T17:48:30.372893750Z 0 3cbf6edcacde408b8979692c16e3d01b Tue 2024-04-30 17:39:20 UTC—Tue 2024-04-30 17:48:30 UTC 2024-04-30T17:48:30.372912686Z I0430 17:48:30.372891 19112 rpm-ostree.go:394] Running captured: systemctl list-units --state=failed --no-legend 2024-04-30T17:48:30.378069332Z I0430 17:48:30.378059 19112 daemon.go:1322] systemd service state: OK 2024-04-30T17:48:30.378069332Z I0430 17:48:30.378066 19112 daemon.go:987] Starting MachineConfigDaemon 2024-04-30T17:48:30.378121340Z I0430 17:48:30.378106 19112 daemon.go:994] Enabling Kubelet Healthz Monitor 2024-04-30T17:48:31.486786667Z I0430 17:48:31.486747 19112 daemon.go:457] Node worker-048.kub3.sttlwazu.vzwops.com is not labeled node-role.kubernetes.io/master 2024-04-30T17:48:31.491674986Z I0430 17:48:31.491594 19112 daemon.go:1243] Current+desired config: rendered-worker-a6be9fb3f667b76a611ce51811434cf9 2024-04-30T17:48:31.491674986Z I0430 17:48:31.491603 19112 daemon.go:1253] state: Done 2024-04-30T17:48:31.495704843Z I0430 17:48:31.495617 19112 daemon.go:617] Detected a login session before the daemon took over on first boot 2024-04-30T17:48:31.495704843Z I0430 17:48:31.495624 19112 daemon.go:618] Applying annotation: machineconfiguration.openshift.io/ssh 2024-04-30T17:48:31.503165515Z I0430 17:48:31.503052 19112 update.go:2110] Running: rpm-ostree cleanup -r 2024-04-30T17:48:32.232728843Z Bootloader updated; bootconfig swap: yes; bootversion: boot.1.1, deployment count change: -1 2024-04-30T17:48:35.755815139Z Freed: 92.3 MB (pkgcache branches: 0) 2024-04-30T17:48:35.764568364Z I0430 17:48:35.764548 19112 daemon.go:1563] Validating against current config rendered-worker-a6be9fb3f667b76a611ce51811434cf9 2024-04-30T17:48:36.120148982Z I0430 17:48:36.120119 19112 rpm-ostree.go:394] Running captured: rpm-ostree kargs 2024-04-30T17:48:36.179660790Z I0430 17:48:36.179631 19112 update.go:2125] Validated on-disk state 2024-04-30T17:48:36.182434142Z I0430 17:48:36.182406 19112 daemon.go:1646] In desired config rendered-worker-a6be9fb3f667b76a611ce51811434cf9 2024-04-30T17:48:36.196911084Z I0430 17:48:36.196879 19112 config_drift_monitor.go:246] Config Drift Monitor started
Version-Release number of selected component (if applicable):
4.12.45
How reproducible:
They can reproduce in multiple clusters
Actual results:
Node stays with rendered-worker config
Expected results:
machineconfigpool updating should prompt a change to the desired config which the machine-config-daemon pod then updates node to
Additional info:
here is the latest must-gather where this issue is occuring: https://attachments.access.redhat.com/hydra/rest/cases/03803880/attachments/3fd0cf52-a770-4525-aecd-3a437ea70c9b?usePresignedUrl=true
Description of problem:
Destroying a private cluster doesn't delete the forwarding-rule/backend-service/health-check/firewall-rules created by ingress operator.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-multi-2024-10-23-202329
How reproducible:
Always
Steps to Reproduce:
1. pre-create vpc network/subnets/router and a bastion host 2. "create install-config", and then insert the network settings under platform.gcp, along with "publish: Internal" (see [1]) 3. "create cluster" (use the above bastion host as http proxy) 4. "destroy cluster" (see [2])
Actual results:
Although "destroy cluster" completes successfully, the forwarding-rule/backend-service/health-check/firewall-rules created by ingress operator are not deleted (see [3]), which leads to deleting the vpc network/subnets failure.
Expected results:
The forwarding-rule/backend-service/health-check/firewall-rules created by ingress operator should also be deleted during "destroy cluster".
Additional info:
FYI one history bug https://issues.redhat.com/browse/OCPBUGS-37683
Managed services marks a couple of nodes as "infra" so user workloads don't get scheduled on them. However, platform daemonsets like iptables-alerter should run there – and the typical toleration for that purpose should be:
tolerations:
- operator: Exists
instead the toleration is
tolerations: - key: "node-role.kubernetes.io/master" operator: "Exists" effect: "NoSchedule"
Examples from other platform DS:
$ for ns in openshift-cluster-csi-drivers openshift-cluster-node-tuning-operator openshift-dns openshift-image-registry openshift-machine-config-operator openshift-monitoring openshift-multus openshift-multus openshift-multus openshift-network-diagnostics openshift-network-operator openshift-ovn-kubernetes openshift-security; do echo "NS: $ns"; oc get ds -o json -n $ns|jq '.items.[0].spec.template.spec.tolerations'; done NS: openshift-cluster-csi-drivers [ { "operator": "Exists" } ] NS: openshift-cluster-node-tuning-operator [ { "operator": "Exists" } ] NS: openshift-dns [ { "key": "node-role.kubernetes.io/master", "operator": "Exists" } ] NS: openshift-image-registry [ { "operator": "Exists" } ] NS: openshift-machine-config-operator [ { "operator": "Exists" } ] NS: openshift-monitoring [ { "operator": "Exists" } ] NS: openshift-multus [ { "operator": "Exists" } ] NS: openshift-multus [ { "operator": "Exists" } ] NS: openshift-multus [ { "operator": "Exists" } ] NS: openshift-network-diagnostics [ { "operator": "Exists" } ] NS: openshift-network-operator [ { "effect": "NoSchedule", "key": "node-role.kubernetes.io/master", "operator": "Exists" } ] NS: openshift-ovn-kubernetes [ { "operator": "Exists" } ] NS: openshift-security [ { "operator": "Exists" } ]
The helper doesn't have all the namespaces in it, and we're getting some flakes in CI like this:
{{batch/v1/Job/openshift-backplane-managed-scripts/<batch_job>/container/osd-delete-backplane-script-resources
does not have a cpu request (rule: "batch/v1/Job/openshift-backplane-managed-scripts/<batch_job>/container/osd-delete-backplane-script-resources/request[cpu]")}}
Description of problem:
The machine-os-builder deployment manifest does not set the openshift.io/required-scc annotation, which appears to be required for the upgrade conformance suite to pass. The rest of the MCO components currently set this annotation and we can probably use the same setting for the Machine Config Controller (which is restricted-v2). What I'm unsure of is whether this also needs to be set on the builder pods as well and what the appropriate setting would be for that case.
Version-Release number of selected component (if applicable):
How reproducible:
This always occurs in the new CI jobs, e2e-aws-ovn-upgrade-ocb-techpreview and e2e-aws-ovn-upgrade-ocb-conformance-suite-techpreview. Here's two examples from rehearsal failures:
Steps to Reproduce:
Run either of the aforementioned CI jobs.
Actual results:
Test [sig-auth] all workloads in ns/openshift-machine-config-operator must set the 'openshift.io/required-scc' annotation fails.
Expected results:
Test{{ [sig-auth] all workloads in ns/openshift-machine-config-operator must set the 'openshift.io/required-scc' annotation}} should pass.
Additional info:
Description of problem:
Under some circumstances (not clear exactly which ones), the OVN databases of 2 nodes ended up having 2 src-ip static routes in ovn_cluster_router instead of one: one of them points to the correct IP of the rtoj-GR_${NODE_NAME} LRP and one points to a wrong IP on the join subnet (that IP is not used in any other LRP or LSP).
Both static routes are taken into consideration while routing traffic out from the cluster, so packets that use the right route are able to egress while the packets that use the wrong route are dropped.
Version-Release number of selected component (if applicable):
Reproduced in 4.14.20
How reproducible:
At least once. Only 2 nodes of the cluster.
Steps to Reproduce:
(Not sure, it was just found after investigation of strange packet drop)
Actual results:
Wrong static route to some non-existent IP in the join subnet. Intermittent packet drop.
Expected results:
No wrong static routes. No packet drop.
Additional info:
This can be workarounded by wiping the OVN databases of the impacted node.
Our unit test runtime is slow. It seems to run anywhere from ~16-20 minutes locally. On CI it can take at least 30 minutes to run. Investigate whether or not any changes can be made to improve the unit test runtime.
This issue tracks the updation of k8s and related openshift APIs to a recent version, to keep in-line with other MAPI providers.
Description of problem:
Console plugin details page is throwing error on some specific YAML
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-11-30-141716
How reproducible:
Always
Steps to Reproduce:
1. Create a ConsolePlugin with minimum required fields apiVersion: console.openshift.io/v1 kind: ConsolePlugin metadata: name: console-demo-plugin-two spec: backend: type: Service displayName: OpenShift Console Demo Plugin 2. Visit consoleplugin details page at /k8s/cluster/console.openshift.io~v1~ConsolePlugin/console-demo-plugin
Actual results:
2. We will see an error page
Expected results:
2. we should not show an error page since ConsolePlugin YAML has every required fields although they are not complete
Additional info:
This is a clone of issue OCPBUGS-45859. The following is the description of the original issue:
—
The following test is failing more than expected:
Undiagnosed panic detected in pod
See the sippy test details for additional context.
Observed in 4.18-e2e-azure-ovn/1864410356567248896 as well as pull-ci-openshift-installer-master-e2e-azure-ovn/1864312373058211840
: Undiagnosed panic detected in pod { pods/openshift-cloud-controller-manager_azure-cloud-controller-manager-5788c6f7f9-n2mnh_cloud-controller-manager_previous.log.gz:E1204 22:27:54.558549 1 iface.go:262] "Observed a panic" panic="interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.EndpointSlice" panicGoValue="&runtime.TypeAssertionError{_interface:(*abi.Type)(0x291daa0), concrete:(*abi.Type)(0x2b73880), asserted:(*abi.Type)(0x2f5cc20), missingMethod:\"\"}" stacktrace=<}
Description of problem:
Previously, failed task rus did not emit results, now they do but the UI still shows "No TaskRun results available due to failure" even though task run's status contains a result.
Version-Release number of selected component (if applicable):
4.14.3
How reproducible:
Always with a task run producing a result but failing afterwards
Steps to Reproduce:
1. Create the pipelinerun below 2. have a look on its task run
apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
name: hello-pipeline
spec:
tasks:
- name: hello
taskSpec:
results:
- name: greeting1
steps:
- name: greet
image: registry.access.redhat.com/ubi8/ubi-minimal
script: |
#!/usr/bin/env bash
set -e
echo -n "Hello World!" | tee $(results.greeting1.path)
exit 1
results:
- name: greeting2
value: $(tasks.hello.results.greeting1)
Actual results:
No results in UI
Expected results:
One result should be displayed even though task run failed
Additional info:
Pipelines 1.13.0
Description of problem:
While upgrading the Fusion operator, IBM team is facing the following error in the operator's subscription: error validating existing CRs against new CRD's schema for "fusionserviceinstances.service.isf.ibm.com": error validating service.isf.ibm.com/v1, Kind=FusionServiceInstance "ibm-spectrum-fusion-ns/odfmanager": updated validation is too restrictive: [].status.triggerCatSrcCreateStartTime: Invalid value: "number": status.triggerCatSrcCreateStartTime in body must be of type integer: "number" question here, "triggerCatSrcCreateStartTime" has been present in the operator for the past few releases and it's datatype (integer) hasn't changed in the latest release as well. There was one "FusionServiceInstance" CR present in the cluster when this issue was hit and the value of "triggerCatSrcCreateStartTime" field being "1726856593000774400".
Version-Release number of selected component (if applicable):
Its impacting between OCP 4.16.7 and OCP 4.16.14 versions
How reproducible:
Always
Steps to Reproduce:
1.Upgrade the fusion operator ocp version 4.16.7 to ocp 4.16.14 2. 3.
Actual results:
Upgrade fails with error in description
Expected results:
Upgrade should not be failed
Additional info:
The aks-e2e test keeps failing on the CreateClusterV2 test because the `ValidReleaseInfo` condition is not set. The patch that sets this status keeps failing. Investigate why & provide a fix.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
every time
Steps to Reproduce:
1. Create the dashboard with a bar chart and sort query result asc. 2. 3.
Actual results:
bar goes outside of the border
Expected results:
bar should not goes outside of the border
Additional info:
screenshot: https://drive.google.com/file/d/1xPRgenpyCxvUuWcGiWzmw5kz51qKLHyI/view?usp=drive_link
Description of problem:
Trying to setup a disconnected HCP cluster with self-managed image registry. After the cluster installed, all the imagestream failed to import images. With error: ``` Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client ``` The imagestream will talk to openshift-apiserver and get the image target there. After login to the hcp namespace, figured out that I cannot access any external network with https protocol.
Version-Release number of selected component (if applicable):
4.14.35
How reproducible:
always
Steps to Reproduce:
1. Install the hypershift hosted cluster with above setup 2. The cluster can be created successfully and all the pods on the cluster can be running with the expected images pulled 3. Check the internal image-registry 4. Check the openshift-apiserver pod from management cluster
Actual results:
All the imagestreams failed to sync from the remote registry. $ oc describe is cli -n openshift Name: cli Namespace: openshift Created: 6 days ago Labels: <none> Annotations: include.release.openshift.io/ibm-cloud-managed=true include.release.openshift.io/self-managed-high-availability=true openshift.io/image.dockerRepositoryCheck=2024-11-06T22:12:32Z Image Repository: image-registry.openshift-image-registry.svc:5000/openshift/cli Image Lookup: local=false Unique Images: 0 Tags: 1latest updates automatically from registry quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d ! error: Import failed (InternalError): Internal error occurred: [122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-1@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-2@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-3@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-4@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, 122610517469.dkr.ecr.us-west-2.amazonaws.com/ocp-mirror-5@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/": http: server gave HTTP response to HTTPS client, quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:49baeac68e90026799d0b62609e04adf285be5b83bdb5dbd372de2b14442be5d: Get "https://quay.io/v2/": http: server gave HTTP response to HTTPS client] Access the external network from the openshift-apiserver pod: sh-5.1$ curl --connect-timeout 5 https://quay.io/v2 curl: (28) Operation timed out after 5001 milliseconds with 0 out of 0 bytes received sh-5.1$ curl --connect-timeout 5 https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/ curl: (28) Operation timed out after 5001 milliseconds with 0 out of 0 bytes received sh-5.1$ env | grep -i http.*proxy HTTPS_PROXY=http://127.0.0.1:8090 HTTP_PROXY=http://127.0.0.1:8090
Expected results:
The openshift-apiserver should be able to talk to the remote https services.
Additional info:
It is working after set the registry to no_proxy sh-5.1$ NO_PROXY=122610517469.dkr.ecr.us-west-2.amazonaws.com curl --connect-timeout 5 https://122610517469.dkr.ecr.us-west-2.amazonaws.com/v2/ Not Authorized
Refactor name to Dockerfile.ocp as a better, version independent alternative
Description of problem:
Additional network is not correctly configured on the secondary interface inside the masters and the workers.
With install-config.yaml with this section:
# This file is autogenerated by infrared openshift plugin apiVersion: v1 baseDomain: "shiftstack.local" compute: - name: worker platform: openstack: zones: [] additionalNetworkIDs: ['26a751c3-c316-483c-91ed-615702bcbba9'] type: "worker" replicas: 3 controlPlane: name: master platform: openstack: zones: [] additionalNetworkIDs: ['26a751c3-c316-483c-91ed-615702bcbba9'] type: "master" replicas: 3 metadata: name: "ostest" networking: clusterNetworks: - cidr: fd01::/48 hostPrefix: 64 serviceNetwork: - fd02::/112 machineNetwork: - cidr: "fd2e:6f44:5dd8:c956::/64" networkType: "OVNKubernetes" platform: openstack: cloud: "shiftstack" region: "regionOne" defaultMachinePlatform: type: "master" apiVIPs: ["fd2e:6f44:5dd8:c956::5"] ingressVIPs: ["fd2e:6f44:5dd8:c956::7"] controlPlanePort: fixedIPs: - subnet: name: "subnet-ssipv6" pullSecret: | {"auths": {"installer-host.example.com:8443": {"auth": "ZHVtbXkxMjM6ZHVtbXkxMjM="}}} sshKey: <hidden> additionalTrustBundle: <hidden> imageContentSources: - mirrors: - installer-host.example.com:8443/registry source: quay.io/openshift-release-dev/ocp-v4.0-art-dev - mirrors: - installer-host.example.com:8443/registry source: registry.ci.openshift.org/ocp/release
The installation works. However, the additional network is not configured on the masters or the workers, which leads in our case to faulty manila integration.
In journal of all OCP nodes, it's observed logs repeteadly like below one from the master-0:
Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info> [1731590504.9667] device (enp4s0): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed') Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <warn> [1731590504.9672] device (enp4s0): Activation: failed for connection 'Wired connection 1' Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info> [1731590504.9674] device (enp4s0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed') Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info> [1731590504.9768] dhcp4 (enp4s0): canceled DHCP transaction Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info> [1731590504.9768] dhcp4 (enp4s0): activation: beginning transaction (timeout in 45 seconds) Nov 14 13:21:44 ostest-kmmtt-master-0 NetworkManager[1126]: <info> [1731590504.9768] dhcp4 (enp4s0): state changed no lease
Where that server has specifically an interface connected to the subnet "StorageNFSSubnet":
$ openstack server list | grep master-0 | da23da4a-4af8-4e54-ac60-88d6db2627b6 | ostest-kmmtt-master-0 | ACTIVE | StorageNFS=fd00:fd00:fd00:5000::fb:d8; network-ssipv6=fd2e:6f44:5dd8:c956::2e4 | ostest-kmmtt-rhcos | master |
That subnet is defined in openstack as dhcpv6-stateful:
$ openstack subnet show StorageNFSSubnet +----------------------+-------------------------------------------------------+ | Field | Value | +----------------------+-------------------------------------------------------+ | allocation_pools | fd00:fd00:fd00:5000::fb:10-fd00:fd00:fd00:5000::fb:fe | | cidr | fd00:fd00:fd00:5000::/64 | | created_at | 2024-11-13T12:34:41Z | | description | | | dns_nameservers | | | dns_publish_fixed_ip | None | | enable_dhcp | True | | gateway_ip | None | | host_routes | | | id | 480d7b2a-915f-4f0c-9717-90c55b48f912 | | ip_version | 6 | | ipv6_address_mode | dhcpv6-stateful | | ipv6_ra_mode | dhcpv6-stateful | | name | StorageNFSSubnet | | network_id | 26a751c3-c316-483c-91ed-615702bcbba9 | | prefix_length | None | | project_id | 4566c393806c43b9b4e9455ebae1cbb6 | | revision_number | 0 | | segment_id | None | | service_types | None | | subnetpool_id | None | | tags | | | updated_at | 2024-11-13T12:34:41Z | +----------------------+-------------------------------------------------------+
I also compared with ipv4 installation, and the storageNFSsubnet IP is successfully configured on enp4s0.
Version-Release number of selected component (if applicable):
How reproducible: Always
Additional info: must-gather and journal of the OCP nodes provided in private comment.
Description of problem:
The 'Plus' button in the 'Edit Pod Count' popup window overlaps the input field, which is incorrect.
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-12-05-103644
How reproducible:
Always
Steps to Reproduce:
1.Navigate to Workloads -> ReplicaSets page, choose one resource, and click the Keban list buton, choose ‘Edit Pod count’ 2. 3.
Actual results:
The Layout is incorrect
Expected results:
The 'Plus' button in the 'Edit Pod Count' popup window should not overlaps the input field
Additional info:
Snapshot: https://drive.google.com/file/d/1mL7xeT7FzkdsM1TZlqGdgCP5BG6XA8uh/view?usp=drive_link https://drive.google.com/file/d/1qmcal_4hypEPjmG6PTG11AJPwdgt65py/view?usp=drive_link
Description of problem:
console is showing view release notes on several places, but the current link only point to Y release main release note
Version-Release number of selected component (if applicable):
4.17.2
How reproducible:
Always
Steps to Reproduce:
1. set up 4.17.2 cluster 2. navigate to Cluster Settings page, check 'View release note' link in 'Update history' table
Actual results:
the link only point user to Y release main release note
Expected results:
the link should point to release note of a specific version the correct link should be https://access.redhat.com/documentation/en-us/openshift_container_platform/${major}.${minor}/html/release_notes/ocp-${major}-${minor}-release-notes#ocp-${major}-${minor}-${patch}_release_notes
Additional info:
Description of problem:
Sippy complains about pathological events in ns/openshift-cluster-csi-drivers in vsphere-ovn-serial jobs. See this job as one example.
Jan noticed that the DaemonSet generation is 10-12, while in 4.17 is 2. Why is our operator updating the DaemonSet so often?
I wrote a quick "one-liner" to generate json diffs from the vmware-vsphere-csi-driver-operator logs:
prev=''; grep 'DaemonSet "openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-node" changes' openshift-cluster-csi-drivers_vmware-vsphere-csi-driver-operator-5b79c58f6f-hpr6g_vmware-vsphere-csi-driver-operator.log | sed 's/^.*changes: //' | while read -r line; do diff <(echo $prev | jq .) <(echo $line | jq .); prev=$line; echo "####"; done
It really seems to be only operator.openshift.io/spec-hash and operator.openshift.io/dep-* fields changing in the json diffs:
#### 4,5c4,5 < "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==", < "operator.openshift.io/spec-hash": "fb274874404ad6706171c6774a369876ca54e037fcccc200c0ebf3019a600c36" --- > "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==", > "operator.openshift.io/spec-hash": "27a1bab0c00ace8ac21d95a5fe9a089282e7b2b3ec042045951bd5e26ae01a09" 13c13 < "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==" --- > "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==" #### 4,5c4,5 < "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==", < "operator.openshift.io/spec-hash": "27a1bab0c00ace8ac21d95a5fe9a089282e7b2b3ec042045951bd5e26ae01a09" --- > "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==", > "operator.openshift.io/spec-hash": "fb274874404ad6706171c6774a369876ca54e037fcccc200c0ebf3019a600c36" 13c13 < "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "AFeN-A==" --- > "operator.openshift.io/dep-1b5c921175cca7ab09ea7d1d58e35428291b8": "MZ-w-Q==" ####
The deployment is also changing in the same way. We need to find what is causing the spec-hash and dep-* fields to change and avoid the unnecessary churn that causes new daemonset / deployment rollouts.
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
~20% failure rate in 4.18 vsphere-ovn-serial jobs
Steps to Reproduce:
Actual results:
operator rolls out unnecessary daemonset / deployment changes
Expected results:
don't roll out changes unless there is a spec change
Additional info:
Please review the following PR: https://github.com/openshift/coredns/pull/130
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Customer is trying to install Self managed OCP cluster in aws. This customer use AWS VPC DHCPOptionSet. where it has a trailing dot (.) at the end of domain name in dhcpoptionset. due to this setting Master nodes hostname also has trailing dot & this cause failure in OpenShift installation.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1.Please create a aws vpc with DHCPOptionSet, where DHCPoptionSet has trailing dot at the domain name. 2.Try installation of cluster with IPI.
Actual results:
Openshift Installer should allowed to create AWS Master nodes, where domain has trailing dot(.).
Expected results:
Additional info:
Description of problem:
Unit tests for openshift/builder permanently failing for v4.18
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Always
Steps to Reproduce:
1. Run PR against openshift/builder
Actual results:
Test fails: --- FAIL: TestUnqualifiedClone (0.20s) source_test.go:171: unable to add submodule: "Cloning into '/tmp/test-unqualified335202210/sub'...\nfatal: transport 'file' not allowed\nfatal: clone of 'file:///tmp/test-submodule643317239' into submodule path '/tmp/test-unqualified335202210/sub' failed\n" source_test.go:195: unable to find submodule dir panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference
Expected results:
Tests pass
Additional info:
Example: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_builder/401/pull-ci-openshift-builder-master-unit/1853816128913018880
Description of problem:
We have recently enabled a few endpoint overrides, but ResourceManager was accidentally excluded.
Description of problem:
Installing 4.17 agent-based hosted cluster on bare-metal with IPv6 stack in disconnected environment. We cannot install MetalLB operator on the hosted cluster to expose openshift router and handle ingress because the openshift-marketplace pods that extract the operator bundle and the relative pods are in Error state. They try to execute the following command but cannot reach the cluster apiserver: opm alpha bundle extract -m /bundle/ -n openshift-marketplace -c b5a818607a7a162d7f9a13695046d44e47d8127a45cad69c0d8271b2da945b1 -z INFO[0000] Using in-cluster kube client config Error: error loading manifests from directory: Get "https://[fd02::1]:443/api/v1/namespaces/openshift-marketplace/configmaps/b5a818607a7a162d7f9a13695046d44e47d8127a45cad69c0d8271b2da945b1": dial tcp [fd02::1]:443: connect: connection refused In our hosted cluster fd02::1 is the clusterIP of the kubernetes service and the endpoint associated to the service is [fd00::1]:6443. By debugging the pods we see that connection to clusterIP is refused but if we try to connect to its endpoint the connection is established and we get 403 Forbidden: sh-5.1$ curl -k https://[fd02::1]:443 curl: (7) Failed to connect to fd02::1 port 443: Connection refused sh-5.1$ curl -k https://[fd00::1]:6443 { "kind": "Status", "apiVersion": "v1", "metadata": {}, "status": "Failure", "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"", "reason": "Forbidden", "details": {}, "code": 403 This issue is happening also in other pods in the hosted cluster which are in Error or in CrashLoopBackOff, we have similar error in their logs, e.g.: F1011 09:11:54.129077 1 cmd.go:162] failed checking apiserver connectivity: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-service-ca-operator/leases/service-ca-operator-lock": dial tcp [fd02::1]:443: connect: connection refused IPv6 disconnected 4.16 hosted cluster with same configuration was installed successfully and didn't show this issue, and neither IPv4 disconnected 4.17. So the issue is with IPv6 stack only.
Version-Release number of selected component (if applicable):
Hub cluster: 4.17.0-0.nightly-2024-10-10-004834 MCE 2.7.0-DOWNANDBACK-2024-09-27-14-52-56 Hosted cluster: version 4.17.1 image: registry.ci.openshift.org/ocp/release@sha256:e16ac60ac6971e5b6f89c1d818f5ae711c0d63ad6a6a26ffe795c738e8cc4dde
How reproducible:
100%
Steps to Reproduce:
1. Install MCE 2.7 on 4.17 IPv6 disconnected BM hub cluster 2. Install 4.17 agent-based hosted cluster and scale up the nodepool 3. After worker nodes are installed, attempt to install MetalLB operator to hanlde ingress
Actual results:
MetalLB operator cannot be installed because pods cannot connect to the cluster apiserver.
Expected results:
Pods in the cluster can connect to apiserver.
Additional info:
Description of problem:
During the EUS to EUS upgrade of a MNO cluster from 4.14.16 to 4.16.11 on baremetal, we have seen that depending on the custom configuration, like performance profile or container runtime config, one or more control plane nodes are rebooted multiple times. Seems that this is a race condition. When the first MachineConfig rendered is generated, the first Control Plane node start the reboot(the maxUnavailable is set to 1 on the master MCP), and at this moment a new MachineConfig render is generated, what means a second reboot. Once this first node is rebooted the second time, the rest of the Control Plane nodes are rebooted just once, because no more new MachineConfig renders are generated.
Version-Release number of selected component (if applicable):
OCP 4.14.16 > 4.15.31 > 4.16.11
How reproducible:
Perform the upgrade of a Multi Node OCP with a custom configuration like a performance profile or container runtime configuration (like force cgroups v1, or update runc to crun)
Steps to Reproduce:
1. Deploy on baremetal a MNO OCP 4.14 with a custom manifest, like the below: --- apiVersion: config.openshift.io/v1 kind: Node metadata: name: cluster spec: cgroupMode: v1 2. Upgrade the cluster to the next minor version available, for instance 4.15.31, make a partial upgrade pausing the worker Machine Config Pool. 3. Monitoring the upgrade process (cluster operators, Machine Configs, Machine Config Pools and nodes)
Actual results:
You will see that once almost all the Cluster Operators are in the 4.15.31 version, except the Machine Config Operator, at this moment review the MachineConfig reders that are generated for the master Machine Config Pool, and also monitor the nodes, to see that new MachineConfig render is generated once the first Control Plane node has been rebooted.
Expected results:
What is expected is that in a upgrade only one Machine Config Render is generated per Machine Config Pool, and only one reboot per node to finish the upgrade.
Additional info:
Description of problem:
1 Client can not connect to the kube-apiserver via kubernetes svc, as the kubernetes svc is not in the cert SANs 2 The kube-apiserver-operator generate apiserver certs, and insert the kubernetes svc ip from the network cr status.ServiceNetwork 3 When the temporary control plane is down, and the network cr is not ready yet, Client will not connect to apiserver again
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. I have just met this for very rare conditions, especially when the machine performance is poor 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When deploying a disconnected cluster, creating the iso by "openshifit-install agent create image" is failing (authentication required), when the release image resides in a secured local-registry. Actually the issue is this: openshift-install generates registry-config out of the install-config.yaml, and it's only the local regustry credentials (disconnected deploy), but it's not creating an icsp-file to get the image from local registry.
Version-Release number of selected component (if applicable):
How reproducible:
Run an agent-based iso image creation of a disconnected clutser. choose a version (nightly), where the image is in secured registry (such as registry.ci). it will fail on authentication required.
Steps to Reproduce:
1.openshift-install agant create image 2. 3.
Actual results:
failing on authentication required
Expected results:
iso to be created
Additional info:
Description of problem:
dynamic plugin in Pending status will block console plugins tab page loading
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-11-27-162407
How reproducible:
Always
Steps to Reproduce:
1. Create a dynamic plugin which will be in Pending status, we can create from file https://github.com/openshift/openshift-tests-private/blob/master/frontend/fixtures/plugin/pending-console-demo-plugin-1.yaml 2. Enable the 'console-demo-plugin-1' plugin and navigate to Console plugins tab at /k8s/cluster/operator.openshift.io~v1~Console/cluster/console-plugins
Actual results:
2. page will always be loading
Expected results:
2. console plugins list table should be displayed
Additional info:
Description of problem:
'Channel' and 'Version' dropdowns do not collapse if the user does not select an option
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-12-04-113014
How reproducible:
Always
Steps to Reproduce:
1. Naviage to Operator Insatallation page OR Operator Install details page eg: /operatorhub/ns/openshift-console?source=["Red+Hat"]&details-item=datagrid-redhat-operators-openshift-marketplace&channel=stable&version=8.5.4 /operatorhub/subscribe?pkg=datagrid&catalog=redhat-operators&catalogNamespace=openshift-marketplace&targetNamespace=openshift-console&channel=stable&version=8.5.4&tokenizedAuth= 2. Click the Channel/Update channel OR 'Version' dropdown list 3. Click the dropdow again
Actual results:
The dropdown list cannot collapse, only if user selected an option OR click other area
Expected results:
the dropdown can collapse after click
Additional info:
Description of problem:
The --report and --pxe flags were introduced in 4.18. It should be marked as experimental until 4.19.
Version-Release number of selected component (if applicable):
4.18
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
We should expand upon our current pre-commit hooks:
This will help prevent errors before code makes it on GitHub and CI.
This is a clone of issue OCPBUGS-41727. The following is the description of the original issue:
—
Original bug title:
cert-manager [v1.15 Regression] Failed to issue certs with ACME Route53 dns01 solver in AWS STS env
Description of problem:
When using Route53 as the dns01 solver to create certificates, it fails in both automated and manual tests. For the full log, please refer to the "Actual results" section.
Version-Release number of selected component (if applicable):
cert-manager operator v1.15.0 staging build
How reproducible:
Always
Steps to Reproduce: also documented in gist
1. Install the cert-manager operator 1.15.0 2. Follow the doc to auth operator with AWS STS using ccoctl: https://docs.openshift.com/container-platform/4.16/security/cert_manager_operator/cert-manager-authenticate.html#cert-manager-configure-cloud-credentials-aws-sts_cert-manager-authenticate 3. Create a ACME issuer with Route53 dns01 solver 4. Create a cert using the created issuer
OR:
Refer by running `/pj-rehearse pull-ci-openshift-cert-manager-operator-master-e2e-operator-aws-sts` on https://github.com/openshift/release/pull/59568
Actual results:
1. The certificate is not Ready. 2. The challenge of the cert is stuck in the pending status: PresentError: Error presenting challenge: failed to change Route 53 record set: operation error Route 53: ChangeResourceRecordSets, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region
Expected results:
The certificate should be Ready. The challenge should succeed.
Additional info:
The only way to get it working again seems to be injecting the "AWS_REGION" environment variable into the controller pod. See upstream discussion/change:
I couldn't find a way to inject the env var into our operator-managed operands, so I only verified this workaround using the upstream build v1.15.3. After applying the patch with the following command, the challenge succeeded and the certificate became Ready.
oc patch deployment cert-manager -n cert-manager \ --patch '{"spec": {"template": {"spec": {"containers": [{"name": "cert-manager-controller", "env": [{"name": "AWS_REGION", "value": "aws-global"}]}]}}}}'
Open Questions
Description of problem:
The manila controller[1] defines labels that are not based on the asset prefix defined in the manila config[2], consequently when assets that selects this resource are generated they use the asset prefix as a base to define the label, resulting in them not being selected. For example in the pod antifinity[3] and controller pbd[4]. We need to change the labels used in the selectors to match the actual labels of the controller.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
[Azure disk/file csi driver]on ARO HCP could not provision volume succeed
Version-Release number of selected component (if applicable):
4.19.0-0.nightly-2024-12-13-083421
How reproducible:
Always
Steps to Reproduce:
1.Install AKS cluster on azure. 2.Install hypershift operator on the AKS cluster. 3.Use hypershift CLI create hosted cluster with the Client Certificate mode. 4.Check the azure disk/file csi dirver work well on hosted cluster.
Actual results:
In step 4: the the azure disk/file csi dirver provision volume failed on hosted cluster # azure disk pvc provision failed $ oc describe pvc mypvc ... Normal WaitForFirstConsumer 74m persistentvolume-controller waiting for first consumer to be created before binding Normal Provisioning 74m disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_2334468f-9d27-4bdd-a53c-27271ee60073 External provisioner is provisioning volume for claim "default/mypvc" Warning ProvisioningFailed 74m disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_2334468f-9d27-4bdd-a53c-27271ee60073 failed to provision volume with StorageClass "managed-csi": rpc error: code = Unavailable desc = error reading from server: EOF Warning ProvisioningFailed 71m disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_28ba5ad9-c4f8-4dc8-be40-c80c546b7ef8 failed to provision volume with StorageClass "managed-csi": rpc error: code = Unavailable desc = error reading from server: EOF Normal Provisioning 71m disk.csi.azure.com_azure-disk-csi-driver-controller-74d944bbcb-7zz89_28ba5ad9-c4f8-4dc8-be40-c80c546b7ef8 External provisioner is provisioning volume for claim "default/mypvc" ... $ oc logs azure-disk-csi-driver-controller-74d944bbcb-7zz89 -c csi-driver W1216 08:07:04.282922 1 main.go:89] nodeid is empty I1216 08:07:04.290689 1 main.go:165] set up prometheus server on 127.0.0.1:8201 I1216 08:07:04.291073 1 azuredisk.go:213] DRIVER INFORMATION: ------------------- Build Date: "2024-12-13T02:45:35Z" Compiler: gc Driver Name: disk.csi.azure.com Driver Version: v1.29.11 Git Commit: 4d21ae15d668d802ed5a35068b724f2e12f47d5c Go Version: go1.23.2 (Red Hat 1.23.2-1.el9) X:strictfipsruntime Platform: linux/amd64 Topology Key: topology.disk.csi.azure.com/zone I1216 08:09:36.814776 1 utils.go:77] GRPC call: /csi.v1.Controller/CreateVolume I1216 08:09:36.814803 1 utils.go:78] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"topology.disk.csi.azure.com/zone":""}}],"requisite":[{"segments":{"topology.disk.csi.azure.com/zone":""}}]},"capacity_range":{"required_bytes":1073741824},"name":"pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316","parameters":{"csi.storage.k8s.io/pv/name":"pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316","csi.storage.k8s.io/pvc/name":"mypvc","csi.storage.k8s.io/pvc/namespace":"default","skuname":"Premium_LRS"},"volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":7}}]} I1216 08:09:36.815338 1 controllerserver.go:208] begin to create azure disk(pvc-d6af3900-ec5b-4e09-83d6-d0e112b02316) account type(Premium_LRS) rg(ci-op-zj9zc4gd-12c20-rg) location(centralus) size(1) diskZone() maxShares(0) panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x190c61d] goroutine 153 [running]: sigs.k8s.io/cloud-provider-azure/pkg/provider.(*ManagedDiskController).CreateManagedDisk(0x0, {0x2265cf0, 0xc0001285a0}, 0xc0003f2640) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/sigs.k8s.io/cloud-provider-azure/pkg/provider/azure_managedDiskController.go:127 +0x39d sigs.k8s.io/azuredisk-csi-driver/pkg/azuredisk.(*Driver).CreateVolume(0xc000564540, {0x2265cf0, 0xc0001285a0}, 0xc000272460) /go/src/github.com/openshift/azure-disk-csi-driver/pkg/azuredisk/controllerserver.go:297 +0x2c59 github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler.func1({0x2265cf0?, 0xc0001285a0?}, {0x1e5a260?, 0xc000272460?}) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:6420 +0xcb sigs.k8s.io/azuredisk-csi-driver/pkg/csi-common.logGRPC({0x2265cf0, 0xc0001285a0}, {0x1e5a260, 0xc000272460}, 0xc00017cb80, 0xc00014ea68) /go/src/github.com/openshift/azure-disk-csi-driver/pkg/csi-common/utils.go:80 +0x409 github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler({0x1f3e440, 0xc000564540}, {0x2265cf0, 0xc0001285a0}, 0xc00029a700, 0x2084458) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/github.com/container-storage-interface/spec/lib/go/csi/csi.pb.go:6422 +0x143 google.golang.org/grpc.(*Server).processUnaryRPC(0xc00059cc00, {0x2265cf0, 0xc000128510}, {0x2270d60, 0xc0004f5980}, 0xc000308480, 0xc000226a20, 0x31c8f80, 0x0) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1379 +0xdf8 google.golang.org/grpc.(*Server).handleStream(0xc00059cc00, {0x2270d60, 0xc0004f5980}, 0xc000308480) /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1790 +0xe8b google.golang.org/grpc.(*Server).serveStreams.func2.1() /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1029 +0x7f created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 16 /go/src/github.com/openshift/azure-disk-csi-driver/vendor/google.golang.org/grpc/server.go:1040 +0x125 # azure file pvc provision failed $ oc describe pvc mypvc Name: mypvc Namespace: openshift-cluster-csi-drivers StorageClass: azurefile-csi Status: Pending Volume: Labels: <none> Annotations: volume.beta.kubernetes.io/storage-provisioner: file.csi.azure.com volume.kubernetes.io/storage-provisioner: file.csi.azure.com Finalizers: [kubernetes.io/pvc-protection] Capacity: Access Modes: VolumeMode: Filesystem Used By: <none> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ExternalProvisioning 14s (x2 over 14s) persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'file.csi.azure.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered. Normal Provisioning 7s (x4 over 14s) file.csi.azure.com_azure-file-csi-driver-controller-879f56577-5hjn8_38c8218e-e52c-4248-ada7-268742afaac0 External provisioner is provisioning volume for claim "openshift-cluster-csi-drivers/mypvc" Warning ProvisioningFailed 7s (x4 over 14s) file.csi.azure.com_azure-file-csi-driver-controller-879f56577-5hjn8_38c8218e-e52c-4248-ada7-268742afaac0 failed to provision volume with StorageClass "azurefile-csi": rpc error: code = Internal desc = failed to ensure storage account: could not list storage accounts for account type Standard_LRS: StorageAccountClient is nil
Expected results:
In step 4: the the azure disk/file csi dirver should provision volume succeed on hosted cluster
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
Delete the openshift-monitoring/monitoring-plugin-cert secret, SCO will re-create a new one with different content
Actual results:
- monitoring-plugin is still using the old cert content. - If the cluster doesn’t show much activity, the hash may take time to be updated.
Expected results:
CMO should detect that exact change and run a sync to recompute and set the new hash.
Additional info:
- We shouldn't rely on another changeto trigger the sync loop. - CMO should maybe watch that secret? (its name isn't known in advance).
Description of problem:
When updating cypress-axe, new changes and bugfixes in the axe-core accessibility auditing package have surfaced various accessibility violations that have to be addressed
Version-Release number of selected component (if applicable):
openshift4.18.0
How reproducible:
always
Steps to Reproduce:
1. Update axe-core and cypress-axe to the latest versions 2. Run test-cypress-console and run a cypress test, I used other-routes.cy.ts
Actual results:
The tests fail with various accessibility violations
Expected results:
The tests pass without accessibility violations
Additional info:
As a maintainer of the SNO CI lane, I would like to ensure that the following test doesn't failure regularly as part of SNO CI.
[sig-architecture] platform pods in ns/openshift-e2e-loki should not exit an excessive amount of times
This issue is a symptom of a greater problem with SNO where there is downtime in resolving DNS after the upgrade reboot where the DNS operator has an outage while its deploying the new DNS pods. During that time, loki exists after hitting the following error:
2024/10/23 07:21:32 OIDC provider initialization failed: Get "https://sso.redhat.com/auth/realms/redhat-external/.well-known/openid-configuration": dial tcp: lookup sso.redhat.com on 172.30.0.10:53: read udp 10.128.0.4:53104->172.30.0.10:53: read: connection refused
This issue is important because it can contribute to payload rejection in our blocking CI jobs.
Acceptance Criteria:
Description of problem:
Bare Metal UPI cluster Nodes lose communication with other nodes and this affects the pod communication on these nodes as well. This issue can be fixed with an OVN rebuild on the nodes db that are hitting the issue but eventually the nodes will degrade again and lose communication again. Note despite an OVN Rebuild fixing the issue temporarily Host Networking is set to True so it's using the kernel routing table. **update: observed on Vsphere with routingViaHost: false, ipForwarding: global configuration as well.
Version-Release number of selected component (if applicable):
4.14.7, 4.14.30
How reproducible:
Can't reproduce locally but reproducible and repeatedly occurring in customer environment
Steps to Reproduce:
identify a host node who's pods can't be reached from other hosts in default namespaces ( tested via openshift-dns). observe curls to that peer pod consistently timeout. TCPdumps to target pod observe that packets are arriving and are acknowledged, but never route back to the client pod successfully. (SYN/ACK seen at pod network layer, not at geneve; so dropped before hitting geneve tunnel).
Actual results:
Nodes will repeatedly degrade and lose communication despite fixing the issue with a ovn db rebuild (db rebuild only provides hours/days of respite, no permanent resolve).
Expected results:
Nodes should not be losing communication and even if they did it should not happen repeatedly
Additional info:
What's been tried so far ======================== - Multiple OVN rebuilds on different nodes (works but node will eventually hit issue again) - Flushing the conntrack (Doesn't work) - Restarting nodes (doesn't work) Data gathered ============= - Tcpdump from all interfaces for dns-pods going to port 7777 (to segregate traffic) - ovnkube-trace - SOSreports of two nodes having communication issues before an OVN rebuild - SOSreports of two nodes having communication issues after an OVN rebuild - OVS trace dumps of br-int and br-ex ==== More data in nested comments below.
linking KCS: https://access.redhat.com/solutions/7091399
In 4.8's installer#4760, the installer began passing oc adm release new ... a manifest so the cluster-version operator would manage a coreos-bootimages ConfigMap in the openshift-machine-config-operator namespace. installer#4797 reported issues with the 0.0.1-snapshot placeholder not getting substituted, and installer#4814 attempted to fix that issue by converting the manifest from JSON to YAML to align with the replacement rexexp. But for reasons I don't understand, that manifest still doesn't seem to be getting replaced.
From 4.8 through 4.15.
100%
With 4.8.0:
$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.8.0-x86_64 $ grep releaseVersion manifests/0000_50_installer_coreos-bootimages.yaml
releaseVersion: 0.0.1-snapshot
releaseVersion: 4.8.0
or other output that matches the extracted release. We just don't want the 0.0.1-snapshot placeholder.
Reproducing in the latest 4.14 RC:
$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.14.0-rc.2-x86_64 $ grep releaseVersion manifests/0000_50_installer_coreos-bootimages.yaml releaseVersion: 0.0.1-snapshot
Description of problem:
When Applying profile with isolated field containing huge cpu list, profile doesn't apply and no errors is reported
Version-Release number of selected component (if applicable):
4.18.0-0.nightly-2024-11-26-075648
How reproducible:
Everytime.
Steps to Reproduce:
1. Create a profile as specified below: apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: annotations: kubeletconfig.experimental: '{"topologyManagerPolicy":"restricted"}' creationTimestamp: "2024-11-27T10:25:13Z" finalizers: - foreground-deletion generation: 61 name: performance resourceVersion: "3001998" uid: 8534b3bf-7bf7-48e1-8413-6e728e89e745 spec: cpu: isolated: 25,281,117,373,127,383,3,259,30,286,77,333,23,279,21,277,66,322,12,268,15,271,61,317,120,376,35,291,62,318,93,349,126,382,19,275,52,308,110,366,50,306,92,348,124,380,119,375,2,258,29,285,60,316,115,371,118,374,104,360,108,364,70,326,72,328,76,332,96,352,99,355,64,320,80,336,97,353,8,264,11,267,38,294,53,309,57,313,103,359,14,270,87,343,7,263,40,296,51,307,94,350,116,372,39,295,46,302,90,346,101,357,107,363,26,282,67,323,98,354,106,362,113,369,6,262,10,266,20,276,33,289,112,368,85,341,121,377,68,324,71,327,79,335,81,337,83,339,88,344,9,265,89,345,91,347,100,356,54,310,31,287,58,314,59,315,22,278,47,303,105,361,17,273,114,370,111,367,28,284,49,305,55,311,84,340,27,283,95,351,5,261,36,292,41,297,43,299,45,301,75,331,102,358,109,365,37,293,56,312,63,319,65,321,74,330,125,381,13,269,42,298,44,300,78,334,122,378,4,260,16,272,34,290,123,379,18,274,48,304,69,325,82,338,24,280,32,288,73,329,86,342,220,476,251,507,206,462,226,482,229,485,221,477,253,509,255,511,135,391,144,400,183,439,233,489,137,393,186,442,198,454,190,446,234,490,147,403,163,419,172,428,148,404,149,405,250,506,151,407,217,473,238,494,131,387,230,486,235,491,246,502,145,401,194,450,199,455,143,399,169,425,231,487,245,501,129,385,142,398,179,435,225,481,236,492,152,408,203,459,214,470,166,422,207,463,212,468,130,386,155,411,215,471,188,444,201,457,210,466,193,449,200,456,248,504,141,397,167,423,191,447,181,437,222,478,252,508,128,384,139,395,174,430,164,420,168,424,187,443,232,488,133,389,157,413,208,464,140,396,185,441,241,497,219,475,175,431,184,440,213,469,154,410,197,453,249,505,209,465,218,474,227,483,244,500,134,390,153,409,178,434,160,416,195,451,196,452,211,467,132,388,136,392,146,402,138,394,150,406,239,495,173,429,192,448,202,458,205,461,216,472,158,414,159,415,176,432,189,445,237,493,242,498,177,433,182,438,204,460,240,496,254,510,162,418,171,427,180,436,243,499,156,412,165,421,170,426,228,484,247,503,161,417,223,479,224,480 reserved: 0,256,1,257 hugepages: defaultHugepagesSize: 1G pages: - count: 20 size: 2M machineConfigPoolSelector: machineconfiguration.openshift.io/role: worker-cnf net: userLevelNetworking: true nodeSelector: node-role.kubernetes.io/worker-cnf: "" numa: topologyPolicy: restricted realTimeKernel: enabled: false workloadHints: highPowerConsumption: true perPodPowerManagement: false realTime: true 2. The worker-cnf node doesn't contain any kernel args associated with the above profile. 3.
Actual results:
System doesn't boot with kernel args associated with above profile
Expected results:
System should boot with Kernel args presented from Performance Profile.
Additional info:
We can see MCO gets the details and creates the mc: Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: machine-config-daemon[9550]: "Running rpm-ostree [kargs --delete=systemd.unified_cgroup_hierarchy=1 --delete=cgroup_no_v1=\"all\" --delete=psi=0 --delete=skew_tick=1 --delete=tsc=reliable --delete=rcupda> Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: cbs=25,281,117,373,127,383,3,259,30,286,77,333,23,279,21,277,66,322,12,268,15,271,61,317,120,376,35,291,62,318,93,349,126,382,19,275,52,308,110,366,50,306,92,348,124,380,119,375,2,258,29,285,60,316,115,3> Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: 4,190,446,234,490,147,403,163,419,172,428,148,404,149,405,250,506,151,407,217,473,238,494,131,387,230,486,235,491,246,502,145,401,194,450,199,455,143,399,169,425,231,487,245,501,129,385,142,398,179,435,2> Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: systemd.cpu_affinity=0,1,256,257 --append=iommu=pt --append=amd_pstate=guided --append=tsc=reliable --append=nmi_watchdog=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=is> Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: ,78,334,122,378,4,260,16,272,34,290,123,379,18,274,48,304,69,325,82,338,24,280,32,288,73,329,86,342,220,476,251,507,206,462,226,482,229,485,221,477,253,509,255,511,135,391,144,400,183,439,233,489,137,393> Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: 510,162,418,171,427,180,436,243,499,156,412,165,421,170,426,228,484,247,503,161,417,223,479,224,480 --append=nohz_full=25,281,117,373,127,383,3,259,30,286,77,333,23,279,21,277,66,322,12,268,15,271,61,317> Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: ,476,251,507,206,462,226,482,229,485,221,477,253,509,255,511,135,391,144,400,183,439,233,489,137,393,186,442,198,454,190,446,234,490,147,403,163,419,172,428,148,404,149,405,250,506,151,407,217,473,238,49> Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com root[18779]: ppend=nosoftlockup --append=skew_tick=1 --append=rcutree.kthread_prio=11 --append=default_hugepagesz=1G --append=hugepagesz=2M --append=hugepages=20]" Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com rpm-ostree[18750]: client(id:machine-config-operator dbus:1.336 unit:crio-36c845a9c9a58a79a0e09dab668f8b21b5e46e5734a527c269c6a5067faa423b.scope uid:0) added; new total=1 Dec 02 08:59:43 cnfdd11.t5g-dev.eng.rdu2.dc.redhat.com rpm-ostree[18750]: Loaded sysroot Actual Kernel args: BOOT_IMAGE=(hd1,gpt3)/boot/ostree/rhcos-854dd632224b34d5f4df1884c4ba8c2f9527422b37744b83e7b1b98172586ff4/vmlinuz-5.14.0-427.44.1.el9_4.x86_64 rw ostree=/ostree/boot.0/rhcos/854dd632224b34d5f4df1884c4ba8c2f9527422b37744b83e7b1b98172586ff4/0 ignition.platform.id=metal ip=dhcp root=UUID=0068e804-432c-409d-aabc-260aa71e3669 rw rootflags=prjquota boot=UUID=7797d927-876e-426b-9a30-d1e600c1a382 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=0 skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 nohz=on
Description of problem:
The create button on MultiNetworkPolicies and NetworkPolicies list page is in wrong position, it should on the top right.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
See https://github.com/kubernetes/kubernetes/issues/127352
Version-Release number of selected component (if applicable):
See https://github.com/kubernetes/kubernetes/issues/127352
How reproducible:
See https://github.com/kubernetes/kubernetes/issues/127352
Steps to Reproduce:
See https://github.com/kubernetes/kubernetes/issues/127352
Actual results:
See https://github.com/kubernetes/kubernetes/issues/127352
Expected results:
See https://github.com/kubernetes/kubernetes/issues/127352
Additional info:
See https://github.com/kubernetes/kubernetes/issues/127352
Description of problem:
checked in 4.18.0-0.nightly-2024-12-05-103644/4.19.0-0.nightly-2024-12-04-03122, admin console go to "Observe -> Metrics", execute one query, make sure there is result for it, for example "cluster_version", click the kebab menu, "Show all series" under the list, it's wrong, should be "Hide all series", click "Show all series" will unselect all series, then "Hide all series" always show under the menu, click it, the series would be changed from selected and unselected, but always see "Hide all series", see recording: https://drive.google.com/file/d/1kfwAH7FuhcloCFdRK--l01JYabtzcG6e/view?usp=drive_link
same issue for developer console for 4.18+, 4.17 and below does not have such issue
Version-Release number of selected component (if applicable):
4.18+
How reproducible:
always with 4.18+
Steps to Reproduce:
see the description
Actual results:
Hide/Show all series status under"Observe -> Metrics" kebab menu is wrong
Expected results:
should be right
Description of problem:
the go to arrow and new doc link icon not aligned with text any more
Version-Release number of selected component (if applicable):
4.19.0-0.nightly-2024-12-12-144418
How reproducible:
Always
Steps to Reproduce:
1. goes to Home -> Overview page 2. 3.
Actual results:
the go to arrow and new doc link icon are not horizontal aligned with their text any more
Expected results:
icon and text should be aligned
Additional info:
screenshot https://drive.google.com/file/d/1S61XY-lqmmJgGbwB5hcR2YU_O1JSJPtI/view?usp=drive_link
Description of problem:
CAPI install got ImageReconciliationFailed when creating vpc custom image
Version-Release number of selected component (if applicable):
4.19.0-0.nightly-2024-12-06-101930
How reproducible:
always
Steps to Reproduce:
1.add the following in install-config.yaml featureSet: CustomNoUpgrade featureGates: [ClusterAPIInstall=true] 2. create IBMCloud cluster with IPI
Actual results:
level=info msg=Done creating infra manifests level=info msg=Creating kubeconfig entry for capi cluster ci-op-h3ykp5jn-32a54-xprzg level=info msg=Waiting up to 30m0s (until 11:25AM UTC) for network infrastructure to become ready... level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: infrastructure was not ready within 30m0s: client rate limiter Wait returned an error: context deadline exceeded
in IBMVPCCluster-openshift-cluster-api-guests log
reason: ImageReconciliationFailed message: 'error failure trying to create vpc custom image: error unknown failure creating vpc custom image: The IAM token that was specified in the request has expired or is invalid. The request is not authorized to access the Cloud Object Storage resource.'
Expected results:
create cluster succeed
Additional info:
the resources created when install failed: ci-op-h3ykp5jn-32a54-xprzg-cos dff97f5c-bc5e-4455-b470-411c3edbe49c crn:v1:bluemix:public:cloud-object-storage:global:a/fdc2e14cf8bc4d53a67f972dc2e2c861:f648897a-2178-4f02-b948-b3cd53f07d85:: ci-op-h3ykp5jn-32a54-xprzg-vpc is.vpc crn:v1:bluemix:public:is:jp-tok:a/fdc2e14cf8bc4d53a67f972dc2e2c861::vpc:r022-46c7932d-8f4d-4d53-a398-555405dfbf18 copier-resurrect-panzer-resistant is.security-group crn:v1:bluemix:public:is:jp-tok:a/fdc2e14cf8bc4d53a67f972dc2e2c861::security-group:r022-2367a32b-41d1-4f07-b148-63485ca8437b deceiving-unashamed-unwind-outward is.network-acl crn:v1:bluemix:public:is:jp-tok:a/fdc2e14cf8bc4d53a67f972dc2e2c861::network-acl:r022-b50286f6-1052-479f-89bc-fc66cd9bf613
Description of problem:
node-joiner --pxe does not rename pxe artifacts
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. node-joiner --pxe
Actual results:
agent*.* artifacts are generated in the working dir
Expected results:
In the target folder, there should be only the following artifacts:
* node.x86_64-initrd.img * node.x86_64-rootfs.img * node.x86_64-vmlinuz * node.x86_64.ipxe (if required)
Additional info:
Today, when source images are by digest only, oc-mirror applies a default tag:
This should be unified.
Component Readiness has found a potential regression in the following test:
install should succeed: infrastructure
installer fails with:
time="2024-10-20T04:34:57Z" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: control-plane machines were not provisioned within 15m0s: client rate limiter Wait returned an error: context deadline exceeded"
Significant regression detected.
Fishers Exact probability of a regression: 99.96%.
Test pass rate dropped from 98.94% to 89.29%.
Sample (being evaluated) Release: 4.18
Start Time: 2024-10-14T00:00:00Z
End Time: 2024-10-21T23:59:59Z
Success Rate: 89.29%
Successes: 25
Failures: 3
Flakes: 0
Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T23:59:59Z
Success Rate: 98.94%
Successes: 93
Failures: 1
Flakes: 0
Description of problem:
The period is placed inside the quotes of the missingKeyHandler i18n error
Version-Release number of selected component (if applicable):
4.18.0
How reproducible:
Always when there is a missingKeyHandler error
Steps to Reproduce:
1. Check browser console 2. Observe period is placed inside the quites 3.
Actual results:
It is placed inside the quotes
Expected results:
It should be placed outside the quotes
Additional info:
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled
one issue is that the ovnk build needed to test the PR is taking close to an hour and sometimes closer to 2 hours. One major problem is that 'dnf install' that is done
in the ovnk Dockerfile(s) is not able to reach certain dnf repositories that are
default in the base openshift image. They may be behind the Red Hat VPN which
our test clusters don't have access too. If the problem is a Timeout, the default
dnf timeout is 30s and that can cause 5+ minutes of delay. We do more than one
dnf install in our Dockerfiles too, so the problem is amplified.
another issue is the default job timeout is 4h0m, with a ~1h build time and other
time consuming steps like must-gather, watchers, and long e2e runs getting to that
4h is pretty easy.
As a developer looking to contribute to OCP BuildConfig I want contribution guidelines that make it easy for me to build and test all the components.
Much of the contributor documentation for openshift/builder is either extremely out of date or buggy. This hinders the ability for newcomers to contribute.
Description of problem:
s2i conformance test appears to fail permanently on OCP 4.16.z
Version-Release number of selected component (if applicable):
4.16.z
How reproducible:
Since 2024-11-04 at least
Steps to Reproduce:
Run OpenShift build test suite in PR
Actual results:
Test fails - root cause appears to be that a built/deployed pod crashloops
Expected results:
Test succeeds
Additional info:
Job history https://prow.ci.openshift.org/job-history/gs/test-platform-results/pr-logs/directory/pull-ci-openshift-openshift-controller-manager-release-4.16-e2e-gcp-ovn-builds
CMO should create and deploy a configmap that contains the data for accelerators monitoring. When CMO creates node-exporter daemonsets and mounts the the configmap into the node-exporter's pods
The monitoring-plugin is still using Patternfly v4; it needs to be upgraded to Patternfly v5. This major version release deprecates components in the monitoring-plugin. These components will need to be replaced/removed to accommodate the version update.
We need to remove the deprecated components from the monitoring plugin, extending the work from CONSOLE-4124
Work to be done:
One of our customers observed this issue. In order to reproduce, In my test cluster, I intentionally increased the overall CPU limits to over 200% and monitored the cluster for more than 2 days. However, I did not see the KubeCPUOvercommit alert, which ideally should trigger after 10 minutes of overcommitment.
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 2654m (75%) 8450m (241%)
memory 5995Mi (87%) 12264Mi (179%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
OCP console --> Observe --> alerting --> alert rule and select for the `KubeCPUOvercommit` alert.
Expression:
sum by (cluster) (namespace_cpu:kube_pod_container_resource_requests:sum{job="kube-state-metrics"}) - (sum by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) - max by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"})) > 0 and (sum by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) - max by (cluster) (kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"})) > 0
The following Insights APIs use duration attributes:
The kubebuilder validation patterns are defined as
^0|([1-9][0-9]*(\\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$
and
^([1-9][0-9]*(\\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$
Unfortunately this is not enough and it fails in case of updating the resource e.g with value "2m0s".
The validation pattern must allow these values.
User Story:
As an OpenShift Engineer I want Create PR for machine-api refactoring of feature gate parameters so We need to pull out the logic from Neil's PR that removes individual feature gate parameters to use the new FeatureGate mutable map.
Description:
< Record any background information >
Acceptance Criteria:
< Record how we'll know we're done >
Other Information:
< Record anything else that may be helpful to someone else picking up the card >
issue created by splat-bot
Description of problem:
openstack-manila-csi-controllerplugin-csi-driver container is not functional in the first run, it needs to restart once and then it's good. This causes HCP e2e to fail on the EnsureNoCrashingPods test.
Version-Release number of selected component (if applicable):
4.19, 4.18
How reproducible:
Deploy Shift on Stack with Manila available in the cloud.
Actual results:
The openstack-manila-csi-controllerplugin pod will restart once and then it'll be functional.
Expected results:
No restart should be needed. This is likely an orchestration issue.
Analyze the data from the new tests and determine what, if anything, we should do.
Security Tracking Issue
Do not make this issue public.
Flaw:
Non-linear parsing of case-insensitive content in golang.org/x/net/html
https://bugzilla.redhat.com/show_bug.cgi?id=2333122
An attacker can craft an input to the Parse functions that would be processed non-linearly with respect to its length, resulting in extremely slow parsing. This could cause a denial of service.
~~~
Description of problem:
We have two EAP application server clusters and for each of them there is a service created. We have a route configured to the one of the services. When we update the route programmatically to lead to the second service/cluster the response shows it is still being attached to the same service.
Steps to Reproduce:
1. Create two separate clusters of the EAP servers
2. Create one service for the first cluster (hsc1) and one for the second one (hsc2)
3. Create a route for the first service (hsc1)
4. Start both of the clusters and assure the replication works
5. Send a request to the first cluster using the route URL - response should contain identification of the first cluster (hsc-1-xxx)
[2024-08-29 11:30:44,544] INFO - [ForkJoinPool-1-worker-1] 1st request responseString hsc-1-2-0c872b89-ef3e-48c6-8163-372e447e013d 1 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com [2024-08-29 11:30:44,654] INFO - [ForkJoinPool-1-worker-1] 2nd request responseString hsc-1-2-0c872b89-ef3e-48c6-8163-372e447e013d 2 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
6. update the route programatically to redirect to the second service (hsc2)
...
builder.editSpec().editTo().withName("hsc2").endTo().endSpec();
...
7. Send the request again using the same route - in the response there is the same identification of the first cluster
[2024-08-29 11:31:45,098] INFO - [ForkJoinPool-1-worker-1] responseString after route update hsc-1-2-0c872b89-ef3e-48c6-8163-372e447e013d 3 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
although the service was updated in the route yaml:
... kind: Service name: hsc2
When creating a new route hsc2 for a service hsc2 and using it for the third request we can see the second cluster was targetted correctly with his own separate replication working
[2024-08-29 13:43:13,679] INFO - [ForkJoinPool-1-worker-1] 1st request responseString hsc-1-2-00594ca9-f70c-45de-94b8-354a6e1fc293 1 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
[2024-08-29 13:43:13,790] INFO - [ForkJoinPool-1-worker-1] 2nd request responseString hsc-1-2-00594ca9-f70c-45de-94b8-354a6e1fc293 2 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
[2024-08-29 13:44:14,056] INFO - [ForkJoinPool-1-worker-1] responseString after second route for service hsc2 was used hsc-2-2-614582a9-3c71-4690-81d3-32a616ed8e54 1 with route hsc2-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
I also did a different attempt.
I Stopped the test in debug mode after the two requests were executed
[2024-08-30 14:23:43,101] INFO - [ForkJoinPool-1-worker-1] 1st request responseString hsc-1-1-489d4239-be4f-4d5e-9343-3211ae479d51 1 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com [2024-08-30 14:23:43,210] INFO - [ForkJoinPool-1-worker-1] 2nd request responseString hsc-1-1-489d4239-be4f-4d5e-9343-3211ae479d51 2 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
Then manually changed the route yaml to use the hsc2 service and send the request manually:
curl http://hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com/Counter
hsc-2-2-84fa1d7e-4045-4708-b89e-7d7f3cd48541 1
responded correctly with the second service/cluster.
Then resumed the test run in debug mode and sent the request programmatically
[2024-08-30 14:24:59,509] INFO - [ForkJoinPool-1-worker-1] responseString after route update hsc-1-1-489d4239-be4f-4d5e-9343-3211ae479d51 3 with route hsc-cihak2.apps.cpqe037-cnid.eapqe.psi.redhat.com
responded with the wrong first service/cluster.
Actual results: Route directs to the same service and EAP cluster
Expected results: After the update the route should direct to the second service and EAP cluster
Additional info:
This issue started to occur from OCP 4.16. Going through the 4.16 release notes and suggested route configuration didn't lead to any possible configuration chnages which should have been applied.
The code of the MultipleClustersTest.twoClustersTest where was this issue discovered is available here.
All the logs as well as services and route yamls are attached to the EAPQE jira.
CBO-installed Ironic unconditionally has TLS, even though we don't do proper host validation just yet (see bug OCPBUGS-20412). Ironic in the installer does not use TLS (mostly for historical reasons). Now that OCPBUGS-36283 added a TLS certificate for virtual media, we can use the same for Ironic API. At least initially, it will involve disabling host validation for IPA.
Description of problem:
openshift virt allows hotplugging block volumes into it's pods, which relies on the fact that changing the cgroup corresponding to the pid of the container suffices. crun is test driving some changes it integrated recently; it's configuring two cgroups, `*.scope` and sub cgroup called `container` while before, the parent existed as sort of a no op (wasn't configured, so, all devices were allowed, for example) This results in the volume hotplug breaking since applying the device filter to the sub cgroup is not enough anymore
Version-Release number of selected component (if applicable):
4.18.0 RC2
How reproducible:
100%
Steps to Reproduce:
1. Block volume hotplug to VM 2. 3.
Actual results:
Failure
Expected results:
Success
Additional info:
https://kubevirt.io/user-guide/storage/hotplug_volumes/