Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
1. Proposed title of this feature request
Add runbook_url to alerts in the OCP UI
2. What is the nature and description of the request?
If an alert includes a runbook_url label, then it should appear in the UI for the alert as a link.
3. Why does the customer need this? (List the business requirements here)
Customer can easily reach the alert runbook and be able to address their issues.
4. List any affected packages or components.
As a user, I should be able to configure CSI driver to have a storage topology.
In the console-operator repo we need to add `capability.openshift.io/console` annotation to all the manifests that the operator either contains creates on the fly.
Manifests are currently present in /bindata and /manifest directories.
Here is example of the insights-operator change.
Here is the overall enhancement doc.
Feature Overview
Provide CSI drivers to replace all the intree cloud provider drivers we currently have. These drivers will probably be released as tech preview versions first before being promoted to GA.
Goals
Requirements
Requirement | Notes | isMvp? |
---|---|---|
Framework for CSI driver | TBD | Yes |
Drivers should be available to install both in disconnected and connected mode | Yes | |
Drivers should upgrade from release to release without any impact | Yes | |
Drivers should be installable via CVO (when in-tree plugin exists) |
Out of Scope
This work will only cover the drivers themselves, it will not include
Background, and strategic fit
In a future Kubernetes release (currently 1.21) intree cloud provider drivers will be deprecated and replaced with CSI equivalents, we need the drivers created so that we continue to support the ecosystems in an appropriate way.
Assumptions
Customer Considerations
Customers will need to be able to use the storage they want.
Documentation Considerations
This Epic is to track the GA of this feature
As an OCP user, I want images for GCP Filestore CSI Driver and Operator, so that I can install them on my cluster and utilize GCP Filestore shares.
We need to continue to maintain specific areas within storage, this is to capture that effort and track it across releases.
Goals
Requirements
Requirement | Notes | isMvp? |
---|---|---|
Telemetry | No | |
Certification | No | |
API metrics | No | |
Out of Scope
n/a
Background, and strategic fit
With the expected scale of our customer base, we want to keep load of customer tickets / BZs low
Assumptions
Customer Considerations
Documentation Considerations
Notes
In progress:
High prio:
Unsorted
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF). Trying no-feature-freeze in 4.12. We will try to do as much as we can before FF, but we're quite sure something will slip past FF as usual.
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
This includes (but is not limited to):
Operators:
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
This includes ibm-vpc-node-label-updater!
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
There is a new driver release 5.0.0 since the last rebase that includes snapshot support:
https://github.com/kubernetes-sigs/ibm-vpc-block-csi-driver/releases/tag/v5.0.0
Rebase the driver on v5.0.0 and update the deployments in ibm-vpc-block-csi-driver-operator.
There are no corresponding changes in ibm-vpc-node-label-updater since the last rebase.
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update all CSI sidecars to the latest upstream release.
This includes update of VolumeSnapshot CRDs in https://github.com/openshift/cluster-csi-snapshot-controller-operator/tree/master/assets
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
The End of General support for vSphere 6.7 will be on October 15, 2022. So, vSphere 6.7 will be deprecated for 4.11.
We want to encourage vSphere customers to upgrade to vSphere 7 in OCP 4.11 since VMware is EOLing (general support) for vSphere 6.7 in Oct 2022.
We want the cluster Upgradeable=false + have a strong alert pointing to our docs / requirements.
related slack: https://coreos.slack.com/archives/CH06KMDRV/p1647541493096729
On new installations, we should make the StorageClass created by the CSI operator the default one.
However, we shouldn't do that on an upgrade scenario. The main reason is that users might have set a different quota on the CSI driver Storage Class.
Exit criteria:
This Epic tracks the GA of this feature
Epic Goal
On new installations, we should make the StorageClass created by the CSI operator the default one.
However, we shouldn't do that on an upgrade scenario. The main reason is that users might have set a different quota on the CSI driver Storage Class.
Exit criteria:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Rebase openshift-controller-manager to k8s 1.24
4.11 MVP Requirements
Out of scope use cases (that are part of the Kubeframe/factory project):
Questions to be addressed:
As an OpenShift infrastructure owner, I want to deploy a cluster zero with RHACM or MCE and have the required components installed when the installation is completed
BILLI makes it easier to deploy a cluster zero. BILLI users know at installation time what the purpose of their cluster is when they plan the installation. Day-2 steps are necessary to install operators and users, especially when automating installations, want to finish the installation flow when their required components are installed.
As a customer, I want to be able to:
so that I can achieve
Description of criteria:
We are only allowing the user to provide extra manifests to install MCE at this time. We are not adding an option to "install mce" on the command line (or UI)
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a customer, I want to be able to:
so that I can achieve
Description of criteria:
We are only allowing the user to provide extra manifests to install MCE at this time. We are not adding an option to "install mce" on the command line (or UI)
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a OpenShift infrastructure owner, I want to deploy OpenShift clusters with dual-stack IPv4/IPv6
As a OpenShift infrastructure owner, I want to deploy OpenShift clusters with single-stack IPv6
IPv6 and dual-stack clusters are requested often by customers, especially from Telco customers. Working with dual-stack clusters is a requirement for many but also a transition into a single-stack IPv6 clusters, which for some of our users is the final destination.
Karim's work proving how agent-based can deploy IPv6: IPv6 deploy with agent based installer]
For dual-stack installations the agent-cluster-install.yaml must have both an IPv4 and IPv6 subnet in the networkking.MachineNetwork or assisted-service will throw an error. This field is in InstallConfig but it must be added to agent-cluster-install in its Generate().
For IPv4 and IPv6 installs, setting up the MachineNetwork is not needed but it also does not cause problems if its set, so it should be fine to set it all times.
Set the ClusterDeployment CRD to deploy OpenShift in FIPS mode and make sure that after deployment the cluster is set in that mode
In order to install FIPS compliant clusters, we need to make sure that installconfig + agentoconfig based deployments take into account the FIPS config in installconfig.
This task is about passing the config to agentclusterinstall so it makes it into the iso. Once there, AGENT-374 will give it to assisted service
As a user I would like to see all the events that the autoscaler creates, even duplicates. Having the CAO set this flag will allow me to continue to see these events.
We have carried a patch for the autoscaler that would enable the duplication of events. This patch can now be dropped because the upstream added a flag for this behavior in https://github.com/kubernetes/autoscaler/pull/4921
Add GA support for deploying OpenShift to IBM Public Cloud
Complete the existing gaps to make OpenShift on IBM Cloud VPC (Next Gen2) General Available
This epic tracks the changes needed to the ingress operator to support IBM DNS Services for private clusters.
Currently in OpenShift we do not support distributing hotfix packages to cluster nodes. In time-sensitive situations, a RHEL hotfix package can be the quickest route to resolving an issue.
Before we ship OCP CoreOS layering in https://issues.redhat.com/browse/MCO-165 we need to switch the format of what is currently `machine-os-content` to be the new base image.
The overall plan is:
As a OCP CoreOS layering developer, having telemetry data about number of cluster using osImageURL will help understand how broadly this feature is getting used and improve accordingly.
Acceptance Criteria:
After https://github.com/openshift/os/pull/763 is in the release image, teach the MCO how to use it. This is basically:
Goal: Control plane nodes in the cluster can be scaled up or down, lost and recovered, with no more importance or special procedure than that of a data plane node.
Problem: There is a lengthy special procedure to recover from a failed control plane node (or majority of nodes) and to add new control plane nodes.
Why is this important: Increased operational simplicity and scale flexibility of the cluster’s control plane deployment.
See slack working group: #wg-ctrl-plane-resize
We recently finished working on the CPMS main controller. We now need a way to generate the CPMS object for a cluster based on its current state.
We want a new controller that based on the cluster state generates a CPMS object and creates it against the API of the cluster. The CPMS object that will be created will have the `.spec.state` field set to `Inactive`, meaning that the main CPMS controller won't do any modification to the cluster based on that configuration until the user has reviewed/modified it and activated it by setting it to `Active`.
To catch changes in the cluster state the CPMS generator controller will keep watching the state and perform CPMS object regenerations checking if the object needs updating. If that's the case the generator will delete and recreate the CPMS object. This will constantly happen while the CPMS state is set to `Inactive`. As soon as the user activates it (sets `.spec.state` to `Active`). The generator controller will stop processing and its logic will become a no-op.
For now the generator controller will only be able to generate config for AWS clusters.
We should make sure to include envtest based testing to give us confidence when making changes to the code in the future.
Assumption
Doc: https://docs.google.com/document/d/1sXCaRt3PE0iFmq7ei0Yb1svqzY9bygR5IprjgioRkjc/edit
Customers do not pay Red Hat more to run HyperShift control planes and supporting infrastructure than Standalone control planes and supporting infrastructure.
Assumption
cluster-snapshot-controller-operator is running on the CP.
More information here: https://docs.google.com/document/d/1sXCaRt3PE0iFmq7ei0Yb1svqzY9bygR5IprjgioRkjc/edit
As OpenShift developer I want cluster-csi-snapshot-controller-operator to use existing controllers in library-go, so I don’t need to maintain yet another code that does the same thing as library-go.
Note: if this refactoring introduces any new conditions, we must make sure that 4.11 snapshot controller clears them to support downgrade! This will need 4.11 BZ + z-stream update!
Similarly, if some conditions become obsolete / not managed by any controller, they must be cleared by 4.12 operator.
Exit criteria:
As HyperShift Cluster Instance Admin, I want to run cluster-csi-snapshot-controller-operator in the management cluster, so the guest cluster runs just my applications.
Exit criteria:
Customers do not pay Red Hat more to run HyperShift control planes and supporting infrastructure than Standalone control planes and supporting infrastructure.
Assumption
Run cluster-storage-operator (CSO) + AWS EBS CSI driver operator + AWS EBS CSI driver control-plane Pods in the management cluster, run the driver DaemonSet in the hosted cluster.
More information here: https://docs.google.com/document/d/1sXCaRt3PE0iFmq7ei0Yb1svqzY9bygR5IprjgioRkjc/edit
As HyperShift Cluster Instance Admin, I want to run AWS EBS CSI driver operator + control plane of the CSI driver in the management cluster, so the guest cluster runs just my applications.
Exit criteria:
As OCP support engineer I want the same guest cluster storage-related objects in output of "hypershift dump cluster --dump-guest-cluster" as in "oc adm must-gather ", so I can debug storage issues easily.
must-gather collects: storageclasses persistentvolumes volumeattachments csidrivers csinodes volumesnapshotclasses volumesnapshotcontents
hypershift collects none of this, the relevant code is here: https://github.com/openshift/hypershift/blob/bcfade6676f3c344b48144de9e7a36f9b40d3330/cmd/cluster/core/dump.go#L276
Exit criteria:
As HyperShift Cluster Instance Admin, I want to run cluster-storage-operator (CSO) in the management cluster, so the guest cluster runs just my applications.
Exit criteria:
CNCC was moved to the management cluster and it should use proxy settings defined for the management cluster.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
Questions to be addressed:
As a developer, I want to be able to:
so that I can achieve
Description of criteria:
All modifications to the openshift-installer is handled in other cards in the epic.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
Questions to be addressed:
This description is based on the Google Doc by Rafael Fonseca dos Santos : https://docs.google.com/document/d/1yQt8sbknSmF_hriHyMAKPiztSoRIvntSX9i1wtObSYs
Microsoft is deprecating two APIs. The AD Graph API used by Installer destroy code and also used by the CCO to mint credentials. ADAL is also going EOL. ADAL is used by the installer and all cluster components that authenticate to Azure:
Azure Active Directory Authentication Library (ADAL) Retirement **
ADAL end-of-life is December 31, 2022. While ADAL apps may continue to work, no support or security fixes will be provided past end-of-life. In addition, there are no planned ADAL releases planned prior to end-of-life for features or planned support for new platform versions. We recommend prioritizing migration to Microsoft Authentication Library (MSAL).
Azure AD Graph API
Azure AD Graph will continue to function until June 30, 2023. This will be three years after the initial deprecation[ announcement.|https://techcommunity.microsoft.com/t5/microsoft-entra-azure-ad-blog/update-your-applications-to-use-microsoft-authentication-library/ba-p/1257363] Based on Azure deprecation[ guidelines|https://docs.microsoft.com/en-us/lifecycle/], we reserve the right to retire Azure AD Graph at any time after June 30, 2023, without advance notice. Though we reserve the right to turn it off after June 30, 2023, we want to ensure all customers migrate off and discourage applications from taking production dependencies on Azure AD Graph. Investments in new features and functionalities will only be made in[ Microsoft Graph|https://docs.microsoft.com/en-us/graph/overview]. Going forward, we will continue to support Azure AD Graph with security-related fixes. We recommend prioritizing migration to Microsoft Graph.
References:
ADAL is EOL. Update component to use azidentity.
Update ETCD datastore encryption to use AES-GCM instead of AES-CBC
2. What is the nature and description of the request?
The current ETCD datastore encryption solution uses the aes-cbc cipher. This cipher is now considered "weak" and is susceptible to padding oracle attack. Upstream recommends using the AES-GCM cipher. AES-GCM will require automation to rotate secrets for every 200k writes.
The cipher used is hard coded.
3. Why is this needed? (List the business requirements here).
Security conscious customers will not accept the presence and use of weak ciphers in an OpenShift cluster. Continuing to use the AES-CBC cipher will create friction in sales and, for existing customers, may result in OpenShift being blocked from being deployed in production.
4. List any affected packages or components.
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
The Kube APIserver is used to set the encryption of data stored in etcd. See https://docs.openshift.com/container-platform/4.11/security/encrypting-etcd.html
Today with OpenShift 4.11 or earlier, only aescbc is allowed as the encryption field type.
RFE-3095 is asking that aesgcm (which is an updated and more recent type) be supported. Furthermore RFE-3338 is asking for more customizability which brings us to how we have implemented cipher customzation with tlsSecurityProfile. See https://docs.openshift.com/container-platform/4.11/security/tls-security-profiles.html
Why is this important? (mandatory)
AES-CBC is considered as a weak cipher
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
The new aesgcm encryption provider was added in 4.13 as techpreview, but as part of https://issues.redhat.com/browse/API-1509, the feature needs to be GA in OCP 4.13.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
This work will require updates to the core OpenShift API repository to add the new platform type, and then a distribution of this change to all components that use the platform type information. For components that partners might replace, per-component action will need to be taken, with the project team's guidance, to ensure that the component properly handles the "External" platform. These changes will look slightly different for each component.
To integrate these changes more easily into OpenShift, it is possible to take a multi-phase approach which could be spread over a release boundary (eg phase 1 is done in 4.X, phase 2 is done in 4.X+1).
Phase 1
Phase 2
Phase 3
Phase 1
Phase 2
Phase 3
As defined in the External platform enhancement , a new platform is being added to OpenShift. To accommodate the phase 2 work, the CIO should be updated, if necessary, to react to the "External" platform in the same manner as it would for platform "None".
Please see the enhancement and the parent plan OCPBU-5 for more details about this process.
In phase 2 (planned for 4.13 release) of the external platform enhancement, the new platform type will be added to the openshift/api packages. As part of staging the release of this new platform we will need to ensure that all operators react in a neutral way to the platform, as if it were a "None" platform to ensure the continued normal operation of OpenShift.
We are working to create an External platform test which will exercise this mechanism, see OCPCLOUD-1782
as described in the epic, the CIO should be updated to react to the new "External" platform as it would for a "None" platform.
Goal:
API and implementation work to provide the cluster admin with an option in the IngressController API to use PROXY protocol with IBM Cloud load-balancers.
Description:
This epic extends the IngressController API essentially by copying the option we added in NE-330. In that epic, we added a configuration option to use PROXY protocol when configuring an IngresssController to use a NodePort service or host networking. With this epic (NE-1090), the same configuration option is added to use PROXY protocol when configuring an IngressController to use a LoadBalancer service on IBM Cloud.
This epic tracks the API and implementation work to provide the cluster admin with an option in the IngressController API to use PROXY protocol with IBM Cloud load-balancers.
This epic extends the IngressController API essentially by copying the option we added in NE-330. In that epic, we added a configuration option to use PROXY protocol when configuring an IngresssController to use a NodePort service or host networking. With this epic (NE-1090), the same configuration option is added to use PROXY protocol when configuring an IngressController to use a LoadBalancer service on IBM Cloud.
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
OLM would have to support a mechanism like podAffinity which allows multiple architecture values to be specified which enables it to pin operators to the matching architecture worker nodes
Ref: https://github.com/openshift/enhancements/pull/1014
Cut a new release of the OLM API and update OLM API dependency version (go.mod) in OLM package; then
Bring the upstream changes from OLM-2674 to the downstream olm repo.
A/C:
- New OLM API version release
- OLM API dependency updated in OLM Project
- OLM Subscription API changes downstreamed
- OLM Controller changes downstreamed
- Changes manually tested on Cluster Bot
We have a set of images
that should become multiarch images. This should be done both in upstream and downstream.
As a reference, we have built internally those images as multiarch and made them available as
They can be consumed by the Assisted Serivce pod via the following env
- name: AGENT_DOCKER_IMAGE value: registry.redhat.io/rhai-tech-preview/assisted-installer-agent-rhel8:latest - name: CONTROLLER_IMAGE value: registry.redhat.io/rhai-tech-preview/assisted-installer-reporter-rhel8:latest - name: INSTALLER_IMAGE value: registry.redhat.io/rhai-tech-preview/assisted-installer-rhel8:latest
We drive OpenShift cross-market customer success and new customer adoption with constant improvements and feature additions to the existing capabilities of our OpenShift Core Networking (SDN and Network Edge). This feature captures that natural progression of the product.
There are definitely grey areas, but in general:
Questions to be addressed:
Goal: Provide queryable metrics and telemetry for cluster routes and sharding in an OpenShift cluster.
Problem: Today we test OpenShift performance and scale with best-guess or anecdotal evidence for the number of routes that our customers use. Best practices for a large number of routes in a cluster is to shard, however we have no visibility with regard to if and how customers are using sharding.
Why is this important? These metrics will inform our performance and scale testing, documented cluster limits, and how customers are using sharding for best practice deployments.
Dependencies (internal and external):
Prioritized epics + deliverables (in scope / not in scope):
Not in scope:
Estimate (XS, S, M, L, XL, XXL):
Previous Work:
Open questions:
Acceptance criteria:
Epic Done Checklist:
Description:
As described in the Metrics to be sent via telemetry section of the Design Doc, the following metrics is needed to be sent from OpenShift cluster to Red Hat premises:
The metrics should be allowlisted on the cluster side.
The steps described in Sending metrics via telemetry are needed to be followed. Specifically step 5.
Depends on CFE-478.
Acceptance Criteria:
Description:
As described in the Design Doc, the following information is needed to be exported from Cluster Ingress Operator:
Design 2 will be implemented as part of this story.
Acceptance Criteria:
This is a epic bucket for all activities surrounding the creation of declarative approach to release and maintain OLM catalogs.
When working on this Epic, it's important to keep in mind this other potentially related Epic: https://issues.redhat.com/browse/OLM-2276
Jira Description
As an OPM maintainer, I want to downstream the PR for (OCP 4.12 ) and backport it to OCP 4.11 so that IIB will NOT be impacted by the changes when it upgrades the OPM version to use the next/future opm upstream release (v1.25.0).
Summary / Background
IIB(the downstream service that manages the indexes) uses the upstream version and if they bump the OPM version to the next/future (v1.25.0) release with this change before having the downstream images updated then: the process to manage the indexes downstream will face issues and it will impact the distributions.
Acceptance Criteria
Definition of Ready
Definition of Done
enhance the veneer rendering to be able to read the input veneer data from stdin, via a pipe, in a manner similar to https://dev.to/napicella/linux-pipes-in-golang-2e8j
then the command could be used in a manner similar to many k8s examples like
```shell
opm alpha render-veneer semver -o yaml < infile > outfile
```
Upstream issue link: https://github.com/operator-framework/operator-registry/issues/1011
tldr: three basic claims, the rest is explanation and one example
While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.
One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.
I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.
We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.
Relevant links:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Refactor Test_desiredLoadBalancerService to match our unit test standards, remove extraneous test cases, and make it more readable/maintainable.
Use t.Run to run test cases for table-driven unit tests, for consistency.
Unit tests names should be formatted with Test_Function name, so that the scope of the function (private or Public) can be preserved.
Test_getRR: Fix typo: "excepted" -> "expected"
Test_desiredHttpErrorCodeConfigMap contains a section that has dead code when checking for expect == nil || actual == ||. Clean this up.
Also replace Ruby-style #{} syntax for string interpolation with Go string formats.
Go 1.16 added the new embed directive to go. This embed directive lets you natively (and trivially) compile your binary with static asset files.
The current go-bindata dependency that's used in both the Ingress and DNS operator's for yaml asset compilation could be dropped in exchange for the new go embed functionality. This would reduce our dependency count, remove the need for `bindata.go` (which is version controlled and constantly updated), and make our code easier to read. This switch would also reduce the overall lines of code in our repos.
Note that this may be applicable to OCP 4.8 if and when images are built with go 1.16.
Epic Template descriptions and documentation.
Enable the chaos plugin https://coredns.io/plugins/chaos/ in our CoreDNS configuration so that we can use a DNS query to easily identify what DNS pods are responding to our requests.
Requirement | Notes | isMvp? |
---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
As a developer, I want to make status.HostIP for Pods visible in the Pod details page of the OCP Web Console. Currently there is no way to view the node IP for a Pod in the OpenShift Web Console. When viewing a Pod in the console, the field status.HostIP is not visible.
Acceptance criteria:
When OCP is performing cluster upgrade user should be notified about this fact.
There are two possibilities how to surface the cluster upgrade to the users:
AC:
Note: We need to decide if we want to distinguish this particular notification by a different color? ccing Ali Mobrem
Created from: https://issues.redhat.com/browse/RFE-3024
As a console user I want to have option to:
For Deployments we will add the 'Restart rollout' action button. This action will PATCH the Deployment object's 'spec.template.metadata.annotations' block, by adding 'openshift.io/restartedAt: <actual-timestamp>' annotation. This will restart the deployment, by creating a new ReplicaSet.
For DeploymentConfig we will add 'Retry rollout' action button. This action will PATCH the latest revision of ReplicationController object's 'metadata.annotations' block by setting 'openshift.io/deployment/phase: "New"' and removing openshift.io/deployment.cancelled and openshift.io/deployment.status-reason.
Acceptance Criteria:
BACKGROUND:
OpenShift console will be updated to allow rollout restart deployment from the console itself.
Currently, from the OpenShift console, for the resource “deploymentconfigs” we can only start and pause the rollout, and for the resource “deployment” we can only resume the rollout. None of the resources (deployment & deployment config) has this option to restart the rollout. So, that is the reason why the customer wants this functionality to perform the same action from the CLI as well as the OpenShift console.
The customer wants developers who are not fluent with the oc tool and terminal utilities, can use the console instead of the terminal to restart deployment, just like we use to do it through CLI using the command “oc rollout restart deploy/<deployment-name>“.
Usually when developers change the config map that deployment uses they have to restart pods. Currently, the developers have to use the oc rollout restart deployment command. The customer wants the functionality to get this button/menu to perform the same action from the console as well.
Design
Doc: https://docs.google.com/document/d/1i-jGtQGaA0OI4CYh8DH5BBIVbocIu_dxNt3vwWmPZdw/edit
This feature is about reducing the complexity of the CAPI install system architecture which is needed for using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift
prerequisite work Goals
Complete the design of the Cluster API (CAPI) architecture and build the core operator logic needed for Phase-1, incorporating the assets from different repositories to simplify asset management.
There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open
As an OpenShift engineer I want the CAPI Providers repositories to use the new generator tool so that they can independently generate CAPI Provider transport ConfigMaps
Once the new CAPI manifests generator tool is ready, we want to make use of that directly from the CAPI Providers repositories so we can avoid storing the generated configuration centrally and independently apply that based on the running platform.
As an OpenShift engineer I want to be able to install the new manifest generation tool as a standalone tool in my CAPI Infra Provider repo to generate the CAPI Provider transport ConfigMap(s)
Renaming of the CAPI Asset/Manifest generator from assets (generator) to manifest-gen, as it won't need to generate go embeddable assets anymore, but only manifests that will be referenced and applied by CVO
We are deprecating DeploymentConfig with Deployment in OpenShift because Deployment is the recommended way to deploy applications. Deployment is a more flexible and powerful resource that allows you to control the deployment of your applications more precisely. DeploymentConfig is a legacy resource that is no longer necessary. We will continue to support DeploymentConfig for a period of time, but we encourage you to migrate to Deployment as soon as possible.
Here are some of the benefits of using Deployment over DeploymentConfig:
We hope that you will migrate to Deployment as soon as possible. If you have any questions, please contact us.
Given the nature of this component (embedded into a shared api server and controller manager), this will likely require adding logic within those shared components to not enable specific bits of function when the build or DeploymentConfig capability is disabled, and watching the enabled capability set so that the components enable the functionality when necessary.
I would not expect us to split the components out of their existing location as part of this, though that is theoretically an option.
None
Make the list of enabled/disable controllers in OAS reflect enabled/disabled capabilities.
Acceptance criteria:
QE:
OC mirror is GA product as of Openshift 4.11 .
The goal of this feature is to solve any future customer request for new features or capabilities in OC mirror
Description of problem:
Even though in 4.11 we introduced LegacyServiceAccountTokenNoAutoGeneration to be compatible with upstream K8s to not generate secrets with tokens when service accounts are created, today OpenShift still creates secrets and tokens that are used for legacy usage of openshift-controller as well as the image-pull secrets.
Customer issues:
Customers see auto-generated secrets for service accounts which is flagged as a security risk.
This Feature is to track the implementation for removing legacy usage and image-pull secret generation as well so that NO secrets are auto-generated when a Service Account is created on OpenShift cluster.
NO Secrets to be auto-generated when creating service accounts
Following *secrets need to NOT be generated automatically with every Serivce account creation:*
Use Cases (Optional):
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
Concerns/Risks: Replacing functionality of one of the openshift-controller used for controllers that's been in the code for a long time may impact behaviors that w
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Existing documentation needs to be clear on where we are today and why we are providing the above 2 credentials. Related Tracker: https://issues.redhat.com/browse/OCPBUGS-13226
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Create custom roles for GCP with minimal set of required permissions.
Enable customers to better scope credential permissions and create custom roles on GCP that only include the minimum subset of what is needed for OpenShift.
Some of the service accounts that CCO creates, e.g. service account with role roles/iam.serviceAccountUser provides elevated permissions that are not required/used by the requesting OpenShift components. This is because we use predefined roles for GCP that come with bunch of additional permissions. The goal is to create custom roles with only the required permissions.
TBD
These are phase 2 items from CCO-188
Moving items from other teams that need to be committed to for 4.13 this work to complete
Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster Ingress Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
Allow to configure compute and control plane nodes on across multiple subnets for on-premise IPI deployments. With separating nodes in subnets, also allow using an external load balancer, instead of the built-in (keepalived/haproxy) that the IPI workflow installs, so that the customer can configure their own load balancer with the ingress and API VIPs pointing to nodes in the separate subnets.
I want to install OpenShift with IPI on an on-premise platform (high priority for bare metal and vSphere) and I need to distribute my control plane and compute nodes across multiple subnets.
I want to use IPI automation but I will configure an external load balancer for the API and Ingress VIPs, instead of using the built-in keepalived/haproxy-based load balancer that come with the on-prem platforms.
Customers require using multiple logical availability zones to define their architecture and topology for their datacenter. OpenShift clusters are expected to fit in this architecture for the high availability and disaster recovery plans of their datacenters.
Customers want the benefits of IPI and automated installations (and avoid UPI) and at the same time when they expect high traffic in their workloads they will design their clusters with external load balancers that will have the VIPs of the OpenShift clusters.
Load balancers can distribute incoming traffic across multiple subnets, which is something our built-in load balancers aren't able to do and which represents a big limitation for the topologies customers are designing.
While this is possible with IPI AWS, this isn't available with on-premise platforms installed with IPI (for the control plane nodes specifically), and customers see this as a gap in OpenShift for on-premise platforms.
Epic | Control Plane with Multiple Subnets | Compute with Multiple Subnets | Doesn't need external LB | Built-in LB |
---|---|---|---|---|
NE-1069 (all-platforms) | ✓ | ✓ | ✓ | ✓ |
NE-905 (all-platforms) | ✓ | ✓ | ✓ | ✕ |
✓ | ✓ | ✓ | ✓ | |
✓ | ✓ | ✓ | ✓ | |
✓ | ✓ | ✓ | ||
✓ | ✓ | ✓ | ✕ | |
NE-905 (all platforms) | ✓ | ✓ | ✓ | ✕ |
✓ | ✓ | ✓ | ✓ | |
✕ | ✓ | ✓ | ✓ | |
✕ | ✓ | ✓ | ✓ | |
✕ | ✓ | ✓ | ✓ |
Workers on separate subnets with IPI documentation
We can already deploy compute nodes on separate subnets by preventing the built-in LBs from running on the compute nodes. This is documented for bare metal only for the Remote Worker Nodes use case: https://docs.openshift.com/container-platform/4.11/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#configure-network-components-to-run-on-the-control-plane_ipi-install-installation-workflow
This procedure works on vSphere too, albeit no QE CI and not documented.
External load balancer with IPI documentation
As an OpenShift installation admin I want to use the Assisted Installer, ZTP and IPI installation workflows to deploy a cluster that has remote worker nodes in subnets different from the local subnet, while my VIPs with the built-in load balancing services (haproxy/keepalived).
While this request is most common with OpenShift on bare metal, any platform using the ingress operator will benefit from this enhancement.
Customers using platform none run external load balancers and they won't need this, this is specific for platforms deployed via AI, ZTP and IPI.
Customers and partners want to install remote worker nodes on day1. Due to the built-in network services we provide with Assisted Installer, ZTP and IPI that manage the VIP for ingress, we need to ensure that they remain in the local subnet where the VIPs are configured.
The bare metal IPI tam added a workflow that allows to place the VIPs in the masters. While this isn't an ideal solution, this is the only option documented:
Configuring network components to run on the control plane
Make it possible to entirely disable the Ingress Operator by leveraging the OCPPLAN-9638 Composable OpenShift capability.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
RFE: https://issues.redhat.com/browse/RFE-3395
Enhancement PR: https://github.com/openshift/enhancements/pull/1415
API PR: https://github.com/openshift/api/pull/1516
Ingress Operator PR: https://github.com/openshift/cluster-ingress-operator/pull/950
Feature Goal: Make it possible to entirely disable the Ingress Operator by leveraging the Composable OpenShift capability.
Implement the ingress capability focusing on the HyperShift users.
As described in the EP PR.
# ...
* Release Technical Enablement - Provide necessary release enablement details and documents.
* Ingress Operator can be disabled on HyperShift.
# The install-config and ClusterVersion API have been updated with the capability feature.
# The console operator.
#
* CI - CI is running, tests are automated and merged.
* Release Enablement <link to Feature Enablement Presentation>
* DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
* DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
* DEV - Downstream build attached to advisory: <link to errata>
* QE - Test plans in Polarion: <link or reference to Polarion>
* QE - Automated tests merged: <link or reference to automated tests>
* DOC - Downstream documentation merged: <link to meaningful PR>
Goal
The goal of this user story is to add the new (ingress) capability to the cluster operator's payload (manifests: CRDs, RBACs, deployment, etc.).
Out of scope
Acceptance criteria
Links
Support OpenShift installation in AWS Shared VPC [1] scenario where AWS infrastructure resources (at least the Private Hosted Zone) belong to an account separate from the cluster installation target account.
As a user I need to use a Shared VPC [1] when installing OpenShift on AWS into an existing VPC. Which will at least require the use of a preexisting Route53 hosted zone where I am not allowed the user "participant" of the shared VPC to automatically create Route53 private zones.
The Installer is able to successfully deploy OpenShift on AWS with a Shared VPC [1], and the cluster is able to successfully pass osde2e testing. This will include at least the scenario when private hostedZone belongs to different account (Account A) than cluster resources (Account B)
[1] https://docs.aws.amazon.com/vpc/latest/userguide/vpc-sharing.html
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic —
Enhancement PR: https://github.com/openshift/enhancements/pull/1397
API PR: https://github.com/openshift/api/pull/1460
Ingress Operator PR: https://github.com/openshift/cluster-ingress-operator/pull/928
Feature Goal: Support OpenShift installation in AWS Shared VPC scenario where AWS infrastructure resources (at least the Private Hosted Zone) belong to an account separate from the cluster installation target account.
The ingress operator is responsible for creating DNS records in AWS Route53 for cluster ingress. Prior to the implementation of this epic, the ingress operator doesn't have the capability to add DNS records into an existing Route 53 hosted zone in the shared VPC.
As described in the WIP PR https://github.com/openshift/cluster-ingress-operator/pull/928, the ingress operator will consume a new API field that contains the IAM Role ARN for configuring DNS records in the private hosted zone. If this field is present, then the ingress operator will use this account to create all private hosted zone records. The API fields will be described in the Enhancement PR.
The ingress operator code will accomplish this by defining a new provider implementation that wraps two other DNS providers, using one of them to publish records to the public zone and the other to publish records to the private zone.
See NE-1299
See NE-1299
Develop the implementation for supporting AWS Shared VPC pre-existing Route53 as it is described in the enhancement: https://github.com/openshift/enhancements/pull/1397
Goal: Control plane nodes in the cluster can be scaled up or down, lost and recovered, with no more importance or special procedure than that of a data plane node.
Problem: There is a lengthy special procedure to recover from a failed control plane node (or majority of nodes) and to add new control plane nodes.
Why is this important: Increased operational simplicity and scale flexibility of the cluster’s control plane deployment.
See slack working group: #wg-ctrl-plane-resize
Now that we have a CPMS generator controller, we need to document how the user will interact with the CPMS to activate it.
Not every platform needs to support failure domains, for example baremetal.
In order to support generic platforms, we could introduce a generic providerConfig abstraction that can handle the case when the failureDomains is empty.
In this case, we never manipulate the providerSpec anyway so there's no need to have a platform specific providerConfig abstraction.
Pre-Work Objectives
Since some of our requirements from the ACM team will not be available for the 4.12 timeframe, the team should work on anything we can get done in the scope of the console repo so that when the required items are available in 4.13, we can be more nimble in delivering GA content for the Unified Console Epic.
Overall GA Key Objective
Providing our customers with a single simplified User Experience(Hybrid Cloud Console)that is extensible, can run locally or in the cloud, and is capable of managing the fleet to deep diving into a single cluster.
Why customers want this?
Why we want this?
Phase 2 Goal: Productization of the united Console
As a developer I would like to disable clusters like *KS that we can't support for multi-cluster (for instance because we can't authenticate). The ManagedCluster resource has a vendor label that we can use to know if the cluster is supported.
cc Ali Mobrem Sho Weimer Jakub Hadvig
UPDATE: 9/20/22 : we want an allow-list with OpenShift, ROSA, ARO, ROKS, and OpenShiftDedicated
Acceptance criteria:
Feature Goal: Unify the management of cluster ingress with a common, open, expressive, and extensible API.
Why is this Important? Gateway API is the evolution of upstream Kubernetes Ingress APIs. The upstream project is part of Kubernetes, working under SIG-NETWORK. OpenShift is contributing to the development, building a leadership position, and preparing OpenShift to support Gateway API, with Istio as our supported implementation.
The plug-able nature of the implementation of Gateway API enables support for additional and optional 3rd-party Ingress technologies.
Functional Requirements
Non-Functional Requirements:
Open Questions:
Documentation Considerations:
User Story: As a cluster admin, I want to create a gatewayclass and a gateway, and OpenShift should configure Istio/Envoy with an LB and DNS, so that traffic can reach httproutes attached to the gateway.
The operator will be one of these (or some combination):
Functionality includes DNS (NE-1107), LoadBalancer (NE-1108), , and other operations formerly performed by the cluster-ingress-operator for routers.
Requires design document or enhancement proposal, breakdown into more specific stories.
(probably needs to be an Epic, will move things around later to accomodate that).
Out of scope for enhanced dev preview:
Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission.
With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.
With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn".
Epic Goal
Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.
When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.
To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).
Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).
The following table tracks progress:
namespace | in review | merged |
---|---|---|
openshift-apiserver-operator | PR | |
openshift-authentication | PR | |
openshift-authentication-operator | PR | |
openshift-cloud-controller-manager | ||
openshift-cloud-controller-manager-operator | ||
openshift-cloud-credential-operator | PR | |
openshift-cloud-network-config-controller | PR | |
openshift-cluster-csi-drivers | PR1, PR2 | |
openshift-cluster-machine-approver | ||
openshift-cluster-node-tuning-operator | PR | |
openshift-cluster-samples-operator | PR | |
openshift-cluster-storage-operator | PR1, PR2 | |
openshift-cluster-version | PR | |
openshift-config-operator | PR | |
openshift-console | PR | |
openshift-console-operator | PR | |
openshift-controller-manager | PR | |
openshift-controller-manager-operator | PR | |
openshift-dns | ||
openshift-dns-operator | ||
openshift-etcd | ||
openshift-etcd-operator | ||
openshift-image-registry | PR | |
openshift-ingress | PR | |
openshift-ingress-canary | PR | |
openshift-ingress-operator | PR | |
openshift-insights | PR | |
openshift-kube-apiserver | ||
openshift-kube-apiserver-operator | ||
openshift-kube-controller-manager | ||
openshift-kube-controller-manager-operator | ||
openshift-kube-scheduler | ||
openshift-kube-scheduler-operator | ||
openshift-kube-storage-version-migrator | PR | |
openshift-kube-storage-version-migrator-operator | PR | |
openshift-machine-api | PR1, PR2, PR3, PR4, PR5, PR6 | |
openshift-machine-config-operator | PR | |
openshift-marketplace | ||
openshift-monitoring | ||
openshift-multus | ||
openshift-network-diagnostics | PR | |
openshift-network-node-identity | PR | |
openshift-network-operator | ||
openshift-oauth-apiserver | PR | |
openshift-operator-lifecycle-manager | PR | |
openshift-ovn-kubernetes | ||
openshift-route-controller-manager | PR | |
openshift-service-ca | PR | |
openshift-service-ca-operator | PR |
More details at ARO managed identity scope and impact.
This Section: A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
Update Azure Credentials Request manifest of the Cluster Ingress Operator to use new API field for requesting permissions to enable OCPBU-8?
This effort is dependent on the completion of work for CCO-187, and effort in dependent modules is planned to be worked on by the CCO team unless individual repo owners can help. Operators owners/teams will be expected to review merge requests and complete appropriate QE effort for an openshift release.
RHEL CoreOS should be updated to RHEL 9.2 sources to take advantage of newer features, hardware support, and performance improvements.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
Questions to be addressed:
PROBLEM
We would like to improve our signal for RHEL9 readiness by increasing internal engineering engagement and external partner engagement on our community OpehShift offering, OKD.
PROPOSAL
Adding OKD to run on SCOS (a CentOS stream for CoreOS) brings the community offering closer to what a partner or an internal engineering team might expect on OCP.
ACCEPTANCE CRITERIA
Image has been switched/included:
DEPENDENCIES
The SCOS build payload.
RELATED RESOURCES
OKD+SCOS proposal: https://docs.google.com/presentation/d/1_Xa9Z4tSqB7U2No7WA0KXb3lDIngNaQpS504ZLrCmg8/edit#slide=id.p
OKD+SCOS work draft: https://docs.google.com/document/d/1cuWOXhATexNLWGKLjaOcVF4V95JJjP1E3UmQ2kDVzsA/edit
Acceptance Criteria
A stable OKD on SCOS is built and available to the community sprintly.
This comes up when installing ipi-on-aws on arm64 with the custom payload build at quay.io/aleskandrox/okd-release:4.12.0-0.okd-centos9-full-rebuild-arm64 that is using scos as machine-content-os image
```
[root@ip-10-0-135-176 core]# crictl logs c483c92e118d8
2022-08-11T12:19:39+00:00 [cnibincopy] FATAL ERROR: Unsupported OS ID=scos
```
The probable fix has to land on https://github.com/openshift/cluster-network-operator/blob/master/bindata/network/multus/multus.yaml#L41-L53
Create a Azure cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in Azure) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.
Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.
Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").
Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.
Goals
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
List any affected packages or components.
This epic covers the work to apply user defined tags to Azure created for openshift cluster available as tech preview.
The user should be able to define the azure tags to be applied on the resources created during cluster creation by the installer and other operators which manages the specific resources. The user will be able to define the required tags in the install-config.yaml while preparing with the user inputs for cluster creation, which will then be made available in the status sub-resource of Infrastructure custom resource which cannot be edited but will be available for user reference and will be used by the in-cluster operators for tagging when the resources are created.
Updating/deleting of tags added during cluster creation or adding new tags as Day-2 operation is out of scope of this epic.
List any affected packages or components.
Reference - https://issues.redhat.com/browse/RFE-2017
Enhancement proposed for Azure tags support in OCP, requires cluster-ingress-operator to add azure userTags available in the status sub resource of infrastructure CR, to the azure DNS resource created.
cluster-ingress-operator should add Tags to the DNS records created.
Note: dnsrecords.ingress.operator.openshift.io and openshift-ingress CRD, usage to be identified.
Acceptance Criteria
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.14 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
HyperShift came to life to serve multiple goals, some are main near-term, some are secondary that serve well long-term.
HyperShift opens up doors to penetrate the market. HyperShift enables true hybrid (CP and Workers decoupled, mixed IaaS, mixed Arch,...). An architecture that opens up more options to target new opportunities in the cloud space. For more details on this one check: Hosted Control Planes (aka HyperShift) Strategy [Live Document]
To bring hosted control planes to our customers, we need the means to ship it. Today MCE is how HyperShift shipped, and installed so that customers can use it. There are two main customers for hosted-control-planes:
If you have noticed, MCE is the delivery mechanism for both management models. The difference between managed and self-managed is the consumer persona. For self-managed, it's the customer SRE for managed its the RH SRE.
For us to ship HyperShift in the product (as hosted control planes) in either management model, there is a necessary readiness checklist that we need to satisfy. Below are the high-level requirements needed before GA:
Please also have a look at our What are we missing in Core HyperShift for GA Readiness? doc.
Multi-cluster is becoming an industry need today not because this is where trend is going but because it’s the only viable path today to solve for many of our customer’s use-cases. Below is some reasoning why multi-cluster is a NEED:
As a result, multi-cluster management is a defining category in the market where Red Hat plays a key role. Today Red Hat solves for multi-cluster via RHACM and MCE. The goal is to simplify fleet management complexity by providing a single pane of glass to observe, secure, police, govern, configure a fleet. I.e., the operand is no longer one cluster but a set, a fleet of clusters.
HyperShift logically centralized architecture, as well as native separation of concerns and superior cluster lifecyle management experience, makes it a great fit as the foundation of our multi-cluster management story.
Thus the following stories are important for HyperShift:
Refs:
HyperShift is the core engine that will be used to provide hosted control-planes for consumption in managed and self-managed.
Main user story: When life cycling clusters as a cluster service consumer via HyperShift core APIs, I want to use a stable/backward compatible API that is less susceptible to future changes so I can provide availability guarantees.
Ref: What are we missing in Core HyperShift for GA Readiness?
Customers do not pay Red Hat more to run HyperShift control planes and supporting infrastructure than Standalone control planes and supporting infrastructure.
Assumptions:
HyperShift - proposed cuts from data plane
When operating OpenShift clusters (for any OpenShift form factor) from MCE/ACM/OCM/CLI as a Cluster Service Consumer (RH managed SRE, or self-manage SRE/admin) I want to be able to migrate CPs from one hosting service cluster to another:
More information:
To understand usage patterns and inform our decision making for the product. We need to be able to measure adoption and assess usage.
See Hosted Control Planes (aka HyperShift) Strategy [Live Document]
Whether it's managed or self-managed, it’s pertinent to report health metrics to be able to create meaningful Service Level Objectives (SLOs), alert of failure to meet our availability guarantees. This is especially important for our managed services path.
https://issues.redhat.com/browse/OCPPLAN-8901
HyperShift for managed services is a strategic company goal as it improves usability, feature, and cost competitiveness against other managed solutions, and because managed services/consumption-based cloud services is where we see the market growing (customers are looking to delegate platform overhead).
We should make sure our SD milestones are unblocked by the core team.
This feature reflects HyperShift core readiness to be consumed. When all related EPICs and stories in this EPIC are complete HyperShift can be considered ready to be consumed in GA form. This does not describe a date but rather the readiness of core HyperShift to be consumed in GA form NOT the GA itself.
- GA date for self-managed will be factoring in other inputs such as adoption, customer interest/commitment, and other factors.
- GA dates for ROSA-HyperShift are on track, tracked in milestones M1-7 (have a look at https://issues.redhat.com/browse/OCPPLAN-5771)
Epic Goal*
The goal is to split client certificate trust chains from the global Hypershift root CA.
Why is this important? (mandatory)
This is important to:
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
Hypershift team needs to provide us with code reviews and merge the changes we are to deliver
Contributing Teams(and contacts) (mandatory)
Acceptance Criteria (optional)
The serviceaccount CA bundle automatically injected to all pods cannot be used to authenticate any client certificate generated by the control-plane.
Drawbacks or Risk (optional)
Risk: there is a throbbing time pressure as this should be delivered before first stable Hypershift release
Done - Checklist (mandatory)
AUTH-311 introduced an enhancement. Implement the signer separation described there.
Overarching Goal
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift.
Phase 1 & 2 covers implementing base functionality for CAPI.
Phase 2 also covers migrating MAPI resources to CAPI.
Phase 2 Goal:
for Phase-1, incorporating the assets from different repositories to simplify asset management.
There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open
In the cluster-capi-operator repository are present several CAPI E2E tests for specific providers.
We run these tests on every PR that lands on that repository.
In order to test rebases for the cluster-api providers we want to run these tests also there to prove rebase PRs are not breaking CAPI functionality.
DoD:
Currently, SCCs are part of the OpenShift API and are subject to modifications by customers. This leads to a constant stream of issues:
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Currently, SCCs are part of the OpenShift API and are subject to modifications by customers. This leads to a constant stream of issues:
We need to find and implement schemes to protect core workloads while retaining the API guarantee for modifications of SCCs (unfortunately).
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.15 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.15 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
Extend OpenShift on IBM Cloud integration with additional features to pair the capabilities offered for this provider integration to the ones available in other cloud platforms.
Extend the existing features while deploying OpenShift on IBM Cloud.
This top level feature is going to be used as a placeholder for the IBM team who is working on new features for this integration in an effort to keep in sync their existing internal backlog with the corresponding Features/Epics in Red Hat's Jira.
A user currently is not able to create a Disconnected cluster, using IPI, on IBM Cloud.
Currently, support for BYON and Private clusters does exist on IBM Cloud, but support to override IBM Cloud Service endpoints does not exist, which is required to allow for Disconnected support to function (reach IBM Cloud private endpoints).
IBM dependent components of OCP will need to add support to use a set of endpoint override values in order to reach IBM Cloud Services in Disconnected environments.
The Ingress Operator components will need to be able to allow all API calls to IBM Cloud Services, be directed to these endpoint values, in order to communicate in environments where the Public or default IBM Cloud Service endpoint is not available.
The endpoint overrides are available via the infrastructure/cluster (.status.platformStatus.ibmcloud.serviceEndpoints) resource, which is how a majority of components are consuming cluster specific configurations (Ingress, MAPI, etc.). It will be structured as such
apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2023-10-04T22:02:15Z" generation: 1 name: cluster resourceVersion: "430" uid: b923c3de-81fc-4a0e-9fdb-8c4c337fba08 spec: cloudConfig: key: config name: cloud-provider-config platformSpec: type: IBMCloud status: apiServerInternalURI: https://api-int.us-east-disconnect-21.ipi-cjschaef-dns.com:6443 apiServerURL: https://api.us-east-disconnect-21.ipi-cjschaef-dns.com:6443 controlPlaneTopology: HighlyAvailable cpuPartitioning: None etcdDiscoveryDomain: "" infrastructureName: us-east-disconnect-21-gtbwd infrastructureTopology: HighlyAvailable platform: IBMCloud platformStatus: ibmcloud: dnsInstanceCRN: 'crn:v1:bluemix:public:dns-svcs:global:a/fa4fd9fa0695c007d1fdcb69a982868c:f00ac00e-75c2-4774-a5da-44b2183e31f7::' location: us-east providerType: VPC resourceGroupName: us-east-disconnect-21-gtbwd serviceEndpoints: - name: iam url: https://private.us-east.iam.cloud.ibm.com - name: vpc url: https://us-east.private.iaas.cloud.ibm.com/v1 - name: resourcecontroller url: https://private.us-east.resource-controller.cloud.ibm.com - name: resourcemanager url: https://private.us-east.resource-controller.cloud.ibm.com - name: cis url: https://api.private.cis.cloud.ibm.com - name: dnsservices url: https://api.private.dns-svcs.cloud.ibm.com/v1 - name: cis url: https://s3.direct.us-east.cloud-object-storage.appdomain.cloud type: IBMCloud
The CCM is currently relying on updates to the openshift-cloud-controller-manager/cloud-conf configmap, in order to override its required IBM Cloud Service endpoints, such as:
data: config: |+ [global] version = 1.1.0 [kubernetes] config-file = "" [provider] accountID = ... clusterID = temp-disconnect-7m6rw cluster-default-provider = g2 region = eu-de g2Credentials = /etc/vpc/ibmcloud_api_key g2ResourceGroupName = temp-disconnect-7m6rw g2VpcName = temp-disconnect-7m6rw-vpc g2workerServiceAccountID = ... g2VpcSubnetNames = temp-disconnect-7m6rw-subnet-compute-eu-de-1,temp-disconnect-7m6rw-subnet-compute-eu-de-2,temp-disconnect-7m6rw-subnet-compute-eu-de-3,temp-disconnect-7m6rw-subnet-control-plane-eu-de-1,temp-disconnect-7m6rw-subnet-control-plane-eu-de-2,temp-disconnect-7m6rw-subnet-control-plane-eu-de-3 iamEndpointOverride = https://private.iam.cloud.ibm.com g2EndpointOverride = https://eu-de.private.iaas.cloud.ibm.com rmEndpointOverride = https://private.resource-controller.cloud.ibm.com
Installer validates and injects user provided endpoint overrides into cluster deployment process and the Ingress Operator components use specified endpoints and start up properly.
As an openshift admin ,who wants to make my OCP more secure and stable . I want to prevent anyone to schedule their workload in master node so that master node only run OCP management related workload .
secure OCP master node by preventing scheduling of customer workload in master node
Anyone applying toleration(s) in a pod spec can unintentionally tolerate master taints which protect master nodes from receiving application workload when master nodes are configured to repel application workload. An admission plugin needs to be configured to protect master nodes from this scenario. Besides the taint/toleration, users can also set spec.NodeName directly, which this plugin should also consider protecting master nodes against.
The MCO should properly report its state in a way that's consistent and able to be understood by customers, troubleshooters, and maintainers alike.
Some customer cases have revealed scenarios where the MCO state reporting is misleading and therefore could be unreliable to base decisions and automation on.
In addition to correcting some incorrect states, the MCO will be enhanced for a more granular view of update rollouts across machines.
The MCO should properly report its state in a way that's consistent and able to be understood by customers, troubleshooters, and maintainers alike.
For this epic, "state" means "what is the MCO doing?" – so the goal here is to try to make sure that it's always known what the MCO is doing.
This includes:
While this probably crosses a little bit into the "status" portion of certain MCO objects, as some state is definitely recorded there, this probably shouldn't turn into a "better status reporting" epic. I'm interpreting "status" to mean "how is it going" so status is maybe a "detail attached to a state".
Exploration here: https://docs.google.com/document/d/1j6Qea98aVP12kzmPbR_3Y-3-meJQBf0_K6HxZOkzbNk/edit?usp=sharing
https://docs.google.com/document/d/17qYml7CETIaDmcEO-6OGQGNO0d7HtfyU7W4OMA6kTeM/edit?usp=sharing
The current property description is:
configuration represents the current MachineConfig object for the machine config pool.
But in a 4.12.0-ec.4 cluster, the actual semantics seem to be something closer to "the most recent rendered config that we completely leveled on". We should at least update the godocs to be more specific about the intended semantics. And perhaps consider adjusting the semantics?
Customer has escalated the following issues where ports don't have TLS support. This Feature request lists all the components port raised by the customer.
Details here https://docs.google.com/document/d/1zB9vUGB83xlQnoM-ToLUEBtEGszQrC7u-hmhCnrhuXM/edit
Currently, we are serving the metrics as http on 9191, and via TLS on 9192.
We need to make sure the metrics are only available on 9192 via TLS.
Related to https://issues.redhat.com/browse/RFE-4665
CMA currently exposes metrics on two ports via the 0.0.0.0 all hosts binding. We need to make sure that only the TLS port is accessible from outside localhost.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.16 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
Goal:
As an administrator, I would like to use my own managed DNS solution instead of only specific openshift-install supported DNS services (such as AWS Route53, Google Cloud DNS, etc...) for my OpenShift deployment.
Problem:
While cloud-based DNS services provide convenient hostname management, there's a number of regulatory (ITAR) and operational constraints customers face prohibiting the use of those DNS hosting services on public cloud providers.
Why is this important:
Dependencies (internal and external):
Prioritized epics + deliverables (in scope / not in scope):
Estimate (XS, S, M, L, XL, XXL):
Previous Work:
Open questions:
Link to Epic: https://docs.google.com/document/d/1OBrfC4x81PHhpPrC5SEjixzg4eBnnxCZDr-5h3yF2QI/edit?usp=sharing
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
This Epic details that work required to augment this CoreDNS pod to also resolve the *.apps URL. In addition, it will include changes to prevent Ingress Operator from configuring the cloud DNS after the ingress LBs have been created.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Reduce the OpenShift platform and associated RH provided components to a single physical core on Intel Sapphire Rapids platform for vDU deployments on SingleNode OpenShift.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
Provide a mechanism to tune the platform to use only one physical core. |
Users need to be able to tune different platforms. | YES |
Allow for full zero touch provisioning of a node with the minimal core budget configuration. | Node provisioned with SNO Far Edge provisioning method - i.e. ZTP via RHACM, using DU Profile. | YES |
Platform meets all MVP KPIs | YES |
Questions to be addressed:
Description of problem:
After running tests on an SNO with Telco DU profile for a couple of hours kubernetes.io/kubelet-serving CSRs in Pending state start showing up and accumulating in time.
Version-Release number of selected component (if applicable):
4.14.0-ec.3
How reproducible:
So far on 2 different environments
Steps to Reproduce:
1. Deploy SNO with Telco DU profile 2. Run system tests 3. Check CSRs status
Actual results:
oc get csr | grep Pending | wc -l 34
Expected results:
No Pending CSRs
Additional info:
This issue blocks retrieving pod logs. Attaching must-gather and sosreport after manually approving CSRs.
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
This is epic tracks "business as usual" requirements / enhancements / bug fixing of Insights Operator.
Today the links point at a rule-scoped page, but that page lacks information about recommended resolution. You can click through by cluster ID to your specific cluster and get that recommendation advice, but it would be more convenient and less confusing for customers if we linked directly to the cluster-scoped recommendation page.
We can implement by updating the template here to be:
fmt.Sprintf("https://console.redhat.com/openshift/insights/advisor/clusters/%s?first=%s%%7C%s", clusterID, ruleIDStr, rec.ErrorKey)
or something like that.
unknowns
request is clear, solution/implementation to be further clarified
This story only covers API components. We will create a separate story for other utility functions.
Today we are generating documentation for Console's Dynamic Plugin SDK in
frontend/packages/dynamic-plugin-sdk. We are missing ts-doc for a set of hooks and components.
We are generating the markdown from the dynamic-plugin-sdk using
yarn generate-doc
Here is the list of the API that the dynamic-plugin-sdk is exposing:
https://gist.github.com/spadgett/0ddefd7ab575940334429200f4f7219a
Acceptance Criteria:
Out of Scope:
Move `frontend/public/components/nav` to `packages/console-app/src/components/nav` and address any issues resulting from the move.
There will be some expected lint errors relating to cyclical imports. These will require some refactoring to address.
Currently the ConsolePlugins API version is v1alpha1. Since we are going GA with dynamic plugins we should be creating a v1 version.
This would require updates in following repositories:
AC:
NOTE: This story does not include the conversion webhook change which will be created as a follow on story
`@openshift-console/plugin-shared` (NPM) is a package that will contain shared components that can be upversioned separately by the Plugins so they can keep core compatibility low but upversion and support more shared components as we need them.
This isn't documented today. We need to do that.
We neither use nor support static plugin nav extensions anymore so we should remove the API in the static plugin SDK and get rid of related cruft in our current nav components.
AC: Remove static plugin nav extensions code. Check the navigation code for any references to the old API.
The console has good error boundary components that are useful for dynamic plugin.
Exposing them will enable the plugins to get the same look and feel of handling react errors as console
The minimum requirement right now is to expose the ErrorBoundaryFallbackPage component from
https://github.com/openshift/console/blob/master/frontend/packages/console-shared/src/components/error/fallbacks/ErrorBoundaryFallbackPage.tsx
During the development of https://issues.redhat.com/browse/CONSOLE-3062, it was determined additional information is needed in order to assist a user when troubleshooting a Failed plugin (see https://github.com/openshift/console/pull/11664#issuecomment-1159024959). As it stands today, there is no data available to the console to relay to the user regarding why the plugin Failed. Presumably, a message should be added to NotLoadedDynamicPlugin to address this gap.
AC: Add `message` property to NotLoadedDynamicPluginInfo type.
We should have a global notification or the `Console plugins` page (e.g., k8s/cluster/operator.openshift.io~v1~Console/cluster/console-plugins) should alert users when console operator `spec.managementState` is `Unmanaged` as changes to `enabled` for plugins will have no effect.
The extension `console.dashboards/overview/detail/item` doesn't constrain the content to fit the card.
The details-card has an expectation that a <dd> item will be the last item (for spacing between items). Our static details-card items use a component called 'OverviewDetailItem'. This isn't enforced in the extension and can cause undesired padding issues if they just do whatever they want.
I feel our approach here should be making the extension take the props of 'OverviewDetailItem' where 'children' is the new 'component'.
when defining two proxy endpoints,
apiVersion: console.openshift.io/v1alpha1
kind: ConsolePlugin
metadata:
...
name: forklift-console-plugin
spec:
displayName: Console Plugin Template
proxy:
service:
basePath: /
I get two proxy endpoints
/api/proxy/plugin/forklift-console-plugin/forklift-inventory
and
/api/proxy/plugin/forklift-console-plugin/forklift-must-gather-api
but both proxy to the `forklift-must-gather-api` service
e.g.
curl to:
[server url]/api/proxy/plugin/forklift-console-plugin/forklift-inventory
will point to the `forklift-must-gather-api` service, instead of the `forklift-inventory` service
Based on API review CONSOLE-3145, we have decided to deprecate the following APIs:
cc Andrew Ballantyne Bryan Florkiewicz
Currently our `api.md` does not generate docs with "tags" (aka `@deprecated`) – we'll need to add that functionality to the `generate-doc.ts` script. See the code that works for `console-extensions.md`
Following https://coreos.slack.com/archives/C011BL0FEKZ/p1650640804532309, it would be useful for us (network observability team) to have access to ResourceIcon in dynamic-plugin-sdk.
Currently ResourceLink is exported but not ResourceIcon
AC:
To align with https://github.com/openshift/dynamic-plugin-sdk, plugin metadata field dependencies as well as the @console/pluginAPI entry contained within should be made optional.
If a plugin doesn't declare the @console/pluginAPI dependency, the Console release version check should be skipped for that plugin.
Acceptance Criteria: Add missing api docs for *Icon and *Status components ins the API docs
This enhancement Introduces support for provisioning and upgrading heterogenous architecture clusters in phases.
We need to scan through the compute nodes and build a set of supported architectures from those. Each node on the cluster has a label for architecture: e.g. `kuberneties.io/arch:arm64`, `kubernetes.io/arch:amd64` etc. Based on the set of supported architectures console will need to surface only those operators in the Operator Hub, which are supported on our Nodes. Each operator's PackageManifest contains a labels that indicates whats the operator's supported architecture, e.g. `operatorframework.io/arch.s390x: supported`. An operator can be supported on multiple architectures
AC:
OS and arch filtering: https://github.com/openshift/console/blob/2ad4e17d76acbe72171407fc1c66ca4596c8aac4/frontend/packages/operator-lifecycle-manager/src/components/operator-hub/operator-hub-items.tsx#L49-L86
@jpoulin is good to ask about heterogeneous clusters.
This enhancement Introduces support for provisioning and upgrading heterogenous architecture clusters in phases.
We need to scan through the compute nodes and build a set of supported architectures from those. Each node on the cluster has a label for architecture: e.g. kubernetes.io/arch=arm64, kubernetes.io/arch=amd64 etc. Based on the set of supported architectures console will need to surface only those operators in the Operator Hub, which are supported on our Nodes.
AC:
@jpoulin is good to ask about heterogeneous clusters.
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
As a developer, I want to be able to clean up the css markup after making the css / scss changes required for dark mode and remove any old unused css / scss content.
Acceptance criteria:
As a user, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
1. Proposed title of this feature request
Basic authentication for Helm Chart repository in helmchartrepositories.helm.openshift.io CRD.
2. What is the nature and description of the request?
As of v4.6.9, the HelmChartRepository CRD only supports client TLS authentication through spec.connectionConfig.tlsClientConfig.
3. Why do you need this? (List the business requirements here)
Basic authentication is widely used by many chart repositories managers (Nexus OSS, Artifactory, etc.)
Helm CLI also supports them with the helm repo add command.
https://helm.sh/docs/helm/helm_repo_add/
4. How would you like to achieve this? (List the functional requirements here)
Probably by extending the CRD:
spec:
connectionConfig:
username: username
password:
secretName: secret-name
The secret namespace should be openshift-config to align with the tlsClientConfig behavior.
5. For each functional requirement listed in question 4, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented.
Trying to pull helm charts from remote private chart repositories that has disabled anonymous access and offers basic authentication.
E.g.: https://github.com/sonatype/docker-nexus
As an OCP user I will like to be able to install helm charts from repos added to ODC with basic authentication fields populated
We need to support helm installs for Repos that have the basic authentication secret name and namespace.
Updating the ProjectHelmChartRepository CRD, already done in diff story
Supporting the HelmChartRepository CR, this feature will be scoped first to project/namespace scope repos.
<Defines what is included in this story>
If the new fields for basic auth are set in the repo CR then use those credentials when making API calls to helm to install/upgrade charts. We will error out if user logged in does not have access to the secret referenced by Repo CR. If basic auth fields are not present we assume is not an authenticated repo.
Nonet
NA
I can list, install and update charts on authenticated repos from ODC
Needs Documentation both upstream and downstream
Needs new unit test covering repo auth
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Unknown
Verified
Unsatisfied
ACCEPTANCE CRITERIA
NOTES
ACCEPTANCE CRITERIA
NOTES
This is a follow up Epic to https://issues.redhat.com/browse/MCO-144, which aimed to get in-place upgrades for Hypershift. This epic aims to capture additional work to focus on using CoreOS/OCP layering into Hypershift, which has benefits such as:
- removing or reducing the need for ignition
- maintaining feature parity between self-driving and managed OCP models
- adding additional functionality such as hotfixes
Right now in https://github.com/openshift/hypershift/pull/1258 you can only perform one upgrade at a time. Multiple upgrades will break due to controller logic
Properly create logic to handle manifest creation/updates and deletion, so the logic is more bulletproof
Currently not implemented, and will require the MCD hypershift mode to be adjusted to handle disruptionless upgrades like regular MCD
We plan to build Ironic Container Images using RHEL9 as base image in OCP 4.12
This is required because the ironic components have abandoned support for CentOS Stream 8 and Python 3.6/3.7 upstream during the most recent development cycle that will produce the stable Zed release, in favor of CentOS Stream 9 and Python 3.8/3.9
More info on RHEL8 to RHEL9 transition in OCP can be found at https://docs.google.com/document/d/1N8KyDY7KmgUYA9EOtDDQolebz0qi3nhT20IOn4D-xS4
update ironic software to pick up latest bug fixes
This is an API change and we will consider this as a feature request.
https://issues.redhat.com/browse/NE-799 Please check this for more details
https://issues.redhat.com/browse/NE-799 Please check this for more details
No
N/A
As a developer i want to have most recent version of testing framework with all fancy features like Junit reporting
We are widely using ginkgo across our components, v1 was deprecated sometime ago, need to update this.
Based on the updates in https://github.com/openshift/cluster-api-actuator-pkg/pull/258, we would like to update the test suites within this repository to use Ginkgo V2.
This will include updating the hack scripts to make sure that:
Make sure that the CSI driver automatically updates oVirt credentials when they are updated in OpenShift.
In the CSI driver operator we should add the
withSecretHashAnnotation
call from library-go like this: https://github.com/openshift/aws-ebs-csi-driver-operator/blob/53ed27b2a0eaa655338da180a79897855b366ac7/pkg/operator/starter.go#L138
We need tests for the ovirt-csi-driver and the cluster-api-provider-ovirt. These tests help us to
Also, having dedicated tests on lower levels with a smaller scope (unit, integration, ...) has the following benefits:
Integration tests need to be implemented according to https://cluster-api.sigs.k8s.io/developer/testing.html#integration-tests using envtest.
As a user, I would like to be informed in an intuitive way, when quotas have been reached in a namespace
Refer below for more details
As a user, In the topology view, I would like to be updated intuitively if any of the deployments have reached quota limits
Refer below for more details
Provide a form driven experience to allow cluster admins to manage the perspectives to meet the ACs below.
We have heard the following requests from customers and developer advocates:
As an admin, I want to hide the admin perspective for non-privileged users or hide the developer perspective for all users
Based on the https://issues.redhat.com/browse/ODC-6730 enhancement proposal, it is required to extend the console configuration CRD to enable the cluster admins to configure this data in the console resource
Previous customization work:
As an admin, I should be able to see a code snippet that shows how to add user perspectives
Based on the https://issues.redhat.com/browse/ODC-6732 enhancement proposal, the cluster admin can add user perspectives
To support the cluster-admin to configure the perspectives correctly, the developer console should provide a code snippet for the customization of yaml resource (Console CRD).
Customize Perspective Enhancement PR: https://github.com/openshift/enhancements/pull/1205
Previous work:
As an admin, I want to be able to use a form driven experience to hide user perspective(s)
As an admin, I want to hide user perspective(s) based on the customization.
Customers don't want their users to have access to some/all of the items which are available in the Developer Catalog. The request is to change access for the cluster, not per user or persona.
Provide a form driven experience to allow cluster admins easily disable the Developer Catalog, or one or more of the sub catalogs in the Developer Catalog.
Multiple customer requests.
We need to consider how this will work with subcatalogs which are installed by operators: VMs, Event Sources, Event Catalogs, Managed Services, Cloud based services
As a cluster-admin, I should be able to see a code snippet that shows how to enable sub-catalogs or the entire dev catalog.
Based on the https://issues.redhat.com/browse/ODC-6732 enhancement proposal, the cluster admin can add sub-catalog(s) from the Developer Catalog or the Dev catalog as a whole.
To support the cluster-admin to configure the sub-catalog list correctly, the developer console should provide a code snippet for the customization yaml resource (Console CRD).
Previous work:
As an admin, I want to hide/disable access to specific sub-catalogs in the developer catalog or the complete dev catalog for all users across all namespaces.
Based on the https://issues.redhat.com/browse/ODC-6732 enhancement proposal, it is required to extend the console configuration CRD to enable the cluster admins to configure this data in the console resource
Extend the "customization" spec type definition for the CRD in the openshift/api project
Previous customization work:
As an admin, I want to hide sub-catalogs in the developer catalog or hide the developer catalog completely based on the customization.
As an admin, I would like openshift-* namespaces with an operator to be labeled with security.openshift.io/scc.podSecurityLabelSync=true to ensure the continual functioning of operators without manual intervention. The label should only be applied to openshift-* namespaces with an operator (the presence of a ClusterServiceVersion resource) IF the label is not already present. This automation will help smooth functioning of the cluster and avoid frivolous operational events.
Context: As part of the PSA migration period, Openshift will ship with the "label sync'er" - a controller that will automatically adjust PSA security profiles in response to the workloads present in the namespace. We can assume that not all operators (produced by Red Hat, the community or ISVs) will have successfully migrated their deployments in response to upstream PSA changes. The label sync'er will sync, by default, any namespace not prefixed with "openshift-", of which an explicit label (security.openshift.io/scc.podSecurityLabelSync=true) is required for sync.
A/C:
- OLM operator has been modified (downstream only) to label any unlabelled "openshift-" namespace in which a CSV has been created
- If a labeled namespace containing at least one non-copied csv becomes unlabelled, it should be relabelled
- The implementation should be done in a way to eliminate or minimize subsequent downstream sync work (it is ok to make slight architectural changes to the OLM operator in the upstream to enable this)
As a SRE, I want hypershift operator to expose a metric when hosted control plane is ready.
This should allow SRE to tune (or silence) alerts occurring while the hosted control plane is spinning up.
The Kube APIServer has a sidecar to output audit logs. We need similar sidecars for other APIServers that run on the control plane side. We also need to pass the same audit log policy that we pass to the KAS to these other API servers.
This epic tracks network tooling improvements for 4.12
New framework and process should be developed to make sharing network tools with devs, support and customers convenient. We are going to add some tools for ovn troubleshooting before ovn-k goes default, also some tools that we got from customer cases, and some more to help analyze and debug collected logs based on stable must-gather/sosreport format we get now thanks to 4.11 Epic.
Our estimation for this Epic is 1 engineer * 2 Sprints
WHY:
This epic is important to help improve the time it takes our customers and our team to understand an issue within the cluster.
A focus of this epic is to develop tools to quickly allow debugging of a problematic cluster. This is crucial for the engineering team to help us scale. We want to provide a tool to our customers to help lower the cognitive burden to get at a root cause of an issue.
Alert if any of the ovn controllers disconnected for a period of time from the southbound database using metric ovn_controller_southbound_database_connected.
The metric updates every 2 minutes so please be mindful of this when creating the alert.
If the controller is disconnected for 10 minutes, fire an alert.
DoD: Merged to CNO and tested by QE
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Add sock proxy to cluster-network-operator so egressip can use grpc to reach worker nodes.
With the introduction of grpc as means for determining the state of a given egress node, hypershift should
be able to leverage socks proxy and become able to know the state of each egress node.
References relevant to this work:
1281-network-proxy
[+https://coreos.slack.com/archives/C01C8502FMM/p1658427627751939+]
[+https://github.com/openshift/hypershift/pull/1131/commits/28546dc587dc028dc8bded715847346ff99d65ea+]
This Epic is here to track the rebase we need to do when kube 1.25 is GA https://www.kubernetes.dev/resources/release/
Keeping this in mind can help us plan our time better. ATTOW GA is planned for August 23
https://docs.google.com/document/d/1h1XsEt1Iug-W9JRheQas7YRsUJ_NQ8ghEMVmOZ4X-0s/edit --> this is the link for rebase help
We need to rebase cloud network config controller to 1.25 when the kube 1.25 rebase lands.
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
In the current version, router does not support to load secrets directly and uses route resource to load private key and certificates exposing the security artifacts.
Acceptance criteria :
Update cluster-ingress-operator to bootstrap router with required featuregates
The cluster-ingress-operator will propagate the relevant Tech-Preview feature gate down to the router. This feature gate will be added as a command-line argument called ROUTER_EXTERNAL_CERTIFICATE to the router and will not be user configurable.
Refer:
Acceptance criteria
Place holder epic to track spontaneous task which does not deserve its own epic.
Once the HostedCluster and NodePool gets stopped using PausedUntil statement, the awsprivatelink controller will continue reconciling.
How to test this:
DoD:
At the moment if the input etcd kms encryption (key and role) is invalid we fail transparently.
We should check that both key and role are compatible/operational for a given cluster and fail in a condition otherwise
Changes made in METAL-1 open up opportunities to improve our handling of images by cleaning up redundant code that generates extra work for the user and extra load for the cluster.
We only need to run the image cache DaemonSet if there is a QCOW URL to be mirrored (effectively this means a cluster installed with 4.9 or earlier). We can stop deploying it for new clusters installed with 4.10 or later.
Currently, the image-customization-controller relies on the image cache running on every master to provide the shared hostpath volume containing the ISO and initramfs. The first step is to replace this with a regular volume and an init container in the i-c-c pod that extracts the images from machine-os-images. We can use the copy-metal -image-build flag (instead of -all used in the shared volume) to provide only the required images.
Once i-c-c has its own volume, we can switch the image extraction in the metal3 Pod's init container to use the -pxe flag instead of -all.
The machine-os-images init container for the image cache (not the metal3 Pod) can be removed. The whole image cache deployment is now optional and need only be started if provisioningOSDownloadURL is set (and in fact should be deleted if it is not).
Description of the problem:
When running assisted-installer on a machine where is more than one volume group per physical volume. Only the first volume group will be cleaned up. This leads to problems later and will lead to errors such as
Failed - failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- pvremove /dev/sda -y -ff], Error exit status 5, LastOutput "Can't open /dev/sda exclusively. Mounted filesystem?
How reproducible:
Set up a VM with more than one volume group per physical volume. As an example, look at the following sample from a customer cluster.
List block devices /usr/bin/lsblk -o NAME,MAJ:MIN,SIZE,TYPE,FSTYPE,KNAME,MODEL,UUID,WWN,HCTL,VENDOR,STATE,TRAN,PKNAME NAME MAJ:MIN SIZE TYPE FSTYPE KNAME MODEL UUID WWN HCTL VENDOR STATE TRAN PKNAME loop0 7:0 125.9G loop xfs loop0 c080b47b-2291-495c-8cc0-2009ebc39839 loop1 7:1 885.5M loop squashfs loop1 sda 8:0 894.3G disk sda INTEL SSDSC2KG96 0x55cd2e415235b2db 1:0:0:0 ATA running sas |-sda1 8:1 250M part sda1 0x55cd2e415235b2db sda |-sda2 8:2 750M part ext2 sda2 3aa73c72-e342-4a07-908c-a8a49767469d 0x55cd2e415235b2db sda |-sda3 8:3 49G part xfs sda3 ffc3ccfe-f150-4361-8ae5-f87b17c13ac2 0x55cd2e415235b2db sda |-sda4 8:4 394.2G part LVM2_member sda4 Ua3HOc-Olm4-1rma-q0Ug-PtzI-ZOWg-RJ63uY 0x55cd2e415235b2db sda `-sda5 8:5 450G part LVM2_member sda5 W8JqrD-ZvaC-uNK9-Y03D-uarc-Tl4O-wkDdhS 0x55cd2e415235b2db sda `-nova-instance 253:0 3.1T lvm ext4 dm-0 d15e2de6-2b97-4241-9451-639f7b14594e running sda5 sdb 8:16 894.3G disk sdb INTEL SSDSC2KG96 0x55cd2e415235b31b 1:0:1:0 ATA running sas `-sdb1 8:17 894.3G part LVM2_member sdb1 6ETObl-EzTd-jLGw-zVNc-lJ5O-QxgH-5wLAqD 0x55cd2e415235b31b sdb `-nova-instance 253:0 3.1T lvm ext4 dm-0 d15e2de6-2b97-4241-9451-639f7b14594e running sdb1 sdc 8:32 894.3G disk sdc INTEL SSDSC2KG96 0x55cd2e415235b652 1:0:2:0 ATA running sas `-sdc1 8:33 894.3G part LVM2_member sdc1 pBuktx-XlCg-6Mxs-lddC-qogB-ahXa-Nd9y2p 0x55cd2e415235b652 sdc `-nova-instance 253:0 3.1T lvm ext4 dm-0 d15e2de6-2b97-4241-9451-639f7b14594e running sdc1 sdd 8:48 894.3G disk sdd INTEL SSDSC2KG96 0x55cd2e41521679b7 1:0:3:0 ATA running sas `-sdd1 8:49 894.3G part LVM2_member sdd1 exVSwU-Pe07-XJ6r-Sfxe-CQcK-tu28-Hxdnqo 0x55cd2e41521679b7 sdd `-nova-instance 253:0 3.1T lvm ext4 dm-0 d15e2de6-2b97-4241-9451-639f7b14594e running sdd1 sr0 11:0 989M rom iso9660 sr0 Virtual CDROM0 2022-06-17-18-18-33-00 0:0:0:0 AMI running usb
Now run the assisted installer and try to install an SNO node on this machine, you will find that the installation will fail with a message that indicates that it could not exclusively access /dev/sda
Actual results:
The installation will fail with a message that indicates that it could not exclusively access /dev/sda
Expected results:
The installation should proceed and the cluster should start to install.
Suspected Cases
https://issues.redhat.com/browse/AITRIAGE-3809
https://issues.redhat.com/browse/AITRIAGE-3802
https://issues.redhat.com/browse/AITRIAGE-3810
Description of the problem:
Cluster Installation fail if installation disk has lvm on raid:
Host: test-infra-cluster-3cc862c9-master-0, reached installation stage Failed: failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- mdadm --stop /dev/md0], Error exit status 1, LastOutput "mdadm: Cannot get exclusive access to /dev/md0:Perhaps a running process, mounted filesystem or active volume group?"
How reproducible:
100%
Steps to reproduce:
1. Install a cluster while master nodes has disk with LVM on RAID (reproduces using test: https://gitlab.cee.redhat.com/ocp-edge-qe/kni-assisted-installer-auto/-/blob/master/api_tests/test_disk_cleanup.py#L97)
Actual results:
Installation failed
Expected results:
Installation success
Same thing as we've had in assisted-service. We sometimes fail to install golangci-lint by fetching release artifacts from GitHub directly. That's usually because the same IP address (CI build cluster) tries to access GitHub in a high rate, leading to 429 (too many requests)
The way we fixed it for assisted-service is changing installation to use quay.io image that is already built with the binary.
Example for such a failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/30788/rehearse-30788-periodic-ci-openshift-assisted-installer-agent-release-ocm-2.6-subsystem-test-periodic/1551879759036682240
Filter for all recent failures: https://search.ci.openshift.org/?search=golangci%2Fgolangci-lint+crit+unable+to+find&maxAge=168h&context=1&type=build-log&name=.*assisted.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Section 5 of PRD: https://docs.google.com/document/d/1fF-Ajdzc9EDDg687FzTrX577hvY9NdK0/edit#heading=h.gjdgxs
Testing and collaboration with NVIDIA: https://docs.google.com/spreadsheets/d/1LHY-Af-2kQHVwtW4aVdHnmwZLTiatiyf-ySffC8O5NM/edit#gid=0
Deploying Nvidia Patches: https://docs.google.com/document/d/1yR4lphjPKd6qZ9sGzZITl0wH1r4ykfMKPjUnlzvWji4/edit#
This is the continuation of https://issues.redhat.com/browse/NHE-273 but now the focus is on the remainig flows
Description of problem:
check_pkt_length cannot be offloaded without 1) sFlow offload patches in Openvswitch 2) Hardware driver support. Since 1) will not be done anytime soon. We need a work around for the check_pkt_length issue.
Version-Release number of selected component (if applicable):
4.11/4.12
How reproducible:
Always
Steps to Reproduce:
1. Any flow that has check_pkt_len() 5-b: Pod -> NodePort Service traffic (Pod Backend - Different Node) 6-b: Pod -> NodePort Service traffic (Host Backend - Different Node) 4-b: Pod -> Cluster IP Service traffic (Host Backend - Different Node) 10-b: Host Pod -> Cluster IP Service traffic (Host Backend - Different Node) 11-b: Host Pod -> NodePort Service traffic (Pod Backend - Different Node) 12-b: Host Pod -> NodePort Service traffic (Host Backend - Different Node)
Actual results:
Poor performance due to upcalls when check_pkt_len() is not supported.
Expected results:
Good performance.
Additional info:
https://docs.google.com/spreadsheets/d/1LHY-Af-2kQHVwtW4aVdHnmwZLTiatiyf-ySffC8O5NM/edit#gid=670206692
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As a developer building container images on OpenShift
I want to specify that my build should run without elevated privileges
So that builds do not run as root from the host's perspective with elevated privileges
No QE required for Dev Preview. OpenShift regression testing will verify that existing behavior is not impacted.
We will need to document how to enable this feature, with sufficient warnings regarding Dev Preview.
This likely warrants an OpenShift blog post, potentially?
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
We have been running into a number of problems with configure-ovs and nodeip-configuration selecting different interfaces in OVNK deployments. This causes connectivity issues, so we need some way to ensure that everything uses the same interface/IP.
Currently configure-ovs runs before nodeip-configuration, but since nodeip-configuration is the source of truth for IP selection regardless of CNI plugin, I think we need to look at swapping that order. That way configure-ovs could look at what nodeip-configuration chose and not have to implement its own interface selection logic.
I'm targeting this at 4.12 because even though there's probably still time to get it in for 4.11, changing the order of boot services is always a little risky and I'd prefer to do it earlier in the cycle so we have time to tease out any issues that arise. We may need to consider backporting the change though since this has been an issue at least back to 4.10.
Goal
Provide an indication that advanced features are used
Problem
Today, customers and RH don't have the information on the actual usage of advanced features.
Why is this important?
Prioritized Scenarios
In Scope
1. Add a boolean variable in our telemetry to mark if the customer is using advanced features (PV encryption, encryption with KMS, external mode).
Not in Scope
Integrate with subscription watch - will be done by the subscription watch team with our help.
Customers
All
Customer Facing Story
As a compliance manager, I should be able to easily see if all my clusters are using the right amount of subscriptions
What does success look like?
A clear indication in subscription watch for ODF usage (either essential or advanced).
1. Proposed title of this feature request
2. What is the nature and description of the request?
3. Why does the customer need this? (List the business requirements here)
4. List any affected packages or components.
_____________________
Link to main epic: https://issues.redhat.com/browse/RHSTOR-3173
We migrated most component as part of https://issues.redhat.com/browse/RHSTOR-2165
We now have a few components remaining roughly 15 to 20%. This epic tragets
1) Add support for in-tree modal launcher
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
Description of problem:
intra namespace allow network policy doesn't work after applying ingress&egress deny all network policy
Version-Release number of selected component (if applicable):
OpenShift 4.10.12
How reproducible:
Always
Steps to Reproduce:
1. Define deny all network policy for egress an ingress in a namespace:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
2. Define the following network policy to allow the traffic between the pods in the namespace:
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-intra-namespace-001 spec: egress: - to: - podSelector: {} ingress: - from: - podSelector: {} podSelector: {} policyTypes: - Ingress - Egress
3. Test the connectivity between two pods from the namespace.
Actual results:
The connectivity is not allowed
Expected results:
The connectivity should be allowed between pods from the same namespace.
Additional info:
After performing a test and analyzing SDN flows for the namespace:
sh-4.4# ovs-ofctl dump-flows -O OpenFlow13 br0 | grep --color 0x964376 cookie=0x0, duration=99375.342s, table=20, n_packets=14, n_bytes=588, priority=100,arp,in_port=21,arp_spa=10.128.2.20,arp_sha=00:00:0a:80:02:14/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30 cookie=0x0, duration=1681.845s, table=20, n_packets=11, n_bytes=462, priority=100,arp,in_port=24,arp_spa=10.128.2.23,arp_sha=00:00:0a:80:02:17/00:00:ff:ff:ff:ff actions=load:0x964376->NXM_NX_REG0[],goto_table:30 cookie=0x0, duration=99375.342s, table=20, n_packets=135610, n_bytes=759239814, priority=100,ip,in_port=21,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27 cookie=0x0, duration=1681.845s, table=20, n_packets=2006, n_bytes=12684967, priority=100,ip,in_port=24,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27 cookie=0x0, duration=99375.342s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.20 actions=load:0x964376->NXM_NX_REG0[],goto_table:27 cookie=0x0, duration=1681.845s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.2.23 actions=load:0x964376->NXM_NX_REG0[],goto_table:27 cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30 cookie=0x0, duration=99375.342s, table=70, n_packets=145260, n_bytes=11722173, priority=100,ip,nw_dst=10.128.2.20 actions=load:0x964376->NXM_NX_REG1[],load:0x15->NXM_NX_REG2[],goto_table:80 cookie=0x0, duration=1681.845s, table=70, n_packets=2336, n_bytes=191079, priority=100,ip,nw_dst=10.128.2.23 actions=load:0x964376->NXM_NX_REG1[],load:0x18->NXM_NX_REG2[],goto_table:80 cookie=0x0, duration=975.129s, table=80, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=output:NXM_NX_REG2[]
We see that the following rule doesn't match because `reg1` hasn't been defined:
cookie=0x0, duration=975.129s, table=27, n_packets=0, n_bytes=0, priority=150,reg0=0x964376,reg1=0x964376 actions=goto_table:30
Description of problem:
Git icon shown in the repository details page should be based on the git provider.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Always
Steps to Reproduce:
1. Create a Repository with gitlab repo url
2. Navigate to the detail page.
Actual results:
github icon is displayed for the gitlab url.
Expected results:
gitlab icon should be displayed for the gitlab url.
Additional info:
use `GitLabIcon` and `BitBucketIcon` from patternfly react-icons.
[release-4.12] bp ovnkube-trace changes to 4.12
As mentioned in AITRIAGE-3520, there multiple attempts to grab controller logs might fail at some point and override existing logs.
In the case of the ticket I mentioned, we were able to retrieve controller logs from the logs server. However, this might not always be the case for other clusters.
We need to find a way to preserve all logs, or time out log collection differently.
The way we thought it can be handled is by writing logs inside container and in case kube-api is not reachable we will read logs from file
Description of problem:
If you set a services cluster IP to an IP with a leading zero (e.g. 192.168.0.011), ovn-k should normalise this and remove the leading zero before sending it to ovn.
This was seen by me on a CI run executing the k8 test here: test/e2e/network/funny_ips.go +75
you can reproduce using that above test.
Have a read of the text there:
43 // What are funny IPs: 44 // The adjective is because of the curl blog that explains the history and the problem of liberal 45 // parsing of IP addresses and the consequences and security risks caused the lack of normalization, 46 // mainly due to the use of different notations to abuse parsers misalignment to bypass filters. 47 // xref: https://daniel.haxx.se/blog/2021/04/19/curl-those-funny-ipv4-addresses/ 48 // 49 // Since golang 1.17, IPv4 addresses with leading zeros are rejected by the standard library. 50 // xref: https://github.com/golang/go/issues/30999 51 // 52 // Because this change on the parsers can cause that previous valid data become invalid, Kubernetes 53 // forked the old parsers allowing leading zeros on IPv4 address to not break the compatibility. 54 // 55 // Kubernetes interprets leading zeros on IPv4 addresses as decimal, users must not rely on parser 56 // alignment to not being impacted by the associated security advisory: CVE-2021-29923 golang 57 // standard library "net" - Improper Input Validation of octal literals in golang 1.16.2 and below 58 // standard library "net" results in indeterminate SSRF & RFI vulnerabilities. xref: 59 // https://nvd.nist.gov/vuln/detail/CVE-2021-29923
northd is logging an error about this also:
|socket_util|ERR|172.30.0.011:7180: bad IP address "172.30.0.011" ... 2022-08-23T14:14:21.968Z|01839|ovn_util|WARN|bad ip address or port for load balancer key 172.30.0.011:7180
Also, I see the error:
E0823 14:14:34.135115 3284 gateway_shared_intf.go:600] Failed to delete conntrack entry for service e2e-funny-ips-8626/funny-ip: failed to delete conntrack entry for service e2e-funny-ips-8626/funny-ip with svcVIP 172.30.0.011, svcPort 7180, protocol TCP: value "<nil>" passed to DeleteConntrack is not an IP address
We should normalise the IPs before sending to OVN-k. I see also theres conntrack error when trying to set this bad IP.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. See above k8 test
Actual results:
Leading zero IP sent to OVN
Expected results:
No leading zero IP sent to OVN
Additional info:
Description of problem:
The Alertmanager silence create / edit form got a new "Negative matcher" option in 4.12 (see https://issues.redhat.com/browse/OCPBUGSM-47734). However, there is nothing to explain what this option means and it will likely not be obvious from the label alone unless you are already quite familiar with Alertmanager. After discussion with the docs team, it was decided that adding some explanation in context in the UI would be much better than adding an explanation to the documentation.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Go to Admin perspective 2. Go to Observe > Alerting > Silences page 3. Click on the Create button ("Negative matcher" option is shown with no explanation)
Actual results:
Expected results:
Additional info:
We need to rebase openshift-sdn to kube 1.25's kube-proxy.
In particular, we need this to get https://github.com/kubernetes/kubernetes/pull/110334 into master because we will probably get asked to backport it.
This is a clone of issue OCPBUGS-860. The following is the description of the original issue:
—
Description of problem:
In GCP, once an external IP address is assigned to master/infra node through GCP console, numbers of pending CSR from kubernetes.io/kubelet-serving is increasing, and the following error are reported: I0902 10:48:29.254427 1 controller.go:121] Reconciling CSR: csr-q7bwd I0902 10:48:29.365774 1 csr_check.go:157] csr-q7bwd: CSR does not appear to be client csr I0902 10:48:29.371827 1 csr_check.go:545] retrieving serving cert from build04-c92hb-master-1.c.openshift-ci-build-farm.internal (10.0.0.5:10250) I0902 10:48:29.375052 1 csr_check.go:188] Found existing serving cert for build04-c92hb-master-1.c.openshift-ci-build-farm.internal I0902 10:48:29.375152 1 csr_check.go:192] Could not use current serving cert for renewal: CSR Subject Alternate Name values do not match current certificate I0902 10:48:29.375166 1 csr_check.go:193] Current SAN Values: [build04-c92hb-master-1.c.openshift-ci-build-farm.internal 10.0.0.5], CSR SAN Values: [build04-c92hb-master-1.c.openshift-ci-build-farm.internal 10.0.0.5 35.211.234.95] I0902 10:48:29.375175 1 csr_check.go:202] Falling back to machine-api authorization for build04-c92hb-master-1.c.openshift-ci-build-farm.internal E0902 10:48:29.375184 1 csr_check.go:420] csr-q7bwd: IP address '35.211.234.95' not in machine addresses: 10.0.0.5 I0902 10:48:29.375193 1 csr_check.go:205] Could not use Machine for serving cert authorization: IP address '35.211.234.95' not in machine addresses: 10.0.0.5 I0902 10:48:29.379457 1 csr_check.go:218] Falling back to serving cert renewal with Egress IP checks I0902 10:48:29.382668 1 csr_check.go:221] Could not use current serving cert and egress IPs for renewal: CSR Subject Alternate Names includes unknown IP addresses I0902 10:48:29.382702 1 controller.go:233] csr-q7bwd: CSR not authorized
Version-Release number of selected component (if applicable):
4.11.2
Steps to Reproduce:
1. Assign external IPs to master/infra node in GCP 2. oc get csr | grep kubernetes.io/kubelet-serving
Actual results:
CSRs are not approved
Expected results:
CSRs are approved
Additional info:
This issue is only happen in GCP. Same OpenShift installations in AWS do not have this issue. It looks like the CSR are created using external IP addresses once assigned. Ref: https://coreos.slack.com/archives/C03KEQZC1L2/p1662122007083059
This is a clone of issue OCPBUGS-10558. The following is the description of the original issue:
—
Description of problem:
When running a cluster on application credentials, this event appears repeatedly: ns/openshift-machine-api machineset/nhydri0d-f8dcc-kzcwf-worker-0 hmsg/173228e527 - pathological/true reason/ReconcileError could not find information for "ci.m1.xlarge"
Version-Release number of selected component (if applicable):
How reproducible:
Happens in the CI (https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/33330/rehearse-33330-periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.13-e2e-openstack-ovn-serial/1633149670878351360).
Steps to Reproduce:
1. On a living cluster, rotate the OpenStack cloud credentials 2. Invalidate the previous credentials 3. Watch the machine-api events (`oc -n openshift-machine-api get event`). A `Warning` type of issue could not find information for "name-of-the-flavour" will appear. If the cluster was installed using a password that you can't invalidate: 1. Rotate the cloud credentials to application credentials 2. Restart MAPO (`oc -n openshift-machine-api get pods -o NAME | xargs -r oc -n openshift-machine-api delete`) 3. Rotate cloud credentials again 4. Revoke the first application credentials you set 5. Finally watch the events (`oc -n openshift-machine-api get event`) The event signals that MAPO wasn't able to update flavour information on the MachineSet status.
Actual results:
Expected results:
No issue detecting the flavour details
Additional info:
Offending code likely around this line: https://github.com/openshift/machine-api-provider-openstack/blob/bcb08a7835c08d20606d75757228fd03fbb20dab/pkg/machineset/controller.go#L116
Description of problem:
Tests failure when running dev-console tests locally.
Version-Release number of selected component (if applicable):
At least on 4.11 and 4.12
How reproducible:
Always
Steps to Reproduce:
1. Start cypress: yarn run test-cypress-dev-console
2. Run add-page
Actual results:
Fails
Expected results:
Should pass
Additional info:
This is a clone of issue OCPBUGS-6049. The following is the description of the original issue:
—
Description of problem:
We show the UpdateInProgress component (the progress bars) when the cluster update status is Failing, UpdatingAndFailing, or Updating. The inclusion of the Failing case results in a bug where the progress bars can display when an update is not occurring (see attached screenshot).
Steps to Reproduce:
1. Add the following overrides to ClusterVersion config (/k8s/cluster/config.openshift.io~v1~ClusterVersion/version) spec: overrides: - group: apps kind: Deployment name: console-operator namespace: openshift-console-operator unmanaged: true - group: rbac.authorization.k8s.io kind: ClusterRole name: console-operator namespace: '' unmanaged: true 2. Wait for ClusterVersion changes to roll out. 3. Visit /settings/cluster and note the progress bars are present and displaying 100% but the cluster is not updating
Actual results:
Progress bars are displaying when not updating.
Expected results:
Progress bars should not display when updating.
I saw the following while trying to debug the following "unexpectedly found multiple equivalent ACLs" error.
Add a generic networkpolicy:
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: allow-same-namespace
namespace: nbc9-demo-project
spec:
podSelector: {}
ingress:
$ kubectl get pod ovnkube-master-pk89w -o jsonpath='
{range .spec.containers[]} {@.image}'
quay.io/openshift/okd-content@sha256:79ee71e045a7b224a132f6c75b4220ec35b9a06049061a6bd9ca9fc976c412e5
[root@dev-nkjpp-master-2 ~]# ovnkube -v
I0609 17:33:34.930787 58 ovs.go:93] Maximum command line arguments set to: 191102
Version: 0.3.0
Git commit: 7bf36eea28fe66365d0dfdf8c39e3311ea14d19b
Git branch: release-4.10
Go version: go1.16.6
Build date: 2022-05-27
OS/Arch: linux amd64
Which then fails to apply, retries, and when the networkpolicy is deleted, the ovnkube-master pod segfaults:
I0609 17:00:26.653710 1 policy.go:1092] Adding network policy allow-same-namespace in namespace nbc9-demo-project
E0609 17:00:26.656858 1 ovn.go:753] Failed to create network policy nbc9-demo-project/allow-same-namespace, error: failed to create default port groups and acls for policy: nbc9-demo-project/allow-same-namespace, error: unexpectedly found multiple equivalent ACLs: [
{UUID:60cb946a-46e9-4623-9ba4-3cb35f018ed6 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a7830797310894963783_egressDefaultDeny Meter:0xc0010df390 Name:0xc0010df3d0 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc0010df3e0}
]
I0609 17:00:51.437895 1 policy_retry.go:46] Network Policy Retry: nbc9-demo-project/allow-same-namespace retry network policy setup
I0609 17:00:51.437935 1 policy_retry.go:63] Network Policy Retry: Creating new policy for nbc9-demo-project/allow-same-namespace
I0609 17:00:51.437941 1 policy.go:1092] Adding network policy allow-same-namespace in namespace nbc9-demo-project
I0609 17:00:51.438174 1 policy_retry.go:65] Network Policy Retry create failed for nbc9-demo-project/allow-same-namespace, will try again later: failed to create default port groups and acls for policy: nbc9-demo-project/allow-same-namespace, error: unexpectedly found multiple equivalent ACLs: [
{UUID:7b55ba0c-150f-4a63-9601-cfde25f29408 Action:drop Direction:from-lport ExternalIDs:map[default-deny-policy-type:Egress] Label:0 Log:false Match:inport == @a7830797310894963783_egressDefaultDeny Meter:0xc0022b0310 Name:0xc0022b03a0 Options:map[apply-after-lb:true] Priority:1000 Severity:0xc000070ab0}
]
I0609 17:01:02.679219 1 policy.go:1174] Deleting network policy allow-same-namespace in namespace nbc9-demo-project
E0609 17:01:02.679407 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 249 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1c19c80, 0x2e9a810)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86
panic(0x1c19c80, 0x2e9a810)
/usr/lib/golang/src/runtime/panic.go:965 +0x1b9
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).destroyNetworkPolicy(0xc0022c2000, 0x0, 0xc000bb9000, 0x0, 0x0)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1210 +0x55
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).deleteNetworkPolicy(0xc0022c2000, 0xc002544f00, 0x0, 0x0, 0x0)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1198 +0x43f
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchNetworkPolicy.func4(0x1e7e840, 0xc002544f00)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/ovn.go:800 +0xae
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:245
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnDelete(0xc000f4c4c0, 0x2160f10, 0xc002f498c0, 0x1e7e840, 0xc002544f00)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:288 +0x6a
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*Handler).OnDelete(...)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:52
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedHandler.func3.1(0xc00463dbf0)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:340 +0x65
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).forEachHandler(0xc0002c61b0, 0x1e7e840, 0xc002544f00, 0xc003dc9d60)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:114 +0x156
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedHandler.func3(0x1e7e840, 0xc002544f00)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:339 +0x1b2
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:245
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:779 +0x166
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc002367760)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc003dc9f60, 0x2127a00, 0xc000229a70, 0x1bd5d01, 0xc000039740)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc002367760, 0x3b9aca00, 0x0, 0x1, 0xc000039740)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0004f3180)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771 +0x95
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc0002bed80, 0xc000ed5850)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73 +0x51
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x65
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1a021d5]
goroutine 249 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x109
panic(0x1c19c80, 0x2e9a810)
/usr/lib/golang/src/runtime/panic.go:965 +0x1b9
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).destroyNetworkPolicy(0xc0022c2000, 0x0, 0xc000bb9000, 0x0, 0x0)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1210 +0x55
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).deleteNetworkPolicy(0xc0022c2000, 0xc002544f00, 0x0, 0x0, 0x0)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/policy.go:1198 +0x43f
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchNetworkPolicy.func4(0x1e7e840, 0xc002544f00)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/ovn/ovn.go:800 +0xae
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:245
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnDelete(0xc000f4c4c0, 0x2160f10, 0xc002f498c0, 0x1e7e840, 0xc002544f00)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:288 +0x6a
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*Handler).OnDelete(...)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:52
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedHandler.func3.1(0xc00463dbf0)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:340 +0x65
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).forEachHandler(0xc0002c61b0, 0x1e7e840, 0xc002544f00, 0xc003dc9d60)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:114 +0x156
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).newFederatedHandler.func3(0x1e7e840, 0xc002544f00)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:339 +0x1b2
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:245
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:779 +0x166
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc002367760)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc003dc9f60, 0x2127a00, 0xc000229a70, 0x1bd5d01, 0xc000039740)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc002367760, 0x3b9aca00, 0x0, 0x1, 0xc000039740)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0004f3180)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771 +0x95
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc0002bed80, 0xc000ed5850)
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73 +0x51
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
/go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x65
Please let me know if any further information is required. I have a must-gather for this cluster but the file attachment tool in bugzilla won't let me attach anything larger than 19.5MB (the must-gather is 212.1MB)
job=pull-ci-openshift-origin-master-e2e-gcp-builds=all
This test has started permafailing on e2e-gcp-builds:
[sig-builds][Feature:Builds][Slow] s2i build with environment file in sources Building from a template should create a image from "test-env-build.json" template and run it in a pod [apigroup:build.openshift.io][apigroup:image.openshift.io]
The error in the test says
Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:21 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Pulling: Pulling image "image-registry.openshift-image-registry.svc:5000/e2e-test-build-sti-env-nglnt/test@sha256:262820fd1a94d68442874346f4c4024fdf556631da51cbf37ce69de094f56fe8" Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:23 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Pulled: Successfully pulled image "image-registry.openshift-image-registry.svc:5000/e2e-test-build-sti-env-nglnt/test@sha256:262820fd1a94d68442874346f4c4024fdf556631da51cbf37ce69de094f56fe8" in 1.763914719s Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:23 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Created: Created container test Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:23 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Started: Started container test Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:24 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Pulled: Container image "image-registry.openshift-image-registry.svc:5000/e2e-test-build-sti-env-nglnt/test@sha256:262820fd1a94d68442874346f4c4024fdf556631da51cbf37ce69de094f56fe8" already present on machine Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:25 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} Unhealthy: Readiness probe failed: Get "http://10.129.2.63:8080/": dial tcp 10.129.2.63:8080: connect: connection refused Sep 13 07:03:30.345: INFO: At 2022-09-13 07:00:26 +0000 UTC - event for build-test-pod: {kubelet ci-op-kg1t2x13-4e3c6-7hrm8-worker-a-66nwd} BackOff: Back-off restarting failed container
Description of problem:
In a 4.11 cluster with only openshift-samples enabled, the 4.12 introduced optional COs console and insights are installed. While upgrading to 4.12, CVO considers them to be disabled explicitly and skips reconciling them. So these COs are not upgraded to 4.12. Installed COs cannot be disabled, so CVO is supposed to implicitly enable them. $ oc get clusterversion -oyaml { "apiVersion": "config.openshift.io/v1", "kind": "ClusterVersion", "metadata": { "creationTimestamp": "2022-09-30T05:02:31Z", "generation": 3, "name": "version", "resourceVersion": "134808", "uid": "bd95473f-ffda-402d-8fe3-74f852a9d6eb" }, "spec": { "capabilities": { "additionalEnabledCapabilities": [ "openshift-samples" ], "baselineCapabilitySet": "None" }, "channel": "stable-4.11", "clusterID": "8eda5167-a730-4b39-be1d-214a80506d34", "desiredUpdate": { "force": true, "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc", "version": "" } }, "status": { "availableUpdates": null, "capabilities": { "enabledCapabilities": [ "openshift-samples" ], "knownCapabilities": [ "Console", "Insights", "Storage", "baremetal", "marketplace", "openshift-samples" ] }, "conditions": [ { "lastTransitionTime": "2022-09-30T05:02:33Z", "message": "Unable to retrieve available updates: currently reconciling cluster version 4.12.0-0.nightly-2022-09-28-204419 not found in the \"stable-4.11\" channel", "reason": "VersionNotFound", "status": "False", "type": "RetrievedUpdates" }, { "lastTransitionTime": "2022-09-30T05:02:33Z", "message": "Capabilities match configured spec", "reason": "AsExpected", "status": "False", "type": "ImplicitlyEnabledCapabilities" }, { "lastTransitionTime": "2022-09-30T05:02:33Z", "message": "Payload loaded version=\"4.12.0-0.nightly-2022-09-28-204419\" image=\"registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc\" architecture=\"amd64\"", "reason": "PayloadLoaded", "status": "True", "type": "ReleaseAccepted" }, { "lastTransitionTime": "2022-09-30T05:23:18Z", "message": "Done applying 4.12.0-0.nightly-2022-09-28-204419", "status": "True", "type": "Available" }, { "lastTransitionTime": "2022-09-30T07:05:42Z", "status": "False", "type": "Failing" }, { "lastTransitionTime": "2022-09-30T07:41:53Z", "message": "Cluster version is 4.12.0-0.nightly-2022-09-28-204419", "status": "False", "type": "Progressing" } ], "desired": { "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc", "version": "4.12.0-0.nightly-2022-09-28-204419" }, "history": [ { "completionTime": "2022-09-30T07:41:53Z", "image": "registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc", "startedTime": "2022-09-30T06:42:01Z", "state": "Completed", "verified": false, "version": "4.12.0-0.nightly-2022-09-28-204419" }, { "completionTime": "2022-09-30T05:23:18Z", "image": "registry.ci.openshift.org/ocp/release@sha256:5a6f6d1bf5c752c75d7554aa927c06b5ea0880b51909e83387ee4d3bca424631", "startedTime": "2022-09-30T05:02:33Z", "state": "Completed", "verified": false, "version": "4.11.0-0.nightly-2022-09-29-191451" } ], "observedGeneration": 3, "versionHash": "CSCJ2fxM_2o=" } } $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-2022-09-28-204419 True False False 93m cloud-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h56m cloud-credential 4.12.0-0.nightly-2022-09-28-204419 True False False 3h59m cluster-autoscaler 4.12.0-0.nightly-2022-09-28-204419 True False False 3h53m config-operator 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m console 4.11.0-0.nightly-2022-09-29-191451 True False False 3h45m control-plane-machine-set 4.12.0-0.nightly-2022-09-28-204419 True False False 117m csi-snapshot-controller 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m dns 4.12.0-0.nightly-2022-09-28-204419 True False False 3h53m etcd 4.12.0-0.nightly-2022-09-28-204419 True False False 3h52m image-registry 4.12.0-0.nightly-2022-09-28-204419 True False False 3h46m ingress 4.12.0-0.nightly-2022-09-28-204419 True False False 151m insights 4.11.0-0.nightly-2022-09-29-191451 True False False 3h48m kube-apiserver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h50m kube-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h51m kube-scheduler 4.12.0-0.nightly-2022-09-28-204419 True False False 3h51m kube-storage-version-migrator 4.12.0-0.nightly-2022-09-28-204419 True False False 91m machine-api 4.12.0-0.nightly-2022-09-28-204419 True False False 3h50m machine-approver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m machine-config 4.12.0-0.nightly-2022-09-28-204419 True False False 3h52m monitoring 4.12.0-0.nightly-2022-09-28-204419 True False False 3h44m network 4.12.0-0.nightly-2022-09-28-204419 True False False 3h55m node-tuning 4.12.0-0.nightly-2022-09-28-204419 True False False 113m openshift-apiserver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h48m openshift-controller-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 113m openshift-samples 4.12.0-0.nightly-2022-09-28-204419 True False False 116m operator-lifecycle-manager 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m operator-lifecycle-manager-catalog 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m operator-lifecycle-manager-packageserver 4.12.0-0.nightly-2022-09-28-204419 True False False 3h48m service-ca 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m storage 4.12.0-0.nightly-2022-09-28-204419 True False False 3h54m
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-28-204419
How reproducible:
Always
Steps to Reproduce:
1. Install a 4.11 cluster with only openshift-samples enabled 2. Upgrade to 4.12 3.
Actual results:
The 4.12 introduced optional CO console and insights are not upgraded to 4.12
Expected results:
All the installed COs get upgraded
Additional info:
In the Known Issues section of the OpenStack-specific Installer docs issues, there is a point about control plane anti-affinity.
The known issue has several problems:
Description of problem:
During ocp multinode spoke cluster creation agent provisioning is stuck on "configuring" because machineConfig service is crashing on the node.
After restarting the service still fails with
Can't read link "/var/lib/containers/storage/overlay/l/V2OP2CCVMKSOHK2XICC546DUCG" because it does not exist. A storage corruption might have occurred, attempting to recreate the missing symlinks. It might be best wipe the storage to avoid further errors due to storage corruption.
Version-Release number of selected component (if applicable):
Podman 4.0.2 +
How reproducible:
sometimes
Steps to Reproduce:
1. deploy multinode spoke (ipxe + boot order ) 2. 3.
Actual results:
4 agents in done state and 1 is in "configuring"
Expected results:
all agents are in "done" state
Additional info:
issue mentioned in https://github.com/containers/podman/issues/14003
Fix: https://github.com/containers/storage/issues/1136
Description of problem:
On the alert details page and alerting rule details page, clicking on a field that has a popover help throws an uncaught JavaScript error.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Go to Observe > Alerting pages 2. Click on an alert (or go to the rules tab then click on a rule) 3. Click on one of the underlined fields (those that have a popover help)
Actual results:
Expected results:
Additional info:
This bug is a backport clone of [Bugzilla Bug 1948666](https://bugzilla.redhat.com/show_bug.cgi?id=1948666). The following is the description of the original bug:
—
Description of problem:
When users try to deploy an application from git method on dev console it throws warning message for specific public repos `URL is valid but cannot be reached. If this is a private repository, enter a source secret in Advanced Git Options.`. If we ignore the warning and go ahead the build will be successful although the warning message seems to be misleading.
Actual results:
Getting a warning for url while trying to deploy an application from git method on dev console from a public repo
Expected results:
It should show validated
Description of problem:
Since the decomissioning of the psi cluster, and subsequent move of the rhcos release browser, product builds machine-os-images builds have been failing. See e.g. https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=47565717
Version-Release number of selected component (if applicable):
4.12, 4.11, 4.10.
How reproducible:
Have ART build the image
Steps to Reproduce:
1. Have ART build the image
Actual results:
Build failure
Expected results:
Build succesful
Additional info:
Description of problem:
when using the OnDelete update method of the ControlPlaneMachineSet, it should not be possible to have multiple machines in the Running phase in the same machine index. eg, if machine-1 is in Running phase, we should not have a machine-replacement-1 also in the Running phase.
Version-Release number of selected component (if applicable):
4.12 / main branch
How reproducible:
unsure, this is currently not tested in the code and is difficult to produce
Steps to Reproduce:
1. setup a cluster with CPMS in OnDelete update mode 2. rename one of the master machines to have the same index as another, or manually create a machine to match. this step might be difficult to reproduce. 3. observe logs from CPMS operator
Actual results:
no errors are emitted about the extra machine, although perhaps others are. operator does not degrade.
Expected results:
an error should be produced and the operator should go degraded
Additional info:
this bug is slightly predictive, we have not observed this condition but have detected a gap in the code that might make it possible.
This is a clone of issue OCPBUGS-5523. The following is the description of the original issue:
Description of problem:
catalog pod restarting frequently after one stack trace daily. ~~~ $ omc logs catalog-operator-f7477865d-x6frl -p 2023-01-04T13:05:15.175952229Z time="2023-01-04T13:05:15Z" level=info msg=syncing event=update reconciling="*v1alpha1.Subscription" selflink= 2023-01-04T13:05:15.175952229Z fatal error: concurrent map read and map write 2023-01-04T13:05:15.178587884Z 2023-01-04T13:05:15.178674833Z goroutine 669 [running]: 2023-01-04T13:05:15.179284556Z runtime.throw({0x1efdc12, 0xc000580000}) 2023-01-04T13:05:15.179458107Z /usr/lib/golang/src/runtime/panic.go:1198 +0x71 fp=0xc00559d098 sp=0xc00559d068 pc=0x43bcd1 2023-01-04T13:05:15.179707701Z runtime.mapaccess1_faststr(0x7f39283dd878, 0x10, {0xc000894c40, 0xf}) 2023-01-04T13:05:15.179932520Z /usr/lib/golang/src/runtime/map_faststr.go:21 +0x3a5 fp=0xc00559d100 sp=0xc00559d098 pc=0x418ca5 2023-01-04T13:05:15.180181245Z github.com/operator-framework/operator-lifecycle-manager/pkg/metrics.UpdateSubsSyncCounterStorage(0xc00545cfc0) ~~~
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Slack discussion: https://redhat-internal.slack.com/archives/C3VS0LV41/p1673120541153639 MG link - https://attachments.access.redhat.com/hydra/rest/cases/03396604/attachments/25f23643-2447-442b-ba26-4338b679b8cc?usePresignedUrl=true
Description of problem:
QE has one vsphere6.7 u3 env, privilege "InventoryService.Tagging.ObjectAttachable" does not exist, and installer fails as below. FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": platform.vsphere.defaultDatastore: Internal error: privileges missing for vSphere vCenter Datastore: InventoryService.Tagging.ObjectAttachable As vSphere 6.7 U3 is deprecated but not removed, so it should be supported, users may hit the similar issue on 6.7u3 when fresh installing.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-08-30-142847
How reproducible:
always
Steps to Reproduce:
1. Create role for each vsphere objects and assign listed privileges on it based on instlallation doc, then add permission to each object with created and user 2. Install IPI cluster on vSphere platform by this user 3. Installer fails and complains that missing privilege "InventoryService.Tagging.ObjectAttachable"
Actual results:
Installer fails and complains that missing privilege "InventoryService.Tagging.ObjectAttachable"
Expected results:
Installer should succeed.
Additional info:
`aws-ebs-csi-driver-operator` runs in the mgmt cluster and does not need to be configured with the guest cluster proxy
hypershift proxy conformance test currently fails because the `storage` CO never becomes `Available`
https://k8s-testgrid.appspot.com/redhat-hypershift#4.12-conformance-aws-ovn-proxy
This is a clone of issue OCPBUGS-7837. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Users on a disconnected cluster with a proxy could not import a Devfile (from GitHub).
The API call /api/devfile/ takes 30 seconds until it fails with 504 Gateway timeout.
Version-Release number of selected component (if applicable):
This might happen since 4.8
Tested this yet only on 4.12.0-0.nightly-2022-09-07-112008
How reproducible:
Always
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
The console Pod log contains this error:
E0909 10:28:18.448680 1 devfile-handler.go:74] Failed to parse devfile: failed to populateAndParseDevfile: Get "https://registry.devfile.io/devfiles/go": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
This is a clone of issue OCPBUGS-3990. The following is the description of the original issue:
—
Description of problem:
This PR fails HyperShift CI fails with:
=== RUN TestAutoscaling/EnsureNoPodsWithTooHighPriority util.go:411: pod csi-snapshot-controller-7bb4b877b4-q5457 with priorityClassName system-cluster-critical has a priority of 2000000000 with exceeds the max allowed of 100002000 util.go:411: pod csi-snapshot-webhook-644b6dbfb-v4lj7 with priorityClassName system-cluster-critical has a priority of 2000000000 with exceeds the max allowed of 100002000
How reproducible:
always
Steps to Reproduce:
Alternatively, ci/prow/e2e-aws in https://github.com/openshift/hypershift/pull/1698 and https://github.com/openshift/hypershift/pull/1748 must pass.
Description of problem:
Upgrade OCP 4.11 --> 4.12 fails with one 'NotReady,SchedulingDisabled' node and MachineConfigDaemonFailed.
Version-Release number of selected component (if applicable):
Upgrade from OCP 4.11.0-0.nightly-2022-09-19-214532 on top of OSP RHOS-16.2-RHEL-8-20220804.n.1 to 4.12.0-0.nightly-2022-09-20-040107. Network Type: OVNKubernetes
How reproducible:
Twice out of two attempts.
Steps to Reproduce:
1. Install OCP 4.11.0-0.nightly-2022-09-19-214532 (IPI) on top of OSP RHOS-16.2-RHEL-8-20220804.n.1. The cluster is up and running with three workers: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-09-19-214532 True False 51m Cluster version is 4.11.0-0.nightly-2022-09-19-214532 2. Run the OC command to upgrade to 4.12.0-0.nightly-2022-09-20-040107: $ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-09-20-040107 --allow-explicit-upgrade --force=true warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Requesting update to release image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-09-20-040107 3. The upgrade is not succeeds: [0] $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-09-19-214532 True True 17h Unable to apply 4.12.0-0.nightly-2022-09-20-040107: wait has exceeded 40 minutes for these operators: network One node degrided to 'NotReady,SchedulingDisabled' status: $ oc get nodes NAME STATUS ROLES AGE VERSION ostest-9vllk-master-0 Ready master 19h v1.24.0+07c9eb7 ostest-9vllk-master-1 Ready master 19h v1.24.0+07c9eb7 ostest-9vllk-master-2 Ready master 19h v1.24.0+07c9eb7 ostest-9vllk-worker-0-4x4pt NotReady,SchedulingDisabled worker 18h v1.24.0+3882f8f ostest-9vllk-worker-0-h6kcs Ready worker 18h v1.24.0+3882f8f ostest-9vllk-worker-0-xhz9b Ready worker 18h v1.24.0+3882f8f $ oc get pods -A | grep -v -e Completed -e Running NAMESPACE NAME READY STATUS RESTARTS AGE openshift-openstack-infra coredns-ostest-9vllk-worker-0-4x4pt 0/2 Init:0/1 0 18h $ oc get events LAST SEEN TYPE REASON OBJECT MESSAGE 7m15s Warning OperatorDegraded: MachineConfigDaemonFailed /machine-config Unable to apply 4.12.0-0.nightly-2022-09-20-040107: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)] 7m15s Warning MachineConfigDaemonFailed /machine-config Cluster not available for [{operator 4.11.0-0.nightly-2022-09-19-214532}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)] $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-2022-09-20-040107 True False False 18h baremetal 4.12.0-0.nightly-2022-09-20-040107 True False False 19h cloud-controller-manager 4.12.0-0.nightly-2022-09-20-040107 True False False 19h cloud-credential 4.12.0-0.nightly-2022-09-20-040107 True False False 19h cluster-autoscaler 4.12.0-0.nightly-2022-09-20-040107 True False False 19h config-operator 4.12.0-0.nightly-2022-09-20-040107 True False False 19h console 4.12.0-0.nightly-2022-09-20-040107 True False False 18h control-plane-machine-set 4.12.0-0.nightly-2022-09-20-040107 True False False 17h csi-snapshot-controller 4.12.0-0.nightly-2022-09-20-040107 True False False 19h dns 4.12.0-0.nightly-2022-09-20-040107 True True False 19h DNS "default" reports Progressing=True: "Have 5 available node-resolver pods, want 6." etcd 4.12.0-0.nightly-2022-09-20-040107 True False False 19h image-registry 4.12.0-0.nightly-2022-09-20-040107 True True False 18h Progressing: The registry is ready... ingress 4.12.0-0.nightly-2022-09-20-040107 True False False 18h insights 4.12.0-0.nightly-2022-09-20-040107 True False False 19h kube-apiserver 4.12.0-0.nightly-2022-09-20-040107 True True False 18h NodeInstallerProgressing: 1 nodes are at revision 11; 2 nodes are at revision 13 kube-controller-manager 4.12.0-0.nightly-2022-09-20-040107 True False False 19h kube-scheduler 4.12.0-0.nightly-2022-09-20-040107 True False False 19h kube-storage-version-migrator 4.12.0-0.nightly-2022-09-20-040107 True False False 19h machine-api 4.12.0-0.nightly-2022-09-20-040107 True False False 19h machine-approver 4.12.0-0.nightly-2022-09-20-040107 True False False 19h machine-config 4.11.0-0.nightly-2022-09-19-214532 False True True 16h Cluster not available for [{operator 4.11.0-0.nightly-2022-09-19-214532}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)] marketplace 4.12.0-0.nightly-2022-09-20-040107 True False False 19h monitoring 4.12.0-0.nightly-2022-09-20-040107 True False False 18h network 4.12.0-0.nightly-2022-09-20-040107 True True True 19h DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2022-09-20T14:16:13Z... node-tuning 4.12.0-0.nightly-2022-09-20-040107 True False False 17h openshift-apiserver 4.12.0-0.nightly-2022-09-20-040107 True False False 18h openshift-controller-manager 4.12.0-0.nightly-2022-09-20-040107 True False False 17h openshift-samples 4.12.0-0.nightly-2022-09-20-040107 True False False 17h operator-lifecycle-manager 4.12.0-0.nightly-2022-09-20-040107 True False False 19h operator-lifecycle-manager-catalog 4.12.0-0.nightly-2022-09-20-040107 True False False 19h operator-lifecycle-manager-packageserver 4.12.0-0.nightly-2022-09-20-040107 True False False 19h service-ca 4.12.0-0.nightly-2022-09-20-040107 True False False 19h storage 4.12.0-0.nightly-2022-09-20-040107 True True False 19h ManilaCSIDriverOperatorCRProgressing: ManilaDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods... [0] http://pastebin.test.redhat.com/1074531
Actual results:
OCP 4.11 --> 4.12 upgrade fails.
Expected results:
OCP 4.11 --> 4.12 upgrade success.
Additional info:
Attached logs of the NotReady node - [^journalctl_ostest-9vllk-worker-0-4x4pt.log.tar.gz]
Description of problem:
a freshly installed 4.12 cluster should have stable-4.12 channel by default
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-02-154321
How reproducible:
100%
Steps to Reproduce:
install 4.12 cluster
Actual results:
oc get clusterversion/version -ojson | jq .spec.channel "stable-4.11"
Expected results:
oc get clusterversion/version -ojson | jq .spec.channel "stable-4.12"
Additional info:
Description of problem:
acquiring node lock for assigning ip address, node: %s, ip: %sci-ln-g470i52-1d09d-slz7m-worker-westus-6wt7k10.0.128.102
This is a clone of issue OCPBUGS-5346. The following is the description of the original issue:
—
Description of problem:
The vSphere status health item is misleading.
More info: https://coreos.slack.com/archives/CUPJTHQ5P/p1672829660214369
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Steps to Reproduce:
1. Have OCP 4.12 on vSphere 2. On the Cluster Dashboard (landing page), check the vSphere Status Health (static plugin) 3.
Actual results:
The icon shows pregress but nothing is progressing when the modal dialog is open
Expected results:
No misleading message and icon are rendered.
Additional info:
Since the Problem detector is not a reliable source and modification of the HealthItem in the OCP Console is too complex task for the recent state of release, a non-misleading text is good-enough.
Description of problem:
When using the agent based instller to zero-touch provision the cluster. If the network bandwidth is low, and the assisted-service or the assisted-service fails to pull the docker image within the timeout. The create-cluster-and-infraenv, apply-host-config, and start-cluster-installation services will be deactivated due to dependency failed. The process will be blocked, and require enable & start the service manually.
Version-Release number of selected component (if applicable):
openshift-install 4.11.0 built from commit 863cd1ea823559116e26de327705ed72ccdede8f release image quay.io/openshift-release-dev/ocp-release@sha256:300bce8246cf880e792e106607925de0a404484637627edf5f517375517d54a4 release architecture amd64
How reproducible:
Install Openshift with agent based installer with local mirror.
Steps to Reproduce:
1.Stop the local registry or limit the network bandwidth to make assisted-service-pod.service or assisted-service.service fails to started within the 90s timeout. 2.Start the local registry or mannully pull the image on the node0. 3.
Actual results:
When using the agent based instller to zero-touch pprovision the cluster. If the network bandwidth is low, and the assisted-service or the assisted-service fails to pull the docker image within the timeout. The create-cluster-and-infraenv, apply-host-config, and start-cluster-installation services will be deactivated due to dependency failed. The process will be blocked, and require enable & start the service manually.
Expected results:
Provision start after the assisted-service started.
Additional info:
Given: assisted-service-pod.service requires assisted-service-db.service assisted-service.service assisted-service.service BindsTo=assisted-service-pod.service create-cluster-and-infraenv.service Requires=assisted-service.service and PartOf=assisted-service-pod.service apply-host-config.service Requires=create-cluster-and-infraenv.service start-cluster-installation.service Requires=apply-host-config.service Requires= "Configures requirement dependencies on other units. If this unit gets activated, the units listed here will be activated as well. If one of the other units gets deactivated or its activation fails, this unit will be deactivated."When assisted-service-pod.service starts, assisted-service-db.service and assisted-service.service also be started, Once assisted-service-pod.service fails to be started, assisted-service.service also fail to be started due to "BindsTo=assisted-service-pod.service". Then dependency failed for create-cluster-and-infraenv.service due to Requires=assisted-service.service which activation fails, Therefore it will be deactived. Then dependency failed for apply-host-config.service, due to Requires=create-cluster-and-infraenv.service which activation fails, Therefore it will be deactived. Then dependency failed for start-cluster-installation.service, due to Requires=apply-host-config.service which activation fails, Therefore it will be deactived.Then assisted-service-pod.service restarts, assisted-service.service and assisted-service-db.service restarts as well, since they are binded to assisted-service-pod.service. However, create-cluster-and-infraenv.service apply-host-config.service and start-cluster-installation.service was be deactivated, they requires to be activate mannully.Eventually, assisted-service started and hang with waitting for create infraenv. The provisioning is blocked.
This is a clone of issue OCPBUGS-1805. The following is the description of the original issue:
—
The vSphere CSI cloud.conf lists the single datacenter from platform workspace config but in a multi-zone setup (https://github.com/openshift/enhancements/pull/918 ) there may be more than the one datacenter.
This issue is resulting in PVs failing to attach because the virtual machines can't be find in any other datacenter. For example:
0s Warning FailedAttachVolume pod/image-registry-85b5d5db54-m78vp AttachVolume.Attach failed for volume "pvc-ab1a0611-cb3b-418d-bb3b-1e7bbe2a69ed" : rpc error: code = Internal desc = failed to find VirtualMachine for node:"rbost-zonal-ghxp2-worker-3-xm7gw". Error: virtual machine wasn't found
The machine above lives in datacenter-2 but the CSI cloud.conf is only aware of the datacenter IBMCloud.
$ oc get cm vsphere-csi-config -o yaml -n openshift-cluster-csi-drivers | grep datacenters
datacenters = "IBMCloud"
This bug is a backport clone of [Bugzilla Bug 2100181](https://bugzilla.redhat.com/show_bug.cgi?id=2100181). The following is the description of the original bug:
—
Created attachment 1891950
log
Description of problem:
Prior to OCP 4.7.48, the configure-ovs script picked the corrected bonded interface for br-ex. In OCP 4.7.48 we have that is consistently fail. It picks one of the slave interfaces (ens3f0).
Version-Release number of selected component (if applicable):
OCP Release > OCP 4.7.37
How reproducible:
100%
Steps to Reproduce:
1. Deploy an OCP cluster with bonding
2.
3.
Actual results:
Expected results:
configure-ovs should not fail and assign the correct interface to br-ex (bond1)
Additional info:
There appears to be a new default NM profile from 4.7.37 to 4.7.38 a that was not there before
Description of problem:
Invalid documentation link in knative-plugin README https://github.com/openshift/console/blob/master/frontend/packages/knative-plugin/README.md
This is a clone of issue OCPBUGS-3612. The following is the description of the original issue:
—
Description of problem:
OCP 4.12 deployments making use of secondary bridge br-ex1 for CNI fail to start ovs-configuration service, with multiple failures.
Version-Release number of selected component (if applicable):
Openshift 4.12.0-rc.0 (2022-11-10)
How reproducible:
Until now always at least one node out of four workers fails, not always the same node, sometimes several nodes.
Steps to Reproduce:
1. Preparing to configure ipi on the provisioning node - RHEL 8 ( haproxy, named, mirror registry, rhcos_cache_server ..) 2. configuring the install-config.yaml (attached) - provisioningNetwork: enabled - machine network: single stack ipv4 - disconnected installation - ovn-kubernetes with hybrid-networking setup - LACP bonding setup using MC manifests at day1 * bond0 -> baremetal 192.168.32.0/24 (br-ex) * bond0.662 -> interface for secondary bridge (br-ex1) 192.168.66.128/26 - secondary bridge defined in /etc/ovnk/extra_bridge using MC Manifest 3. deploy the cluster - Usually the deployment is completed - Nodes show Ready status, but in some nodes ovs-configuration fails - Consequent MC changes fail because MCP cannot roll out configurations in nodes with the failure. NOTE: This impacts testing of our partners Verizon and F5, because we are validating their CNFs before OCP 4.12 release and we need a secondary bridge for CNI.
Actual results:
br-ex1 and all its related ovs-ports and interfaces fail to activate, ovs-configuration service fails.
Expected results:
br-ex1 and all its related ovs-ports and interfaces succeed to activate, ovs-configuration service starts successfully.
Additional info:
1. Nodes and MCP info
$ oc get nodes NAME STATUS ROLES AGE VERSION master-0 Ready control-plane,master 7h59m v1.25.2+f33d98e master-1 Ready control-plane,master 7h59m v1.25.2+f33d98e master-2 Ready control-plane,master 8h v1.25.2+f33d98e worker-0 Ready worker 7h26m v1.25.2+f33d98e worker-1 Ready worker 7h25m v1.25.2+f33d98e worker-2 Ready worker 7h25m v1.25.2+f33d98e worker-3 Ready worker 7h25m v1.25.2+f33d98e $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-210a69a0b40162b2f349ea3a5b5819e5 True False False 3 3 3 0 7h57m worker rendered-worker-e8a62c86ce16e98e45e3166847484cf0 False True True 4 2 2 1 7h57m
2. When logging it to the nodes via SSH, we see when ovs-configuration fails, and from the ovs-configuration service logs, we see the following error: (full log attached worker-0-ovs-configuration.log)
$ ssh core@worker-0 --- Last login: Sat Nov 12 21:33:58 2022 from 192.168.62.10 [systemd] Failed Units: 3 NetworkManager-wait-online.service ovs-configuration.service stalld.service [core@worker-0 ~]$ sudo journalctl -u ovs-configuration | less ... Nov 12 15:27:54 worker-0 configure-ovs.sh[8237]: Error: invalid field 'connection.type'; allowed fields: NAME,UUID,TYPE,TIMESTAMP,TIMESTAMP-REAL,AUTOCONNECT,AUTOCONNECT-PRIORITY,READONLY,DBUS-PATH,ACT> Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' == vlan ']' Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: /usr/local/bin/configure-ovs.sh: line 178: [: ==: unary operator expected Nov 12 15:27:54 worker-0 configure-ovs.sh[8241]: ++ nmcli --get-values connection.type conn show Nov 12 15:27:54 worker-0 configure-ovs.sh[8241]: Error: invalid field 'connection.type'; allowed fields: NAME,UUID,TYPE,TIMESTAMP,TIMESTAMP-REAL,AUTOCONNECT,AUTOCONNECT-PRIORITY,READONLY,DBUS-PATH,ACT> Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' == bond ']' Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: /usr/local/bin/configure-ovs.sh: line 191: [: ==: unary operator expected Nov 12 15:27:54 worker-0 configure-ovs.sh[8245]: ++ nmcli --get-values connection.type conn show Nov 12 15:27:54 worker-0 configure-ovs.sh[8245]: Error: invalid field 'connection.type'; allowed fields: NAME,UUID,TYPE,TIMESTAMP,TIMESTAMP-REAL,AUTOCONNECT,AUTOCONNECT-PRIORITY,READONLY,DBUS-PATH,ACT> Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' == team ']' Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: /usr/local/bin/configure-ovs.sh: line 203: [: ==: unary operator expected Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + iface_type=802-3-ethernet Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' '!' '' = 0 ']'
3. We observe the failed node (worker-0) has ovs-if-phys1 connection as an ethernet type. But a working node (worker-1) shows a vlan type for the same connection with the vlan info
[core@worker-0 ~]$ sudo cat /etc/NetworkManager/system-connections/ovs-if-phys1.nmconnection [connection] id=ovs-if-phys1 uuid=aea14dc9-2d0c-4320-9c13-ddf3e64747bf type=ethernet autoconnect=false autoconnect-priority=100 autoconnect-slaves=1 interface-name=bond0.662 master=e61c56f7-f3ba-40f7-a1c1-37921fc6c815 slave-type=ovs-port [ethernet] cloned-mac-address=B8:83:03:91:C5:2C mtu=1500 [ovs-interface] type=system [core@worker-1 ~]$ sudo cat /etc/NetworkManager/system-connections/ovs-if-phys1.nmconnection [connection] id=ovs-if-phys1 uuid=9a019885-3cc1-4961-9dfa-6b7f996556c4 type=vlan autoconnect-priority=100 autoconnect-slaves=1 interface-name=bond0.662 master=877acf53-87d7-4cdf-a078-000af4f962c3 slave-type=ovs-port timestamp=1668265640 [ethernet] cloned-mac-address=B8:83:03:91:C5:E8 mtu=9000 [ovs-interface] type=system [vlan] flags=1 id=662 parent=bond0
4. Another problem we observe is that we specifically disable IPv6 in the the bond0.662 connection, but the generated connection for br-ex1 has ipv6 method-auto, and it should be disabled.
[core@worker-0 ~]$ sudo cat /etc/NetworkManager/system-connections/bond0.662.nmconnection [connection] id=bond0.662 type=vlan interface-name=bond0.662 autoconnect=true autoconnect-priority=99 [vlan] parent=bond0 id=662 [ethernet] mtu=9000 [ipv4] method=auto dhcp-timeout=2147483647 never-default=true [ipv6] method=disabled never-default=true [core@worker-0 ~]$ sudo cat /etc/NetworkManager/system-connections/br-ex1.nmconnection [connection] id=br-ex1 uuid=df67dcd9-4263-4707-9abc-eda16e75ea0d type=ovs-bridge autoconnect=false autoconnect-slaves=1 interface-name=br-ex1 [ethernet] mtu=1500 [ovs-bridge] [ipv4] method=auto [ipv6] addr-gen-mode=stable-privacy method=auto [proxy]
5. All journals, must-gather, some deployment files can be found in our CI console (Login with RedHat SSO) https://www.distributed-ci.io/jobs/46459571-900f-43df-8798-d36b322d26f4/files
But attached some of the logs to facilitate the task, worker-0 files are from the node with issues with ovs, worker-1 are from a worker that is OK in case you want to compare.
11_master-bonding.yaml 11_worker-bonding.yaml install-config.yaml journal-worker-0.log journal-worker-1.log must_gather.tar.gz sosreport-worker-0-2022-11-12-csbyqfe.tar.xz sosreport-worker-1-2022-11-12-ubltjdn.tar.xz worker-0-ip-nmcli-info.log worker-0-ovs-configuration.log worker-1-ip-nmcli-info.log worker-1-ovs-configuration.log
Please let us know if you need any additional information.
This is a clone of issue OCPBUGS-3668. The following is the description of the original issue:
—
Description of problem:
Installer fails to install 4.12.0-rc.0 on VMware IPI with the script that worked with prior OCP versions. Error happens during Terraform prepare step when gathering information in the "Platform Provisioning Check". It looks like a permission issue, but we're using the VCenter administrator account. I double checked and that account has all the necessary permissions.
Version-Release number of selected component (if applicable):
OCP installer 4.12.0-rc.0 VSphere & Vcenter 7.0.3 - no pending updates
How reproducible:
always - we observed this already in the nightlies, but wanted to wait for a RC to confirm
Steps to Reproduce:
1. Try to install using the openshift-install binary
Actual results:
Fails during the preparation step
Expected results:
Installs the cluster ;)
Additional info:
This runs in our CICD pipeline, let me know if you want to need access to the full run log: https://gitlab.consulting.redhat.com/cblum/storage-ocs-lab/-/jobs/219304 This includes the install-config.yaml, all component versions and the full debug log output
I haven't gone back to pin down all affected versions, but I wouldn't be surprised if we've had this exposure for a while. On a 4.12.0-ec.2 cluster, we have:
cluster:usage:resources:sum{resource="podnetworkconnectivitychecks.controlplane.operator.openshift.io"}
currently clocking in around 67983. I've gathered a dump with:
$ oc --as system:admin -n openshift-network-diagnostics get podnetworkconnectivitychecks.controlplane.operator.openshift.io | gzip >checks.gz
And many, many of these reference nodes which no longer exist (the cluster is aggressively autoscaled, with nodes coming and going all the time). We should fix garbage collection on this resource, to avoid consuming excessive amounts of memory in the Kube API server and etcd as they attempt to list the large resource set.
Description of problem:
NPE on topology if creates a k8s svc and KSVC which has no metadata in template
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. create a KSVC from admin -> serving -> create service 2. create a k8s svc from search service (create)
Actual results:
topology breaks (see attached screenshot)
Expected results:
topology shouldn't break
Additional info:
Description of problem:
The samples operator needs to update it's imagestreams to use the Jenkins 4.12 release.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This bug is a backport clone of [Bugzilla Bug 2092811](https://bugzilla.redhat.com/show_bug.cgi?id=2092811). The following is the description of the original bug:
—
+++ This bug was initially created as a clone of Bug #1926943 +++
The customer is facing this issue:
I0530 05:19:11.481797 1 vsphere_check.go:220] CheckDefaultDatastore failed: defaultDatastore "FI-HML-DC2-CONT-1" in vSphere configuration: datastore FI-HML-DC2-CONT-1: datastore name is too long: escaped volume path "var-lib-kubelet-plugins-kubernetes.io-vsphere\\x2dvolume-mounts\\x5bFI\\x2dHML\\x2dDC2\\x2dCONT\\x2d1\\x5d\\x2000000000\\x2d0000\\x2d0000\\x2d0000\\x2d000000000000-fi\\x2dhmy\\x2dsas\\x2dprod\\x2dnp868\\x2d\\x2dpvc\\x2d00000000\\x2d0000\\x2d0000\\x2d0000
x2d000000000000.vmdk" must be under 255 characters, got 255
Looks like the bug has resurfaced.
This is a clone of issue OCPBUGS-4986. The following is the description of the original issue:
—
We should avoid errors like:
$ oc get -o json clusterversion version | jq -r '.status.history[0].acceptedRisks' Forced through blocking failures: Precondition "ClusterVersionRecommendedUpdate" failed because of "UnknownUpdate": RetrievedUpdates=True (), so the update from 4.13.0-0.okd-2022-12-11-064650 to 4.13.0-0.okd-2022-12-13-052859 is probably neither recommended nor supported.
Instead, tweak the logic from OCPBUGS-2727, and only append the Forced through blocking failures: prefix when the forcing was required.
Currently on summery logs if there is kube-api issue controller will not upload logs but it should as it has file to read them from
Description of problem:
install 4.12 of IPv6 single stack disconnected cluster: etcd member is in abnormal status:
E1026 03:35:58.409977 1 etcdmemberscontroller.go:73] Unhealthy etcd member found: openshift-qe-057.arm.eng.rdu2.redhat.com, took=, err=create client failure: failed to make etcd client for endpoints https://[26xx:52:0:1eb:3xx3:5xx:fxxe:7550]:2379: context deadline exceeded
How reproducible:
not Always
Steps to Reproduce:
As description
Actual results:
As title
Expected results
etcd co stauts is normal
Description of problem:
When alert raised for vSphere privilege check which is reported by vsphere-problem-detector, we could only get the very simple info as below:
=======================================
Description
The vsphere-problem-detector monitors the health and configuration of OpenShift on VSphere. If problems are found which may prevent machine scaling, storage provisioning, and safe upgrades, the vsphere-problem-detector will raise alerts.
Summary
VSphere cluster health checks are failing
Message
VSphere cluster health checks are failing with CheckAccountPermissions
=======================================
(We could get the namespace/pod info from metric, but I think adding it in alert Description or Message should be more clear)
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-12-152748
How reproducible:
Always
Steps to Reproduce:
See description
Actual results:
Alert info is not so clear
Expected results:
Add more Alert info
Description of problem:
We discovered an issue before code freeze that caused many CI issues.This is resolved with this PR: https://github.com/openshift/cluster-network-operator/pull/1579
Version-Release number of selected component (if applicable):
4.12
How reproducible:
NA
Steps to Reproduce:
1.NA 2. 3.
Actual results:
Severity is set too low for various OVN-K alerts
Expected results:
Alerts work as expected at the correct severity level and CI runs are clear including for hypershift clusters.
Additional info:
This is resolved with this PR: https://github.com/openshift/cluster-network-operator/pull/1579 Here is my testing with `e2e-all` and `e2e-serial` and there are no issues after 10 runs each: https://docs.google.com/spreadsheets/d/1FZON8-d3m7D_2-z3XetODA-ucbXKJzCioC-zRMArHlY/edit?usp=sharing
Description of problem:
csi-snapshot-controller-operator logs:
The operator misses some RBAC:
2022-09-07T12:13:09.457345845Z F0907 12:13:09.457292 1 cmd.go:138] unable to load configmap based request-header-client-ca-file: configmaps "extension-apiserver-authentication" is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:csi-snapshot-controller-operator" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
Surprisingly, standalone OCP jobs work well.
Console should be using v1 version of the ConsolePlugin model rather then the old v1alpha1.
CONSOLE-3077 was updating this version, but did not made the cut for the 4.12 release. Based on discussion with Samuel Padgett we should be backporting to 4.12.
The risk should be minimal since we are only updating the model itself + validation + Readme
Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/187
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-4168. The following is the description of the original issue:
—
Description of problem:
Prometheus continuously restarts due to slow WAL replay
Version-Release number of selected component (if applicable):
openshift - 4.11.13
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-3414. The following is the description of the original issue:
—
Description of problem:
The current implementation of new OCI FBC feature omits the creation of the ImageContentSourcePolicy
and CatalogSource resources
Description of problem:
Cluster can not be installed when updating join network CIDR using v6InternalSubnet fdxx::/64 in the manifests/cluster-network-03-config.yml
Version-Release number of selected component (if applicable):
v4.12
How reproducible:
Always
Steps to Reproduce:
Using v6InternalSubnet: fd66::/48 in manifests/cluster-network-03-config.yml to install a dual stack cluster: cp manifests/cluster-network-02-config.yml manifests/cluster-network-03-config.yml sed -i 's/config.openshift.io\/v1/operator.openshift.io\/v1/g' manifests/cluster-network-03-config.yml cat > ovn_kube_config <<HEREDOC defaultNetwork: type: OVNKubernetes ovnKubernetesConfig: v6InternalSubnet: fd66::/48 HEREDOC sed -i $'/^status/{e cat ovn_kube_config\n}' manifests/cluster-network-03-config.yml
Actual results:
Installation fail
Expected results:
Installation pass
Additional info:
We are seeing windows to linux networking failures, across all PRs.
This is occurring across all clouds.
Example test failure
seems this could have been due to the downstream merge, the windows jobs did not pass before the PR was merged
Job that failed against the downstream merge, but did not prevent it from merging
This is blocking all PRs against the WMCO repo.
There is capacity limit on egressIP for different cloud provider, for example, GCP, the limit is 10.
If the number of egressIP added to hostsubnet exceeds the capability limit, it is expected some logging message is emitted to event log, that can be seen through "oc get event"
On a GCP with SDN plugin, configure egressCIDRs on one worker node, configured 12 netnamespaces, each has 1 egressIP configured, the total number of egressIP for the hostsubnet has exceeded its capacity limit of 10. No event log was seen to indicate that the number of egressIP for the hostsubnet has exceeded the limit.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-08-02-014045 True False 160m Cluster version is 4.11.0-0.nightly-2022-08-02-014045
See attachment for more details.
This is a clone of issue OCPBUGS-723. The following is the description of the original issue:
—
Description of problem:
I have a customer who created clusterquota for one of the namespace, it got created but the values were not reflecting under limits or not displaying namespace details.
~~~
$ oc describe AppliedClusterResourceQuota
Name: test-clusterquota
Created: 19 minutes ago
Labels: size=custom
Annotations: <none>
Namespace Selector: []
Label Selector:
AnnotationSelector: map[openshift.io/requester:system:serviceaccount:application-service-accounts:test-sa]
Scopes: NotTerminating
Resource Used Hard
-------- ---- ----
~~~
WORKAROUND: They recreated the clusterquota object (cache it off, delete it, create new) after which it displayed values as expected.
In the past, they saw similar behavior on their test cluster, there it was heavily utilized the etcd DB was much larger in size (>2.5Gi), and had many more objects (at that time, helm secrets were being cached for all deployments, and keeping a history of 10, so etcd was being bombarded).
This cluster the same "symptom" was noticed however etcd was nowhere near that in size nor the amount of etcd objects and/or helm cached secrets.
Version-Release number of selected component (if applicable): OCP 4.9
How reproducible: Occurred only twice(once in test and in current cluster)
Steps to Reproduce:
1. Create ClusterQuota
2. Check AppliedClusterResourceQuota
3. The values and namespace is empty
Actual results: ClusterQuota should display the values
Expected results: ClusterQuota not displaying values
The test results in sippy look really bad on our less common platforms, but still pretty unacceptable even on core clouds. It's reasonably often the only test that fails. We need to decide what to do here, and we're going to need input from the etcd team.
As of Sep 13th:
Even on some major variant combos, a 4-8% failure rate is too high.
On Sep 13 arch call (no etcd present), Damien mentioned this might be an upstream alert that just isn't well suited for OpenShift's use cases, is this the case and it needs tuning?
Has the problem been getting worse?
I believe this link https://datastudio.google.com/s/urkKwmmzvgo indicates that this may be the case for 4.12, AWS and Azure are both getting worse in ways that I don't see if we change the release to 4.11 where it looks consistent. gcp seems fine on 4.12. We do not have data for vsphere for some reason.
This link shows the grpc_methods most commonly involved: https://search.ci.openshift.org/?search=etcdGRPCRequestsSlow+was+at+or+above&maxAge=48h&context=7&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
At a glance: LeaseGrant, MemberList, Txn, Status, Range.
Broken out of TRT-401
For linking with sippy:
[bz-etcd][invariant] alert/etcdGRPCRequestsSlow should not be at or above info
[sig-arch][bz-etcd][Late] Alerts alert/etcdGRPCRequestsSlow should not be at or above info [Suite:openshift/conformance/parallel]
Description of problem:
When providing install-config as
platform: baremetal: apiVIP: 192.168.122.10 ingressVIP: 192.168.122.11
agent installer fails with
bin/openshift-install agent create cluster-manifests
FATAL failed to fetch Agent Manifests: failed to load asset "Install Config": invalid install-config configuration: [Platform.Baremetal.ApiVips: Required value: apiVips must be set for baremetal platform, Platform.Baremetal.IngressVips: Required value: ingressVips must be set for baremetal platform]
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Git clone latest installer https://github.com/openshift/installer and build it 2. Provide install-config.yaml for baremetal platform with deprecated apiVip and ingressVip set 3. Create agent image with "bin/openshift-install agent create cluster-manifests"
Actual results:
bin/openshift-install agent create cluster-manifests FATAL failed to fetch Agent Manifests: failed to load asset "Install Config": invalid install-config configuration: [Platform.Baremetal.ApiVips: Required value: apiVips must be set for baremetal platform, Platform.Baremetal.IngressVips: Required value: ingressVips must be set for baremetal platform]
Expected results:
agent installer should upconvert the depreacted fields and should not error. apiVip, ingressVip should be upconverted into apiVips and ingressVips respectively
Additional info:
This is a clone of issue OCPBUGS-4252. The following is the description of the original issue:
—
Description of problem: When visiting the Terminal tab of a Node details page, an error is displayed instead of the terminal
Steps to Reproduce:
1. Go to the Terminal tab of a Node details page (e.g., /k8s/cluster/nodes/ip-10-0-129-13.ec2.internal/terminal)
2. Note the error alert that appears on the page instead of the terminal.
This is a clone of issue OCPBUGS-1761. The following is the description of the original issue:
—
Description of problem:
When we configure a MC using an osImage that cannot be pulled, the machine config daemon pod spams logs saying that the node is set to "Degraded" state, but the node is not set to "Degraded" state. Only after long time, like 20 minutes or half and hour, the node eventually becomes degraded.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-26-111919
How reproducible:
Always
Steps to Reproduce:
1. Create a MC using an osImage that cannot be pulled apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: creationTimestamp: "2022-09-27T12:48:13Z" generation: 1 labels: machineconfiguration.openshift.io/role: worker name: not-pullable-image-tc54054-w75j1k67 resourceVersion: "374500" uid: 7f828fbc-8da3-4f16-89e2-34e39ff830b3 spec: config: ignition: version: 3.2.0 storage: files: [] systemd: units: [] osImageURL: quay.io/openshifttest/tc54054fakeimage:latest 2. Check the logs in the machine config daemon pod, you can see this message being spammed, saying that the daemon is marking the node with "Degraded" status. E0927 14:31:22.858546 1697 writer.go:200] Marking Degraded due to: Error checking type of update image: failed to run command podman (6 tries): [timed out waiting for the condition, running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshifttest/tc54054fakeimage:latest failed: Error: initializing source docker://quay.io/openshifttest/tc54054fakeimage:latest: reading manifest latest in quay.io/openshifttest/tc54054fakeimage: name unknown: repository not found E0927 14:34:10.698564 1697 writer.go:200] Marking Degraded due to: Error checking type of update image: failed to run command podman (6 tries): [timed out waiting for the condition, running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshifttest/tc54054fakeimage:latest failed: Error: initializing source docker://quay.io/openshifttest/tc54054fakeimage:latest: reading manifest latest in quay.io/openshifttest/tc54054fakeimage: name unknown: repository not found E0927 14:36:58.557340 1697 writer.go:200] Marking Degraded due to: Error checking type of update image: failed to run command podman (6 tries): [timed out waiting for the condition, running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshifttest/tc54054fakeimage:latest failed: Error: initializing source docker://quay.io/openshifttest/tc54054fakeimage:latest: reading manifest latest in quay.io/openshifttest/tc54054fakeimage: name unknown: repository not found
Actual results:
The node is not marked as degraded as it should. Only after long time, 20 minutes or so, the node becomes degraded.
Expected results:
When the podman pull command fails and the machine config daemon sets the node state as "Degraded", the node should actually be marked as "Degraded".
Additional info:
Description of problem:
cluster-version-operator pod crashloop during the bootstrap process might be leading to a longer bootstrap process causing the installer to timeout and fail. The cluster-version-operator pod is continuously restarting due to a go panic. The bootstrap process fails due to the timeout although it completes the process correctly after more time, once the cluster-version-operator pod runs correctly. $ oc -n openshift-cluster-version logs -p cluster-version-operator-754498df8b-5gll8 I0919 10:25:05.790124 1 start.go:23] ClusterVersionOperator 4.12.0-202209161347.p0.gc4fd1f4.assembly.stream-c4fd1f4 F0919 10:25:05.791580 1 start.go:29] error: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused goroutine 1 [running]: k8s.io/klog/v2.stacks(0x1) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:860 +0x8a k8s.io/klog/v2.(*loggingT).output(0x2bee180, 0x3, 0x0, 0xc00017d5e0, 0x1, {0x22e9abc?, 0x1?}, 0x2beed80?, 0x0) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:825 +0x686 k8s.io/klog/v2.(*loggingT).printfDepth(0x2bee180, 0x0?, 0x0, {0x0, 0x0}, 0x1?, {0x1b9cff0, 0x9}, {0xc000089140, 0x1, ...}) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:630 +0x1f2 k8s.io/klog/v2.(*loggingT).printf(...) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:612 k8s.io/klog/v2.Fatalf(...) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1516 main.init.3.func1(0xc00012ac80?, {0x1b96f60?, 0x6?, 0x6?}) /go/src/github.com/openshift/cluster-version-operator/cmd/start.go:29 +0x1e6 github.com/spf13/cobra.(*Command).execute(0xc00012ac80, {0xc0002fea20, 0x6, 0x6}) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:860 +0x663 github.com/spf13/cobra.(*Command).ExecuteC(0x2bd52a0) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3b4 github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:902 main.main() /go/src/github.com/openshift/cluster-version-operator/cmd/main.go:29 +0x46
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-18-234318
How reproducible:
Most of the times, with any network type and installation type (IPI, UPI and proxy).
Steps to Reproduce:
1. Install OCP 4.12 IPI $ openshift-install create cluster 2. Wait until bootstrap is completed
Actual results:
[...] level=error msg="Bootstrap failed to complete: timed out waiting for the condition" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane."
NAMESPACE NAME READY STATUS RESTARTS AGE openshift-cluster-version cluster-version-operator-754498df8b-5gll8 0/1 CrashLoopBackOff 7 (3m21s ago) 24m openshift-image-registry image-registry-94fd8b75c-djbxb 0/1 Pending 0 6m44s openshift-image-registry image-registry-94fd8b75c-ft66c 0/1 Pending 0 6m44s openshift-ingress router-default-64fbb749b4-cmqgw 0/1 Pending 0 13m openshift-ingress router-default-64fbb749b4-mhtqx 0/1 Pending 0 13m openshift-monitoring prometheus-operator-admission-webhook-6d8cb95cf7-6jn5q 0/1 Pending 0 14m openshift-monitoring prometheus-operator-admission-webhook-6d8cb95cf7-r6nnk 0/1 Pending 0 14m openshift-network-diagnostics network-check-source-8758bd6fc-vzf5k 0/1 Pending 0 18m openshift-operator-lifecycle-manager collect-profiles-27726375-hlq89 0/1 Pending 0 21m
$ oc -n openshift-cluster-version describe pod cluster-version-operator-754498df8b-5gll8 Name: cluster-version-operator-754498df8b-5gll8 Namespace: openshift-cluster-version Priority: 2000000000 Priority Class Name: system-cluster-critical Node: ostest-4gtwr-master-1/10.196.0.68 Start Time: Mon, 19 Sep 2022 10:17:41 +0000 Labels: k8s-app=cluster-version-operator pod-template-hash=754498df8b Annotations: openshift.io/scc: hostaccess Status: Running IP: 10.196.0.68 IPs: IP: 10.196.0.68 Controlled By: ReplicaSet/cluster-version-operator-754498df8b Containers: cluster-version-operator: Container ID: cri-o://1e2879600c89baabaca68c1d4d0a563d4b664c507f0617988cbf9ea7437f0b27 Image: registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69 Image ID: registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69 Port: <none> Host Port: <none> Args: start --release-image=registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69 --enable-auto-update=false --listen=0.0.0.0:9099 --serving-cert-file=/etc/tls/serving-cert/tls.crt --serving-key-file=/etc/tls/serving-cert/tls.key --v=2 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: I0919 10:33:07.798614 1 start.go:23] ClusterVersionOperator 4.12.0-202209161347.p0.gc4fd1f4.assembly.stream-c4fd1f4 F0919 10:33:07.800115 1 start.go:29] error: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused goroutine 1 [running]: [43/497] k8s.io/klog/v2.stacks(0x1) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:860 +0x8a k8s.io/klog/v2.(*loggingT).output(0x2bee180, 0x3, 0x0, 0xc000433ea0, 0x1, {0x22e9abc?, 0x1?}, 0x2beed80?, 0x0) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:825 +0x686 k8s.io/klog/v2.(*loggingT).printfDepth(0x2bee180, 0x0?, 0x0, {0x0, 0x0}, 0x1?, {0x1b9cff0, 0x9}, {0xc0002d6630, 0x1, ...}) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:630 +0x1f2 k8s.io/klog/v2.(*loggingT).printf(...) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:612 k8s.io/klog/v2.Fatalf(...) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1516 main.init.3.func1(0xc0003b4f00?, {0x1b96f60?, 0x6?, 0x6?}) /go/src/github.com/openshift/cluster-version-operator/cmd/start.go:29 +0x1e6 github.com/spf13/cobra.(*Command).execute(0xc0003b4f00, {0xc000311980, 0x6, 0x6}) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:860 +0x663 github.com/spf13/cobra.(*Command).ExecuteC(0x2bd52a0) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3b4 github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:902 main.main() /go/src/github.com/openshift/cluster-version-operator/cmd/main.go:29 +0x46 Exit Code: 255 Started: Mon, 19 Sep 2022 10:33:07 +0000 Finished: Mon, 19 Sep 2022 10:33:07 +0000 Ready: False Restart Count: 7 Requests: cpu: 20m memory: 50Mi Environment: KUBERNETES_SERVICE_PORT: 6443 KUBERNETES_SERVICE_HOST: 127.0.0.1 NODE_NAME: (v1:spec.nodeName) CLUSTER_PROFILE: self-managed-high-availability Mounts: /etc/cvo/updatepayloads from etc-cvo-updatepayloads (ro) /etc/ssl/certs from etc-ssl-certs (ro) /etc/tls/service-ca from service-ca (ro) /etc/tls/serving-cert from serving-cert (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access (ro) onditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: etc-ssl-certs: Type: HostPath (bare host directory volume) Path: /etc/ssl/certs HostPathType: etc-cvo-updatepayloads: Type: HostPath (bare host directory volume) Path: /etc/cvo/updatepayloads HostPathType: serving-cert: Type: Secret (a volume populated by a Secret) SecretName: cluster-version-operator-serving-cert Optional: false service-ca: Type: ConfigMap (a volume populated by a ConfigMap) Name: openshift-service-ca.crt Optional: false kube-api-access: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3600 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/network-unavailable:NoSchedule op=Exists node.kubernetes.io/not-ready:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 120s node.kubernetes.io/unreachable:NoExecute op=Exists for 120s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 25m default-scheduler no nodes available to schedule pods Warning FailedScheduling 21m default-scheduler 0/2 nodes are available: 2 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/2 nodes are available: 2 Preemption is no t helpful for scheduling. Normal Scheduled 19m default-scheduler Successfully assigned openshift-cluster-version/cluster-version-operator-754498df8b-5gll8 to ostest-4gtwr-master-1 by ostest-4gtwr-bootstrap Warning FailedMount 17m kubelet Unable to attach or mount volumes: unmounted volumes=[serving-cert], unattached volumes=[service-ca kube-api-access etc-ssl-certs etc-cvo-updatepayloads serving-cert]: timed out waiting for the condition Warning FailedMount 17m (x9 over 19m) kubelet MountVolume.SetUp failed for volume "serving-cert" : secret "cluster-version-operator-serving-cert" not found Normal Pulling 15m kubelet Pulling image "registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69" Normal Pulled 15m kubelet Successfully pulled image "registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69" in 7.481824271s Normal Started 14m (x3 over 15m) kubelet Started container cluster-version-operator Normal Created 14m (x4 over 15m) kubelet Created container cluster-version-operator Normal Pulled 14m (x3 over 15m) kubelet Container image "registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69" already present on machine Warning BackOff 4m22s (x52 over 15m) kubelet Back-off restarting failed container
Expected results:
No panic?
Additional info:
Seen in most of OCP on OSP QE CI jobs.
Attached [^must-gather-install.tar.gz]
Description of problem:
oc-mirror shouldn't clean out the operator versions that are not referenced in the channel anymore Cu has following ImageSetConfiguration. They are running oc-mirror GitVersion: 4.11.0-2022082035.p0.g3c1c80c.assembly.stream-3c1c80c. apiVersion: mirror.openshift.io/v1alpha2 kind: ImageSetConfiguration mirror: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.10 targetName: bit-redhat-operator-catalog-platform-essentials-index packages: - name: elasticsearch-operator channels: - name: stable minVersion: 5.4.2 - name: cluster-logging channels: - name: stable minVersion: 5.4.2 This works at first and it makes 5.4.2 available in their internal catalog. However after some time the version 5.4.2 disappears out of their catalog and we get the following error while syncing: ERRO[0108] Operator elasticsearch-operator was not found, please check name, minVersion, maxVersion, and channels in the config file. ERRO[0108] Operator cluster-logging was not found, please check name, minVersion, maxVersion, and channels in the config file. The issue is that the original configured version 5.4.2 is not anymore in the catalog, which we can verify by querying the catalog: $ oc-mirror list operators --catalog registry.redhat.io/redhat/redhat-operator-index:v4.10 --package elasticsearch-operator --channel stable WARN[0022] DEPRECATION NOTICE: Sqlite-based catalogs and their related subcommands are deprecated. Support for them will be removed in a future release. Please migrate your catalog workflows to the new file-based catalog format. VERSIONS 5.5.0 $ oc-mirror list operators --catalog registry.redhat.io/redhat/redhat-operator-index:v4.10 --package elasticsearch-operator --channel stable-5.4 WARN[0019] DEPRECATION NOTICE: Sqlite-based catalogs and their related subcommands are deprecated. Support for them will be removed in a future release. Please migrate your catalog workflows to the new file-based catalog format. VERSIONS 5.4.4 So, a) the version 5.4.2 completely disappeard b) the stable channel now starts with 5.5.0 The oc-mirror would clean out the versions that are not referenced anymore and thus we would assume that 5.4.2 would be cleaned from the mirror. Which they definitely do not want to happen, since they still have that version on clusters in their environment. It is quite tedious to keep editing the image-set.yaml for all the versions that disappear out of the catalog
Version-Release number of selected component (if applicable):
oc-mirror GitVersion: 4.11.0-2022082035.p0.g3c1c80c.assembly.stream-3c1c80c
How reproducible:
100%
Steps to Reproduce:
1. Create an ImageSetConfiguration to mirror a particular operator 2. Mirror the operator to mirror registry using oc-mirror 3. The specified version of the operator disappears from the catalog after a few days when there are changes in the channel and start getting the mentioned error on sync.
Actual results:
Operator disappears from the catalog
Expected results:
The mentioned version of the operator to be available in mirror registry even after it's not referenced in the channel
Additional info:
Description of problem:
When providing the openshift-install agent create command with installconfig + agentconfig manifests that contain the InstallConfig Proxy section, the Proxy configuration does not get configured cluster-wide.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1.Define InstallConfig with Proxy section 2.openshift-install agent create image 3.Boot ISO 4.Check /etc/assisted/manifests for agent-cluster-install.yaml to contain the Proxy section
Actual results:
Missing proxy
Expected results:
Proxy should be present and match with the InstallConfig
Additional info:
This is a clone of issue OCPBUGS-4089. The following is the description of the original issue:
—
The kube-state-metric pod inside the openshift-monitoring namespace is not running as expected.
On checking the logs I am able to see that there is a memory panic
~~~
2022-11-22T09:57:17.901790234Z I1122 09:57:17.901768 1 main.go:199] Starting kube-state-metrics self metrics server: 127.0.0.1:8082
2022-11-22T09:57:17.901975837Z I1122 09:57:17.901951 1 main.go:66] levelinfomsgTLS is disabled.http2false
2022-11-22T09:57:17.902389844Z I1122 09:57:17.902291 1 main.go:210] Starting metrics server: 127.0.0.1:8081
2022-11-22T09:57:17.903191857Z I1122 09:57:17.903133 1 main.go:66] levelinfomsgTLS is disabled.http2false
2022-11-22T09:57:17.906272505Z I1122 09:57:17.906224 1 builder.go:191] Active resources: certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments
2022-11-22T09:57:17.917758187Z E1122 09:57:17.917560 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
2022-11-22T09:57:17.917758187Z goroutine 24 [running]:
2022-11-22T09:57:17.917758187Z k8s.io/apimachinery/pkg/util/runtime.logPanic(
)
2022-11-22T09:57:17.917758187Z /usr/lib/golang/src/runtime/panic.go:1038 +0x215
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/internal/store.ingressMetricFamilies.func6(0x40)
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/internal/store/ingress.go:136 +0x189
2022-11-22T09:57:17.917758187Z k8s.io/kube-state-metrics/v2/internal/store.wrapIngressFunc.func1(
)
2022-11-22T09:57:17.917758187Z /go/src/k8s.io/kube-state-metrics/pkg/metric_generator/generator.go:107 +0xd8
~~~
Logs are attached to the support case
Description of problem:
ClusterOperator status get's updated when the conditions are re-ordered. There doesn't seem to be any change to the conditions except the reorder.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
kubectl get clusteroperator monitoring -oyaml --watch
Actual results:
status: conditions: - lastTransitionTime: "2022-08-25T23:39:59Z" message: Successfully rolled out the stack. reason: RollOutDone status: "True" type: Available - lastTransitionTime: "2022-08-25T23:39:59Z" status: "False" type: Progressing - lastTransitionTime: "2022-08-25T23:39:59Z" message: 'Prometheus is running without persistent storage which can lead to data loss during upgrades and cluster disruptions. Please refer to the official documentation to see how to configure storage for Prometheus: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html' reason: PrometheusDataPersistenceNotConfigured status: "False" type: Degraded - lastTransitionTime: "2022-08-25T23:39:59Z" status: "True" type: Upgradeable
Expected results:
I would have expected no update, since nothing changed.
status: conditions: - lastTransitionTime: "2022-08-25T23:39:59Z" status: "True" type: Upgradeable - lastTransitionTime: "2022-08-25T23:39:59Z" message: Successfully rolled out the stack. reason: RollOutDone status: "True" type: Available - lastTransitionTime: "2022-08-25T23:39:59Z" status: "False" type: Progressing - lastTransitionTime: "2022-08-25T23:39:59Z" message: 'Prometheus is running without persistent storage which can lead to data loss during upgrades and cluster disruptions. Please refer to the official documentation to see how to configure storage for Prometheus: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html' reason: PrometheusDataPersistenceNotConfigured status: "False" type: Degraded
Additional info:
As an OpenShift operator, i would like to be able to add labels to my MachineSets and nodes which contain unique values, while also using the cluster autoscaler's ability to balance similar node groups. Being able to specify additional labels through the ClusterAutoscaler CRD would allow me to do that.
Something that has arisen during the investigation of https://bugzilla.redhat.com/show_bug.cgi?id=2001027 is the notion that each CSI driver could create its own zone topology labels, and that they do not have to be consistent with the well known kubernetes label.
It is possible, although not entirely confirmed, that a CSI driver might add these labels even when not in use (although running in the cluster).
Additionally, users may need the option to specify more labels to ignore (as illustrated in the discussion of the bug).
Description of problem:
Event souces are not shown in topology
Version-Release number of selected component (if applicable):
Have verified it on 4.12.0-0.nightly-2022-09-20-095559
How reproducible:
Steps to Reproduce:
1. Install Serverless operator 2. Create CR for knative-serving and knative-eventing respectively 3. Create/select a ns -> go to dev console -> add -> event souce 4. Create any event source
Actual results:
Can't see created resouoce(Event source) in topology
Expected results:
Should be able to see created resoouce on topology
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Customers have introduced Openshift using CloudFormation in "Example 4.55. CloudFormation template for the VPC", referring to the document below.
https://access.redhat.com/documentation/en-us/openshift_container_platform/4.8/html-single/installing/index#installing-restricted-networks-aws
CloudFormation uses python3.7 with Lambda.
Since it will be the EOL of Python 3.7, what kind of effect will it have if it becomes unusable?
Is there any immediate effect? Will there be any impact when adding worker nodes?
OCP Version & Channel: 4.10
Cloud Platform: AWS
This is a clone of issue OCPBUGS-3883. The following is the description of the original issue:
—
While doing a PerfScale test of we noticed that the ovnkube pods are not being spread out evenly among the available workers. Instead they are all stacking on a few until they fill up the available allocatable ebs volumes (25 in the case of m5 instances that we see here).
An example from partway through our 80 hosted cluster test when there were ~30 hosted clusters created/in progress
There are 24 workers available:
```
$ for i in `oc get nodes l node-role.kubernetes.io/worker=,node-role.kubernetes.io/infra!=,node-role.kubernetes.io/workload!= | egrep -v "NAME" | awk '{ print $1 }'`; do echo $i `oc describe node $i | grep -v openshift | grep ovnkube -c`; done
ip-10-0-129-227.us-west-2.compute.internal 0
ip-10-0-136-22.us-west-2.compute.internal 25
ip-10-0-136-29.us-west-2.compute.internal 0
ip-10-0-147-248.us-west-2.compute.internal 0
ip-10-0-150-147.us-west-2.compute.internal 0
ip-10-0-154-207.us-west-2.compute.internal 0
ip-10-0-156-0.us-west-2.compute.internal 0
ip-10-0-157-1.us-west-2.compute.internal 4
ip-10-0-160-253.us-west-2.compute.internal 0
ip-10-0-161-30.us-west-2.compute.internal 0
ip-10-0-164-98.us-west-2.compute.internal 0
ip-10-0-168-245.us-west-2.compute.internal 0
ip-10-0-170-103.us-west-2.compute.internal 0
ip-10-0-188-169.us-west-2.compute.internal 25
ip-10-0-188-194.us-west-2.compute.internal 0
ip-10-0-191-51.us-west-2.compute.internal 5
ip-10-0-192-10.us-west-2.compute.internal 0
ip-10-0-193-200.us-west-2.compute.internal 0
ip-10-0-193-27.us-west-2.compute.internal 7
ip-10-0-199-1.us-west-2.compute.internal 0
ip-10-0-203-161.us-west-2.compute.internal 0
ip-10-0-204-40.us-west-2.compute.internal 23
ip-10-0-220-164.us-west-2.compute.internal 0
ip-10-0-222-59.us-west-2.compute.internal 0
```
This is running quay.io/openshift-release-dev/ocp-release:4.11.11-x86_64 for the hosted clusters and the hypershift operator is quay.io/hypershift/hypershift-operator:4.11 on a 4.11.9 management cluster
Description of problem:
With every pod update we are executing a mutate operation to add the pod port to the port group or add the pod IP to an address set. This functionally doesn't hurt, since mutate will not add duplicate values to the same set. However, this is bad for performance. For example, with a 730 network policies affecting a pod, and issuing 7 pod updates would result in over 5k transactions.
This is a clone of issue OCPBUGS-4350. The following is the description of the original issue:
—
Steps to reproduce:
Release: 4.13.0-0.nightly-2022-11-30-183109 (latest 4.12 nightly as well)
Create a HyperShift cluster on AWS, wait til its completed rolling out
Upgrade the HostedCluster by updating its release image to a newer one
Observe the 'network' clusteroperator resource in the guest cluster as well as the 'version' clusterversion resource in the guest cluster.
When the clusteroperator resource reports the upgraded release and the clusterversion resource reports the new release as applied, take a look at the ovnkube-master statefulset in the control plane namespace of the management cluster. It is still not finished rolling out.
Expected: that the network clusteroperator reports the new version only when all components have finished rolling out.
4.12 tech-preview jobs are suffering:
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=event+happened.*no+matches+for+kind.*InsightsDataGather&maxAge=48h&type=junit' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-techpreview (all) - 10 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-techpreview-serial (all) - 10 runs, 100% failed, 90% of failures match = 90% impact periodic-ci-openshift-release-master-ci-4.12-e2e-azure-sdn-techpreview (all) - 10 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.12-e2e-azure-sdn-techpreview-serial (all) - 10 runs, 100% failed, 90% of failures match = 90% impact periodic-ci-openshift-release-master-ci-4.12-e2e-gcp-sdn-techpreview (all) - 10 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.12-e2e-gcp-sdn-techpreview-serial (all) - 10 runs, 100% failed, 100% of failures match = 100% impact
with runs like this failing:
: [sig-arch] events should not repeat pathologically expand_less 0s { 1 events happened too frequently event happened 138 times, something is wrong: ns/default namespace/default - reason/Unable to find REST mapping for %s/%s: %w InsightsDataGather.config.openshift.io%!(EXTRA string=v1, *meta.NoKindMatchError=no matches for kind "InsightsDataGather" in version "config.openshift.io/v1")}
based on events like:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-techpreview/1597393851226525696/artifacts/e2e-aws-sdn-techpreview/gather-extra/artifacts/events.json | jq -r '.items[] | select(.metadata.namespace == "default" and (.message | contains("InsightsDataGather")))' { "apiVersion": "v1", "count": 145, "eventTime": null, "firstTimestamp": "2022-11-29T01:32:16Z", "involvedObject": { "apiVersion": "v1", "kind": "Namespace", "name": "default", "namespace": "default" }, "kind": "Event", "lastTimestamp": "2022-11-29T02:19:36Z", "message": "InsightsDataGather.config.openshift.io%!(EXTRA string=v1, *meta.NoKindMatchError=no matches for kind \"InsightsDataGather\" in version \"config.openshift.io/v1\")", "metadata": { "creationTimestamp": "2022-11-29T01:32:16Z", "name": "default.172bea26177786ae", "namespace": "default", "resourceVersion": "237357", "uid": "187cf3a0-cf4b-4cd1-ae72-51b5d77b7e73" }, "reason": "Unable to find REST mapping for %s/%s: %w", "reportingComponent": "", "reportingInstance": "", "source": { "component": "run-resourcewatch-config-observer-controller-configobservercontroller" }, "type": "Warning" }
4.12 tech-preview jobs are impacted.
100% for some job flavors, per the search CI output above.
1. Look at test results for any of the impacted job flavors.
Actual results:
Lots of NoKindMatchError events for v1 InsightsDataGather (it's only v1alpha1).
Expected results:
Passing test-cases.
Additional info:
The problematic REST-mapping client was removed from 4.13/dev as part of origin#27596.
This is a clone of issue OCPBUGS-6011. The following is the description of the original issue:
—
Description of problem:
The 4.12.0 openshift-client package has kubectl 1.24.1 bundled in it when it should have 1.25.x
Version-Release number of selected component (if applicable):
4.12.0
How reproducible:
Very
Steps to Reproduce:
1. Download and unpack https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/stable/openshift-client-linux-4.12.0.tar.gz 2. ./kubectl version
Actual results:
# ./kubectl version Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.1", GitCommit:"1928ac4250660378a7d8c3430478dfe77977cb2a", GitTreeState:"clean", BuildDate:"2022-12-07T05:08:22Z", GoVersion:"go1.18.7", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.4
Expected results:
kubectl version 1.25.x
Additional info:
This is a clone of issue OCPBUGS-5734. The following is the description of the original issue:
—
Description of problem:
In https://issues.redhat.com/browse/OCPBUGSM-46450, the VIP was added to noProxy for StackCloud but it should also be added for all national clouds.
Version-Release number of selected component (if applicable):
4.10.20
How reproducible:
always
Steps to Reproduce:
1. Set up a proxy 2. Deploy a cluster in a national cloud using the proxy 3.
Actual results:
Installation fails
Expected results:
Additional info:
The inconsistence was discovered when testing the cluster-network-operator changes https://issues.redhat.com/browse/OCPBUGS-5559
Description of problem:
We need to have admin-ack in 4.12 so that admins can check the deprecated APIs and approve when they move to 4.12.Refer https://access.redhat.com/articles/6958394 for more information. As planned we want to add the admin-ack around 4.13 feature freeze.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Install a cluster in 4.12. 2. Run an application which uses the deprecated API. See https://access.redhat.com/articles/6958394 for more information. 3. Upgrade to 4.13
Actual results:
The upgrade happens without asking the admin to confirm that the worksloads do not use the deprecated APIs.
Expected results:
Upgrade should wait for the admin-ack.
Additional info:
This was the PR for 4.11.z https://github.com/openshift/cluster-version-operator/pull/836
We should reformat assisted-installer ops.go code and use exec commands as interface to allow mocking
Description of problem:
oc --context build02 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-ec.1 True False 45h Error while reconciling 4.12.0-ec.1: the cluster operator kube-controller-manager is degraded oc --context build02 get co kube-controller-manager NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-controller-manager 4.12.0-ec.1 True False True 2y87d GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.153.28:9091: connect: cannot assign requested address
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
build02 is a build farm cluster in CI production.
I can provide credentials to access the cluster if needed.
Hi,
Bare Metal IPI provisioning is failing to provision the worker nodes. The metal3-machine-os-downloader InitContainer is getting in CrashLoopBackOff state because it cannot find virt-* commands in the container image.
> oc -n openshift-machine-api get pods | grep -v Running NAME READY STATUS metal3-fc66f5846-gtq9m 0/7 Init:CrashLoopBackOff metal3-image-cache-d4qcz 0/1 Init:1/2 metal3-image-cache-djzcf 0/1 Init:1/2 metal3-image-cache-p5mwg 0/1 Init:1/2
> oc -n openshift-machine-api logs deployment/metal3 -c metal3-machine-os-downloader [omitted] ++ LIBGUESTFS_BACKEND=direct ++ virt-filesystems -a rhcos-412.86.202207142104-0-openstack.x86_64.qcow2 -l /usr/local/bin/get-resource.sh: line 88: virt-filesystems: command not found ++ grep boot ++ cut -f1 '-d ' + BOOT_DISK= ++ LIBGUESTFS_BACKEND=direct ++ virt-ls -a rhcos-412.86.202207142104-0-openstack.x86_64.qcow2 -m '' /boot/loader/entries /usr/local/bin/get-resource.sh: line 90: virt-ls: command not found + BOOT_ENTRIES= + rm -fr /shared/tmp/tmp.CnCd2E3kxN
OpenShift 4.12.0-ec.0+
Since https://github.com/openshift/ocp-build-data/pull/1757, the ironic-machine-os-downloader container image is built using RHEL9 repositories.
However, following upstream move of guestfs tools to a dedicated repository [1], the libguestfs packaging differs between RHEL8 and RHEL9:
Since the Dockerfile specifies only the libguestfs-tools package, the virt-* commands are not installed when using RHEL9 repositories.
A trivial fix is to update the Dockerfile to install the guestfs-tools package instead of the libguestfs-tools package.
Regards,
Denis
4.2 AWS boot images such as ami-01e7fdcb66157b224 include the old ignition.platform.id=ec2 kernel command line parameter. When launched against 4.12.0-rc.3, new machines fail with:
coreos-assemblers used ignition.platform.id=ec2, but pivoted to =aws here. It's not clear when that made its way into new AWS boot images. Some time after 4.2 and before 4.6.
Afterburn dropped support for legacy command-line options like the ec2 slug in 5.0.0. But it's not clear when that shipped into RHCOS. The release controller points at this RHCOS diff, but that has afterburn-0-5.3.0-1 builds on both sides.
100%, given a sufficiently old AMI and a sufficiently new OpenShift release target.
The new Machine will get to Provisioned but fail to progress to Running. systemd journal logs will include unknown provider 'ec2' for Afterburn units.
Old boot-image AMIs can successfully update to 4.12.
Alternatively, we pin down the set of exposed boot images sufficiently that users with older clusters can audit for exposure and avoid the issue by updating to more modern boot images (although updating boot images is not trivial, see RFE-3001 and the Ignition spec 2 to 3 transition discussed in kcs#5514051.
When we get telemetry from connected clusters, we want to be able to tell when they were created with the agent installer vs. the host assisted service. Currently there is no way to distinguish.
It's not clear whether any particular group owns the namespace of installation methods, or whom we need to notify when we create one.
As a developer, I would like to remove the random terraform provider because it is essentially unnecessary and would improve our build process.
The random Terraform provider is used in Azure & Azure Stack to create a random string. This could easily be done in go code and passed in as a variable.
Removing an extra provider would decrease our build time and improve our build stability, which is often failing due to timeouts.
The random string is used here in Azure (and similarly in Azure Stack):
https://github.com/openshift/installer/blob/master/data/data/azure/vnet/main.tf#L23-L27
One approach would be to generate the string in tfvars and pass it in as a terraform variable.
Currently controller will set status done each time it sees host that is ready in k8s without looking if it was already set.
time="2022-09-13T19:03:45Z" level=info msg="Found new ready node ocp-2.cluster1.kpsalerno.us.ibm.com with inventory id 2da64d56-5057-78c6-ea6e-bf74a783bd79, kubernetes id 2da64d56-5057-78c6-ea6e-bf74a783bd79, updating its status to Done" func="github.com/openshift/assisted-installer/src/assisted_installer_controller.(*controller).waitAndUpdateNodesStatus" file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:255" request_id=6258e5a2-4e78-4148-a913-45d704a0fa1d
time="2022-09-13T19:04:05Z" level=info msg="Found new ready node ocp-2.cluster1.kpsalerno.us.ibm.com with inventory id 2da64d56-5057-78c6-ea6e-bf74a783bd79, kubernetes id 2da64d56-5057-78c6-ea6e-bf74a783bd79, updating its status to Done" func="github.com/openshift/assisted-installer/src/assisted_installer_controller.(*controller).waitAndUpdateNodesStatus" file="/remote-source/app/src/assisted_installer_controller/assisted_installer_controller.go:255" request_id=49e4e63f-cf4f-4b9f-b1f3-923c473c09dd
Description of problem:
Automatic ART PRs to update the build config are failing. Needs manual intervention.
This is a clone of issue OCPBUGS-3993. The following is the description of the original issue:
—
Description of problem:
On Openshift on Openstack CI, we are deploying an OCP cluster with an additional network on the workers in install-config.yaml for integration with Openstack Manila.
compute: - name: worker platform: openstack: zones: [] additionalNetworkIDs: ['0eeae16f-bbc7-4e49-90b2-d96419b7c30d'] replicas: 3
As a result, the egressIP annotation includes two interfaces definition:
$ oc get node ostest-hp9ld-worker-0-gdp5k -o json | jq -r '.metadata.annotations["cloud.network.openshift.io/egress-ipconfig"]' | jq . [ { "interface": "207beb76-5476-4a05-b412-d0cc53ab00a7", "ifaddr": { "ipv4": "10.46.44.64/26" }, "capacity": { "ip": 8 } }, { "interface": "2baf2232-87f7-4ad5-bd80-b6586de08435", "ifaddr": { "ipv4": "172.17.5.0/24" }, "capacity": { "ip": 10 } } ]
According to Huiran Wang, egressIP only works for primary interface on the node.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-11-22-012345 RHOS-16.1-RHEL-8-20220804.n.1
How reproducible:
Always
Steps to Reproduce:
Deploy cluster with additional Network on the workers
Actual results:
It is possible to select an egressIP network for a secondary interface
Expected results:
Only primary subnet can be chosen for egressIP
Additional info:
https://issues.redhat.com/browse/OCPQE-12968
Allow users to turn PodSecurity admission in enforcement mode in 4.12 as TechPreviewNoUpgrade in order to be able to test the feature with their workloads and see if there is anything that needs fixing.
Description of problem:
This PR: https://github.com/openshift/cluster-network-operator/pull/1612/files removed the fallback logic of checking for the hosts kubeconfig file when apiserver-url.env was not populated on the machine. In IBM Cloud ROKS (both public cloud + Satellite (Hypershift)) this file is not populated. This means that any upgrade to 4.12 will result in the cluster network operator failing and cause impacts to the cluster. I am proposing the following plan: First, this PR is held till 4.13. Second: IBM Cloud ROKS team will ensure from the initial release of 4.12 that this file is populated in it's entire fleet of workers (4.12 and beyond). Holding this to 4.13 will allow a seamless upgrade experience when the user upgrades the control plane to 4.12 but the workers are still 4.11. Then when the user goes to upgrade to 4.13: their workers will all be at 4.12 which is guarenteed to have this file and the logic to remove the check for the host kubeconfig can be removed. For full disclosure was brought up that we could go and push a daemonset across our entire fleet of 16000+ ROKS clusters that just lays down the file but that still introduces race conditions with the network-operator and results in significant resource increase of cluster workload across our entire fleet that the plan I proposed above would remove Example on a ROKS on Satellite worker showing that this file does not exist (yet): [root@tyler-test-24 ~]# ls /etc/kubernetes/apiserver-url.env ls: cannot access '/etc/kubernetes/apiserver-url.env': No such file or directory
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-5988. The following is the description of the original issue:
—
Description of problem:
Etcd operator is in degraded state as one of the masters can't connect. Master that fails to connect was previously bootstrap and pivoted as part of assisted-installer installation to master. Etcd log: 2023-01-17T23:09:26.523562312Z 28dcf1b0a44481b0, started, test-infra-cluster-04bf4418-master-1, https://192.168.127.11:2380, https://192.168.127.11:2379, false 2023-01-17T23:09:26.523562312Z 30600b5b86e23c8e, started, etcd-bootstrap, https://192.168.127.12:2380, https://192.168.127.12:2379, false 2023-01-17T23:09:26.523562312Z 73f00626fee34a87, started, test-infra-cluster-04bf4418-master-0, https://192.168.127.10:2380, https://192.168.127.10:2379, false 2023-01-17T23:09:26.541214220Z #### attempt 0 2023-01-17T23:09:26.547811132Z member={name="test-infra-cluster-04bf4418-master-1", peerURLs=[https://192.168.127.11:2380}, clientURLs=[https://192.168.127.11:2379] 2023-01-17T23:09:26.547811132Z member={name="etcd-bootstrap", peerURLs=[https://192.168.127.12:2380}, clientURLs=[https://192.168.127.12:2379] 2023-01-17T23:09:26.547811132Z member={name="test-infra-cluster-04bf4418-master-0", peerURLs=[https://192.168.127.10:2380}, clientURLs=[https://192.168.127.10:2379] 2023-01-17T23:09:26.547811132Z target={name="etcd-bootstrap", peerURLs=[https://192.168.127.12:2380}, clientURLs=[https://192.168.127.12:2379] 2023-01-17T23:09:26.547846508Z member "https://192.168.127.12:2380" dataDir has been destroyed and must be removed from the cluster There are couple of problems that we see: 1. For unknown reason etcd operator BootstrapTeardownController fails to start as it fails to see "openshift-etcd" namespace though by the logs it is there. 2023-01-17T21:39:43.323928903Z E0117 21:39:43.323917 1 base_controller.go:272] BootstrapTeardownController reconciliation failed: failed to get bootstrap scaling strategy: failed to get openshift-etcd names 2. DelayStrategy code was change by https://github.com/openshift/cluster-etcd-operator/pull/964/files and currently it requires 3 healthy members in order to remove. It can create issues as etcd and cluster-bootstrap(bootkube) are not synchronized and nothing is actually blocking bootstrap on stop etcd and block remove of bootstrap etcd.(at least how i understand the flow)
Version-Release number of selected component (if applicable):
How reproducible:
It is race as far as i understand but reproduced pretty much in our CI by installing 4.12 nightlies
Steps to Reproduce:
1. 2. 3.
Actual results:
Etcd is degrade cause third joined master etcd can't start
Expected results:
Etcd is healthy
Additional info:
Description of problem
`oc-mirror` will hit error when use docker without namespace for OCI format mirror
How reproducible:
always
Steps to Reproduce:
Copy the operator image with OCI format to localhost;
cat copy.yaml
apiVersion: mirror.openshift.io/v1alpha2
kind: ImageSetConfiguration
mirror:
operators:
`oc-mirror --config mirror.yaml --use-oci-feature --oci-feature-action=mirror --dest-skip-tls docker://localhost:5000`
Actual results:
2. Hit error:
`oc-mirror --config mirror.yaml --use-oci-feature --oci-feature-action=mirror --dest-skip-tls docker://localhost:5000`
……
info: Mirroring completed in 30ms (0B/s)
error: mirroring images "localhost:5000//multicluster-engine/mce-operator-bundle@sha256:e7519948bbcd521390d871ccd1489a49aa01d4de4c93c0b6972dfc61c92e0ca2" is not a valid image reference: invalid reference format
Expected results:
2. No error
Additional info:
`oc-mirror --config mirror.yaml --use-oci-feature --oci-feature-action=mirror --dest-skip-tls docker://localhost:5000/ocmir` works well.
This is a clone of issue OCPBUGS-3426. The following is the description of the original issue:
—
Description of problem:
We need to update the operator to be synced with the K8 api version used by OCP 4.13. We also need to sync our samples libraries with latest available libraries. Any deprecated libraries should be removed as well.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
pkg/devfile/sample_test.go fails after devfile registry was updated (https://github.com/devfile/registry/pull/126)
OCPBUGS-1677 is about updating our assertion so that the CI job runs successfully again. We might want to backport this as well.
This is about updating the code that the test should use a mock response instead of the latest registry content OR check some specific attributes instead of comparing the full JSON response.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Clone openshift/console
2. Run ./test-backend.sh
Actual results:
Unit tests fail
Expected results:
Unit tests should pass again
Additional info:
This is a clone of issue OCPBUGS-6053. The following is the description of the original issue:
—
Description of problem:
When a ClusterVersion's `status.availableUpdates` has a value of `null` and `Upgradeable=False`, a run time error occurs on the Cluster Settings page as the UpdatesGraph component expects `status.availableUpdates` to have a non-empty value.
Steps to Reproduce:
1. Add the following overrides to ClusterVersion config (/k8s/cluster/config.openshift.io~v1~ClusterVersion/version) spec: overrides: - group: apps kind: Deployment name: console-operator namespace: openshift-console-operator unmanaged: true - group: rbac.authorization.k8s.io kind: ClusterRole name: console-operator namespace: '' unmanaged: true 2. Visit /settings/cluster and note the run-time error (see attached screenshot)
Actual results:
An error occurs.
Expected results:
The contents of the Cluster Settings page render.
Description of problem:
The function desiredIngressClass doesn't specify ingressClass.spec.parameters.scope, while the ingressClass API object specifies "Cluster" by default. This causes unneeded updates to all IngressClasses when the CIO starts. The CIO will fight with the API default any time an update triggers a change in an IngressClass.
Reference: https://github.com/kubernetes/api/blob/master/networking/v1/types.go#L640
Version-Release number of selected component (if applicable):
4.8+
How reproducible:
Steps to Reproduce:
We really need https://issues.redhat.com/browse/OCPBUGS-6700 to be fixed before we can identify these spirituous updates. But when it is fixed:
# Delete CIO oc delete pod -n openshift-ingress-operator $(oc get -n openshift-ingress-operator pods --no-headers | head -1 | awk '{print $1}') # Wait a minute for it to start back up # Should be NO updates to IngressClasses oc logs -n openshift-ingress-operator $(oc get -n openshift-ingress-operator pods --no-headers | head -1 | awk '{print $1}') -c ingress-operator | grep "updated IngressClass" # Instead, we see this every time CIO starts up 2023-01-26T20:57:15.281Z INFO operator.ingressclass_controller ingressclass/ingressclass.go:63 updated IngressClass {"name": "openshift-default",
Actual results:
2023-01-26T20:57:15.281Z INFO operator.ingressclass_controller ingressclass/ingressclass.go:63 updated IngressClass {"name": "openshift-default", ...
Expected results:
No update to ingress upon CIO restart
Additional info:
Description of problem:
After IPI installing a 3-node Hub Cluster, and converting them to dual stack, fd69::/125 address is seen in the Baremetal br-ex interface
Version-Release number of selected component (if applicable):
4.12.0
How reproducible:
Ranodmly reproduced and this IP is assigned in one of the 3 master hub cluster nodes
Steps to Reproduce:
1. IPI install 4.12.0 2. Use the Convert from IPv4/IPv6 dual stack procedure. 3.
Actual results:
Check for the IP fd69::/125 in the br-ex interface
OVN CrashLoopBackOff
Expected results:
The IP is a internal OVNKUBE IP, and it should not appear on the interface.
fd69::2/125 should be present on br-ex, but make sure fd69::2 does not :
Additional info:
This is one of the issues in IPv6 that is discovered, the other issue is linked here as well.
Description of problem:
We have been investigating an issue with slow kube-apiserver rollout times. When a new revision is created, the current static pod is deleted and a new one created to pick up the revision. There is a 5 min timer on the creation, if this timeout is exceeded the rollout will revert to the previous revision.
The customer has been seeing failed rollouts due to this 5 min timer being exceeded. There is load on the platform cpus with the biggest contributor being exec probe overhead, but there is still significant idle ~ 50%.
While not able to reproduce to the same degree as the customer, I was able to reproduce slow rollout times with a similar platform cpu overhead.
From the logs, we see slow container creation times.
I added some instrumentation to the low_latency_hooks.sh script
snip
pid=$(jq '.pid' /dev/stdin 2>&1)
logger "Start low_latency_hooks ${pid}"
[[ $? -eq 0 && -n "${pid}" ]] || { logger "${0}: Failed to extract the pid: ${pid}"; exit 0; }
snip
if [ "${mode}" = "ro" ]; then
ip netns exec "${ns}" mount -o remount,ro /sys
[ $? -eq 0 ] || exit 1 # Error out so the pod will not start with a writable /sys
fi
logger "Stop low_latency_hooks ${pid}"
Analysing the logs for the five running containers in the apiserver we see that the bulk of the time is being spent in the hook.
insecure-readyz
total container create time: 35s
hook time: 29s
cert-syncer
total container create time: 41s
hook time: 32s
cert-regeneration-controller
total container create time: 73s
hook time: 54s
kube-apiserver
total container create time: 18s
hook time: 16s
check-endpoints
total container create time: 31s
hook time: 31s
I ran another test where I removed the oci hook and kept everything else the same, the results were dramatically different.
Container create times:
insecure-readyz - 1s
cert-syncer - 1s
cert-regeneration-controller - 1s
kube-apiserver -1s
check-endpoints - 5s
I was then able to run the same test in the customers lab. In some joint testing we did with the customer we originally saw 4-5 mins for a rollout. Without the hook in the exact same environment, the total rollout time dropped to <=2 mins.
Version-Release number of selected component (if applicable):
4.9.37
Issue likely in later releases as well, have not timed yet
How reproducible:
100%
Steps to Reproduce:
1. Force a rollout with a platform cpu load representative of the application
2.
3.
Actual results:
Slow rollout times sometimes exceeding the timeout
Expected results:
Rollout should fit into the timeout window
Additional info:
At runtime we know the version of OpenShift that we're installing, so we can dynamically generate the OS_IMAGES environment variable to point at the image for the current release. This will prevent having to add to the hard-coded list for every release.
Description of problem:
When disable all helm chart repos the helm navigation item is disabled.
To re-enable the helm charts again the user can search for HCP or PHCPs but the action menu doesn't work if no other helm chart repo is enabled.
Version-Release number of selected component (if applicable):
Only 4.12 (4.11 is fine)
How reproducible:
Always
Steps to Reproduce:
1. Switch to developer perspective
2. Navigate to Helm > Repos > Edit the default repo and disable it
3. Helm Navigation should disappear and the content area maybe switch to 404, that's fine.
4. Navigate to Search and select HelmChartRepository as resource
5. Click on the action menu (kebab icon) to edit the HCR
Actual results:
The action menu is not shown
Expected results:
The action menu should be shown so that the user can edit or delete the HCR.
Additional info:
Manoj noticed that the cluster registration fails for SNO clusters when the network type is set to OpenShiftSDN. We should add some validation to prevent this combination.
Failed to register cluster with assisted-service: AssistedServiceError Code: 400 Href: ID: 400 Kind: Error Reason: OpenShiftSDN network type is not allowed in single node mode
Documentation also indicates OpenShiftSDN is not compatible: https://docs.openshift.com/container-platform/4.11/installing/installing_sno/install-sno-preparing-to-install-sno.html
This bug is a backport clone of [Bugzilla Bug 2090680](https://bugzilla.redhat.com/show_bug.cgi?id=2090680). The following is the description of the original bug:
—
Description of problem:
Version-Release number of the following components:
4.11.0-0.nightly-2022-05-25-123329
How reproducible:
Always
Steps to Reproduce:
1. set up a cluster in a restricted network using 4.11.0-0.nightly-2022-05-25-123329
2. mirror 4.11.0-0.nightly-2022-05-25-193227 to private registry
3. upgrade the cluster to 4.11.0-0.nightly-2022-05-25-193227 without --force option
$ oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd
Actual results:
Wait for 3+ hours, no any upgrade history info in clusterversion, from event log, only can see "Retrieving and verifying payload".
[root@preserve-jialiu-ansible ~]# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-05-25-123329 True False 160m Cluster version is 4.11.0-0.nightly-2022-05-25-123329
[root@preserve-jialiu-ansible ~]# oc get clusterversion -o yaml
apiVersion: v1
items:
[root@preserve-jialiu-ansible ~]# oc get event -n openshift-cluster-version
LAST SEEN TYPE REASON OBJECT MESSAGE
3h11m Warning FailedScheduling pod/cluster-version-operator-b4b6c5f9b-p7fjq no nodes available to schedule pods
3h9m Warning FailedScheduling pod/cluster-version-operator-b4b6c5f9b-p7fjq no nodes available to schedule pods
3h4m Normal Scheduled pod/cluster-version-operator-b4b6c5f9b-p7fjq Successfully assigned openshift-cluster-version/cluster-version-operator-b4b6c5f9b-p7fjq to jialiu411a-5nb8n-master-2 by jialiu411a-5nb8n-bootstrap
3h2m Warning FailedMount pod/cluster-version-operator-b4b6c5f9b-p7fjq MountVolume.SetUp failed for volume "serving-cert" : secret "cluster-version-operator-serving-cert" not found
3h1m Warning FailedMount pod/cluster-version-operator-b4b6c5f9b-p7fjq Unable to attach or mount volumes: unmounted volumes=[serving-cert], unattached volumes=[etc-ssl-certs etc-cvo-updatepayloads serving-cert service-ca kube-api-access]: timed out waiting for the condition
3h1m Normal Pulling pod/cluster-version-operator-b4b6c5f9b-p7fjq Pulling image "registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
3h1m Normal Pulled pod/cluster-version-operator-b4b6c5f9b-p7fjq Successfully pulled image "registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1" in 1.384468759s
3h1m Normal Created pod/cluster-version-operator-b4b6c5f9b-p7fjq Created container cluster-version-operator
3h1m Normal Started pod/cluster-version-operator-b4b6c5f9b-p7fjq Started container cluster-version-operator
3h11m Normal SuccessfulCreate replicaset/cluster-version-operator-b4b6c5f9b Created pod: cluster-version-operator-b4b6c5f9b-p7fjq
3h11m Normal ScalingReplicaSet deployment/cluster-version-operator Scaled up replica set cluster-version-operator-b4b6c5f9b to 1
3h12m Normal LeaderElection configmap/version jialiu411a-5nb8n-bootstrap_0a3ff57f-66cf-4f93-bbe0-484effcc4383 became leader
3h12m Normal RetrievePayload clusterversion/version Retrieving and verifying payload version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
3h12m Normal LoadPayload clusterversion/version Loading payload version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
3h12m Normal PayloadLoaded clusterversion/version Payload loaded version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
166m Normal LeaderElection configmap/version jialiu411a-5nb8n-master-2_83752e0b-1ef4-4c69-814f-8eeb54d50781 became leader
166m Normal RetrievePayload clusterversion/version Retrieving and verifying payload version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
166m Normal LoadPayload clusterversion/version Loading payload version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
166m Normal PayloadLoaded clusterversion/version Payload loaded version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
77m Normal RetrievePayload clusterversion/version Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd"
Expected results:
CVO and `oc adm upgrade` should clearly prompt user what issues happened there, but not pending there for a long time without any info.
Additional info:
Try the same upgrade path against a connected cluster, upgrade is kicked off soon, no such issues.
This is a clone of issue OCPBUGS-3096. The following is the description of the original issue:
—
While the installer binary is statically linked, the terraform binaries shipped with it are dynamically linked.
This could give issues when running the installer on Linux and depending on the GLIBC version the specific Linux distribution has installed. It becomes a risk when switching the base image of the builders from ubi8 to ubi9 and trying to run the installer in cs8 or rhel8.
For example, building the installer on cs9 and trying to run it in a cs8 distribution leads to:
time="2022-10-31T14:31:47+01:00" level=debug msg="[INFO] running Terraform command: /root/test/terraform/bin/terraform version -json" time="2022-10-31T14:31:47+01:00" level=error msg="/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /root/test/terraform/bin/terraform)" time="2022-10-31T14:31:47+01:00" level=error msg="/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /root/test/terraform/bin/terraform)" time="2022-10-31T14:31:47+01:00" level=debug msg="[INFO] running Terraform command: /root/test/terraform/bin/terraform version -json" time="2022-10-31T14:31:47+01:00" level=error msg="/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /root/test/terraform/bin/terraform)" time="2022-10-31T14:31:47+01:00" level=error msg="/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /root/test/terraform/bin/terraform)" time="2022-10-31T14:31:47+01:00" level=debug msg="[INFO] running Terraform command: /root/test/terraform/bin/terraform init -no-color -force-copy -input=false -backend=true -get=true -upgrade=false -plugin-dir=/root/test/terraform/plugins" time="2022-10-31T14:31:47+01:00" level=error msg="/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /root/test/terraform/bin/terraform)" time="2022-10-31T14:31:47+01:00" level=error msg="/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /root/test/terraform/bin/terraform)" time="2022-10-31T14:31:47+01:00" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failure applying terraform for \"cluster\" stage: failed to create cluster: failed doing terraform init: exit status 1\n/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /root/test/terraform/bin/terraform)\n/root/test/terraform/bin/terraform: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /root/test/terraform/bin/terraform)\n"
How reproducible:Always
Steps to Reproduce:{code:none} 1. Build the installer on cs9 2. Run the installer on cs8 until the terraform binary are started 3. Looking at the terrform binary with ldd or file, you can get it is not a statically linked binary and the error above might occur depending on the glibc version you are running on
Actual results:
Expected results:
The terraform and providers binaries have to be statically linked as well as the installer is.
Additional info:
This comes from a build of OKD/SCOS that is happening outside of Prow on a cs9-based builder image. One can use the Dockerfile at images/installer/Dockerfile.ci and replace the builder image with one like https://github.com/okd-project/images/blob/main/okd-builder.Dockerfile
This is a clone of issue OCPBUGS-3761. The following is the description of the original issue:
—
Description of problem:
Events.Events: event view displays created pod https://search.ci.openshift.org/?search=event+view+displays+created+pod&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.Run event scenario tests and note below results:
Actual results:
{Expected '' to equal 'test-vjxfx-event-test-pod'. toEqual Error: Failed expectation at /go/src/github.com/openshift/console/frontend/integration-tests/tests/event.scenario.ts:65:72 at Generator.next (<anonymous>:null:null) at fulfilled (/go/src/github.com/openshift/console/frontend/integration-tests/tests/event.scenario.ts:5:58) at runMicrotasks (<anonymous>:null:null) at processTicksAndRejections (internal/process/task_queues.js:93:5) }
Expected results:
Additional info:
Description of problem:
pkg/devfile/sample_test.go fails after devfile registry was updated (https://github.com/devfile/registry/pull/126)
This issue is about updating our assertion so that the CI job runs successfully again. We might want to backport this as well.
OCPBUGS-1678 is about updating the code that the test should use a mock response instead of the latest registry content OR check some specific attributes instead of comparing the full JSON response.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Clone openshift/console
2. Run ./test-backend.sh
Actual results:
Unit tests fail
Expected results:
Unit tests should pass again
Additional info:
We need to merge downstream on master the new retry code from https://github.com/ovn-org/ovn-kubernetes/pull/3171, so that we can also merge the retry logic applied to ovn-k node.
Description of problem:
prometheus-k8s-0 ends in CrashLoopBackOff with evel=error err="opening storage failed: /prometheus/chunks_head/000002: invalid magic number 0" on SNO after hard reboot tests
Version-Release number of selected component (if applicable):
4.11.6
How reproducible:
Not always, after ~10 attempts
Steps to Reproduce:
1. Deploy SNO with Telco DU profile applied 2. Hard reboot node via out of band interface 3. oc -n openshift-monitoring get pods prometheus-k8s-0
Actual results:
NAME READY STATUS RESTARTS AGE prometheus-k8s-0 5/6 CrashLoopBackOff 125 (4m57s ago) 5h28m
Expected results:
Running
Additional info:
Attaching must-gather. The pod recovers successfully after deleting/re-creating. [kni@registry.kni-qe-0 ~]$ oc -n openshift-monitoring logs prometheus-k8s-0 ts=2022-09-26T14:54:01.919Z caller=main.go:552 level=info msg="Starting Prometheus Server" mode=server version="(version=2.36.2, branch=rhaos-4.11-rhel-8, revision=0d81ba04ce410df37ca2c0b1ec619e1bc02e19ef)" ts=2022-09-26T14:54:01.919Z caller=main.go:557 level=info build_context="(go=go1.18.4, user=root@371541f17026, date=20220916-14:15:37)" ts=2022-09-26T14:54:01.919Z caller=main.go:558 level=info host_details="(Linux 4.18.0-372.26.1.rt7.183.el8_6.x86_64 #1 SMP PREEMPT_RT Sat Aug 27 22:04:33 EDT 2022 x86_64 prometheus-k8s-0 (none))" ts=2022-09-26T14:54:01.919Z caller=main.go:559 level=info fd_limits="(soft=1048576, hard=1048576)" ts=2022-09-26T14:54:01.919Z caller=main.go:560 level=info vm_limits="(soft=unlimited, hard=unlimited)" ts=2022-09-26T14:54:01.921Z caller=web.go:553 level=info component=web msg="Start listening for connections" address=127.0.0.1:9090 ts=2022-09-26T14:54:01.922Z caller=main.go:989 level=info msg="Starting TSDB ..." ts=2022-09-26T14:54:01.924Z caller=tls_config.go:231 level=info component=web msg="TLS is disabled." http2=false ts=2022-09-26T14:54:01.926Z caller=main.go:848 level=info msg="Stopping scrape discovery manager..." ts=2022-09-26T14:54:01.926Z caller=main.go:862 level=info msg="Stopping notify discovery manager..." ts=2022-09-26T14:54:01.926Z caller=manager.go:951 level=info component="rule manager" msg="Stopping rule manager..." ts=2022-09-26T14:54:01.926Z caller=manager.go:961 level=info component="rule manager" msg="Rule manager stopped" ts=2022-09-26T14:54:01.926Z caller=main.go:899 level=info msg="Stopping scrape manager..." ts=2022-09-26T14:54:01.926Z caller=main.go:858 level=info msg="Notify discovery manager stopped" ts=2022-09-26T14:54:01.926Z caller=main.go:891 level=info msg="Scrape manager stopped" ts=2022-09-26T14:54:01.926Z caller=notifier.go:599 level=info component=notifier msg="Stopping notification manager..." ts=2022-09-26T14:54:01.926Z caller=main.go:844 level=info msg="Scrape discovery manager stopped" ts=2022-09-26T14:54:01.926Z caller=manager.go:937 level=info component="rule manager" msg="Starting rule manager..." ts=2022-09-26T14:54:01.926Z caller=main.go:1120 level=info msg="Notifier manager stopped" ts=2022-09-26T14:54:01.926Z caller=main.go:1129 level=error err="opening storage failed: /prometheus/chunks_head/000002: invalid magic number 0"
Description of problem:
Customer is running machine learning (ML) tasks on OpenShift Container Platform, for which large models need to be embedded in the container image. When building a new container image with large container image layers (>=10GB) and pushing it to the internal image registry, this fails with the following error message:
error: build error: Failed to push image: writing blob: uploading layer to https://image-registry.openshift-image-registry.svc:5000/v2/example/example-image/blobs/uploads/b305b374-af79-4dce-afe0-afe6893b0ada?_state=[..]: blob upload invalid
In the image registry Pod we can see the following error message:
time="2023-01-30T14:12:22.315726147Z" level=error msg="upload resumed at wrong offest: 10485760000 != 10738341637" [..] time="2023-01-30T14:12:22.338264863Z" level=error msg="response completed with error" err.code="blob upload invalid" err.message="blob upload invalid" [..]
Backend storage is AWS S3. We suspect that this could be the following upstream bug: https://github.com/distribution/distribution/issues/1698
Version-Release number of selected component (if applicable):
Customer encountered the issue on OCP 4.11.20. We reproduced the issue on OCP 4.11.21: $ oc version Client Version: 4.12.0 Kustomize Version: v4.5.7 Server Version: 4.11.21 Kubernetes Version: v1.24.6+5658434
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform cluster 4.11.21 on AWS 2. Confirm registry storage is on AWS S3 3. Create a new build including a 10GB file using the following command: `printf "FROM registry.fedoraproject.org/fedora:37\nRUN dd if=/dev/urandom of=/bigfile bs=1M count=10240" | oc new-build -D -` 4. Wait for some time for the build to run
Actual results:
Pushing the new build fails with the following error message: error: build error: Failed to push image: writing blob: uploading layer to https://image-registry.openshift-image-registry.svc:5000/v2/example/example-image/blobs/uploads/b305b374-af79-4dce-afe0-afe6893b0ada?_state=[..]: blob upload invalid
Expected results:
Push of large container image layers succeeds
Additional info:
Description of problem:
when provisioningNetwork is changed from Disabled to Managed/Unmanaged, the ironic-proxy daemonset is not removed This causes the metal3 pod to be stuck in pending, since both pods are trying to use port 6385 on the host: 0/3 nodes are available: 3 node(s) didn't have free ports for the requested pod ports. preemption: 0/3 nodes are available: 3 node(s) didn't have free ports for the requested pod ports
Version-Release number of selected component (if applicable):
4.12rc.4
How reproducible:
Every time for me
Steps to Reproduce:
1. On a multinode cluster, change the provisioningNetwork from Disabled to Unmanaged (I didn't try Managed) 2. 3.
Actual results:
0/3 nodes are available: 3 node(s) didn't have free ports for the requested pod ports. preemption: 0/3 nodes are available: 3 node(s) didn't have free ports for the requested pod ports
Expected results:
I believe the ironic-proxy daemonset should be deleted when the provisioningNetwork is set to Managed/Unmanaged
Additional info:
If I manually delete the ironic-proxy Daemonset, the controller does not re-create it.
This is a clone of issue OCPBUGS-4411. The following is the description of the original issue:
—
Description of problem:
manually configure ipv6 addresses and route on ipv4 OCP cluster to create a dualstack cluster, newly created pods will stay in 'ContainerCreating' status
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Steps to Reproduce:
1. enable ipv6 in network. # more patch_dual.yaml - op: add path: /spec/clusterNetwork/- value: cidr: fd01::/48 hostPrefix: 64 - op: add path: /spec/serviceNetwork/- value: fd02::/112 # oc patch network.config.openshift.io cluster --type='json' --patch-file patch_dual.yaml 2. Configure ipv6 addresses and routes PODS=$(oc get pods -n openshift-cluster-node-tuning-operator -l openshift-app=tuned --field-selector=status.phase=Running --no-headers -o name) i=10 for pod in $PODS; do oc exec -n openshift-cluster-node-tuning-operator $pod -- ip -6 addr add fd00:172:22::${i}/64 dev br-ex oc exec -n openshift-cluster-node-tuning-operator $pod -- ip -6 route add default via fd00:172:22::1 dev br-ex ((i=i+1)) done 3. create pods and they will stay in ContainerCreating status. 4. if remove the ipv6 configuration in network, newly created pods can be ready.
Actual results:
Pod can not be running
Expected results:
Pod should be ready with both ipv4 and ipv6 address.
Additional info:
version: # oc version Client Version: 4.12.0-0.nightly-2022-11-30-182550 Kustomize Version: v4.5.7 Server Version: 4.12.0-0.nightly-2022-11-30-182550 Kubernetes Version: v1.25.2+5533733 Describe pods: # oc describe pod iperf-rc-normal-qg6zd Name: iperf-rc-normal-qg6zd Namespace: offload-testing Priority: 0 Service Account: default Node: openshift-qe-025.lab.eng.rdu2.redhat.com/192.168.111.54 Start Time: Thu, 01 Dec 2022 21:35:28 -0500 Labels: name=iperf-pods-normal Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.129.2.7/23","fd01:0:0:6::3/64"],"mac_address":"0a:58:0a:81:02:07","gateway_ips":["10.129.2.1","fd01:0:0:6:... openshift.io/scc: restricted-v2 seccomp.security.alpha.kubernetes.io/pod: runtime/default Status: Pending IP: IPs: <none> Controlled By: ReplicationController/iperf-rc-normal Containers: iperf: Container ID: Image: quay.io/openshifttest/iperf3@sha256:440c59251338e9fcf0a00d822878862038d3b2e2403c67c940c7781297953614 Image ID: Port: <none> Host Port: <none> State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Limits: memory: 340Mi Requests: memory: 340Mi Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4266b (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kube-api-access-4266b: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 3m4s (x173 over 5h50m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_iperf-rc-normal-qg6zd_offload-testing_18673f13-37b4-40ea-aa5d-85654dfa5c85_0(4899f7150492fa4cd895c62d0ec25ac5c1507016037c31b6019849083b42cdb5): error adding pod offload-testing_iperf-rc-normal-qg6zd to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [offload-testing/iperf-rc-normal-qg6zd/18673f13-37b4-40ea-aa5d-85654dfa5c85:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[offload-testing/iperf-rc-normal-qg6zd 4899f7150492fa4cd895c62d0ec25ac5c1507016037c31b6019849083b42cdb5] [offload-testing/iperf-rc-normal-qg6zd 4899f7150492fa4cd895c62d0ec25ac5c1507016037c31b6019849083b42cdb5] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:81:02:07 [10.129.2.7/23 fd01:0:0:6::3/64] '
This is a clone of issue OCPBUGS-2281. The following is the description of the original issue:
—
Description of problem:
E2E test cases for knative and pipeline packages have been disabled on CI due to respective operator installation issues. Tests have to be enabled after new operator version be available or the issue resolves
References:
https://coreos.slack.com/archives/C6A3NV5J9/p1664545970777239
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Deploy IPI cluster on multi datacenter/cluster vsphere env, installer failed with some reason, then tried to destroy cluster, and found that one vm folder under one of datacenters is not deleted. When installer exit, following objects are attached with tag jima15b-cq7z7 sh-4.4$ govc tags.attached.ls jima15b-cq7z7 | xargs govc ls -L /IBMCloud/vm/jima15b-cq7z7 /datacenter-2/vm/jima15b-cq7z7 /datacenter-2/vm/jima15b-cq7z7/jima15b-cq7z7-rhcos-us-west-us-west-1a /IBMCloud/vm/jima15b-cq7z7/jima15b-cq7z7-rhcos-us-east-us-east-2a /IBMCloud/vm/jima15b-cq7z7/jima15b-cq7z7-rhcos-us-east-us-east-3a /IBMCloud/vm/jima15b-cq7z7/jima15b-cq7z7-rhcos-us-east-us-east-1a /IBMCloud/vm/jima15b-cq7z7/jima15b-cq7z7-bootstrap sh-4.4$ ./openshift-install destroy cluster --dir ipi_missingzones/ INFO Destroyed VirtualMachine=jima15b-cq7z7-rhcos-us-west-us-west-1a INFO Destroyed VirtualMachine=jima15b-cq7z7-rhcos-us-east-us-east-2a INFO Destroyed VirtualMachine=jima15b-cq7z7-rhcos-us-east-us-east-3a INFO Destroyed VirtualMachine=jima15b-cq7z7-rhcos-us-east-us-east-1a INFO Destroyed VirtualMachine=jima15b-cq7z7-bootstrap INFO Destroyed Folder=jima15b-cq7z7 INFO Deleted Tag=jima15b-cq7z7 INFO Deleted TagCategory=openshift-jima15b-cq7z7 INFO Time elapsed: 55s After destroying cluster, folder jima15b-cq7z7 is still there, not deleted. sh-4.4$ govc ls /datacenter-2/vm/ | grep jima15b-cq7z7 /datacenter-2/vm/jima15b-cq7z7
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-18-141547
How reproducible:
always when installer fails to create infrastructure, it works when installation is successful.
Steps to Reproduce:
1. deploy IPI cluster on vsphere env configured multi datacenter/cluster 2. installer failed to create infrastructure with some reason 3. destroy cluster 4. one folder is not deleted
Actual results:
one folder is not deleted
Expected results:
All infrastructures created by installer should be removed
Additional info:
https://github.com/openshift/origin/pull/27444 was intended to move the scaling test out of serial to it's own test suite, but it added it to parallel – meaning it's running in all our normal upgrade jobs, causing them to frequently fail with repeating pathological events as well as greatly increasing their run time.
See https://github.com/openshift/origin/pull/27444#discussion_r991296925 for more info
Description of problem:
This is a follow up on OCPBUGSM-47202 (https://bugzilla.redhat.com/show_bug.cgi?id=2110570)
While OCPBUGSM-47202 fixes the issue specific for Set Pod Count, many other actions aren't fixed. When the user updates a Deployment with one of this options, and selects the action again, the old values are still shown.
Version-Release number of selected component (if applicable)
4.8-4.12 as well as master with the changes of OCPBUGSM-47202
How reproducible:
Always
Steps to Reproduce:
Actual results:
Old data (labels, annotations, etc.) was shown.
Expected results:
Latest data should be shown
Additional info:
Description of problem:
When the Insights operator is marked as disabled then the "Available" operator condition is updated every 2 mins. This is not desired and gives an impression that the operator is restarted every 2 mins
Version-Release number of selected component (if applicable):
How reproducible:
No extra steps needed, just watch "oc get co insights --watch"
Steps to Reproduce:
1. 2. 3.
Actual results:
available condition transition time updated every 2 min
Expected results:
available condition is updated only when its status changed
Additional info:
Originally reported by lance5890 in issue https://github.com/openshift/cluster-etcd-operator/issues/1000
Under some circumstances the static pod machinery fails to populate the node status in time to generate the correct env variables for ETCD_URL_HOST, ETCD_NAME etc. The pods that come up will fail to accept those variables.
This is particularly pronounced in SNO topologies, leading to installation failures.
The fix is to fail fast in the targetconfig/envvar controller to ensure the CEO goes degraded instead of silently failing on the rollout of an invalid static pod.
This is a clone of issue OCPBUGS-4950. The following is the description of the original issue:
—
Description of problem:
A PR bumping OLM's k8s dependencies to 1.25 wasn't merged into openshift 4.12
Version-Release number of selected component (if applicable):
openshift-4.12
How reproducible:
Always
Steps to Reproduce:
1. Check OLM's repository for k8s dependencies in the 4.12 branch
Actual results:
Has 1.24 k8s dependencies
Expected results:
Has 1.25 k8s dependencies
Additional info:
Description of problem:
The console crashes when it used with a user settings ConfigMap that is created with a 4.13+ console. This version saves "null" for the key "console.pinnedResources" which doesn't happen before and the old console version could not handle this well.
Version-Release number of selected component (if applicable):
4.8-4.12
How reproducible:
Always, but only in the edge case that someone used a newer console first and then downgraded.
This can happen only by manually applying the user settings ConfigMap or when downgrading a cluster.
Steps to Reproduce:
Open the user-settings ConfigMap and set "console.pinedResources" to "null" (with quotes as all ConfigMap values needs to be strings)
Or run this patch command:
oc patch -n openshift-console-user-settings configmaps user-settings-kubeadmin --type=merge --patch '{"data":{"console.pinnedResources":"null"}}'
Open console...
Actual results:
Console crashes
Expected results:
Console should not crash
Node healthz server was added in 4.13 with https://github.com/openshift/ovn-kubernetes/commit/c8489e3ff9c321e77f265dc9d484ed2549df4a6b and https://github.com/openshift/ovn-kubernetes/commit/9a836e3a547f3464d433ce8b9eef336624d51858. We need to configure it by default on 0.0.0.0:10256 on CNO for ovnk, just like we do for sdn.
Description of problem:
openshift-install does not detect releaseImage mismatches between cluster-image-set.yaml and registries.conf
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1.Create ZTP inputs for image generation where registries.conf does not have any source matching the binary releaseimage (the binary image which can be obtained by running "openshift-install version". You can also set this value in ZTP manifest cluster-image-set.yaml 2.run openshift-install agent create image
Actual results:
Image is generated with no warnings
Expected results:
Image is generated with warning message - "The ImageContentSources configuration in install-config.yaml should have at-least one source field matching the releaseImage value %s", releaseImagePath
Additional info:
This is a clone of issue OCPBUGS-5136. The following is the description of the original issue:
—
Description of problem:
Provisioning on ilo4-virtualmedia BMC driver fails with error: "Creating vfat image failed: Unexpected error while running command"
Version-Release number of selected component (if applicable):
4.13 (but will apply to older OpenShift versions too)
How reproducible:
Always
Steps to Reproduce:
1.configure some nodes with ilo4-virtualmedia:// 2.attempt provisioning 3.
Actual results:
provisioning fails with error similar to Failed to inspect hardware. Reason: unable to start inspection: Validation of image href https://10.1.235.67:6183/ilo/boot-9db13f93-861a-4d27-b20d-2c228559faa2.iso failed, reason: HTTPSConnectionPool(host='10.1.235.67', port=6183): Max retries exceeded with url: /ilo/boot-9db13f93-861a-4d27-b20d-2c228559faa2.iso (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1129)')))
Expected results:
Provisioning succeeds
Additional info:
This happens after a preceding issue with missing iLO driver configuration has been fixed (https://github.com/metal3-io/ironic-image/pull/402)
Description of problem:
To address: 'Static Pod is managed but errored" err="managed container xxx does not have Resource.Requests'
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Already merged in https://github.com/openshift/cluster-kube-apiserver-operator/pull/1398
This is a clone of issue OCPBUGS-7729. The following is the description of the original issue:
—
Description of problem:
Etcd's liveliness probe should be removed.
Version-Release number of selected component (if applicable):
4.11
Additional info:
When the Master Hosts hit CPU load this can cause a cascading restart loop for etcd and kube-api due to the etcd liveliness probes failing. Due to this loop load on the masters stays high because the api and controllers restarting over and over again.. There is no reason for etcd to have a liveliness probe, we removed this probe in 3.11 due issues like this.
We cache images by filename, which works when downloading from the Internet as the filename always includes the CoreOS version.
However, when extracting an image from the release payload, it always has the same name. Therefore, we will never update it to a newer image even when running different versions of the installer.
A possible solution:
An alternative might be to set the name of the cache file to something different. It's not clear how we'd guarantee a match between the release payload we've been given and the ISO unless the name was based on the release payload (which eliminates some of the point of the cache, since ordinarily most release payloads will point to a small number of images).
Description of problem:
The platform-operators-aggregated cluster operator wasn't created after enabling "TechPreviewNoUpgrade" featureGate, as follows,
MacBook-Pro:~ jianzhang$ oc patch featuregate cluster -p '{"spec": {"featureSet": "TechPreviewNoUpgrade"}}' --type=merge featuregate.config.openshift.io/cluster patched MacBook-Pro:~ jianzhang$ oc wait --for=condition=Available=True clusteroperators.config.openshift.io/platform-operators-aggregated Error from server (NotFound): clusteroperators.config.openshift.io "platform-operators-aggregated" not found
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-20-095559
How reproducible:
always
Steps to Reproduce:
1. Install OCP 4.12 cluster. 2. Enable "TechPreviewNoUpgrade" feature gate. MacBook-Pro:~ jianzhang$ oc patch featuregate cluster -p '{"spec": {"featureSet": "TechPreviewNoUpgrade"}}' --type=merge featuregate.config.openshift.io/cluster patched 3. Check platform-operators-aggregated cluster operator.
Actual results:
MacBook-Pro:~ jianzhang$ oc wait --for=condition=Available=True clusteroperators.config.openshift.io/platform-operators-aggregated Error from server (NotFound): clusteroperators.config.openshift.io "platform-operators-aggregated" not found
Expected results:
The platform-operators-aggregated cluster operator can be created successfully.
Additional info:
The openshift-platform-operators pods running well.
MacBook-Pro:~ jianzhang$ oc get deploy -n openshift-platform-operators NAME READY UP-TO-DATE AVAILABLE AGE platform-operators-controller-manager 1/1 1 1 126m platform-operators-rukpak-core 1/1 1 1 126m platform-operators-rukpak-webhooks 2/2 2 2 126m MacBook-Pro:~ jianzhang$ oc get co platform-operators-aggregated Error from server (NotFound): clusteroperators.config.openshift.io "platform-operators-aggregated" not found
Description of problem:
This is an OCP clone of https://bugzilla.redhat.com/show_bug.cgi?id=2099794 In summary, NetworkManager reports the network as being up before the ipv6 address of the primary interface is ready and crio fails to bind to it.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-7719. The following is the description of the original issue:
—
An update from 4.13.0-ec.2 to 4.13.0-ec.3 stuck on:
$ oc get clusteroperator machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE machine-config 4.13.0-ec.2 True True True 30h Unable to apply 4.13.0-ec.3: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 105, ready 105, updated: 105, unavailable: 0)]
The worker MachineConfigPool status included:
type: NodeDegraded - lastTransitionTime: "2023-02-16T14:29:21Z" message: 'Failed to render configuration for pool worker: Ignoring MC 99-worker-generated-containerruntime generated by older version 8276d9c1f574481043d3661a1ace1f36cd8c3b62 (my version: c06601510c0917a48912cc2dda095d8414cc5182)'
4.13.0-ec.3. The behavior was apparently introduced as part of OCPBUGS-6018, which has been backported, so the following update targets are expected to be vulnerable: 4.10.52+, 4.11.26+, 4.12.2+, and 4.13.0-ec.3.
100%, when updating into a vulnerable release, if you happen to have leaked MachineConfig.
1. 4.12.0-ec.1 dropped cleanUpDuplicatedMC. Run a later release, like 4.13.0-ec.2.
2. Create more than one KubeletConfig or ContainerRuntimeConfig targeting the worker pool (or any pool other than master). The number of clusters who have had redundant configuration objects like this is expected to be small.
3. (Optionally?) delete the extra KubeletConfig and ContainerRuntimeConfig.
4. Update to 4.13.0-ec.3.
Update sticks on the machine-config ClusterOperator, as described above.
Update completes without issues.
Description of problem:
METAL-256 introduced Ironic API proxy pods. The pods start with IPv4, but crash loop if IPv6 is used. Blocks Assisted ZTP flow (This was with converged flow DISABLED)
[root@ocp-edge34 opt]# oc get pods -n openshift-machine-api NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-85b7c7c69b-2wdh9 2/2 Running 2 (14h ago) 15h cluster-baremetal-operator-8555c9dc87-t5rm4 2/2 Running 0 15h control-plane-machine-set-operator-6c4f7fff6f-fts4p 1/1 Running 0 15h ironic-proxy-67wkh 0/1 CrashLoopBackOff 164 (108s ago) 13h ironic-proxy-9qg6h 0/1 CrashLoopBackOff 163 (106s ago) 13h ironic-proxy-hxft5 0/1 CrashLoopBackOff 164 (108s ago) 13h machine-api-controllers-6b4f47899b-7xqb8 7/7 Running 0 14h machine-api-operator-544587645d-9rv4m 2/2 Running 0 15h metal3-7688b65d7f-kc2mg 5/5 Running 0 13h metal3-image-cache-4w24m 1/1 Running 0 14h metal3-image-cache-q7p54 1/1 Running 0 14h metal3-image-cache-vhnkj 1/1 Running 0 14h metal3-image-customization-5dcd9f4fb7-lpmrq 1/1 Running 0 13h
Apache is used for the underlying proxy, and I believe the ipv6 address probably just needs to be surrounded in brackets to pass syntax.
+ python3 -c 'import os; import sys; import jinja2; sys.stdout.write(jinja2.Template(sys.stdin.read()).render(env=os.environ))' + exec /usr/sbin/httpd -DFOREGROUND AH00526: Syntax error on line 8 of /etc/httpd/conf.d/ironic-proxy.conf: ProxyPass Unable to parse URL: https://fd2e:6f44:5dd8::79:6388/
Version-Release number of selected component (if applicable): OCP hub 4.12.0-ec.3 2.2.0-DOWNANDBACK-2022-09-26-15-59-33
How reproducible: 100%
Steps to Reproduce:
1. Deploy ocp bm compact/HA cluster with ipv6
2. Deploy MCE + Assisted Service
3. Try to deploy a spoke via full ZTP
Actual results: Spoke BMH on Hub cluster do nothing: mstat-0 mstat-master-0-0-bmh true 10h mstat-0 mstat-master-0-1-bmh true 10h mstat-0 mstat-master-0-2-bmh true 10h mstat-0 mstat-worker-0-0-bmh true 10h mstat-0 mstat-worker-0-1-bmh true 10h
Expected results: ZTP flow happens and spoke cluster deployed
Additional info:
As OpenShift user, I want ClusterCSIDriver.Spec.LogLevel to affect the vSphere CSI driver logs, so I can capture the logs with all details and send it to Red Hat for investigation.
As OpenShift developer, I want ClusterCSIDriver.Spec.LogLevel to affect the vShere CSI CSI driver logs, so I can debug the driver with all logs.
Exit criteria:
2022-08-05T11:54:10.808Z DEBUG commonco/utils.go:102 Container Orchestrator init params:
Unknown macro: {InternalFeatureStatesConfigInfo}ServiceMode:controller}
This is a clone of issue OCPBUGS-3633. The following is the description of the original issue:
—
I think something is wrong with the alerts refactor, or perhaps my sync to 4.12.
Failed: suite=[openshift-tests], [sig-instrumentation][Late] Alerts shouldn't report any unexpected alerts in firing or pending state [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] Passed 1 times, failed 0 times, skipped 0 times: we require at least 6 attempts to have a chance at success
We're not getting the passes - from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-azure-ovn-upgrade-4.12-micro-release-openshift-release-analysis-aggregator/1592021681235300352, the successful runs don't show any record of the test at all. We need to record successes and failures for aggregation to work right.
Tracker issue for bootimage bump in 4.12. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-2997.
This is a clone of issue OCPBUGS-881. The following is the description of the original issue:
—
Description of problem:
Create install-config file for vsphere IPI against 4.12.0-0.nightly-2022-09-02-194931, fail as apiVIP and ingressVIP are not in machine CIDR. $ ./openshift-install create install-config --dir ipi ? Platform vsphere ? vCenter xxxxxxxx ? Username xxxxxxxx ? Password [? for help] ******************** INFO Connecting to xxxxxxxx INFO Defaulting to only available datacenter: SDDC-Datacenter INFO Defaulting to only available cluster: Cluster-1 INFO Defaulting to only available datastore: WorkloadDatastore ? Network qe-segment ? Virtual IP Address for API 172.31.248.137 ? Virtual IP Address for Ingress 172.31.248.141 ? Base Domain qe.devcluster.openshift.com ? Cluster Name jimavmc ? Pull Secret [? for help] **************************************************************************************************************************************************************************************** FATAL failed to fetch Install Config: failed to generate asset "Install Config": invalid install config: [platform.vsphere.apiVIPs: Invalid value: "172.31.248.137": IP expected to be in one of the machine networks: 10.0.0.0/16, platform.vsphere.ingressVIPs: Invalid value: "172.31.248.141": IP expected to be in one of the machine networks: 10.0.0.0/16] As user could not define cidr for machineNetwork when creating install-config file interactively, it will use default value 10.0.0.0/16, so fail to create install-config when inputting apiVIP and ingressVIP outside of default machinenNetwork. Error is thrown from https://github.com/openshift/installer/blob/master/pkg/types/validation/installconfig.go#L655-L666, seems new function introduced from PR https://github.com/openshift/installer/pull/5798 The issue should also impact Nutanix platform.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-02-194931
How reproducible:
Always
Steps to Reproduce:
1. create install-config.yaml file by running command "./openshift-install create install-config --dir ipi" 2. failed with above error 3.
Actual results:
fail to create install-config.yaml file
Expected results:
succeed to create install-config.yaml file
Additional info:
Description of problem:
Currently openshift-installer and ARO installer have diverged in code bases. In effort from the ARO team to be able to reduce/remove this, the we are patching openshift-installer. ARO uses a newer version of the Azure SDK. We need to backport this change to previous versions of openshift-installer
Version-Release number of selected component (if applicable):
See affected versions
How reproducible:
N/A
Steps to Reproduce:
N/A
Actual results:
N/A
Expected results:
N/A
Additional info:
Description of problem:
Latest implementation of history pruner (pr805 [1]) had increased max upgrade history in cvo to 100, and implemented a weight based pruning priority strategy for in case history length grows any larger. This pruning however is not happening, letting history grow uncontrollably, and potentially reach resource limits of etcd or kubernetes.
Observed the following while running continuous upgrade-rollback cycles:
$ oc get clusterversion version -o json | jq '.status.history|length'
203
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-08-23-223922
4.12.0-0.nightly-2022-08-23-153511
How reproducible:
1/1
Steps to Reproduce:
Same as described in bz2097067 [2], with addition of waiting a few minutes after the first rollback to allow it to reach 'Completed' state.
Actual results:
History grows uncontrollably
Expected results:
History should be pruned to keep max size of 100
Additional info:
[1] https://github.com/openshift/cluster-version-operator/pull/805
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2097067#c4
This bug is a backport clone of [Bugzilla Bug 2050230](https://bugzilla.redhat.com/show_bug.cgi?id=2050230). The following is the description of the original bug:
—
Description of problem:
In a large cluster, sdn daemonset can DoS the kube-apiserver with un-paginated LIST calls on high count resources.
Version-Release number of selected component (if applicable):
How reproducible:
NA
Steps to Reproduce:
NA
Actual results:
Kube API Server and Openshift API Server in one of the cluster keeps restarting, without proper exception. The cluster is not accessible.
Expected results:
Kube API Server and Openshift API Server should be stable.
Additional info:
The issue found while testing HOSTEDCP-400 and HOSTEDCP-401.
Hypershift operator installed with flags:
--platform-monitoring=operator-only --enable-uwm-telemetry-remote-write=true --metrics-set=telemetry
Service monitors and pod monitors in the control plane:
[jiezhao@cube hypershift]$ oc get servicemonitor -n clusters-jz-test NAME AGE catalog-operator 45m cluster-version-operator 45m etcd 46m kube-apiserver 46m kube-controller-manager 45m monitor-multus-admission-controller 43m monitor-ovn-master-metrics 43m node-tuning-operator 45m olm-operator 45m openshift-apiserver 45m openshift-controller-manager 45m [jiezhao@cube hypershift]$ oc get podmonitor -n clusters-jz-test NAME AGE cluster-image-registry-operator 46m controlplane-operator 47m hosted-cluster-config-operator 46m ignition-server 47m
In OCP management web console, go to Observe->Targets:
1. Status of service monitor 'monitor-multus-admission-controller' is Down, error: Scraped failed: server returned HTTP status 401 Unauthorized. It doesn't have cluster id in target labels 2. Target of pod monitor 'cluster-image-registry-operator' is missing, not shown
If the status for the hosts in assisted-installer changes from preparing-for-installation to ready, that means that it failed to generate the ignition configs needed to install, and installation will not proceed. When we see this we should report a failure immediately from agent wait-for bootstrap-complete. Currently we just time out some time after reporting this log message:
level=info msg=Host master-2.ostest.test.metalkube.org: updated status from preparing-for-installation to known (Host is ready to be installed)
To catch the case where the user runs the command after this failure has already happened, perhaps we should institute a relatively short timeout for installation to begin after all of the hosts are in the known state.
Description of problem:
To address: 'Static Pod is managed but errored" err="managed container xxx does not have Resource.Requests'
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Already merged in https://github.com/openshift/cluster-kube-controller-manager-operator/pull/660
Description of problem:
Provisioning interface on master node not getting ipv4 dhcp ip address from bootstrap dhcp server on OCP 4.10.16 IPI BareMetal install.
Customer is performing an OCP 4.10.16 IPI BareMetal install and bootstrap node provisions just fine, but when master nodes are booted for provisioning, they are not getting an ipv4 address via dhcp. As such, the install is not moving forward at this point.
Version-Release number of selected component (if applicable):
OCP 4.10.16
How reproducible:
Perform OCP 4.10.16 IPI BareMetal install.
Actual results:
provisioning interface comes up (as evidenced by ipv6 address) but is not getting an ipv4 address via dhcp. OCP install / provisioning fails at this point.
Expected results:
provisioning interface successfully received an ipv4 ip address and successfully provisioned master nodes (and subsequently worker nodes as well.)
Additional info:
As a troubleshooting measure, manually adding an ipv4 ip address did allow the coreos image on the bootstrap node to be reached via curl.
Further, the kernel boot line for the first master node was updated for a static ip addresss assignment for further confirmation that the master node would successfully image this way which further confirming that the issue is the provisioning interface not receiving an ipv4 ip address from the dhcp server.
Description of problem:
Install a single node cluster on AWS, then enable TechPreview, cause the cluster error. The CMA and CAPI CMA shouldn't be on the same port.
Version-Release number of selected component (if applicable):
4.11.9
How reproducible:
always
Steps to Reproduce:
1.Launch 4.11.9 single node cluster on AWS liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.9 True False 34m Cluster version is 4.11.9 liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.11.9 True False False 31m baremetal 4.11.9 True False False 49m cloud-controller-manager 4.11.9 True False False 52m cloud-credential 4.11.9 True False False 53m cluster-autoscaler 4.11.9 True False False 48m config-operator 4.11.9 True False False 50m console 4.11.9 True False False 37m csi-snapshot-controller 4.11.9 True False False 49m dns 4.11.9 True False False 48m etcd 4.11.9 True False False 47m image-registry 4.11.9 True False False 43m ingress 4.11.9 True False False 86s insights 4.11.9 True False False 43m kube-apiserver 4.11.9 True False False 43m kube-controller-manager 4.11.9 True False False 47m kube-scheduler 4.11.9 True False False 44m kube-storage-version-migrator 4.11.9 True False False 50m machine-api 4.11.9 True False False 44m machine-approver 4.11.9 True False False 49m machine-config 4.11.9 True False False 49m marketplace 4.11.9 True False False 48m monitoring 4.11.9 True False False 56s network 4.11.9 True False False 52m node-tuning 4.11.9 True False False 49m openshift-apiserver 4.11.9 True False False 72s openshift-controller-manager 4.11.9 True False False 39m openshift-samples 4.11.9 True False False 43m operator-lifecycle-manager 4.11.9 True False False 49m operator-lifecycle-manager-catalog 4.11.9 True False False 49m operator-lifecycle-manager-packageserver 4.11.9 True False False 104s service-ca 4.11.9 True False False 50m storage 4.11.9 True False False 49m liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-137-222.us-east-2.compute.internal Ready master,worker 53m v1.24.0+dc5a2fd 2.Enable TechPreview spec: featureSet: TechPreviewNoUpgrade liuhuali@Lius-MacBook-Pro huali-test % oc edit featuregate featuregate.config.openshift.io/cluster edited 3.Check the cluster liuhuali@Lius-MacBook-Pro huali-test % oc get pod -n openshift-cloud-controller-manager NAME READY STATUS RESTARTS AGE aws-cloud-controller-manager-5888c85fc6-28tgt 1/1 Running 12 (10m ago) 55m liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.9 True False 111m Error while reconciling 4.11.9: the workload openshift-cluster-machine-approver/machine-approver-capi has not yet successfully rolled out liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.11.9 False False False 9m44s OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.huliu-aws411arn2.qe.devcluster.openshift.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)... baremetal 4.11.9 True False False 128m cloud-controller-manager 4.11.9 True False False 131m cloud-credential 4.11.9 True False False 133m cluster-api 4.11.9 True False False 41m cluster-autoscaler 4.11.9 True False False 128m config-operator 4.11.9 True False False 129m console 4.11.9 False True False 10m DeploymentAvailable: 0 replicas available for console deployment... csi-snapshot-controller 4.11.9 True False False 4m52s dns 4.11.9 True False False 128m etcd 4.11.9 True False False 127m image-registry 4.11.9 True False False 123m ingress 4.11.9 True False False 3m15s insights 4.11.9 True False False 122m kube-apiserver 4.11.9 True False False 123m kube-controller-manager 4.11.9 True False False 126m kube-scheduler 4.11.9 True False False 124m kube-storage-version-migrator 4.11.9 True False False 129m machine-api 4.11.9 True False False 124m machine-approver 4.11.9 True False False 128m machine-config 4.11.9 True False False 129m marketplace 4.11.9 True False False 128m monitoring 4.11.9 True False False 5m1s network 4.11.9 True False False 131m node-tuning 4.11.9 True False False 128m openshift-apiserver 4.11.9 True False False 23s openshift-controller-manager 4.11.9 True False False 118m openshift-samples 4.11.9 True False False 122m operator-lifecycle-manager 4.11.9 True False False 128m operator-lifecycle-manager-catalog 4.11.9 True False False 128m operator-lifecycle-manager-packageserver 4.11.9 True False False 2m43s service-ca 4.11.9 True False False 129m storage 4.11.9 True False False 69m liuhuali@Lius-MacBook-Pro huali-test %
Actual results:
Cluster is broken CMA is complaining, message: '0/1 nodes are available: 1 node(s) didn''t have free ports for the requested pod ports. preemption: 0/1 nodes are available: 1 node(s) didn''t have free ports for the requested pod ports.'
Expected results:
Cluster should be healthy
Additional info:
Talked with dev here https://coreos.slack.com/archives/GE2HQ9QP4/p1666178083034159?thread_ts=1666176493.224399&cid=GE2HQ9QP4 Must-Gather https://drive.google.com/file/d/1Q7Ddnhbg3Cq4ptBA2ycJnGKK01As1JcF/view?usp=sharing If enable TechPreview during installation on single node cluster, the cluster installation failed.
This is a clone of issue OCPBUGS-5458. The following is the description of the original issue:
—
reported in https://coreos.slack.com/archives/C027U68LP/p1673010878672479
Description of problem:
Hey guys, I have a openshift cluster that was upgraded to version 4.9.58 from version 4.8. After the upgrade was done, the etcd pod on master1 isn't coming up and is crashlooping. and it gives the following error: {"level":"fatal","ts":"2023-01-06T12:12:58.709Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: max entry size limit exceeded, recBytes: 13279, fileSize(313430016) - offset(313418480) - padBytes(1) = entryLimit(11535)","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/remote-source/cachito-gomod-with-deps/app/server/etcdmain/main.go:40\nmain.main\n\t/remote-source/cachito-gomod-with-deps/app/server/main.go:32\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225"}
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-10427. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-9969. The following is the description of the original issue:
—
Description of problem:
OCP cluster born on 4.1 fails to scale-up node due to older podman version 1.0.2 present in 4.1 bootimage. This was observed while testing bug https://issues.redhat.com/browse/OCPBUGS-7559?focusedCommentId=21889975&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21889975
Journal log: - Unit machine-config-daemon-update-rpmostree-via-container.service has finished starting up. -- -- The start-up result is RESULT. Mar 10 10:41:29 ip-10-0-218-217 podman[18103]: flag provided but not defined: -authfile Mar 10 10:41:29 ip-10-0-218-217 podman[18103]: See 'podman run --help'. Mar 10 10:41:29 ip-10-0-218-217 systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Main process exited, code=exited, status=125/n/a Mar 10 10:41:29 ip-10-0-218-217 systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Failed with result 'exit-code'. Mar 10 10:41:29 ip-10-0-218-217 systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Consumed 24ms CPU time
Version-Release number of selected component (if applicable):
OCP 4.12 and later
Steps to Reproduce:
1.Upgrade a 4.1 based cluster to 4.12 or later version 2. Try to Scale up node 3. Node will fail to join
Installation fails on AWS because the installer manifests include an invalid ingresses.config.openshift.io/cluster manifest.
4.12.
Seems to be a consistent failure.
1. Install a cluster on AWS without specifying lbType in the install-config.
The cluster bootstrap fails with the following error message:
"cluster-ingress-02-config.yml": failed to create ingresses.v1.config.openshift.io/cluster -n : Ingress.config.openshift.io "cluster" is invalid: spec.loadBalancer.platform.aws.type: Required value
Cluster bootstrap should succeed.
https://github.com/openshift/installer/pull/6478 introduced the problematic logic that sets spec.loadBalancer.platform.aws without setting spec.loadBalancer.platform.aws.type.
Description of problem:
This a bug record to pin down dependencies version in CMO release 4.12 after the release-4.12 branch was detached from master branch.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
N/A
Steps to Reproduce:
N/A
Actual results:
N/A
Expected results:
N/A
Additional info:
None.
Not all of the errors reported by the assisted API (and shown in the wait-for bootstrap complete output) actually require user action.
Some appear when the agents first register but resolve themselves relatively quickly in the natural course of events.
Some, like the availability of NTP, don't block the installation from proceeding at all.
We need to think about the best ways of exposing this information to the user.
Description of problem:
Custom manifest files can be placed in the /openshift folder so that they will be applied during cluster installation. Anyhow, if a file contains more than one manifests, all but the first are ignored.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1.Create the following custom manifest file in the /openshift folder: ``` apiVersion: v1 kind: ConfigMap metadata: name: agent-test namespace: openshift-config data: value: agent-test --- apiVersion: v1 kind: ConfigMap metadata: name: agent-test-2 namespace: openshift-config data: value: agent-test-2 ``` 2. Create the agent ISO image and deploy a cluster
Actual results:
ConfigMap agent-test-2 does not exist in the openshift-config namespace
Expected results:
ConfigMap agent-test-2 must exist in the openshift-config namespace
Additional info:
This is a clone of issue OCPBUGS-2841. The following is the description of the original issue:
—
Currently the agent installer supports only x86_64 arch. The image creation command must fail if some other arch is configured different from x86_64
We want to have an allowed list of architectures.
allowed = ['x86_64', 'amd64']
Description of problem:
Image registry pods panic while deploying OCP in me-central-1 AWS region
Version-Release number of selected component (if applicable):
4.11.2
How reproducible:
Deploy OCP in AWS me-central-1 region
Steps to Reproduce:
Deploy OCP in AWS me-central-1 region
Actual results:
panic: Invalid region provided: me-central-1
Expected results:
Image registry pods should come up with no errors
Additional info:
This is a clone of issue OCPBUGS-3924. The following is the description of the original issue:
—
The APIs are scheduled for removal in Kube 1.26, which will ship with OpenShift 4.13. We want the 4.12 CVO to move to modern APIs in 4.12, so the APIRemovedInNext.*ReleaseInUse alerts are not firing on 4.12. We'll need the components setting manifests for these deprecated APIs to move to modern APIs. And then we should drop our ability to reconcile the deprecated APIs, to avoid having other components leak back in to using them.
Specifically cluster-monitoring-operator touches:
Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times
Full output of the test at https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/27560/pull-ci-openshift-origin-master-e2e-gcp-ovn/1593697975584952320/artifacts/e2e-gcp-ovn/openshift-e2e-test/build-log.txt:
[It] clients should not use APIs that are removed in upcoming releases [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/apiserver/api_requests.go:27 Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times Nov 18 21:59:06.261: INFO: api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times Nov 18 21:59:06.261: INFO: api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times Nov 18 21:59:06.261: INFO: user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times [AfterEach] [sig-arch][Late] github.com/openshift/origin/test/extended/util/client.go:158 [AfterEach] [sig-arch][Late] github.com/openshift/origin/test/extended/util/client.go:159 flake: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times Ginkgo exit error 4: exit with code 4
This is required to unblock https://github.com/openshift/origin/pull/27561
This is a clone of issue OCPBUGS-3253. The following is the description of the original issue:
—
It is very easy to accidentally use the traditional openshift-install wait-for <x>-complete commands instead of the equivalent openshift-install agent wait-for <x>-complete command. This will work in some stages of the install, but show much less information or fail altogether in other stages of the install.
If we can detect from the asset store that this was an agent-based install, we should issue a warning if the user uses the old command.
This is a clone of issue OCPBUGS-4955. The following is the description of the original issue:
—
Description of problem:
Customer needs "IfNotPresent" ImagePullPolicy set for bundle unpacker images which reference iamges by digest. Currently, policy is set to "Always" no matter what.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1.Install an operator via bundle referencing an image by digest 2.Check the bundle unpacker pod
Actual results:
Image pull policy will be set to "Always"
Expected results:
Image pull policy will be set to "IfNotPresent" when pulling via digest
Additional info:
Description of problem:
When opening the Devfile sample developer catalog, switch the project in another browser tab, and then open devfile samples link in a new tab, the current project context is getting lost.
Version-Release number of selected component (if applicable):
4.12, expecting that this happen also in older versions
How reproducible:
Always
Steps to Reproduce:
1. Switch to the developer perspective, navigate to Add > Samples
2. Open a new browser tab and create a new project
3. Ctrl+click a sample in the first tab.
Actual results:
The project has also changed in the "Import sample" page
Expected results:
The project should be used also for the new "Import sample" page
Additional info:
We had this issue earlier for other catalog entries. Other samples works already fine, just the Devfile sample links doesn't contain the current namespace.
This is a clone of issue OCPBUGS-3314. The following is the description of the original issue:
—
Description of problem:
triggers[].gitlab.secretReference[1] disappears when a 'buildconfig' is edited on ‘From View’
Version-Release number of selected component (if applicable):
4.10.32
How reproducible:
Always
Steps to Reproduce:
1. Configure triggers[].gitlab.secretReference[1] as below ~~~ spec: .. triggers: - type: ConfigChange - type: GitLab gitlab: secretReference: name: m24s40-githook ~~~ 2. Open ‘Edit BuildConfig’ buildconfig with ‘From’ View: - Buildconfigs -> Actions -> Edit Buildconfig 3. Click ‘YAML view’ on top.
Actual results:
The 'secretReference' configured earlier has disappeared. You can click [Reload] button which will bring the configuration back.
Expected results:
'secretReference' configured in buildconfigs do not disappear.
Additional info:
[1]https://docs.openshift.com/container-platform/4.10/rest_api/workloads_apis/buildconfig-build-openshift-io-v1.html#spec-triggers-gitlab-secretreference
The DVO metrics gatherer in the Insights operator relies on the "deployment-validation-operator" namespace name, but this is not very good, because the DVO can be installed in other namespaces (e.g it's installed in the "openshift-operators" namespace when installing through OperatorHub)
This is a clone of issue OCPBUGS-4973. The following is the description of the original issue:
—
Description of problem:
Config OAuth with htpasswd in the hostedcluster doesn't work as expected.
Version-Release number of selected component (if applicable):
How reproducible:
enable OAuth htpasswd in hostedcluster
Steps to Reproduce:
1. create passwd file for user init by htpasswd ``` htpasswd -cbB .passwd helitest helitest oc create secret generic testuser --from-file=htpasswd=.passwd -n clusters ``` 2. edit hostedcluster.yaml ``` spec: configuration: oauth: identityProviders: - htpasswd: fileData: name: testuser mappingMethod: claim name: htpasswd type: HTPasswd ``` 3. oc login hostedcluster apiserver $ oc login https://ac0be21b169ff4399b6a2044388c38cf-5789e1b174d7424b.elb.us-east-2.amazonaws.com:6443 --username=testuser --password=testuser The server uses a certificate signed by an unknown authority. You can bypass the certificate check, but any data you send to the server could be intercepted by others. Use insecure connections? (y/n): y Login failed (401 Unauthorized)
Actual results:
oc login with error : "Login failed (401 Unauthorized) "
Expected results:
oc login successfully.
Additional info:
# check configmap of oauth $ oc get cm -n clusters-demo-02 oauth-openshift -oyaml ... oauthConfig: alwaysShowProviderSelection: false assetPublicURL: "" grantConfig: method: deny serviceAccountMethod: prompt identityProviders: [] loginURL: https://ac0be21b169ff4399b6a2044388c38cf-5789e1b174d7424b.elb.us-east-2.amazonaws.com:6443 ---> seems `identityProviders` is not synced correctly ?
Description of problem:
We got a feedback from the support team that it is confusing to see switch in the Notifications column for the Alerting rule which have no alerts associated to it as user can not silence the Alerting rule.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. oc apply -f https://gist.githubusercontent.com/vikram-raj/727629797eb9d9bfcfa2721cae2ade86/raw/7c2305e14115a1a4f4f88ebb74cdad32cbec4132/Alerting%2520rule%2520without%2520alert 2. navigate to the Developer perspective Observe -> Alerts 3. Try to silence the VersionAlert alerting rule, nothing will happen
Actual results:
Silence the alerting rule using the switch will do nothing
Expected results:
No switch for silence the alerting rule should be visible if no alerts are associated to the alerting rule.
Additional info:
Description of problem:
While running scale tests with ACM provisioning 1200+ SNOs via ZTP, converged flow was enabled. With converged flow the rate at which clusters begin install is much slower than what was witnessed without converged flow. Example: Without converged flow - 1250/1269 SNOs completed install in 3hrs and 11m With converged flow - 487/1250 SNOs completed install in 10hours The test actually hit timeouts so we don't exactly know how long it took, but you can see we only managed 487 SNOs to be provisioned in 10 hours. The concurrency measurement scripts show that converged flow ran at a concurrency of 68 SNOs installing at a time vs non-converged flow peaking at 507. Something within the converged flow is bottlenecking the SNOs install.
Version-Release number of selected component (if applicable):
Hub/SNO OCP 4.11.8 ACM 2.6.1-DOWNSTREAM-2022-09-08-02-53-38
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
converged flow to match previous provisioning speeds/rates
Additional info:
Must gather will be provided.
Description of problem:
Customer is not able anymore to provision new baremetal nodes in 4.10.35 using the same rootDeviceHints used in 4.10.10. Customer uses HP DL360 Gen10, with exteranal SAN storage that is seen by the system as a multipath device. Latest IPA versions are implementing some changes to avoid wiping shared disks and this seems to affect what we should provide as rootDeviceHints. They used to put /dev/sda as rootDeviceHints, in 4.10.35 it doesn't make the IPA write the image to the disk anymore because it sees the disk as part of a multipath device, we tried using the on top multipath device /dev/dm-0, the system is then able to write the image to the disk but then it gets stuck when it tried to issue a partprobe command, rebooting the systems to boot from the disk does not seem to help complete the provisioning, no workaround so far.
Version-Release number of selected component (if applicable):
How reproducible:
by trying to provisioning a baremetal node with a multipath device.
Steps to Reproduce:
1. Create a new BMH using a multipath device as rootDeviceHints 2. 3.
Actual results:
The node does not get provisioned
Expected results:
the node gets provisioned correctly
Additional info:
Description of problem:
The error message of "opm alpha render-veneer semver" is not correct, "semver &{%!q(*os.file=&{{{0 0 0} 3 {0} 0 1 true true true}" is meaningless, should not be printed.
Version-Release number of selected component (if applicable):
zhaoxia@xzha-mac operator-framework-olm % opm version Version: version.Version{OpmVersion:"2149aebcc", GitCommit:"2149aebcc71367e6fba8f5416374917dae1e6a1c", BuildDate:"2022-09-08T04:31:47Z", GoOs:"darwin", GoArch:"amd64"}
How reproducible:
always
Steps to Reproduce:
1. create file zhaoxia@xzha-mac OCP-53915 % cat catalog-semver-veneer-1.yaml Schema: olm.semver Candidate: Bundles: - Image: quay.io/olmqe/nginxolm-operator-bundle:v0.0.1 - Image: quay.io/olmqe/nginxolm-operator-bundle:v1.0.1 - Image: quay.io/olmqe/nginxolm-operator-bundle:v1.0.1-alpha - Image: quay.io/olmqe/nginxolm-operator-bundle:v1.0.1-beta - Image: quay.io/olmqe/nginxolm-operator-bundle:v1.0.1-alpha20220829 - Image: quay.io/olmqe/nginxolm-operator-bundle:v1.0.1-alpha20220830 Stable: Bundles: - Image: quay.io/olmqe/nginxolm-operator-bundle:v1.0.1-beta Fast: Bundles: - Image: quay.io/olmqe/nginxolm-operator-bundle:v0.0.1 - Image: quay.io/olmqe/nginxolm-operator-bundle:v1.0.1-beta 2. run "opm alpha render-veneer semver" zhaoxia@xzha-mac operator-framework-olm % opm alpha render-veneer semver catalog-semver-veneer-1.yaml 2022/09/08 12:35:05 semver &{%!q(*os.file=&{{{0 0 0} 3 {0} <nil> 0 1 true true true} catalog-semver-veneer-1.yaml <nil> false false false})}: semver-render: unable to post-process bundle info: encountered bundle versions which differ only by build metadata, which cannot be ordered: [bundle version "1.0.1-alpha" cannot be compared to "1.0.1-alpha", bundle version "1.0.1-alpha+20220829" cannot be compared to "1.0.1-alpha"] 3.
Actual results:
"semver &{%!q(*os.file=&{{{0 0 0} 3 {0} 0 1 true true true}" is meaningless, should not be printed.
Expected results:
no error message "semver &{%!q(*os.file=&{{{0 0 0} 3 {0} 0 1 true true true}"
Additional info:
Description of problem:
Name of workload get changed, when project and image stream gets changed on reloading the form on the edit deployment page of the workload
Version-Release number of selected component (if applicable):
4.9 and above
How reproducible:
Always
Steps to Reproduce:
1. Create a deployment workload 2. Select Edit Deployment option on workload 3. Verify initially name was same as workload name and field was not changeable. 4. Change the project to "openshift", image stream to "golang" or anything and tag to "latest" 5. Reload the form 6. Now check that the name also got changed to golang.
Actual results:
Name of workload changes when project and image stream name changed on edit deployment page.
Expected results:
Workload name doesn't have to be changed, when image stream name changed on edit deployment page, as name field is not changeable.
Additional info:
While performing automation, I can see the error "the name of the object(imageStreamName) does not match the name on the URL(workloadName)", but while performing this on UI, no errors.
This is a clone of issue OCPBUGS-7102. The following is the description of the original issue:
—
Description of problem:
https://github.com/openshift/operator-framework-olm/blob/7ec6b948a148171bd336750fed98818890136429/staging/operator-lifecycle-manager/pkg/controller/operators/olm/plugins/downstream_csv_namespace_labeler_plugin_test.go#L309
has a dependency on creation of a next-version release branch.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Steps to Reproduce:
1. clone operator-framework/operator-framework-olm 2. make unit/olm 3. deal with a really bumpy first-time kubebuilder/envtest install experience 4. profit
Actual results:
error
Expected results:
pass
Additional info:
Description of problem: The product name for Azure Red Hat OpenShift was incorrect in Customer Case Management (CCM). As a result, the console included this incorrect product name in order for the support case link to correctly route. https://issues.redhat.com/browse/CPCCM-9926 fixed the incorrect product name, so now the support case link for Azure needs to be updated to reflect the correct product name.
Description of problem:
egressip healthcheck through GRPC on dualstack cluster only uses v6 address when it trying to re-connect to egressIP node
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-04-081353
How reproducible:
Steps to Reproduce:
1. on dualstack OVN cluster, label one node to be egressip assignable 2. check leader ovnkube-master pod's log for egressip health check messages 3. set iptable to drop tcp port 9107 on the egress node, check leader ovnkube-master pod's log again $ oc -n openshift-ovn-kubernetes logs ovnkube-master-s8gl4 -c ovnkube-master | grep health I1004 17:10:13.752545 1 egressip_healthcheck.go:168] Connected to master-01.jechen-1004d.qe.devcluster.openshift.com (10.129.0.2:9107) I1004 17:10:13.754308 1 egressip_healthcheck.go:168] Connected to master-00.jechen-1004d.qe.devcluster.openshift.com (10.128.0.2:9107) I1004 17:10:13.757856 1 egressip_healthcheck.go:168] Connected to worker-00.jechen-1004d.qe.devcluster.openshift.com (10.129.2.2:9107) I1004 17:10:13.760742 1 egressip_healthcheck.go:168] Connected to worker-02.jechen-1004d.qe.devcluster.openshift.com (10.131.0.2:9107) I1004 17:10:13.763491 1 egressip_healthcheck.go:168] Connected to master-02.jechen-1004d.qe.devcluster.openshift.com (10.130.0.2:9107) I1004 17:10:13.766653 1 egressip_healthcheck.go:168] Connected to worker-01.jechen-1004d.qe.devcluster.openshift.com (10.128.2.2:9107) I1004 17:10:18.749573 1 egressip_healthcheck.go:177] Closing connection with worker-00.jechen-1004d.qe.devcluster.openshift.com (10.129.2.2:9107) I1004 17:10:18.749624 1 egressip_healthcheck.go:177] Closing connection with worker-01.jechen-1004d.qe.devcluster.openshift.com (10.128.2.2:9107) I1004 17:10:18.749635 1 egressip_healthcheck.go:177] Closing connection with master-01.jechen-1004d.qe.devcluster.openshift.com (10.129.0.2:9107) I1004 17:10:18.749645 1 egressip_healthcheck.go:177] Closing connection with master-00.jechen-1004d.qe.devcluster.openshift.com (10.128.0.2:9107) I1004 17:10:18.749654 1 egressip_healthcheck.go:177] Closing connection with worker-02.jechen-1004d.qe.devcluster.openshift.com (10.131.0.2:9107) I1004 17:10:18.749663 1 egressip_healthcheck.go:177] Closing connection with master-02.jechen-1004d.qe.devcluster.openshift.com (10.130.0.2:9107) I1004 18:21:13.753154 1 egressip_healthcheck.go:168] Connected to worker-00.jechen-1004d.qe.devcluster.openshift.com (10.129.2.2:9107) I1004 18:21:19.749592 1 egressip_healthcheck.go:177] Closing connection with worker-00.jechen-1004d.qe.devcluster.openshift.com (10.129.2.2:9107) W1004 18:21:24.750727 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:21:29.750396 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:21:34.749900 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:21:39.750830 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:21:44.750599 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:21:49.750640 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:21:54.749998 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:21:59.750512 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:22:04.749911 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:22:09.750500 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:22:14.750400 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:22:19.750448 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:22:24.749497 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:22:29.750366 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded I1004 18:24:03.020413 1 egressip_healthcheck.go:168] Connected to worker-00.jechen-1004d.qe.devcluster.openshift.com (10.129.2.2:9107) I1004 18:24:09.750273 1 egressip_healthcheck.go:177] Closing connection with worker-00.jechen-1004d.qe.devcluster.openshift.com (10.129.2.2:9107) W1004 18:24:14.749580 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:24:19.750138 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:24:24.750291 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:24:29.750526 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:24:34.750725 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:24:39.750496 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:24:44.750182 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:24:49.750172 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:24:54.749791 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:24:59.749548 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:25:04.750806 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:25:09.750666 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:25:14.750602 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:25:19.750717 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded I1004 18:28:58.561054 1 egressip_healthcheck.go:168] Connected to worker-00.jechen-1004d.qe.devcluster.openshift.com (10.129.2.2:9107) I1004 18:29:04.749940 1 egressip_healthcheck.go:177] Closing connection with worker-00.jechen-1004d.qe.devcluster.openshift.com (10.129.2.2:9107) W1004 18:29:09.749710 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded W1004 18:29:14.749689 1 egressip_healthcheck.go:164] Could not connect to worker-00.jechen-1004d.qe.devcluster.openshift.com ([fd01:0:0:6::2]:9107): context deadline exceeded
Actual results:
uses v6 mgmtIP address to try to reconnect
Expected results:
Should use both v4 and v6 address to try to reconnect
Additional info:
Description of problem:
If using ingresscontroller.spec.routeSelector.matchExpressions or ingresscontroller.spec.namespaceSelector.matchExpressions, the route will not count in the new route_metrics_controller_routes_per_shard prometheus metric. This is due to the logic only using "matchLabels". The logic needs to be updated to also use "matchExpressions".
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1. Create IC with matchExpressions: oc apply -f - <<EOF apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: sharded namespace: openshift-ingress-operator spec: domain: reproducer.$domain routeSelector: matchExpressions: - key: type operator: In values: - shard replicas: 1 nodePlacement: nodeSelector: matchLabels: node-role.kubernetes.io/worker: "" EOF 2. Create the route: oc apply -f - <<EOF apiVersion: route.openshift.io/v1 kind: Route metadata: name: route-shard labels: type: shard spec: to: kind: Service name: router-shard EOF 3. Check route_metrics_controller_routes_per_shard{name="sharded"} in prometheus, it's 0
Actual results:
route_metrics_controller_routes_per_shard{name="sharded"} has 0 routes
Expected results:
route_metrics_controller_routes_per_shard{name="sharded"} should have 1 route
Additional info:
This is a clone of issue OCPBUGS-5428. The following is the description of the original issue:
—
Looking at the extension points documented at
and in our docs
https://docs.openshift.com/container-platform/4.11/web_console/dynamic-plug-ins.html
I see many are missing summaries. We should make sure every extension point has at least some documentation.
Description of problem:
With "createFirewallRules: Enabled", after successful "create cluster" and then "destroy cluster", the created firewall-rules in the shared VPC are not deleted.
Version-Release number of selected component (if applicable):
$ ./openshift-install version ./openshift-install 4.12.0-0.nightly-2022-09-28-204419 built from commit 9eb0224926982cdd6cae53b872326292133e532d release image registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc release architecture amd64
How reproducible:
Always
Steps to Reproduce:
1. try IPI installation with "createFirewallRules: Enabled", which succeeded 2. try destroying the cluster, which succeeded 3. check firewall-rules in the shared VPC
Actual results:
After destroying the cluster, its firewall-rules created by installer in the shared VPC are not deleted.
Expected results:
Those firewall-rules should be deleted during destroying the cluster.
Additional info:
$ gcloud --project openshift-qe-shared-vpc compute firewall-rules list --filter='network=installer-shared-vpc' NAME NETWORK DIRECTION PRIORITY ALLOW DENY DISABLED ci-op-xpn-ingress-common installer-shared-vpc INGRESS 60000 tcp:6443,tcp:22,tcp:80,tcp:443,icmp False ci-op-xpn-ingress-health-checks installer-shared-vpc INGRESS 60000 tcp:30000-32767,udp:30000-32767,tcp:6080,tcp:6443,tcp:226 24,tcp:32335 False ci-op-xpn-ingress-internal-network installer-shared-vpc INGRESS 60000 udp:4789,udp:6081,udp:500,udp:4500,esp,tcp:9000-9999,udp: 9000-9999,tcp:10250,tcp:30000-32767,udp:30000-32767,tcp:10257,tcp:10259,tcp:22623,tcp:2379-2380 FalseTo show all fields of the firewall, please show in JSON format: --format=json To show all fields in table format, please see the examples in --help. $ $ yq-3.3.0 r test2/install-config.yaml platform gcp: projectID: openshift-qe region: us-central1 computeSubnet: installer-shared-vpc-subnet-2 controlPlaneSubnet: installer-shared-vpc-subnet-1 createFirewallRules: Enabled network: installer-shared-vpc networkProjectID: openshift-qe-shared-vpc $ $ yq-3.3.0 r test2/install-config.yaml metadata creationTimestamp: null name: jiwei-1013-01 $ $ openshift-install create cluster --dir test2 INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" INFO Consuming Install Config from target directory INFO Creating infrastructure resources... INFO Waiting up to 20m0s (until 4:06AM) for the Kubernetes API at https://api.jiwei-1013-01.qe.gcp.devcluster.openshift.com:6443... INFO API v1.24.0+8c7c967 up INFO Waiting up to 30m0s (until 4:20AM) for bootstrapping to complete... INFO Destroying the bootstrap resources... INFO Waiting up to 40m0s (until 4:42AM) for the cluster at https://api.jiwei-1013-01.qe.gcp.devcluster.openshift.com:6443 to initialize... INFO Checking to see if there is a route at openshift-console/console... INFO Install complete! INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/fedora/test2/auth/kubeconfig' INFO Access the OpenShift web-console here: https://console-openshift-console.apps.jiwei-1013-01.qe.gcp.devcluster.openshift.com INFO Login to the console with user: "kubeadmin", and password: "wWPkc-8G2Lw-xe2Vw-DgWha" INFO Time elapsed: 39m14s $ $ openshift-install destroy cluster --dir test2 INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" INFO Stopped instance jiwei-1013-01-464st-worker-b-pmg5z INFO Stopped instance jiwei-1013-01-464st-worker-a-csg2j INFO Stopped instance jiwei-1013-01-464st-master-1 INFO Stopped instance jiwei-1013-01-464st-master-2 INFO Stopped instance jiwei-1013-01-464st-master-0 INFO Deleted 2 recordset(s) in zone qe INFO Deleted 3 recordset(s) in zone jiwei-1013-01-464st-private-zone INFO Deleted DNS zone jiwei-1013-01-464st-private-zone INFO Deleted bucket jiwei-1013-01-464st-image-registry-us-central1-ulgxgjfqxbdnrhd INFO Deleted instance jiwei-1013-01-464st-master-0 INFO Deleted instance jiwei-1013-01-464st-worker-a-csg2j INFO Deleted instance jiwei-1013-01-464st-master-1 INFO Deleted instance jiwei-1013-01-464st-worker-b-pmg5z INFO Deleted instance jiwei-1013-01-464st-master-2 INFO Deleted disk jiwei-1013-01-464st-master-2 INFO Deleted disk jiwei-1013-01-464st-master-1 INFO Deleted disk jiwei-1013-01-464st-worker-b-pmg5z INFO Deleted disk jiwei-1013-01-464st-master-0 INFO Deleted disk jiwei-1013-01-464st-worker-a-csg2j INFO Deleted address jiwei-1013-01-464st-cluster-public-ip INFO Deleted address jiwei-1013-01-464st-cluster-ip INFO Deleted forwarding rule a516d89f9a4f14bdfb55a525b1a12a91 INFO Deleted forwarding rule jiwei-1013-01-464st-api INFO Deleted forwarding rule jiwei-1013-01-464st-api-internal INFO Deleted target pool a516d89f9a4f14bdfb55a525b1a12a91 INFO Deleted target pool jiwei-1013-01-464st-api INFO Deleted backend service jiwei-1013-01-464st-api-internal INFO Deleted instance group jiwei-1013-01-464st-master-us-central1-a INFO Deleted instance group jiwei-1013-01-464st-master-us-central1-c INFO Deleted instance group jiwei-1013-01-464st-master-us-central1-b INFO Deleted health check jiwei-1013-01-464st-api-internal INFO Deleted HTTP health check a516d89f9a4f14bdfb55a525b1a12a91 INFO Deleted HTTP health check jiwei-1013-01-464st-api INFO Time elapsed: 4m18s $ $ gcloud --project openshift-qe-shared-vpc compute firewall-rules list --filter='network=installer-shared-vpc' NAME NETWORK DIRECTION PRIORITY ALLOW DENY DISABLED ci-op-xpn-ingress-common installer-shared-vpc INGRESS 60000 tcp:6443,tcp:22,tcp:80,tcp:443,icmp False ci-op-xpn-ingress-health-checks installer-shared-vpc INGRESS 60000 tcp:30000-32767,udp:30000-32767,tcp:6080,tcp:6443,tcp:22624,tcp:32335 False ci-op-xpn-ingress-internal-network installer-shared-vpc INGRESS 60000 udp:4789,udp:6081,udp:500,udp:4500,esp,tcp:9000-9999,udp:9000-9999,tcp:10250,tcp:30000-32767,udp:30000-32767,tcp:10257,tcp:10259,tcp:22623,tcp:2379-2380 False jiwei-1013-01-464st-api installer-shared-vpc INGRESS 1000 tcp:6443 False jiwei-1013-01-464st-control-plane installer-shared-vpc INGRESS 1000 tcp:22623,tcp:10257,tcp:10259 False jiwei-1013-01-464st-etcd installer-shared-vpc INGRESS 1000 tcp:2379-2380 False jiwei-1013-01-464st-health-checks installer-shared-vpc INGRESS 1000 tcp:6080,tcp:6443,tcp:22624 False jiwei-1013-01-464st-internal-cluster installer-shared-vpc INGRESS 1000 tcp:30000-32767,udp:9000-9999,udp:30000-32767,udp:4789,udp:6081,tcp:9000-9999,udp:500,udp:4500,esp,tcp:10250 False jiwei-1013-01-464st-internal-network installer-shared-vpc INGRESS 1000 icmp,tcp:22 False k8s-a516d89f9a4f14bdfb55a525b1a12a91-http-hc installer-shared-vpc INGRESS 1000 tcp:30268 False k8s-fw-a516d89f9a4f14bdfb55a525b1a12a91 installer-shared-vpc INGRESS 1000 tcp:80,tcp:443 FalseTo show all fields of the firewall, please show in JSON format: --format=json To show all fields in table format, please see the examples in --help. $ FYI manually deleting those firewall-rules in the shared VPC does work. $ gcloud --project openshift-qe-shared-vpc compute firewall-rules delete -q jiwei-1013-01-464st-api Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/firewalls/jiwei-1013-01-464st-api]. $ gcloud --project openshift-qe-shared-vpc compute firewall-rules delete -q jiwei-1013-01-464st-control-plane Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/firewalls/jiwei-1013-01-464st-control-plane]. $ gcloud --project openshift-qe-shared-vpc compute firewall-rules delete -q jiwei-1013-01-464st-etcd Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/firewalls/jiwei-1013-01-464st-etcd]. $ gcloud --project openshift-qe-shared-vpc compute firewall-rules delete -q jiwei-1013-01-464st-health-checks Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/firewalls/jiwei-1013-01-464st-health-checks]. $ gcloud --project openshift-qe-shared-vpc compute firewall-rules delete -q jiwei-1013-01-464st-internal-cluster Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/firewalls/jiwei-1013-01-464st-internal-cluster]. $ gcloud --project openshift-qe-shared-vpc compute firewall-rules delete -q jiwei-1013-01-464st-internal-network Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/firewalls/jiwei-1013-01-464st-internal-network]. $ gcloud --project openshift-qe-shared-vpc compute firewall-rules delete -q k8s-a516d89f9a4f14bdfb55a525b1a12a91-http-hc Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/firewalls/k8s-a516d89f9a4f14bdfb55a525b1a12a91-http-hc]. $ gcloud --project openshift-qe-shared-vpc compute firewall-rules delete -q k8s-fw-a516d89f9a4f14bdfb55a525b1a12a91 Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/firewalls/k8s-fw-a516d89f9a4f14bdfb55a525b1a12a91]. $ $ gcloud --project openshift-qe-shared-vpc compute firewall-rules list --filter='network=installer-shared-vpc' NAME NETWORK DIRECTION PRIORITY ALLOW DENY DISABLED ci-op-xpn-ingress-common installer-shared-vpc INGRESS 60000 tcp:6443,tcp:22,tcp:80,tcp:443,icmp False ci-op-xpn-ingress-health-checks installer-shared-vpc INGRESS 60000 tcp:30000-32767,udp:30000-32767,tcp:6080,tcp:6443,tcp:22624,tcp:32335 False ci-op-xpn-ingress-internal-network installer-shared-vpc INGRESS 60000 udp:4789,udp:6081,udp:500,udp:4500,esp,tcp:9000-9999,udp:9000-9999,tcp:10250,tcp:30000-32767,udp:30000-32767,tcp:10257,tcp:10259,tcp:22623,tcp:2379-2380 FalseTo show all fields of the firewall, please show in JSON format: --format=json To show all fields in table format, please see the examples in --help. $
ovnkube-trace: ofproto/trace fails for IPv6
[akaris@linux go-controller (fix-ovnkube-trace-ipv6)]$ oc exec -ti ovn-trace-two -n ovn-tests-two -- ovnkube-trace -src-namespace ovn-tests-two -src ovn-trace-two -dst-ip 2404:6800:4003:c06::69 -tcp I1021 12:16:56.478752 3356 ovs.go:90] Maximum command line arguments set to: 191102 ovn-trace from pod to IP indicates success from ovn-trace-two to 2404:6800:4003:c06::69 F1021 12:16:57.075803 3356 ovnkube-trace.go:601] ovs-appctl ofproto/trace pod to IP error command terminated with exit code 2 stdOut: stdErr: Bad openflow flow syntax: in_port=73af56a18042ab9, tcp, dl_src=0a:58:17:2b:b6:42, dl_dst=0a:58:69:bd:ba:d8, nw_src=fd01:0:0:5::13, nw_dst=2404:6800:4003:c06::69, nw_ttl=64, tcp_dst=80, tcp_src=12345: bad value for nw_src (fd01:0:0:5::13: invalid IP address) ovs-appctl: ovs-vswitchd: server returned an error command terminated with exit code 1 [akaris@linux go-controller (fix-ovnkube-trace-ipv6)]$ oc exec -ti ovn-trace-two -n ovn-tests-two -- ovnkube-trace -src-namespace ovn-tests-two -src ovn-trace-two -dst-namespace ovn-tests -dst ovn-trace -udp I1021 12:17:26.695325 3386 ovs.go:90] Maximum command line arguments set to: 191102 ovn-trace source pod to destination pod indicates success from ovn-trace-two to ovn-trace ovn-trace destination pod to source pod indicates success from ovn-trace to ovn-trace-two F1021 12:17:27.708822 3386 ovnkube-trace.go:601] ovs-appctl ofproto/trace source pod to destination pod error command terminated with exit code 2 stdOut: stdErr: Bad openflow flow syntax: in_port=73af56a18042ab9, udp, dl_src=0a:58:17:2b:b6:42, dl_dst=0a:58:69:bd:ba:d8, nw_src=fd01:0:0:5::13, nw_dst=fd01:0:0:5::14, nw_ttl=64, udp_dst=80, udp_src=12345: bad value for nw_src (fd01:0:0:5::13: invalid IP address) ovs-appctl: ovs-vswitchd: server returned an error command terminated with exit code 1
Description of problem:
The cluster-dns-operator does not reconcile the openshift-dns namespace, which has been exposed as an issue in 4.12 due to the requirement for the namespace to have pod-security labels. If a cluster has been incrementally updated from a version less than or equal to 4.9, the openshift-dns namespace will most likely not contain the required pod-security labels since the namespace was statically created when the cluster was installed with old namespace configuration.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always if cluster originally installed with v4.9 or less
Steps to Reproduce:
1. Install v4.9 2. Upgrade to v4.12 (incrementally if required for upgrade path) 3. openshift-dns namespace will be missing pod-security labels
Actual results:
"oc get ns openshift-dns -o yaml" will show missing pod-security labels: apiVersion: v1 kind: Namespace metadata: annotations: openshift.io/node-selector: "" openshift.io/sa.scc.mcs: s0:c15,c0 openshift.io/sa.scc.supplemental-groups: 1000210000/10000 openshift.io/sa.scc.uid-range: 1000210000/10000 creationTimestamp: "2020-05-21T19:36:15Z" labels: kubernetes.io/metadata.name: openshift-dns olm.operatorgroup.uid/3d42c0c1-01cd-4c55-bf88-864f041c7e7a: "" openshift.io/cluster-monitoring: "true" openshift.io/run-level: "0" name: openshift-dns resourceVersion: "3127555382" uid: 0fb4571e-952f-4bea-bc45-461beec54369 spec: finalizers: - kubernetes
Expected results:
pod-security labels should exist: labels: kubernetes.io/metadata.name: openshift-dns olm.operatorgroup.uid/3d42c0c1-01cd-4c55-bf88-864f041c7e7a: "" openshift.io/cluster-monitoring: "true" openshift.io/run-level: "0" pod-security.kubernetes.io/audit: privileged pod-security.kubernetes.io/enforce: privileged pod-security.kubernetes.io/warn: privileged
Additional info:
Issue found in CI during upgrade
https://coreos.slack.com/archives/C03G7REB4JV/p1663676443155839
Description of problem:
When the user selects Serverless as an import strategy and tried to import a Devfile, the import fails because of an invalid Deployment.
Could reproduce this already in 4.11, but its even more prominent in 4.12 when the console automatically selects the resource type serverless when the Serverless operator is installed.
Version-Release number of selected component (if applicable):
Works on 4.10
Failed on 4.11 and 4.12 master
How reproducible:
Always
Steps to Reproduce:
1. Install and setup Serverless operator
1. Switch to dev perspective, navigate to add > import from git
3. Enter a non-Devfile git URL like https://github.com/jerolimov/nodeinfo
4. On 4.11 select resource type Serverless (on 4.12 this should be selected automatically)
5. Update the git URL to a repo with a Devfile like https://github.com/nodeshift-starters/devfile-sample
6. Press create
Actual results:
Import fails with error:
Error "Invalid value: "": name part must be non-empty" for field "spec.template.labels".
Expected results:
Devfile should be imported
Additional info:
This is a clone of issue OCPBUGS-6213. The following is the description of the original issue:
—
Please review the following PR: https://github.com/openshift/machine-config-operator/pull/3450
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When user selects a installed operator (for example, openshift elastic search) in operator hub and navigating to installed operator page from operator information page
with the help of "view it here" option, "404 Not found" information has wrongly shown/appeared although it navigates to the installed operator at the end.
Version-Release number of selected components (if applicable):
4.12.0-0.nightly-2022-08-15-150248
How reproducible:
Always
Steps to Reproduce:
Actual results:
Wrong message "404: Not found" while the user selects an installed operator and navigates from operator hub to installed operator page.
Browser console log indicate as below
main-chunk-525818b154a57a9b220a.min.js:1 unhandled error: Uncaught TypeError: Cannot read properties of undefined (reading 'firstElementChild') TypeError: Cannot read properties of undefined (reading 'firstElementChild') at c (https://console-openshift-console.apps.jmekkatt-dob.ibmcloud.qe.devcluster.openshift.com/static/vendors~main-chunk-40fab65853dff2fbc413.min.js:118:125992) at HTMLDivElement.l (https://console-openshift-console.apps.jmekkatt-dob.ibmcloud.qe.devcluster.openshift.com/static/vendors~main-chunk-40fab65853dff2fbc413.min.js:118:126387) TypeError: Cannot read properties of undefined (reading 'firstElementChild') at c (vendors~main-chunk-40fab65853dff2fbc413.min.js:72303:1) at HTMLDivElement.l (vendors~main-chunk-40fab65853dff2fbc413.min.js:72303:1) window.onerror @ main-chunk-525818b154a57a9b220a.min.js:1 vendors~main-chunk-40fab65853dff2fbc413.min.js:72303 Uncaught TypeError: Cannot read properties of undefined (reading 'firstElementChild') at c (vendors~main-chunk-40fab65853dff2fbc413.min.js:72303:1) at HTMLDivElement.l (vendors~main-chunk-40fab65853dff2fbc413.min.js:72303:1) c @ vendors~main-chunk-40fab65853dff2fbc413.min.js:72303 l @ vendors~main-chunk-40fab65853dff2fbc413.min.js:72303 scroll (async) componentWillUnmount @ vendor-patternfly-core-chunk-006bb1499791fa7cfea7.min.js:38397 hs @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377 bs @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377 hs @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377 bs @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377 Oc @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377 t.unstable_runWithPriority @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171690 Hi @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377 Ac @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377 pc @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377 (anonymous) @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377 t.unstable_runWithPriority @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171690 Hi @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377 Vi @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377 qi @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377 De @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377 Yt @ vendors~main-chunk-40fab65853dff2fbc413.min.js:171377 main-chunk-525818b154a57a9b220a.min.js:1 GET https://console-openshift-console.apps.jmekkatt-dob.ibmcloud.qe.devcluster.openshift.com/api/kubernetes/apis/operators.coreos.com/v1alpha1/clusterserviceversions/elasticsearch-operator.5.5.0 404 (Not Found)
Expected results:
Installed operator details should show without any error when the user selects an installed operator and navigates from operator hub to installed operator page.
Additional info:
Reproduced in both chrome[103.0.5060.114 (Official Build) (64-bit)] and firefox[91.11.0esr (64-bit)] browsers
Attached screen share for the same issue InstalledOperatorNavigation404.mp4
Description of problem:
OCPBUGS-3499 and OCPBUGS-3501 both require a more recent version of openshift/library-go containing the shared validation and host-assignment logic.
This is a clone of issue OCPBUGS-2532. The following is the description of the original issue:
—
Description of problem:
Upgrades from OCP 4.11.9 to the latest OCP 4.12 Nightly builds including 4.12.0-ec.4 will fail. When the upgrade fails, there are typically two operators that never get upgraded(all others do upgrade to the targeted 4.12.x release): dns 4.11.9 True True False 11h DNS "default" reports Progressing=True: "Have 4 available DNS pods, want 5."... machine-config 4.11.9 True False False 14h The dns.operator details state it is waiting for a 4/5 pods to become available: # oc describe dns.operator/default ... Status: Cluster Domain: cluster.local Cluster IP: 172.30.0.10 Conditions: Last Transition Time: 2022-10-18T03:21:44Z Message: Enough DNS pods are available, and the DNS service has a cluster IP address. Reason: AsExpected Status: False Type: Degraded Last Transition Time: 2022-10-18T03:21:44Z Message: Have 4 available DNS pods, want 5. Reason: Reconciling Status: True Type: Progressing The mcp reports everything is good: # oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-87fd457ffdaf49d75e62b532c22a9f1d True False False 3 3 3 0 14h worker rendered-worker-7fc68009b1facf8724cd952cb08435ff True False False 2 2 2 0 14h We have performed a large number of the same upgrades, using the same configuration, and while there are times the upgrade succeeds, the large number of results do fail. This seems to be a timing issue. As a current workaround, if we were to recycle the control plane nodes, the upgrade will complete successfully. A must-gather log is attached for review.
Version-Release number of selected component (if applicable):
Tested upgrading to all the following releases: 4.12.0-ec.4 4.12.0-0.nightly-s390x-2022-10-10-005931 4.12.0-0.nightly-s390x-2022-10-15-144437
How reproducible:
Moderate to Consistently
Steps to Reproduce:
1. Start with a working OCP 4.11.9 Cluster. 2. Perform an upgrade to latest OCP 4.12.x nightly build. 3. Monitor the upgrade status: # oc get clusterversion —> will state % complete and waiting on dns - which never finishes. # oc get co —> the dns and machine-config operators will remain at 4.11.9 4. Upgrade will never complete.
Actual results:
Upgrade will never complete.
Expected results:
Upgrade to the targeted release succeeds.
Additional info:
This upgrade issue occurs for both Connected and Disconnected Clusters.
Just like kube proxy, ovnk should expose port 10256 on every node, so that cloud LBs can send health checks and know which nodes are available. This is relevant for services with externalTrafficPolicy=Cluster.
This is a clone of issue OCPBUGS-3621. The following is the description of the original issue:
—
Description of problem:
EUS-to-EUS upgrade(4.10.38-4.11.13-4.12.0-rc.0), after control-plane nodes are upgraded to 4.12 successfully, unpause the worker pool to get worker nodes updated. But worker nodes failed to be updated with degraded worker pool: ``` # ./oc get node NAME STATUS ROLES AGE VERSION jliu410-6hmkz-master-0.c.openshift-qe.internal Ready master 4h40m v1.25.2+f33d98e jliu410-6hmkz-master-1.c.openshift-qe.internal Ready master 4h40m v1.25.2+f33d98e jliu410-6hmkz-master-2.c.openshift-qe.internal Ready master 4h40m v1.25.2+f33d98e jliu410-6hmkz-worker-a-xdwvv.c.openshift-qe.internal Ready,SchedulingDisabled worker 4h31m v1.23.12+6b34f32 jliu410-6hmkz-worker-b-9hnb8.c.openshift-qe.internal Ready worker 4h31m v1.23.12+6b34f32 jliu410-6hmkz-worker-c-bdv4f.c.openshift-qe.internal Ready worker 4h31m v1.23.12+6b34f32 ... # ./oc get co machine-config machine-config 4.12.0-rc.0 True False True 3h41m Failed to resync 4.12.0-rc.0 because: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)] ... # ./oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-b81233204496767f2fe32fbb6cb088e1 True False False 3 3 3 0 4h10m worker rendered-worker-a2caae543a144d94c17a27e56038d4c4 False True True 3 0 0 1 4h10m ... # ./oc describe mcp worker Message: Reason: Status: True Type: Degraded Last Transition Time: 2022-11-14T07:19:42Z Message: Node jliu410-6hmkz-worker-a-xdwvv.c.openshift-qe.internal is reporting: "Error checking type of update image: error running skopeo inspect --no-tags --retry-times 5 --authfile /var/lib/kubelet/config.json docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c01b0ae9870dbee5609c52b4d649334ce6854fff1237f1521929d151f6876daa: exit status 1\ntime=\"2022-11-14T07:42:47Z\" level=fatal msg=\"unknown flag: --no-tags\"\n" Reason: 1 nodes are reporting degraded status on sync Status: True Type: NodeDegraded ... # ./oc logs machine-config-daemon-mg2zn E1114 08:11:27.115577 192836 writer.go:200] Marking Degraded due to: Error checking type of update image: error running skopeo inspect --no-tags --retry-times 5 --authfile /var/lib/kubelet/config.json docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c01b0ae9870dbee5609c52b4d649334ce6854fff1237f1521929d151f6876daa: exit status 1 time="2022-11-14T08:11:25Z" level=fatal msg="unknown flag: --no-tags" ```
Version-Release number of selected component (if applicable):
4.12.0-rc.0
How reproducible:
Steps to Reproduce:
1. EUS upgrade with path 4.10.38-> 4.11.13-> 4.12.0-rc.0 with paused worker pool 2. After master pool upgrade succeed, unpause worker pool 3.
Actual results:
Worker pool upgrade failed
Expected results:
Worker pool upgrade succeed
Additional info:
Description of problem:
SYN packets for new tcp connections from inside the cluster to an external destination are dropped at random. After few seconds (i.e. few retries), they eventually succeed and no more packet drop happens. Hence, this is perceived as too long TCP connection establishment delay.
Version-Release number of selected component (if applicable):
4.10.0
How reproducible:
Frequently at a concrete cluster. Other clusters with apparently similar configuration don't show the issue.
Steps to Reproduce:
1. Establish TCP connection from pod to external destination. 2. 3.
Actual results:
SYN packets dropped, long TCP establishment time, leading to timeouts.
Expected results:
No drops
Additional info:
This becomes especially harmful because it impacts communication from openshift-apiserver (not to be confused with kube-apiserver) and etcd, because the former is inside the SDN and etcd isn't. More details will follow in comments.
Description of problem:
Installer fails due to Neutron policy error when creating Openstack servers for OCP master nodes. $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-kwtf8-master-0 Running 23h openshift-machine-api ostest-kwtf8-master-1 Running 23h openshift-machine-api ostest-kwtf8-master-2 Running 23h openshift-machine-api ostest-kwtf8-worker-0-g7nrw Provisioning 23h openshift-machine-api ostest-kwtf8-worker-0-lrkvb Provisioning 23h openshift-machine-api ostest-kwtf8-worker-0-vwrsk Provisioning 23h $ oc -n openshift-machine-api logs machine-api-controllers-7454f5d65b-8fqx2 -c machine-controller [...] E1018 10:51:49.355143 1 controller.go:317] controller/machine_controller "msg"="Reconciler error" "error"="error creating Openstack instance: Failed to create port err: Request forbidden: [POST https://overcloud.redhat.local:13696/v2.0/ports], error message: {\"NeutronError\": {\"type\": \"PolicyNotAuthorized\", \"message\": \"(rule:create_port and (rule:create_port:allowed_address_pairs and (rule:create_port:allowed_address_pairs:ip_address and rule:create_port:allowed_address_pairs:ip_address))) is disallowed by policy\", \"detail\": \"\"}}" "name"="ostest-kwtf8-worker-0-lrkvb" "namespace"="openshift-machine-api"
Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2022-10-14-023020
How reproducible:
Always
Steps to Reproduce:
1. Install 4.10 within provider networks (in primary or secondary interface)
Actual results:
Installation failure: 4.10.0-0.nightly-2022-10-14-023020: some cluster operators have not yet rolled out
Expected results:
Successful installation
Additional info:
Please find must-gather for installation on primary interface link here and for installation on secondary interface link here.
Description of problem:
opm serve fails with message: Error: compute digest: compute hash: write tar: stat .: os: DirFS with empty root
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
(The easiest reproducer involves serving an empty catalog)
1. mkdir /tmp/catalog 2. using Dockerfile /tmp/catalog.Dockerfile based on 4.12 docs (https://access.redhat.com/documentation/en-us/openshift_container_platform/4.12/html-single/operators/index#olm-creating-fb-catalog-image_olm-managing-custom-catalogs # The base image is expected to contain # /bin/opm (with a serve subcommand) and /bin/grpc_health_probe FROM registry.redhat.io/openshift4/ose-operator-registry:v4.12 # Configure the entrypoint and command ENTRYPOINT ["/bin/opm"] CMD ["serve", "/configs"] # Copy declarative config root into image at /configs ADD catalog /configs # Set DC-specific label for the location of the DC root directory # in the image LABEL operators.operatorframework.io.index.configs.v1=/configs 3. build the image `cd /tmp/ && docker build -f catalog.Dockerfile .` 4. execute an instance of the container in docker/podman `docker run --name cat-run [image-file]` 5. error
Using a dockerfile generated from opm (`opm generate dockerfile [dir]`) works, but includes precache and cachedir options to opm.
Actual results:
Error: compute digest: compute hash: write tar: stat .: os: DirFS with empty root
Expected results:
opm generates cache in default /tmp/cache location and serves without error
Additional info:
This is a clone of issue OCPBUGS-10213. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-8468. The following is the description of the original issue:
—
Description of problem:
RHCOS is being published to new AWS regions (https://github.com/openshift/installer/pull/6861) but aws-sdk-go need to be bumped to recognize those regions
Version-Release number of selected component (if applicable):
master/4.14
How reproducible:
always
Steps to Reproduce:
1. openshift-install create install-config 2. Try to select ap-south-2 as a region 3.
Actual results:
New regions are not found. New regions are: ap-south-2, ap-southeast-4, eu-central-2, eu-south-2, me-central-1.
Expected results:
Installer supports and displays the new regions in the Survey
Additional info:
See https://github.com/openshift/installer/blob/master/pkg/asset/installconfig/aws/regions.go#L13-L23
Description of problem: After I run the golang script for OCP-53608, I find the created
ingress-controller couldn't be deleted
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-08-15-150248
How reproducible: Run the script and try to delete the custom ingress-controller
Steps to Reproduce:
1.
% oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.12.0-0.nightly-2022-08-15-150248 True False 43m Cluster version is 4.12.0-0.nightly-2022-08-15-150248
shudi@Shudis-MacBook-Pro openshift-tests-private %
2. Run the script
shudi@Shudis-MacBook-Pro openshift-tests-private % ./bin/extended-platform-tests run all --dry-run | grep 53608 | ./bin/extended-platform-tests run -f -
...
---------------------------------------------------------
Received interrupt. Running AfterSuite...
^C again to terminate immediately
Aug 18 10:35:51.087: INFO: Running AfterSuite actions on all nodes
Aug 18 10:35:51.088: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-test-router-tunning-77627" for this suite.
Aug 18 10:35:54.654: INFO: Running AfterSuite actions on node 1
failed: (15m4s) 2022-08-18T02:35:54 "[sig-network-edge] Network_Edge should Author:shudili-Low-53608-Negative Test of Expose a Configurable Reload Interval in HAproxy [Suite:openshift/conformance/parallel]"
Failing tests:
[sig-network-edge] Network_Edge should Author:shudili-Low-53608-Negative Test of Expose a Configurable Reload Interval in HAproxy [Suite:openshift/conformance/parallel]
error: 1 fail, 0 pass, 0 skip (15m4s)
shudi@Shudis-MacBook-Pro openshift-tests-private %
3. show the ingress-controllers
shudi@Shudis-MacBook-Pro openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller
NAME AGE
default 113m
ocp53608 42m
shudi@Shudis-MacBook-Pro openshift-tests-private %
4. Try to delete the ingress-controller ocp53608, when the message "ingresscontroller.operator.openshift.io "ocp53608" deleted" appears, it is hanged for a long time until the error message appears.
shudi@Shudis-MacBook-Pro openshift-tests-private % oc -n openshift-ingress-operator delete ingresscontroller ocp53608
ingresscontroller.operator.openshift.io "ocp53608" deleted
error: An error occurred while waiting for the object to be deleted: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeedingUnable to connect to the server: dial tcp 35.194.1.60:6443: i/o timeout
shudi@Shudis-MacBook-Pro openshift-tests-private %
5. After "ingresscontroller.operator.openshift.io "ocp53608" deleted" message appears, show the ingress-controller, ocp53608 isn't deleted
shudi@Shudis-MacBook-Pro golang % oc -n openshift-ingress-operator get ingresscontroller
NAME AGE
default 3h
ocp53608 109m
shudi@Shudis-MacBook-Pro golang %
6. After the error message(rror: An error occurred while waiting for the object to be deleted) appears, try to show the ingresscontroller
shudi@Shudis-MacBook-Pro openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller
E0818 12:21:57.272967 4168 request.go:1085] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)
E0818 12:21:57.273379 4168 request.go:1085] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)
E0818 12:21:57.274306 4168 request.go:1085] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)
Unable to connect to the server: dial tcp 35.194.1.60:6443: i/o timeout
shudi@Shudis-MacBook-Pro openshift-tests-private %
Actual results: ingress-controller ocp53608 is still there after executed the oc delete command
Expected results:
ingress-controller ocp53608 will be deleted soon after executed the oc delete command
Additional info:
Description of problem:
When enabling OvS HWOL on 4.12.0 nightly, traffic does not pass between pods.
Version-Release number of selected component (if applicable):
4.12.0 nightly
How reproducible:
Always
Steps to Reproduce:
1. Create 2 pods with sriov and try to ping between them (same node or different node)
Actual results:
No Traffic Passes (Ping or other)
Expected results:
Traffic Passes (Ping or other)
Additional info:
Missing this commit in 4.12 branch https://github.com/openshift/ovn-kubernetes/commit/37c6c1d7039fd4c8f3cca560691a254e720172de
Description of the problem:
During install, we assume all PVs on a host have been added to a volume group and only remove them if they are. This could let other PVs that are not attached to volume groups persist and prevent coreos from installing properly.
Relevant assisted installer links:
Found while investigating triage issue https://issues.redhat.com/browse/AITRIAGE-4017
See slack thread for more details https://coreos.slack.com/archives/C02CP89N4VC/p1663263128420489
How reproducible:
100%
Steps to reproduce:
1. Create a host with a PV w/o a volume group
2. Add host to cluster and install
3. Observe the install fail
Actual results:
Installation fails with
"Error: checking for exclusive access to /dev/sda
Caused by:
| 0: couldn't reread partition table: device is in use |
| 1: EBUSY: Device or resource busy"
Expected results:
All PVs and VGs are removed so that the installation will succeed
Test output:
=== RUN TestAll/serial/TestCanaryRoute
canary_test.go:78: failed to create pod openshift-ingress-canary/canary-route-check: pods "canary-route-check" is forbidden: violates PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "curl" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "curl" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "curl" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "curl" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
Description of problem:
hypershift pull secret update failed on 4.12
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
When multi-cluster is enabled, it possible to get in a situation where you can't cancel login. If you select a cluster you don't know the credentials for, console will remember the last cluster and repeatedly send you to the login page with no way to cancel or go back. If we decide to set the last cluster in the user's preferences, it might be possible to get stuck even if you clear cookies and localStorage.
There are similar issues logging into cluster that are hibernating. See attached video.
cc Scott Berens
This is a clone of issue OCPBUGS-2083. The following is the description of the original issue:
—
Description of problem:
Currently we are running VMWare CSI Operator in OpenShift 4.10.33. After running vulnerability scans, the operator was discovered to be running a known weak cipher 3DES. We are attempting to upgrade or modify the operator to customize the ciphers available. We were looking at performing a manual upgrade via Quay.io but can't seem to pull the image and was trying to steer away from performing a custom install from scratch. Looking for any suggestions into mitigated the weak cipher in the kube-rbac-proxy under VMware CSI Operator.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
This is just a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2105570 for purposes of cherry-picking.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
For example, "openshift-install explain installconfig.platform.gcp.publicDNSZone" tells "PublicDNSZone contains the zone ID and project where the Public DNS zone will be created", but in fact it's for specifying an existing zone where the Public DNS zone records will be put in.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-10-015203
How reproducible:
Always
Steps to Reproduce:
1. openshift-install explain installconfig.platform.gcp.publicDNSZone 2. openshift-install explain installconfig.platform.gcp.privateDNSZone 3.
Actual results:
For example, it tells "PublicDNSZone contains the zone ID and project where the Public DNS zone will be created."
Expected results:
It should be like "PublicDNSZone contains the zone ID and project where the Public DNS zone records will be created."
Additional info:
$ openshift-install version openshift-install 4.12.0-0.nightly-2022-10-10-015203 built from commit 02102a96b3f7c78337b32dcafe2e28be6fb67a0f release image registry.ci.openshift.org/ocp/release@sha256:00806cf7faaa86981e73b478a72c1b7a838cd08b215f3a9ab9b278ae94d9a794 release architecture amd64 $ $ openshift-install explain installconfig.platform.gcp.publicDNSZone KIND: InstallConfig VERSION: v1RESOURCE: <object> PublicDNSZone Technology Preview. PublicDNSZone contains the zone ID and project where the Public DNS zone will be created.FIELDS: id <string> ID Technology Preview. ID or name of the zone. project <string> ProjectID Technology Preview When the ProjectID is provided, the zone will be created in this project. When the ProjectID is empty, the DNS zone with this ID will be created and managed in the Service Project (GCP.ProjectID). $ $ openshift-install explain installconfig.platform.gcp.privateDNSZone KIND: InstallConfig VERSION: v1RESOURCE: <object> PrivateDNSZone Technology Preview. PrivateDNSZone contains the zone ID and project where the Private DNS zone will be created.FIELDS: id <string> ID Technology Preview. ID or name of the zone. project <string> ProjectID Technology Preview When the ProjectID is provided, the zone will be created in this project. When the ProjectID is empty, the DNS zone with this ID will be created and managed in the Service Project (GCP.ProjectID). $
This is a clone of issue OCPBUGS-8035. The following is the description of the original issue:
—
Description of problem:
install discnnect private cluster, ssh to master/bootstrap nodes from the bastion on the vpc failed.
Version-Release number of selected component (if applicable):
Pre-merge build https://github.com/openshift/installer/pull/6836 registry.build05.ci.openshift.org/ci-ln-5g4sj02/release:latest Tag: 4.13.0-0.ci.test-2023-02-27-033047-ci-ln-5g4sj02-latest
How reproducible:
always
Steps to Reproduce:
1.Create bastion instance maxu-ibmj-p1-int-svc 2.Create vpc on the bastion host 3.Install private disconnect cluster on the bastion host with mirror registry 4.ssh to the bastion 5.ssh to the master/bootstrap nodes from the bastion
Actual results:
[core@maxu-ibmj-p1-int-svc ~]$ ssh -i ~/openshift-qe.pem core@10.241.0.5 -v OpenSSH_8.8p1, OpenSSL 3.0.5 5 Jul 2022 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config debug1: configuration requests final Match pass debug1: re-parsing configuration debug1: Reading configuration data /etc/ssh/ssh_config debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config debug1: Connecting to 10.241.0.5 [10.241.0.5] port 22. debug1: connect to address 10.241.0.5 port 22: Connection timed out ssh: connect to host 10.241.0.5 port 22: Connection timed out
Expected results:
ssh succeed.
Additional info:
$ibmcloud is sg-rules r014-5a6c16f4-8a4c-4c02-ab2d-626c14f72a77 --vpc maxu-ibmj-p1-vpc Listing rules of security group r014-5a6c16f4-8a4c-4c02-ab2d-626c14f72a77 under account OpenShift-QE as user ServiceId-dff277a9-b608-410a-ad24-c544e59e3778... ID Direction IP version Protocol Remote r014-6739d68f-6827-41f4-b51a-5da742c353b2 outbound ipv4 all 0.0.0.0/0 r014-06d44c15-d3fd-4a14-96c4-13e96aa6769c inbound ipv4 all shakiness-perfectly-rundown-take r014-25b86956-5370-4925-adaf-89dfca9fb44b inbound ipv4 tcp Ports:Min=22,Max=22 0.0.0.0/0 r014-e18f0f5e-c4e5-44a5-b180-7a84aa59fa97 inbound ipv4 tcp Ports:Min=3128,Max=3129 0.0.0.0/0 r014-7e79c4b7-d0bb-4fab-9f5d-d03f6b427d89 inbound ipv4 icmp Type=8,Code=0 0.0.0.0/0 r014-03f23b04-c67a-463d-9754-895b8e474e75 inbound ipv4 tcp Ports:Min=5000,Max=5000 0.0.0.0/0 r014-8febe8c8-c937-42b6-b352-8ae471749321 inbound ipv4 tcp Ports:Min=6001,Max=6002 0.0.0.0/0
Description of problem:
This bugs purpose is to enable a feature backport of https://issues.redhat.com/browse/MON-1949.
Description of problem:
CI failure when installing a devfile, see also attached screenshot.
You can find the error by looking for Object.verifyTopologyPage
For example:
We rely on the user providing accurate information about the MAC addresses in the agent-config, because at the point we read it we haven't seen the hosts yet. However, if the user gets this wrong then chaos may ensue.
Once inventory is available, we should validate that the user has not:
and fail the install if they have.
This is a clone of issue OCPBUGS-4913. The following is the description of the original issue:
—
Description of problem:
Currently the Terraform code waits for 45 seconds, but anecdotal data suggest we should actually wait for 3 minutes in order to avoid "failures" due to occasional slow boots of a new VM in PowerVS.
Version-Release number of selected component (if applicable):
How reproducible:
often enough
Steps to Reproduce:
1. run IPI installer against PowerVS 2. look for "empty tuple" in the error message when it fails to reach `bootstrap-complete` 3.
Actual results:
Expected results:
VMs to always have IP address assigned by DHCP after a certain wait
Additional info:
The change has already been merged into master/4.13, but 4.12 also needs this for planned PowerVS IPI GA on the z-stream.
A related slack thread: here
The error:
which: no kustomize in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/go/bin:/go/bin) + curl -L --retry 5 https://github.com/kubernetes-sigs/kustomize/releases/download/kustomize%2Fv4.3.0/kustomize_v4.3.0_linux_amd64.tar.gz + tar -zx -C /usr/bin/ % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 1523 0 1523 0 0 27196 0 --:--:-- --:--:-- --:--:-- 26719 Warning: Problem : HTTP error. Will retry in 300 seconds. 5 retries left. 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 gzip: stdin: not in gzip format tar: Child died with signal 13 tar: Error is not recoverable: exiting now
A related job search: https://search.ci.openshift.org/?search=gzip%3A+stdin%3A+not+in+gzip+format&maxAge=336h&context=1&type=junit&name=assisted&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Description of problem:
Disconnected IPI OCP 4.10.22 cluster install on baremetal fails when hostname of master nodes does not include "master"
Version-Release number of selected component (if applicable): 4.10.22
How reproducible: Perform disconnected IPI install of OCP 4.10.22 on bare metal with master nodes that do not contain the text "master"
Steps to Reproduce:
Perform disconnected IPI install of OCP 4.10.22 on bare metal with master nodes that do not contain the text "master"
Actual results: master nodes do come up.
Expected results: master nodes should come up despite that the text "master" is not in their hostname.
Additional info:
Disconnected IPI OCP 4.10.22 cluster install on baremetal fails when hostname of master nodes does not include "master"
The code for the cluster-baremetal-operator at the following link:
The following condition is concerning:
if strings.Contains(bmh.Name, "master") && len(bmh.Spec.BootMACAddress) > 0
The packages reveal that bmh.Name references the name inside the metadata of the BMH object.
Should a customer have masters with names that do not include the text "master", the above condition can never become true, and so, the following slice is never created :
macs = append(macs, bmh.Spec.BootMACAddress)
Description of problem:
A cluster hit a panic in etcd operator in bootstrap:
I0829 14:46:02.736582 1 controller_manager.go:54] StaticPodStateController controller terminated
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1e940ab]
goroutine 2701 [running]:
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.checkSingleMemberHealth({0x29374c0, 0xc00217d920}, 0xc0021fb110)
github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:135 +0x34b
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth.func1()
github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:58 +0x7f
created by github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth
github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:54 +0x2ac
Version-Release number of selected component (if applicable):
How reproducible:
Pulled up a 4.12 cluster and hit panic during bootstrap
Steps to Reproduce:
1. 2. 3.
Actual results:
panic as above
Expected results:
no panic
Additional info:
Description of problem:
AWS CCM install fails in CI of WMCB and WMCO repositories, this install is using prow workflow
openshift-e2e-aws-ccm-ovn-hybrid
Version-Release number of selected component:master/4.12
How reproducible: Always
Additional info:
Ongoing slack thread: https://coreos.slack.com/archives/CBZHF4DHC/p1664998753931669
This is a clone of issue OCPBUGS-3278. The following is the description of the original issue:
—
Description of problem:
When doing openshift-install agent create image, one should not need to provide platform specific data like boot MAC addresses.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1.Create install-config with only VIPs in Baremetal platform section apiVersion: v1 metadata: name: foo baseDomain: test.metalkube.org networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineNetwork: - cidr: 192.168.122.0/23 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: {} replicas: 0 controlPlane: name: master replicas: 3 hyperthreading: Enabled architecture: amd64 platform: baremetal: apiVIPs: - 192.168.122.10 ingressVIPs: - 192.168.122.11 --- apiVersion: v1beta1 metadata: name: foo rendezvousIP: 192.168.122.14 2.openshift-install agent create image
Actual results:
ERROR failed to write asset (Agent Installer ISO) to disk: cannot generate ISO image due to configuration errors ERROR failed to fetch Agent Installer ISO: failed to load asset "Install Config": failed to create install config: invalid "install-config.yaml" file: [platform.baremetal.hosts: Invalid value: []*baremetal.Host(nil): bare metal hosts are missing, platform.baremetal.Hosts: Required value: not enough hosts found (0) to support all the configured ControlPlane replicas (3)]
Expected results:
Image gets generated
Additional info:
We should go into install-config validation code, detect if we are doing agent-based installation and skip the hosts checks
Description of problem:
The "Add Git Repository" has a "Show configuration options" expandable section that shows the required permissions for a webhook setup, and provides a link to "read more about setting up webhook".
But the permission section shows nothing when open this second expandable section, and the link doesn't do anything until the user enters a "supported" GitHub, GitLab or BitBucket URL.
Version-Release number of selected component (if applicable):
4.11-4.13
How reproducible:
Always
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-672. The following is the description of the original issue:
—
Description of problem:
Redhat-operator part of the marketplace is failing regularly due to startup probe timing out connecting to registry-server container part of the same pod within 1 sec which in turn increases CPU/Mem usage on Master nodes: 62m Normal Scheduled pod/redhat-operators-zb4j7 Successfully assigned openshift-marketplace/redhat-operators-zb4j7 to ip-10-0-163-212.us-west-2.compute.internal by ip-10-0-149-93 62m Normal AddedInterface pod/redhat-operators-zb4j7 Add eth0 [10.129.1.112/23] from ovn-kubernetes 62m Normal Pulling pod/redhat-operators-zb4j7 Pulling image "registry.redhat.io/redhat/redhat-operator-index:v4.11" 62m Normal Pulled pod/redhat-operators-zb4j7 Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.11" in 498.834447ms 62m Normal Created pod/redhat-operators-zb4j7 Created container registry-server 62m Normal Started pod/redhat-operators-zb4j7 Started container registry-server 62m Warning Unhealthy pod/redhat-operators-zb4j7 Startup probe failed: timeout: failed to connect service ":50051" within 1s 62m Normal Killing pod/redhat-operators-zb4j7 Stopping container registry-server Increasing the threshold of the probe might fix the problem: livenessProbe: exec: command: - grpc_health_probe - -addr=:50051 failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 name: registry-server ports: - containerPort: 50051 name: grpc protocol: TCP readinessProbe: exec: command: - grpc_health_probe - -addr=:50051 failureThreshold: 3 initialDelaySeconds: 5 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Install OSD cluster using 4.11.0-0.nightly-2022-08-26-162248 payload 2. Inspect redhat-operator pod in openshift-marketplace namespace 3. Observe the resource usage ( CPU and Memory ) of the pod
Actual results:
Redhat-operator failing leading to increase to CPU and Mem usage on master nodes regularly during the startup
Expected results:
Redhat-operator startup probe succeeding and no spikes in resource on master nodes
Additional info:
Attached cpu, memory and event traces.
Description of problem:
unset field networks in topology of each failureDomain, but defines platform.vsphere.vcenters.
in install-config.yaml:
vcenters: - server: xxx user: xxx password: xxx datacenters: - IBMCloud - datacenter-2 failureDomains: - name: us-east-1 region: us-east zone: us-east-1a topology: datacenter: IBMCloud computeCluster: /IBMCloud/host/vcs-mdcnc-workload-2 datastore: multi-zone-ds-shared server: ibmvcenter.vmc-ci.devcluster.openshift.com - name: us-east-2 region: us-east zone: us-east-2a topology: datacenter: IBMCloud computeCluster: /IBMCloud/host/vcs-mdcnc-workload-2 datastore: multi-zone-ds-shared server: ibmvcenter.vmc-ci.devcluster.openshift.com - name: us-east-3
Launch installer to create cluster, get panic error
sh-4.4$ ./openshift-install create cluster --dir ipi --log-level debug DEBUG OpenShift Installer 4.12.0-0.nightly-2022-09-25-071630 DEBUG Built from commit 1fb1397635c89ff8b3645fed4c4c264e4119fa84 DEBUG Fetching Metadata... ... DEBUG Reusing previously-fetched Master Ignition Config DEBUG Generating Master Machines... panic: runtime error: index out of range [0] with length 0goroutine 1 [running]: github.com/openshift/installer/pkg/asset/machines/vsphere.getDefinedZones(0xc0003bec80) /go/src/github.com/openshift/installer/pkg/asset/machines/vsphere/machinesets.go:122 +0x4f8 github.com/openshift/installer/pkg/asset/machines/vsphere.Machines({0xc0011ca0b0, 0xd}, 0xc001080c80, 0xc0005cad50, {0xc000651d10, 0x13}, {0x4ab5773, 0x6}, {0x4ad49bb, 0x10}) /go/src/github.com/openshift/installer/pkg/asset/machines/vsphere/machines.go:37 +0x250 github.com/openshift/installer/pkg/asset/machines.(*Master).Generate(0xc001118bd0, 0x5?)
Field platform.vsphere.failureDomains.topology.netowrks is not required in documentation.
sh-4.4$ ./openshift-install explain installconfig.platform.vsphere.failureDomains.topology
KIND: InstallConfig
VERSION: v1RESOURCE: <object>
Topology describes a given failure domain using vSphere constructsFIELDS:
computeCluster <string> -required-
computeCluster as the failure domain This is required to be a path datacenter <string> -required-
datacenter is the vCenter datacenter in which virtual machines will be located and defined as the failure domain. datastore <string> -required-
datastore is the name or inventory path of the datastore in which the virtual machine is created/located. folder <string>
folder is the name or inventory path of the folder in which the virtual machine is created/located. networks <[]string>
networks is the list of networks within this failure domain resourcePool <string>
resourcePool is the absolute path of the resource pool where virtual machines will be created. The absolute path is of the form /<datacenter>/host/<cluster>/Resources/<resourcepool>.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-25-071630
How reproducible:
always when setting platform.vsphere.vcenters and unsetting platform.vsphere.failureDomains.topology.networks It works if no set platform.vsphere.vcenters and set platform.vsphere.failureDomains.topology.networks
Steps to Reproduce:
1. configure zones in install-config.yaml, set platform.vsphere.vcenters and unset platform.vsphere.failureDomains.topology.networks 2. install IPI cluster 3.
Actual results:
installer get panic error
Expected results:
installation is successful.
Additional info:
Description of problem:
failed to run command in pod with network-tools script pod-run-netns-command locally
Version-Release number of selected component (if applicable):
Client Version: 4.12.0-0.nightly-2022-07-25-055755 Kustomize Version: v4.5.4 Server Version: 4.12.0-0.nightly-2022-09-28-204419 Kubernetes Version: v1.24.0+8c7c967
How reproducible:
100%
Steps to Reproduce:
1.configure KUBECONFIG [cloud-user@preserved-qiowang debug-scripts]$ export | grep kube declare -x KUBECONFIG="/var/tmp/kubeconfig412" [cloud-user@preserved-qiowang debug-scripts]$ oc get nodes NAME STATUS ROLES AGE VERSION qiowang-09291-chllb-master-0.c.openshift-qe.internal Ready control-plane,master 7h16m v1.24.0+8c7c967 qiowang-09291-chllb-master-1.c.openshift-qe.internal Ready control-plane,master 7h16m v1.24.0+8c7c967 qiowang-09291-chllb-master-2.c.openshift-qe.internal Ready control-plane,master 7h16m v1.24.0+8c7c967 qiowang-09291-chllb-worker-a-2zq28.c.openshift-qe.internal Ready worker 6h59m v1.24.0+8c7c967 qiowang-09291-chllb-worker-b-226ft.c.openshift-qe.internal Ready worker 6h59m v1.24.0+8c7c967 qiowang-09291-chllb-worker-c-wq52c.c.openshift-qe.internal Ready worker 6h59m v1.24.0+8c7c967 2. clone the openshift/network-tools repo to local 3. create project test, create pod hello-world [cloud-user@preserved-qiowang debug-scripts]$ oc project Using project "test" on server "https://api.qiowang-09291.qe.gcp.devcluster.openshift.com:6443". [cloud-user@preserved-qiowang debug-scripts]$ oc get pods NAME READY STATUS RESTARTS AGE hello-world-j9v9g 1/1 Running 0 68s hello-world-rrwjf 1/1 Running 0 68s 4. run ping command in the pod hello-world-j9v9g with script pod-run-netns-command locally [cloud-user@preserved-qiowang debug-scripts]$ ./network-tools pod-run-netns-command test hello-world-j9v9g ping 8.8.8.8 -c 5 ERROR: Command returned non-zero exit code, check output or logs.
Actual results:
failed to run command in pod hello-world-j9v9g with script pod-run-netns-command locally
Expected results:
can run ping 8.8.8.8 -c 5 in pod hello-world-j9v9g with script pod-run-netns-command locally
Additional info:
Description of problem:
Machine cannot go into Failed phase when providing an invalid vmSize, it stuck in Provisioning, and the prompt message is not accurate. The case works well in 4.11 and previous versions, it’s a regression issue on 4.12, and seems introduced here: https://github.com/openshift/machine-api-provider-azure/pull/32/files#diff-af805e1e45f03df0b5b56ff4413e5ad52cd31904a94d37e8e916751953e4687dR565
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-28-204419
How reproducible:
always
Steps to Reproduce:
1. Create a machineset with invalid vmSize vmSize: invalid liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml machineset.machine.openshift.io/huliu-azure02pr-jmvl2-1 created liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-azure02pr-jmvl2-1-6gbdw Provisioning 4m58s huliu-azure02pr-jmvl2-master-0 Running Standard_D8s_v3 southcentralus 1 5h11m huliu-azure02pr-jmvl2-master-1 Running Standard_D8s_v3 southcentralus 2 5h11m huliu-azure02pr-jmvl2-master-2 Running Standard_D8s_v3 southcentralus 3 5h11m huliu-azure02pr-jmvl2-worker-southcentralus1-9hgmk Running Standard_D4s_v3 southcentralus 1 4h56m huliu-azure02pr-jmvl2-worker-southcentralus2-44mf6 Running Standard_D4s_v3 southcentralus 2 4h56m huliu-azure02pr-jmvl2-worker-southcentralus3-4m9b7 Running Standard_D4s_v3 southcentralus 3 4h56m liuhuali@Lius-MacBook-Pro huali-test % oc get machine huliu-azure02pr-jmvl2-1-6gbdw -o yaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: creationTimestamp: "2022-09-29T06:36:03Z" finalizers: - machine.machine.openshift.io generateName: huliu-azure02pr-jmvl2-1- generation: 2 labels: machine.openshift.io/cluster-api-cluster: huliu-azure02pr-jmvl2 machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: huliu-azure02pr-jmvl2-1 name: huliu-azure02pr-jmvl2-1-6gbdw namespace: openshift-machine-api ownerReferences: - apiVersion: machine.openshift.io/v1beta1 blockOwnerDeletion: true controller: true kind: MachineSet name: huliu-azure02pr-jmvl2-1 uid: f729cb01-274a-4c6e-8f69-808cff412fe3 resourceVersion: "174604" uid: 2c4b9dd4-5666-47cd-8fc5-38bac0b9cad1 spec: lifecycleHooks: {} metadata: {} providerSpec: value: acceleratedNetworking: true apiVersion: machine.openshift.io/v1beta1 credentialsSecret: name: azure-cloud-credentials namespace: openshift-machine-api diagnostics: {} image: offer: "" publisher: "" resourceID: /resourceGroups/huliu-azure02pr-jmvl2-rg/providers/Microsoft.Compute/images/huliu-azure02pr-jmvl2-gen2 sku: "" version: "" kind: AzureMachineProviderSpec location: southcentralus managedIdentity: huliu-azure02pr-jmvl2-identity metadata: creationTimestamp: null name: huliu-azure02pr-jmvl2 networkResourceGroup: huliu-azure02pr-jmvl2-rg osDisk: diskSettings: {} diskSizeGB: 128 managedDisk: storageAccountType: Premium_LRS osType: Linux publicIP: false publicLoadBalancer: huliu-azure02pr-jmvl2 resourceGroup: huliu-azure02pr-jmvl2-rg subnet: huliu-azure02pr-jmvl2-worker-subnet userDataSecret: name: worker-user-data vmSize: invalid vnet: huliu-azure02pr-jmvl2-vnet zone: "1" status: conditions: - lastTransitionTime: "2022-09-29T06:36:03Z" status: "True" type: Drainable - lastTransitionTime: "2022-09-29T06:36:03Z" message: Instance has not been created reason: InstanceNotCreated severity: Warning status: "False" type: InstanceExists - lastTransitionTime: "2022-09-29T06:36:03Z" status: "True" type: Terminable lastUpdated: "2022-09-29T06:36:03Z" phase: Provisioning providerStatus: conditions: - lastTransitionTime: "2022-09-29T06:36:03Z" message: 'failed to create nic huliu-azure02pr-jmvl2-1-6gbdw-nic for machine huliu-azure02pr-jmvl2-1-6gbdw: failed to find sku invalid' reason: MachineCreationFailed status: "True" type: MachineCreated metadata: {} machine-controller log: ... W0929 11:38:25.817887 1 controller.go:382] huliu-azure02pr-jmvl2-invalid-lzzb2: failed to create machine: requeue in: 20s I0929 11:38:25.817905 1 controller.go:412] Actuator returned requeue-after error: requeue in: 20s I0929 11:38:25.817984 1 logr.go:252] events "msg"="Warning" "message"="CreateError: failed to reconcile machine \"huliu-azure02pr-jmvl2-invalid-lzzb2\"s: failed to create nic huliu-azure02pr-jmvl2-invalid-lzzb2-nic for machine huliu-azure02pr-jmvl2-invalid-lzzb2: failed to find sku invalid" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"huliu-azure02pr-jmvl2-invalid-lzzb2","uid":"bab43f44-7da9-4b62-bbdc-01a180cc1de7","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"316506"} "reason"="FailedCreate" I0929 11:38:25.817989 1 controller.go:187] huliu-azure02pr-jmvl2-invalid-lzzb2: reconciling Machine I0929 11:38:25.818015 1 actuator.go:213] huliu-azure02pr-jmvl2-invalid-lzzb2: actuator checking if machine exists W0929 11:38:25.916417 1 virtualmachines.go:99] vm huliu-azure02pr-jmvl2-invalid-lzzb2 not found: %!w(string=compute.VirtualMachinesClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Compute/virtualMachines/huliu-azure02pr-jmvl2-invalid-lzzb2' under resource group 'huliu-azure02pr-jmvl2-rg' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix") I0929 11:38:25.916463 1 controller.go:380] huliu-azure02pr-jmvl2-invalid-lzzb2: reconciling machine triggers idempotent create I0929 11:38:25.916476 1 actuator.go:85] Creating machine huliu-azure02pr-jmvl2-invalid-lzzb2 I0929 11:38:25.917540 1 machine_scope.go:176] huliu-azure02pr-jmvl2-invalid-lzzb2: status unchanged I0929 11:38:25.917596 1 machine_scope.go:192] huliu-azure02pr-jmvl2-invalid-lzzb2: patching machine E0929 11:38:25.941083 1 actuator.go:79] Machine error: failed to reconcile machine "huliu-azure02pr-jmvl2-invalid-lzzb2"s: failed to create nic huliu-azure02pr-jmvl2-invalid-lzzb2-nic for machine huliu-azure02pr-jmvl2-invalid-lzzb2: failed to find sku invalid
Actual results:
Machine stuck in Provisioning, the prompt message is not accurate
Expected results:
Machine go into Failed phase and give InvalidConfiguration error, as the previous versions.
Additional info:
test result on previous version: liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE jfan49-jn66b-master-0 Running Standard_D8s_v3 westus 6h27m jfan49-jn66b-master-1 Running Standard_D8s_v3 westus 6h27m jfan49-jn66b-master-2 Running Standard_D8s_v3 westus 6h27m jfan49-jn66b-worker-1-tdpdt Failed 61s jfan49-jn66b-worker-westus-2fz6b Running Standard_D4s_v3 westus 6h21m jfan49-jn66b-worker-westus-6fkgb Running Standard_D4s_v3 westus 6h21m jfan49-jn66b-worker-westus-k74gf Running Standard_D4s_v3 westus 6h21m liuhuali@Lius-MacBook-Pro huali-test % oc get machine jfan49-jn66b-worker-1-tdpdt -o yaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: annotations: machine.openshift.io/instance-state: Unknown creationTimestamp: "2022-09-29T08:50:13Z" finalizers: - machine.machine.openshift.io generateName: jfan49-jn66b-worker-1- generation: 2 labels: machine.openshift.io/cluster-api-cluster: jfan49-jn66b machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: jfan49-jn66b-worker-1 name: jfan49-jn66b-worker-1-tdpdt namespace: openshift-machine-api ownerReferences: - apiVersion: machine.openshift.io/v1beta1 blockOwnerDeletion: true controller: true kind: MachineSet name: jfan49-jn66b-worker-1 uid: 4319d2e2-3ee2-4cb2-a7b4-5a0d4e1ea3d7 resourceVersion: "128119" uid: 7d9e4bbe-7c37-416e-a133-577476937b7a spec: metadata: {} providerSpec: value: apiVersion: azureproviderconfig.openshift.io/v1beta1 credentialsSecret: name: azure-cloud-credentials namespace: openshift-machine-api image: offer: "" publisher: "" resourceID: /resourceGroups/jfan49-jn66b-rg/providers/Microsoft.Compute/images/jfan49-jn66b sku: "" version: "" kind: AzureMachineProviderSpec location: westus managedIdentity: jfan49-jn66b-identity metadata: creationTimestamp: null name: jfan49-jn66b networkResourceGroup: jfan49-jn66b-rg osDisk: diskSizeGB: 128 managedDisk: storageAccountType: Premium_LRS osType: Linux publicIP: false publicLoadBalancer: jfan49-jn66b resourceGroup: jfan49-jn66b-rg subnet: jfan49-jn66b-worker-subnet userDataSecret: name: worker-user-data vmSize: invalid vnet: jfan49-jn66b-vnet zone: "" status: conditions: - lastTransitionTime: "2022-09-29T08:50:13Z" message: Instance has not been created reason: InstanceNotCreated severity: Warning status: "False" type: InstanceExists errorMessage: 'failed to reconcile machine "jfan49-jn66b-worker-1-tdpdt": failed to create vm jfan49-jn66b-worker-1-tdpdt: failure sending request for machine jfan49-jn66b-worker-1-tdpdt: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidParameter" Message="The value invalid provided for the VM size is not valid. The valid sizes in the current region are: Standard_B1ls,Standard_B1ms,Standard_B1s,Standard_B2ms,Standard_B2s,Standard_B4ms,Standard_B8ms,Standard_B12ms,Standard_B16ms,Standard_B20ms,Standard_E2_v4,Standard_E4_v4,Standard_E8_v4,Standard_E16_v4,Standard_E20_v4,Standard_E32_v4,Standard_E2d_v4,Standard_E4d_v4,Standard_E8d_v4,Standard_E16d_v4,Standard_E20d_v4,Standard_E32d_v4,Standard_E2s_v4,Standard_E4-2s_v4,Standard_E4s_v4,Standard_E8-2s_v4,Standard_E8-4s_v4,Standard_E8s_v4,Standard_E16-4s_v4,Standard_E16-8s_v4,Standard_E16s_v4,Standard_E20s_v4,Standard_E32-8s_v4,Standard_E32-16s_v4,Standard_E32s_v4,Standard_E2ds_v4,Standard_E4-2ds_v4,Standard_E4ds_v4,Standard_E8-2ds_v4,Standard_E8-4ds_v4,Standard_E8ds_v4,Standard_E16-4ds_v4,Standard_E16-8ds_v4,Standard_E16ds_v4,Standard_E20ds_v4,Standard_E32-8ds_v4,Standard_E32-16ds_v4,Standard_E32ds_v4,Standard_D2d_v4,Standard_D4d_v4,Standard_D8d_v4,Standard_D16d_v4,Standard_D32d_v4,Standard_D48d_v4,Standard_D64d_v4,Standard_D2_v4,Standard_D4_v4,Standard_D8_v4,Standard_D16_v4,Standard_D32_v4,Standard_D48_v4,Standard_D64_v4,Standard_D2ds_v4,Standard_D4ds_v4,Standard_D8ds_v4,Standard_D16ds_v4,Standard_D32ds_v4,Standard_D48ds_v4,Standard_D64ds_v4,Standard_D2s_v4,Standard_D4s_v4,Standard_D8s_v4,Standard_D16s_v4,Standard_D32s_v4,Standard_D48s_v4,Standard_D64s_v4,Standard_D1_v2,Standard_D2_v2,Standard_D3_v2,Standard_D4_v2,Standard_D5_v2,Standard_D11_v2,Standard_D12_v2,Standard_D13_v2,Standard_D14_v2,Standard_D15_v2,Standard_D2_v2_Promo,Standard_D3_v2_Promo,Standard_D4_v2_Promo,Standard_D5_v2_Promo,Standard_D11_v2_Promo,Standard_D12_v2_Promo,Standard_D13_v2_Promo,Standard_D14_v2_Promo,Standard_F1,Standard_F2,Standard_F4,Standard_F8,Standard_F16,Standard_DS1_v2,Standard_DS2_v2,Standard_DS3_v2,Standard_DS4_v2,Standard_DS5_v2,Standard_DS11-1_v2,Standard_DS11_v2,Standard_DS12-1_v2,Standard_DS12-2_v2,Standard_DS12_v2,Standard_DS13-2_v2,Standard_DS13-4_v2,Standard_DS13_v2,Standard_DS14-4_v2,Standard_DS14-8_v2,Standard_DS14_v2,Standard_DS15_v2,Standard_DS2_v2_Promo,Standard_DS3_v2_Promo,Standard_DS4_v2_Promo,Standard_DS5_v2_Promo,Standard_DS11_v2_Promo,Standard_DS12_v2_Promo,Standard_DS13_v2_Promo,Standard_DS14_v2_Promo,Standard_F1s,Standard_F2s,Standard_F4s,Standard_F8s,Standard_F16s,Standard_A1_v2,Standard_A2m_v2,Standard_A2_v2,Standard_A4m_v2,Standard_A4_v2,Standard_A8m_v2,Standard_A8_v2,Standard_D2_v3,Standard_D4_v3,Standard_D8_v3,Standard_D16_v3,Standard_D32_v3,Standard_D48_v3,Standard_D64_v3,Standard_D2s_v3,Standard_D4s_v3,Standard_D8s_v3,Standard_D16s_v3,Standard_D32s_v3,Standard_D48s_v3,Standard_D64s_v3,Standard_E2_v3,Standard_E4_v3,Standard_E8_v3,Standard_E16_v3,Standard_E20_v3,Standard_E32_v3,Standard_E2s_v3,Standard_E4-2s_v3,Standard_E4s_v3,Standard_E8-2s_v3,Standard_E8-4s_v3,Standard_E8s_v3,Standard_E16-4s_v3,Standard_E16-8s_v3,Standard_E16s_v3,Standard_E20s_v3,Standard_E32-8s_v3,Standard_E32-16s_v3,Standard_E32s_v3,Standard_F2s_v2,Standard_F4s_v2,Standard_F8s_v2,Standard_F16s_v2,Standard_F32s_v2,Standard_F48s_v2,Standard_F64s_v2,Standard_F72s_v2,Standard_E48_v4,Standard_E64_v4,Standard_E48d_v4,Standard_E64d_v4,Standard_E48s_v4,Standard_E64-16s_v4,Standard_E64-32s_v4,Standard_E64s_v4,Standard_E80is_v4,Standard_E48ds_v4,Standard_E64-16ds_v4,Standard_E64-32ds_v4,Standard_E64ds_v4,Standard_E80ids_v4,Standard_E48_v3,Standard_E64_v3,Standard_E48s_v3,Standard_E64-16s_v3,Standard_E64-32s_v3,Standard_E64s_v3,Standard_A0,Standard_A1,Standard_A2,Standard_A3,Standard_A5,Standard_A4,Standard_A6,Standard_A7,Basic_A0,Basic_A1,Basic_A2,Basic_A3,Basic_A4,Standard_NC4as_T4_v3,Standard_NC8as_T4_v3,Standard_NC16as_T4_v3,Standard_NC64as_T4_v3,Standard_M64,Standard_M64m,Standard_M128,Standard_M128m,Standard_M8-2ms,Standard_M8-4ms,Standard_M8ms,Standard_M16-4ms,Standard_M16-8ms,Standard_M16ms,Standard_M32-8ms,Standard_M32-16ms,Standard_M32ls,Standard_M32ms,Standard_M32ts,Standard_M64-16ms,Standard_M64-32ms,Standard_M64ls,Standard_M64ms,Standard_M64s,Standard_M128-32ms,Standard_M128-64ms,Standard_M128ms,Standard_M128s,Standard_M32ms_v2,Standard_M64ms_v2,Standard_M64s_v2,Standard_M128ms_v2,Standard_M128s_v2,Standard_M192ims_v2,Standard_M192is_v2,Standard_M32dms_v2,Standard_M64dms_v2,Standard_M64ds_v2,Standard_M128dms_v2,Standard_M128ds_v2,Standard_M192idms_v2,Standard_M192ids_v2,Standard_E64i_v3,Standard_E64is_v3,Standard_D1,Standard_D2,Standard_D3,Standard_D4,Standard_D11,Standard_D12,Standard_D13,Standard_D14,Standard_DS1,Standard_DS2,Standard_DS3,Standard_DS4,Standard_DS11,Standard_DS12,Standard_DS13,Standard_DS14,Standard_DC8_v2,Standard_DC1s_v2,Standard_DC2s_v2,Standard_DC4s_v2,Standard_L8s_v2,Standard_L16s_v2,Standard_L32s_v2,Standard_L48s_v2,Standard_L64s_v2,Standard_L80s_v2,Standard_NV4as_v4,Standard_NV8as_v4,Standard_NV16as_v4,Standard_NV32as_v4,Standard_G1,Standard_G2,Standard_G3,Standard_G4,Standard_G5,Standard_GS1,Standard_GS2,Standard_GS3,Standard_GS4,Standard_GS4-4,Standard_GS4-8,Standard_GS5,Standard_GS5-8,Standard_GS5-16,Standard_L4s,Standard_L8s,Standard_L16s,Standard_L32s,Standard_DC2as_v5,Standard_DC4as_v5,Standard_DC8as_v5,Standard_DC16as_v5,Standard_DC32as_v5,Standard_DC48as_v5,Standard_DC64as_v5,Standard_DC96as_v5,Standard_DC2ads_v5,Standard_DC4ads_v5,Standard_DC8ads_v5,Standard_DC16ads_v5,Standard_DC32ads_v5,Standard_DC48ads_v5,Standard_DC64ads_v5,Standard_DC96ads_v5,Standard_EC2as_v5,Standard_EC4as_v5,Standard_EC8as_v5,Standard_EC16as_v5,Standard_EC20as_v5,Standard_EC32as_v5,Standard_EC48as_v5,Standard_EC64as_v5,Standard_EC96as_v5,Standard_EC96ias_v5,Standard_EC2ads_v5,Standard_EC4ads_v5,Standard_EC8ads_v5,Standard_EC16ads_v5,Standard_EC20ads_v5,Standard_EC32ads_v5,Standard_EC48ads_v5,Standard_EC64ads_v5,Standard_EC96ads_v5,Standard_EC96iads_v5,Standard_D2ds_v5,Standard_D4ds_v5,Standard_D8ds_v5,Standard_D16ds_v5,Standard_D32ds_v5,Standard_D48ds_v5,Standard_D64ds_v5,Standard_D96ds_v5,Standard_D2d_v5,Standard_D4d_v5,Standard_D8d_v5,Standard_D16d_v5,Standard_D32d_v5,Standard_D48d_v5,Standard_D64d_v5,Standard_D96d_v5,Standard_D2s_v5,Standard_D4s_v5,Standard_D8s_v5,Standard_D16s_v5,Standard_D32s_v5,Standard_D48s_v5,Standard_D64s_v5,Standard_D96s_v5,Standard_D2_v5,Standard_D4_v5,Standard_D8_v5,Standard_D16_v5,Standard_D32_v5,Standard_D48_v5,Standard_D64_v5,Standard_D96_v5,Standard_E2ds_v5,Standard_E4-2ds_v5,Standard_E4ds_v5,Standard_E8-2ds_v5,Standard_E8-4ds_v5,Standard_E8ds_v5,Standard_E16-4ds_v5,Standard_E16-8ds_v5,Standard_E16ds_v5,Standard_E20ds_v5,Standard_E32-8ds_v5,Standard_E32-16ds_v5,Standard_E32ds_v5,Standard_E48ds_v5,Standard_E64-16ds_v5,Standard_E64-32ds_v5,Standard_E64ds_v5,Standard_E96-24ds_v5,Standard_E96-48ds_v5,Standard_E96ds_v5,Standard_E104ids_v5,Standard_E2d_v5,Standard_E4d_v5,Standard_E8d_v5,Standard_E16d_v5,Standard_E20d_v5,Standard_E32d_v5,Standard_E48d_v5,Standard_E64d_v5,Standard_E96d_v5,Standard_E104id_v5,Standard_E2s_v5,Standard_E4-2s_v5,Standard_E4s_v5,Standard_E8-2s_v5,Standard_E8-4s_v5,Standard_E8s_v5,Standard_E16-4s_v5,Standard_E16-8s_v5,Standard_E16s_v5,Standard_E20s_v5,Standard_E32-8s_v5,Standard_E32-16s_v5,Standard_E32s_v5,Standard_E48s_v5,Standard_E64-16s_v5,Standard_E64-32s_v5,Standard_E64s_v5,Standard_E96-24s_v5,Standard_E96-48s_v5,Standard_E96s_v5,Standard_E104is_v5,Standard_E2_v5,Standard_E4_v5,Standard_E8_v5,Standard_E16_v5,Standard_E20_v5,Standard_E32_v5,Standard_E48_v5,Standard_E64_v5,Standard_E96_v5,Standard_E104i_v5,Standard_E2bs_v5,Standard_E4bs_v5,Standard_E8bs_v5,Standard_E16bs_v5,Standard_E32bs_v5,Standard_E48bs_v5,Standard_E64bs_v5,Standard_E2bds_v5,Standard_E4bds_v5,Standard_E8bds_v5,Standard_E16bds_v5,Standard_E32bds_v5,Standard_E48bds_v5,Standard_E64bds_v5,Standard_D2a_v4,Standard_D4a_v4,Standard_D8a_v4,Standard_D16a_v4,Standard_D32a_v4,Standard_D48a_v4,Standard_D64a_v4,Standard_D96a_v4,Standard_D2as_v4,Standard_D4as_v4,Standard_D8as_v4,Standard_D16as_v4,Standard_D32as_v4,Standard_D48as_v4,Standard_D64as_v4,Standard_D96as_v4,Standard_E2a_v4,Standard_E4a_v4,Standard_E8a_v4,Standard_E16a_v4,Standard_E20a_v4,Standard_E32a_v4,Standard_E48a_v4,Standard_E64a_v4,Standard_E96a_v4,Standard_E2as_v4,Standard_E4-2as_v4,Standard_E4as_v4,Standard_E8-2as_v4,Standard_E8-4as_v4,Standard_E8as_v4,Standard_E16-4as_v4,Standard_E16-8as_v4,Standard_E16as_v4,Standard_E20as_v4,Standard_E32-8as_v4,Standard_E32-16as_v4,Standard_E32as_v4,Standard_E48as_v4,Standard_E64-16as_v4,Standard_E64-32as_v4,Standard_E64as_v4,Standard_E96-24as_v4,Standard_E96-48as_v4,Standard_E96as_v4,Standard_E96ias_v4,Standard_NC6s_v3,Standard_NC12s_v3,Standard_NC24rs_v3,Standard_NC24s_v3,Standard_NV6s_v2,Standard_NV12s_v2,Standard_NV24s_v2,Standard_NV12s_v3,Standard_NV24s_v3,Standard_NV48s_v3,Standard_H8,Standard_H8_Promo,Standard_H16,Standard_H16_Promo,Standard_H8m,Standard_H8m_Promo,Standard_H16m,Standard_H16m_Promo,Standard_H16r,Standard_H16r_Promo,Standard_H16mr,Standard_H16mr_Promo,Standard_M208ms_v2,Standard_M208s_v2,Standard_M416-208s_v2,Standard_M416s_v2,Standard_M416-208ms_v2,Standard_M416ms_v2,Standard_DC1s_v3,Standard_DC2s_v3,Standard_DC4s_v3,Standard_DC8s_v3,Standard_DC16s_v3,Standard_DC24s_v3,Standard_DC32s_v3,Standard_DC48s_v3,Standard_DC1ds_v3,Standard_DC2ds_v3,Standard_DC4ds_v3,Standard_DC8ds_v3,Standard_DC16ds_v3,Standard_DC24ds_v3,Standard_DC32ds_v3,Standard_DC48ds_v3,Standard_NC24ads_A100_v4,Standard_NC48ads_A100_v4,Standard_NC96ads_A100_v4,Standard_D2as_v5,Standard_D4as_v5,Standard_D8as_v5,Standard_D16as_v5,Standard_D32as_v5,Standard_D48as_v5,Standard_D64as_v5,Standard_D96as_v5,Standard_E2as_v5,Standard_E4-2as_v5,Standard_E4as_v5,Standard_E8-2as_v5,Standard_E8-4as_v5,Standard_E8as_v5,Standard_E16-4as_v5,Standard_E16-8as_v5,Standard_E16as_v5,Standard_E20as_v5,Standard_E32-8as_v5,Standard_E32-16as_v5,Standard_E32as_v5,Standard_E48as_v5,Standard_E64-16as_v5,Standard_E64-32as_v5,Standard_E64as_v5,Standard_E96-24as_v5,Standard_E96-48as_v5,Standard_E96as_v5,Standard_E112ias_v5,Standard_D2ads_v5,Standard_D4ads_v5,Standard_D8ads_v5,Standard_D16ads_v5,Standard_D32ads_v5,Standard_D48ads_v5,Standard_D64ads_v5,Standard_D96ads_v5,Standard_E2ads_v5,Standard_E4-2ads_v5,Standard_E4ads_v5,Standard_E8-2ads_v5,Standard_E8-4ads_v5,Standard_E8ads_v5,Standard_E16-4ads_v5,Standard_E16-8ads_v5,Standard_E16ads_v5,Standard_E20ads_v5,Standard_E32-8ads_v5,Standard_E32-16ads_v5,Standard_E32ads_v5,Standard_E48ads_v5,Standard_E64-16ads_v5,Standard_E64-32ads_v5,Standard_E64ads_v5,Standard_E96-24ads_v5,Standard_E96-48ads_v5,Standard_E96ads_v5,Standard_E112iads_v5,Standard_L8s_v3,Standard_L16s_v3,Standard_L32s_v3,Standard_L48s_v3,Standard_L64s_v3,Standard_L80s_v3. Find out more on the valid VM sizes in each region at https://aka.ms/azure-regionservices." Target="vmSize"' errorReason: InvalidConfiguration lastUpdated: "2022-09-29T08:50:19Z" phase: Failed providerStatus: conditions: - lastProbeTime: "2022-09-29T08:50:19Z" lastTransitionTime: "2022-09-29T08:50:19Z" message: 'failed to create vm jfan49-jn66b-worker-1-tdpdt: failure sending request for machine jfan49-jn66b-worker-1-tdpdt: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidParameter" Message="The value invalid provided for the VM size is not valid. The valid sizes in the current region are: Standard_B1ls,Standard_B1ms,Standard_B1s,Standard_B2ms,Standard_B2s,Standard_B4ms,Standard_B8ms,Standard_B12ms,Standard_B16ms,Standard_B20ms,Standard_E2_v4,Standard_E4_v4,Standard_E8_v4,Standard_E16_v4,Standard_E20_v4,Standard_E32_v4,Standard_E2d_v4,Standard_E4d_v4,Standard_E8d_v4,Standard_E16d_v4,Standard_E20d_v4,Standard_E32d_v4,Standard_E2s_v4,Standard_E4-2s_v4,Standard_E4s_v4,Standard_E8-2s_v4,Standard_E8-4s_v4,Standard_E8s_v4,Standard_E16-4s_v4,Standard_E16-8s_v4,Standard_E16s_v4,Standard_E20s_v4,Standard_E32-8s_v4,Standard_E32-16s_v4,Standard_E32s_v4,Standard_E2ds_v4,Standard_E4-2ds_v4,Standard_E4ds_v4,Standard_E8-2ds_v4,Standard_E8-4ds_v4,Standard_E8ds_v4,Standard_E16-4ds_v4,Standard_E16-8ds_v4,Standard_E16ds_v4,Standard_E20ds_v4,Standard_E32-8ds_v4,Standard_E32-16ds_v4,Standard_E32ds_v4,Standard_D2d_v4,Standard_D4d_v4,Standard_D8d_v4,Standard_D16d_v4,Standard_D32d_v4,Standard_D48d_v4,Standard_D64d_v4,Standard_D2_v4,Standard_D4_v4,Standard_D8_v4,Standard_D16_v4,Standard_D32_v4,Standard_D48_v4,Standard_D64_v4,Standard_D2ds_v4,Standard_D4ds_v4,Standard_D8ds_v4,Standard_D16ds_v4,Standard_D32ds_v4,Standard_D48ds_v4,Standard_D64ds_v4,Standard_D2s_v4,Standard_D4s_v4,Standard_D8s_v4,Standard_D16s_v4,Standard_D32s_v4,Standard_D48s_v4,Standard_D64s_v4,Standard_D1_v2,Standard_D2_v2,Standard_D3_v2,Standard_D4_v2,Standard_D5_v2,Standard_D11_v2,Standard_D12_v2,Standard_D13_v2,Standard_D14_v2,Standard_D15_v2,Standard_D2_v2_Promo,Standard_D3_v2_Promo,Standard_D4_v2_Promo,Standard_D5_v2_Promo,Standard_D11_v2_Promo,Standard_D12_v2_Promo,Standard_D13_v2_Promo,Standard_D14_v2_Promo,Standard_F1,Standard_F2,Standard_F4,Standard_F8,Standard_F16,Standard_DS1_v2,Standard_DS2_v2,Standard_DS3_v2,Standard_DS4_v2,Standard_DS5_v2,Standard_DS11-1_v2,Standard_DS11_v2,Standard_DS12-1_v2,Standard_DS12-2_v2,Standard_DS12_v2,Standard_DS13-2_v2,Standard_DS13-4_v2,Standard_DS13_v2,Standard_DS14-4_v2,Standard_DS14-8_v2,Standard_DS14_v2,Standard_DS15_v2,Standard_DS2_v2_Promo,Standard_DS3_v2_Promo,Standard_DS4_v2_Promo,Standard_DS5_v2_Promo,Standard_DS11_v2_Promo,Standard_DS12_v2_Promo,Standard_DS13_v2_Promo,Standard_DS14_v2_Promo,Standard_F1s,Standard_F2s,Standard_F4s,Standard_F8s,Standard_F16s,Standard_A1_v2,Standard_A2m_v2,Standard_A2_v2,Standard_A4m_v2,Standard_A4_v2,Standard_A8m_v2,Standard_A8_v2,Standard_D2_v3,Standard_D4_v3,Standard_D8_v3,Standard_D16_v3,Standard_D32_v3,Standard_D48_v3,Standard_D64_v3,Standard_D2s_v3,Standard_D4s_v3,Standard_D8s_v3,Standard_D16s_v3,Standard_D32s_v3,Standard_D48s_v3,Standard_D64s_v3,Standard_E2_v3,Standard_E4_v3,Standard_E8_v3,Standard_E16_v3,Standard_E20_v3,Standard_E32_v3,Standard_E2s_v3,Standard_E4-2s_v3,Standard_E4s_v3,Standard_E8-2s_v3,Standard_E8-4s_v3,Standard_E8s_v3,Standard_E16-4s_v3,Standard_E16-8s_v3,Standard_E16s_v3,Standard_E20s_v3,Standard_E32-8s_v3,Standard_E32-16s_v3,Standard_E32s_v3,Standard_F2s_v2,Standard_F4s_v2,Standard_F8s_v2,Standard_F16s_v2,Standard_F32s_v2,Standard_F48s_v2,Standard_F64s_v2,Standard_F72s_v2,Standard_E48_v4,Standard_E64_v4,Standard_E48d_v4,Standard_E64d_v4,Standard_E48s_v4,Standard_E64-16s_v4,Standard_E64-32s_v4,Standard_E64s_v4,Standard_E80is_v4,Standard_E48ds_v4,Standard_E64-16ds_v4,Standard_E64-32ds_v4,Standard_E64ds_v4,Standard_E80ids_v4,Standard_E48_v3,Standard_E64_v3,Standard_E48s_v3,Standard_E64-16s_v3,Standard_E64-32s_v3,Standard_E64s_v3,Standard_A0,Standard_A1,Standard_A2,Standard_A3,Standard_A5,Standard_A4,Standard_A6,Standard_A7,Basic_A0,Basic_A1,Basic_A2,Basic_A3,Basic_A4,Standard_NC4as_T4_v3,Standard_NC8as_T4_v3,Standard_NC16as_T4_v3,Standard_NC64as_T4_v3,Standard_M64,Standard_M64m,Standard_M128,Standard_M128m,Standard_M8-2ms,Standard_M8-4ms,Standard_M8ms,Standard_M16-4ms,Standard_M16-8ms,Standard_M16ms,Standard_M32-8ms,Standard_M32-16ms,Standard_M32ls,Standard_M32ms,Standard_M32ts,Standard_M64-16ms,Standard_M64-32ms,Standard_M64ls,Standard_M64ms,Standard_M64s,Standard_M128-32ms,Standard_M128-64ms,Standard_M128ms,Standard_M128s,Standard_M32ms_v2,Standard_M64ms_v2,Standard_M64s_v2,Standard_M128ms_v2,Standard_M128s_v2,Standard_M192ims_v2,Standard_M192is_v2,Standard_M32dms_v2,Standard_M64dms_v2,Standard_M64ds_v2,Standard_M128dms_v2,Standard_M128ds_v2,Standard_M192idms_v2,Standard_M192ids_v2,Standard_E64i_v3,Standard_E64is_v3,Standard_D1,Standard_D2,Standard_D3,Standard_D4,Standard_D11,Standard_D12,Standard_D13,Standard_D14,Standard_DS1,Standard_DS2,Standard_DS3,Standard_DS4,Standard_DS11,Standard_DS12,Standard_DS13,Standard_DS14,Standard_DC8_v2,Standard_DC1s_v2,Standard_DC2s_v2,Standard_DC4s_v2,Standard_L8s_v2,Standard_L16s_v2,Standard_L32s_v2,Standard_L48s_v2,Standard_L64s_v2,Standard_L80s_v2,Standard_NV4as_v4,Standard_NV8as_v4,Standard_NV16as_v4,Standard_NV32as_v4,Standard_G1,Standard_G2,Standard_G3,Standard_G4,Standard_G5,Standard_GS1,Standard_GS2,Standard_GS3,Standard_GS4,Standard_GS4-4,Standard_GS4-8,Standard_GS5,Standard_GS5-8,Standard_GS5-16,Standard_L4s,Standard_L8s,Standard_L16s,Standard_L32s,Standard_DC2as_v5,Standard_DC4as_v5,Standard_DC8as_v5,Standard_DC16as_v5,Standard_DC32as_v5,Standard_DC48as_v5,Standard_DC64as_v5,Standard_DC96as_v5,Standard_DC2ads_v5,Standard_DC4ads_v5,Standard_DC8ads_v5,Standard_DC16ads_v5,Standard_DC32ads_v5,Standard_DC48ads_v5,Standard_DC64ads_v5,Standard_DC96ads_v5,Standard_EC2as_v5,Standard_EC4as_v5,Standard_EC8as_v5,Standard_EC16as_v5,Standard_EC20as_v5,Standard_EC32as_v5,Standard_EC48as_v5,Standard_EC64as_v5,Standard_EC96as_v5,Standard_EC96ias_v5,Standard_EC2ads_v5,Standard_EC4ads_v5,Standard_EC8ads_v5,Standard_EC16ads_v5,Standard_EC20ads_v5,Standard_EC32ads_v5,Standard_EC48ads_v5,Standard_EC64ads_v5,Standard_EC96ads_v5,Standard_EC96iads_v5,Standard_D2ds_v5,Standard_D4ds_v5,Standard_D8ds_v5,Standard_D16ds_v5,Standard_D32ds_v5,Standard_D48ds_v5,Standard_D64ds_v5,Standard_D96ds_v5,Standard_D2d_v5,Standard_D4d_v5,Standard_D8d_v5,Standard_D16d_v5,Standard_D32d_v5,Standard_D48d_v5,Standard_D64d_v5,Standard_D96d_v5,Standard_D2s_v5,Standard_D4s_v5,Standard_D8s_v5,Standard_D16s_v5,Standard_D32s_v5,Standard_D48s_v5,Standard_D64s_v5,Standard_D96s_v5,Standard_D2_v5,Standard_D4_v5,Standard_D8_v5,Standard_D16_v5,Standard_D32_v5,Standard_D48_v5,Standard_D64_v5,Standard_D96_v5,Standard_E2ds_v5,Standard_E4-2ds_v5,Standard_E4ds_v5,Standard_E8-2ds_v5,Standard_E8-4ds_v5,Standard_E8ds_v5,Standard_E16-4ds_v5,Standard_E16-8ds_v5,Standard_E16ds_v5,Standard_E20ds_v5,Standard_E32-8ds_v5,Standard_E32-16ds_v5,Standard_E32ds_v5,Standard_E48ds_v5,Standard_E64-16ds_v5,Standard_E64-32ds_v5,Standard_E64ds_v5,Standard_E96-24ds_v5,Standard_E96-48ds_v5,Standard_E96ds_v5,Standard_E104ids_v5,Standard_E2d_v5,Standard_E4d_v5,Standard_E8d_v5,Standard_E16d_v5,Standard_E20d_v5,Standard_E32d_v5,Standard_E48d_v5,Standard_E64d_v5,Standard_E96d_v5,Standard_E104id_v5,Standard_E2s_v5,Standard_E4-2s_v5,Standard_E4s_v5,Standard_E8-2s_v5,Standard_E8-4s_v5,Standard_E8s_v5,Standard_E16-4s_v5,Standard_E16-8s_v5,Standard_E16s_v5,Standard_E20s_v5,Standard_E32-8s_v5,Standard_E32-16s_v5,Standard_E32s_v5,Standard_E48s_v5,Standard_E64-16s_v5,Standard_E64-32s_v5,Standard_E64s_v5,Standard_E96-24s_v5,Standard_E96-48s_v5,Standard_E96s_v5,Standard_E104is_v5,Standard_E2_v5,Standard_E4_v5,Standard_E8_v5,Standard_E16_v5,Standard_E20_v5,Standard_E32_v5,Standard_E48_v5,Standard_E64_v5,Standard_E96_v5,Standard_E104i_v5,Standard_E2bs_v5,Standard_E4bs_v5,Standard_E8bs_v5,Standard_E16bs_v5,Standard_E32bs_v5,Standard_E48bs_v5,Standard_E64bs_v5,Standard_E2bds_v5,Standard_E4bds_v5,Standard_E8bds_v5,Standard_E16bds_v5,Standard_E32bds_v5,Standard_E48bds_v5,Standard_E64bds_v5,Standard_D2a_v4,Standard_D4a_v4,Standard_D8a_v4,Standard_D16a_v4,Standard_D32a_v4,Standard_D48a_v4,Standard_D64a_v4,Standard_D96a_v4,Standard_D2as_v4,Standard_D4as_v4,Standard_D8as_v4,Standard_D16as_v4,Standard_D32as_v4,Standard_D48as_v4,Standard_D64as_v4,Standard_D96as_v4,Standard_E2a_v4,Standard_E4a_v4,Standard_E8a_v4,Standard_E16a_v4,Standard_E20a_v4,Standard_E32a_v4,Standard_E48a_v4,Standard_E64a_v4,Standard_E96a_v4,Standard_E2as_v4,Standard_E4-2as_v4,Standard_E4as_v4,Standard_E8-2as_v4,Standard_E8-4as_v4,Standard_E8as_v4,Standard_E16-4as_v4,Standard_E16-8as_v4,Standard_E16as_v4,Standard_E20as_v4,Standard_E32-8as_v4,Standard_E32-16as_v4,Standard_E32as_v4,Standard_E48as_v4,Standard_E64-16as_v4,Standard_E64-32as_v4,Standard_E64as_v4,Standard_E96-24as_v4,Standard_E96-48as_v4,Standard_E96as_v4,Standard_E96ias_v4,Standard_NC6s_v3,Standard_NC12s_v3,Standard_NC24rs_v3,Standard_NC24s_v3,Standard_NV6s_v2,Standard_NV12s_v2,Standard_NV24s_v2,Standard_NV12s_v3,Standard_NV24s_v3,Standard_NV48s_v3,Standard_H8,Standard_H8_Promo,Standard_H16,Standard_H16_Promo,Standard_H8m,Standard_H8m_Promo,Standard_H16m,Standard_H16m_Promo,Standard_H16r,Standard_H16r_Promo,Standard_H16mr,Standard_H16mr_Promo,Standard_M208ms_v2,Standard_M208s_v2,Standard_M416-208s_v2,Standard_M416s_v2,Standard_M416-208ms_v2,Standard_M416ms_v2,Standard_DC1s_v3,Standard_DC2s_v3,Standard_DC4s_v3,Standard_DC8s_v3,Standard_DC16s_v3,Standard_DC24s_v3,Standard_DC32s_v3,Standard_DC48s_v3,Standard_DC1ds_v3,Standard_DC2ds_v3,Standard_DC4ds_v3,Standard_DC8ds_v3,Standard_DC16ds_v3,Standard_DC24ds_v3,Standard_DC32ds_v3,Standard_DC48ds_v3,Standard_NC24ads_A100_v4,Standard_NC48ads_A100_v4,Standard_NC96ads_A100_v4,Standard_D2as_v5,Standard_D4as_v5,Standard_D8as_v5,Standard_D16as_v5,Standard_D32as_v5,Standard_D48as_v5,Standard_D64as_v5,Standard_D96as_v5,Standard_E2as_v5,Standard_E4-2as_v5,Standard_E4as_v5,Standard_E8-2as_v5,Standard_E8-4as_v5,Standard_E8as_v5,Standard_E16-4as_v5,Standard_E16-8as_v5,Standard_E16as_v5,Standard_E20as_v5,Standard_E32-8as_v5,Standard_E32-16as_v5,Standard_E32as_v5,Standard_E48as_v5,Standard_E64-16as_v5,Standard_E64-32as_v5,Standard_E64as_v5,Standard_E96-24as_v5,Standard_E96-48as_v5,Standard_E96as_v5,Standard_E112ias_v5,Standard_D2ads_v5,Standard_D4ads_v5,Standard_D8ads_v5,Standard_D16ads_v5,Standard_D32ads_v5,Standard_D48ads_v5,Standard_D64ads_v5,Standard_D96ads_v5,Standard_E2ads_v5,Standard_E4-2ads_v5,Standard_E4ads_v5,Standard_E8-2ads_v5,Standard_E8-4ads_v5,Standard_E8ads_v5,Standard_E16-4ads_v5,Standard_E16-8ads_v5,Standard_E16ads_v5,Standard_E20ads_v5,Standard_E32-8ads_v5,Standard_E32-16ads_v5,Standard_E32ads_v5,Standard_E48ads_v5,Standard_E64-16ads_v5,Standard_E64-32ads_v5,Standard_E64ads_v5,Standard_E96-24ads_v5,Standard_E96-48ads_v5,Standard_E96ads_v5,Standard_E112iads_v5,Standard_L8s_v3,Standard_L16s_v3,Standard_L32s_v3,Standard_L48s_v3,Standard_L64s_v3,Standard_L80s_v3. Find out more on the valid VM sizes in each region at https://aka.ms/azure-regionservices." Target="vmSize"' reason: MachineCreationFailed status: "True" type: MachineCreated metadata: {}
Description of problem:
Follow-up of: https://issues.redhat.com/browse/SDN-2988
This failure is perma-failing in the e2e-metal-ipi-ovn-dualstack-local-gateway jobs.
Example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-ovn-dualstack-local-gateway/1597574181430497280
Search CI: https://search.ci.openshift.org/?search=when+using+openshift+ovn-kubernetes+should+ensure+egressfirewall+is+created&maxAge=336h&context=1&type=junit&name=e2e-metal-ipi-ovn-dualstack-local-gateway&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Sippy: https://sippy.dptools.openshift.org/sippy-ng/jobs/4.13/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-ovn-dualstack-local-gateway%22%7D%5D%7D
Version-Release number of selected component (if applicable):
4.12,4.13
How reproducible:
Every time
Steps to Reproduce:
1. Setup dualstack KinD cluster 2. Create egress fw policy with spec Spec: Egress: To: Cidr Selector: 0.0.0.0/0 Type: Deny 3. create a pod and ping to 1.1.1.1
Actual results:
Egress policy does not block flows to external IP
Expected results:
Egress policy blocks flows to external IP
Additional info:
It seems mixing ip4 and ip6 operands in ACL matchs doesnt work
Our Prometheus alerts are inconsistent with both upstream and sometimes our own vendor folder. Let's do a clean update run before the next release is branched off.
This is a clone of issue OCPBUGS-3508. The following is the description of the original issue:
—
Exposed via the fact that the periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-serial-ipv4 job is at 0% for at least the past two weeks over approximatesly 65 runs.
Testgrid shows that this job started failing in a very consistent way on Oct 25th at about 8am UTC: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.12-informing#periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-serial-ipv4
6 disruption tests fail, all with alarming consistency virtually always claiming exactly 8s of disruption, max allowed 1s.
And then openshift-tests.[sig-arch] events should not repeat pathologically fails with an odd signature:
{ 6 events happened too frequently event happened 35 times, something is wrong: node/master-2 - reason/NodeHasNoDiskPressure roles/control-plane,master Node master-2 status is now: NodeHasNoDiskPressure event happened 35 times, something is wrong: node/master-2 - reason/NodeHasSufficientMemory roles/control-plane,master Node master-2 status is now: NodeHasSufficientMemory event happened 35 times, something is wrong: node/master-2 - reason/NodeHasSufficientPID roles/control-plane,master Node master-2 status is now: NodeHasSufficientPID event happened 35 times, something is wrong: node/master-1 - reason/NodeHasNoDiskPressure roles/control-plane,master Node master-1 status is now: NodeHasNoDiskPressure event happened 35 times, something is wrong: node/master-1 - reason/NodeHasSufficientMemory roles/control-plane,master Node master-1 status is now: NodeHasSufficientMemory event happened 35 times, something is wrong: node/master-1 - reason/NodeHasSufficientPID roles/control-plane,master Node master-1 status is now: NodeHasSufficientPID}
The two types of tests started failing together exactly, and the disruption measurements are bizzarely consistent, every single time we see precisely 8s for kube-api, cache-kube-api, openshift-api, cache-openshift-api, oauth-api, cache-oauth-api. It's always these 6, and it seems to be always exactly 8 seconds. I cannot state enough how strange this is. It almost implies that something is happening on a very consistent schedule.
Occasionally these are accompanied by 1-2s of disruption for those backends with new connections, but sometimes not as well.
It looks like all of the disruption consistently happens within two very long tests:
4s within: [sig-network] services when running openshift ipv4 cluster ensures external ip policy is configured correctly on the cluster [Serial] [Suite:openshift/conformance/serial]
4s within: [sig-network] services when running openshift ipv4 cluster on bare metal [apigroup:config.openshift.io] ensures external auto assign cidr is configured correctly on the cluster [Serial] [Suite:openshift/conformance/serial]
Both tests appear to have run prior to oct 25, so I don't think it's a matter of new tests breaking something or getting unskipped. Both tests also always pass, but appear to be impacting the cluster?
The master's going NotReady also appears to fall within the above two tests as well, though it does not seem to directly match with when we measure disruption, but bear in mind there's a 40s delay before the node goes NotReady.
Focusing on https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-serial-ipv4/1590640492373086208 where the above are from:
Two of the three master nodes appear to be going NodeNotReady a couple times throughout the run, as visible in the spyglass chart under the node state row on the left. master-0 does not appear here, but it does exist. (I suspect it has leader and thus is the node reporting the others going not ready)
From the master-0 kubelet log in must-gather we can see one of these examples where it reports that master-2 has not checked in:
2022-11-10T10:38:35.874090961Z I1110 10:38:35.873975 1 node_lifecycle_controller.go:1137] node master-2 hasn't been updated for 40.00700561s. Last Ready is: &NodeCondition{Type:Ready,Status:True,LastHeartbeatTime:2022-11-10 1 0:36:10 +0000 UTC,LastTransitionTime:2022-11-10 10:29:11 +0000 UTC,Reason:KubeletReady,Message:kubelet is posting ready status,} 2022-11-10T10:38:35.874090961Z I1110 10:38:35.874056 1 node_lifecycle_controller.go:1137] node master-2 hasn't been updated for 40.007097549s. Last MemoryPressure is: &NodeCondition{Type:MemoryPressure,Status:False,LastHeartb eatTime:2022-11-10 10:36:10 +0000 UTC,LastTransitionTime:2022-11-10 10:29:11 +0000 UTC,Reason:KubeletHasSufficientMemory,Message:kubelet has sufficient memory available,} 2022-11-10T10:38:35.874090961Z I1110 10:38:35.874067 1 node_lifecycle_controller.go:1137] node master-2 hasn't been updated for 40.007110285s. Last DiskPressure is: &NodeCondition{Type:DiskPressure,Status:False,LastHeartbeatT ime:2022-11-10 10:36:10 +0000 UTC,LastTransitionTime:2022-11-10 10:29:11 +0000 UTC,Reason:KubeletHasNoDiskPressure,Message:kubelet has no disk pressure,} 2022-11-10T10:38:35.874090961Z I1110 10:38:35.874076 1 node_lifecycle_controller.go:1137] node master-2 hasn't been updated for 40.007119541s. Last PIDPressure is: &NodeCondition{Type:PIDPressure,Status:False,LastHeartbeatTim e:2022-11-10 10:36:10 +0000 UTC,LastTransitionTime:2022-11-10 10:29:11 +0000 UTC,Reason:KubeletHasSufficientPID,Message:kubelet has sufficient PID available,} 2022-11-10T10:38:35.881749410Z I1110 10:38:35.881705 1 controller_utils.go:181] "Recording status change event message for node" status="NodeNotReady" node="master-2" 2022-11-10T10:38:35.881749410Z I1110 10:38:35.881733 1 controller_utils.go:120] "Update ready status of pods on node" node="master-2" 2022-11-10T10:38:35.881820988Z I1110 10:38:35.881799 1 controller_utils.go:138] "Updating ready status of pod to false" pod="metal3-b7b69fdbb-rfbdj" 2022-11-10T10:38:35.881893234Z I1110 10:38:35.881858 1 topologycache.go:179] Ignoring node master-2 because it has an excluded label 2022-11-10T10:38:35.881893234Z W1110 10:38:35.881886 1 topologycache.go:199] Can't get CPU or zone information for worker-0 node 2022-11-10T10:38:35.881903023Z I1110 10:38:35.881892 1 topologycache.go:215] Insufficient node info for topology hints (0 zones, %!s(int64=0) CPU, false) 2022-11-10T10:38:35.881932172Z I1110 10:38:35.881917 1 controller.go:271] Node changes detected, triggering a full node sync on all loadbalancer services 2022-11-10T10:38:35.882290428Z I1110 10:38:35.882270 1 event.go:294] "Event occurred" object="master-2" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="NodeNotReady" message="Node master-2 status is now: NodeNotReady"
Now from master-2's kubelet log around that time, 40 seconds earlier puts us at 10:37:55, so we'd be looking for something odd around there.
A few potential lines:
Nov 10 10:37:55.232537 master-2 kubenswrapper[1930]: I1110 10:37:55.232495 1930 patch_prober.go:29] interesting pod/kube-controller-manager-guard-master-2 container/guard namespace/openshift-kube-controller-manager: Readiness probe status=failure output="Get \"https://192.168.111.22:10257/healthz\": dial tcp 192.168.111.22:10257: connect: connection refused" start-of-body= Nov 10 10:37:55.232537 master-2 kubenswrapper[1930]: I1110 10:37:55.232549 1930 prober.go:114] "Probe failed" probeType="Readiness" pod="openshift-kube-controller-manager/kube-controller-manager-guard-master-2" podUID=8be2c6c1-f8f6-4bf0-b26d-53ce487354bd containerName="guard" probeResult=failure output="Get \"https://192.168.111.22:10257/healthz\": dial tcp 192.168.111.22:10257: connect: connection refused" Nov 10 10:38:12.238273 master-2 kubenswrapper[1930]: E1110 10:38:12.238229 1930 controller.go:187] failed to update lease, error: Put "https://api-int.ostest.test.metalkube.org:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/master-2?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers) Nov 10 10:38:13.034109 master-2 kubenswrapper[1930]: E1110 10:38:13.034077 1930 kubelet_node_status.go:487] "Error updating node status, will retry" err="error getting node \"master-2\": Get \"https://api-int.ostest.test.metalkube.org:6443/api/v1/nodes/master-2?resourceVersion=0&timeout=10s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
At 10:38:40 all kinds of master-2 watches time out with messages like:
Nov 10 10:38:40.244399 master-2 kubenswrapper[1930]: W1110 10:38:40.244272 1930 reflector.go:347] object-"openshift-oauth-apiserver"/"kube-root-ca.crt": watch of *v1.ConfigMap ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
And then suddenly we're back online:
Nov 10 10:38:40.252149 master-2 kubenswrapper[1930]: I1110 10:38:40.252131 1930 kubelet_node_status.go:590] "Recording event message for node" node="master-2" event="NodeHasSufficientMemory" Nov 10 10:38:40.252149 master-2 kubenswrapper[1930]: I1110 10:38:40.252156 1930 kubelet_node_status.go:590] "Recording event message for node" node="master-2" event="NodeHasNoDiskPressure" Nov 10 10:38:40.252268 master-2 kubenswrapper[1930]: I1110 10:38:40.252165 1930 kubelet_node_status.go:590] "Recording event message for node" node="master-2" event="NodeHasSufficientPID" Nov 10 10:38:40.252268 master-2 kubenswrapper[1930]: I1110 10:38:40.252177 1930 kubelet_node_status.go:590] "Recording event message for node" node="master-2" event="NodeReady" Nov 10 10:38:47.904430 master-2 kubenswrapper[1930]: I1110 10:38:47.904373 1930 kubelet.go:2229] "SyncLoop (probe)" probe="readiness" status="" pod="openshift-kube-controller-manager/kube-controller-manager-master-2" Nov 10 10:38:47.904842 master-2 kubenswrapper[1930]: I1110 10:38:47.904662 1930 kubelet.go:2229] "SyncLoop (probe)" probe="startup" status="unhealthy" pod="openshift-kube-controller-manager/kube-controller-manager-master-2" Nov 10 10:38:47.907900 master-2 kubenswrapper[1930]: I1110 10:38:47.907872 1930 kubelet.go:2229] "SyncLoop (probe)" probe="startup" status="started" pod="openshift-kube-controller-manager/kube-controller-manager-master-2" Nov 10 10:38:48.431448 master-2 kubenswrapper[1930]: I1110 10:38:48.431414 1930 kubelet.go:2229] "SyncLoop (probe)" probe="readiness" status="ready" pod="openshift-kube-controller-manager/kube-controller-manager-master-2" Nov 10 10:38:54.764069 master-2 kubenswrapper[1930]: I1110 10:38:54.764029 1930 kubelet_getters.go:182] "Pod status updated" pod="openshift-kube-scheduler/openshift-kube-scheduler-master-2" status=Running Nov 10 10:38:54.764069 master-2 kubenswrapper[1930]: I1110 10:38:54.764059 1930 kubelet_getters.go:182] "Pod status updated" pod="openshift-kni-infra/keepalived-master-2" status=Running Nov 10 10:38:54.764069 master-2 kubenswrapper[1930]: I1110 10:38:54.764077 1930 kubelet_getters.go:182] "Pod status updated" pod="openshift-kni-infra/coredns-master-2" status=Running Nov 10 10:38:54.764069 master-2 kubenswrapper[1930]: I1110 10:38:54.764086 1930 kubelet_getters.go:182] "Pod status updated" pod="openshift-kni-infra/haproxy-master-2" status=Running Nov 10 10:38:54.764492 master-2 kubenswrapper[1930]: I1110 10:38:54.764106 1930 kubelet_getters.go:182] "Pod status updated" pod="openshift-etcd/etcd-master-2" status=Running Nov 10 10:38:54.764492 master-2 kubenswrapper[1930]: I1110 10:38:54.764113 1930 kubelet_getters.go:182] "Pod status updated" pod="openshift-kube-controller-manager/kube-controller-manager-master-2" status=Running
Also curious:
Nov 10 10:37:50.318237 master-2 ovs-vswitchd[1324]: ovs|00251|connmgr|INFO|br0<->unix#468: 2 flow_mods in the last 0 s (2 deletes) Nov 10 10:37:50.342965 master-2 ovs-vswitchd[1324]: ovs|00252|connmgr|INFO|br0<->unix#471: 4 flow_mods in the last 0 s (4 deletes) Nov 10 10:37:50.364271 master-2 ovs-vswitchd[1324]: ovs|00253|bridge|INFO|bridge br0: deleted interface vethcb8d36e6 on port 41 Nov 10 10:37:53.579562 master-2 NetworkManager[1336]: <info> [1668076673.5795] dhcp4 (enp2s0): state changed new lease, address=192.168.111.22
These look like they could be related to the tests these problems appear to coincide with?
Description of problem:
After editing a MachineSet on AWS (just changed an annotation) it shows a warning [~] $ oc -n openshift-machine-api edit machineset.machine.openshift.io/ci-ln-hlf4lft-76ef8-p7rc4-worker-us-west-1b W1111 16:06:32.385856 88719 warnings.go:70] incorrect GroupVersionKind for AWSMachineProviderConfig object: machine.openshift.io/v1beta1, Kind=AWSMachineProviderConfig machineset.machine.openshift.io/ci-ln-hlf4lft-76ef8-p7rc4-worker-us-west-1b edited
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Add an annotation or label to a machine 2. 3.
Actual results:
There is a warning about incorrect GroupVersionKind for AWSMachineProviderConfig object
Expected results:
No warnings shown
Additional info:
This is a clone of issue OCPBUGS-5287. The following is the description of the original issue:
—
Description of problem:
See https://issues.redhat.com/browse/THREESCALE-9015. A problem with the Red Hat Integration - 3scale - Managed Application Services operator prevents it from installing correctly, which results in the failure of operator-install-single-namespace.spec.ts integration test.
Description of problem:
Seeing intermittently during cluster installs Network operator stuck in Progressing with network 4.12.0-0.nightly-2022-10-25-210451 True True False 117m DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes) MG: http://shell.lab.bos.redhat.com/~anusaxen/must-gather.local.5450303633101217331/ iptables-save on master-2 node - http://shell.lab.bos.redhat.com/~anusaxen/iptables-save pod events Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 129m default-scheduler Successfully assigned openshift-network-diagnostics/network-check-target-gnld6 to qe-anurag114e-9xkz4-master-2.c.openshift-qe.internal Warning FailedMount 128m (x7 over 129m) kubelet MountVolume.SetUp failed for volume "kube-api-access-kfg5s" : [object "openshift-network-diagnostics"/"kube-root-ca.crt" not registered, object "openshift-network-diagnostics"/"openshift-service-ca.crt" not registered] Warning NetworkNotReady 128m (x18 over 129m) kubelet network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started? Warning ErrorAddingLogicalPort 127m (x2 over 127m) controlplane addLogicalPort failed for openshift-network-diagnostics/network-check-target-gnld6: unable to parse node L3 gw annotation: k8s.ovn.org/l3-gateway-config annotation not found for node "qe-anurag114e-9xkz4-master-2.c.openshift-qe.internal" Normal AddedInterface 127m multus Add eth0 [10.130.0.3/23] from ovn-kubernetes Warning ProbeError 9m (x16 over 71m) kubelet Readiness probe error: Get "http://10.130.0.3:8080/": dial tcp 10.130.0.3:8080: i/o timeout (Client.Timeout exceeded while awaiting headers) body: Warning ProbeError 4m (x717 over 126m) kubelet Readiness probe error: Get "http://10.130.0.3:8080/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) body:
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-25-210451
How reproducible:
rare
Steps to Reproduce:
1.Install OCP with OVNKubernetes with HO enabled defaultNetwork: type: OVNKubernetes ovnKubernetesConfig: hybridOverlayConfig: hybridClusterNetwork: [] 2. 3.
Actual results:
Installation stuck due to network-check-target issue
Expected results:
Installation should succeed
Additional info:
Will add additional logs
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
1. Proposed title of this feature request
Allow Ingress to be modified the log length when using a sidecar
2. What is the nature and description of the request?
In the past we had the RFE-1794 where an option was created to specify the length of the HAProxy log, however this option was only available for when redirecting the log for an external syslog. We need this option to be available for when using a sidecar to collect the logs.
apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: default namespace: openshift-ingress-operator spec: replicas: 2 logging: access: destination: type: Container container: {}
Differently from the Syslog type, the Container type does not have any sub-parameter, which makes possible to configurate the log length.
As we can see in the RFE-1794, the option to change the log length already exists in the haproxy configuration, but when using the sidecar, only the default value(1024) is used.
3. Why does the customer need this? (List the business requirements here)
The default log length of HAProxy is 1024. When the clients communicate to the application with the long uri arguments, it cannot catch the full access log and the parameter info. It is required a option to setup 8192 or higher.
4. List any affected packages or components.
When a HostedCluster is configured as `Private`, annotate the necessary hosted CP components (API and OAuth) so that External DNS can still create public DNS records (pointing to private IP resources).
The External DNS record should be pointing to the resource for the PrivateLink VPC Endpoint. "We need to specify the IP of the A record. We can do that with a cluster IP service."
Context: https://redhat-internal.slack.com/archives/C01C8502FMM/p1675432805760719
Description of problem:
health_statuses_insights metrics is showing disabled rules in "total". In other fields, it shows the correct amount. In the code linked below, we can see that the "Disabled" rules are only skipped during the value assigning of TotalRisk
How reproducible:
Always
Steps to Reproduce:
1. Upload a fake archive to trigger health checks (for example with rule CVE_2020_8555_kubernetes) 2. Disable one of the rules through https://console.redhat.com/api/insights-results-aggregator/v1/clusters/{cluster.id}/rules/{rule}/error_key/{error_key}/disable 3. Create support secret and set endpoint="https://httpstat.us/200" 4. restart insights operator 5. wait for alerts to trigger 6. Check health_statuses_insights metrics.
rule:
ccx_rules_ocp.external.rules.ocp_version_end_of_life.report
error_key:
OCP4X_BEYOND_EOL
Actual results:
"moderate" health_statuses_insights shows 2 triggers "total" shows 3. Therefore, it is accounting for the deactivated rule.
Expected results:
"moderate" health_statuses_insights shows 2 triggers "total" health_statuses_insights shows 2 triggers (doesn't account for deactivated rule)
Additional info:
If there is any issue in triggering this events, you may contact me and I can help with the steps.
Description of problem:
Each LB created for a Service type LoadBalancer results in 1 client rule and <# of public subnets> health rules being created. The rules per SG quota in AWS is quite small; 60 by default, and 200 hard max. OCP has about 40 rules OOTB. Assuming an HA cluster in 3 AZs, that is 4 rules per LB. With default AWS quota, only ~5 LBs can be create and with the hard max of 200, only ~40 LBs can be created.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Create Service type LoadBalancer and observe increase in master-sg and worker-sg rules sets 2. 3.
Actual results:
4 rules are created
Expected results:
1 rules is created when the client rule is a superset of the per-subnet health rules
Additional info:
This ~4x the number of Services of type LoadBalancer. This is required for Hypershift.
Description of problem:
metal3 pod does not come up on SNO when creating Provisioning with provisioningNetwork set to Disabled The issue is that on SNO, there is no Machine, and no BareMetalHost, it is looking of Machine objects to populate the provisioningMacAddresses field. However, when provisioningNetwork is Disabled, provisioningMacAddresses is not used anyway. You can work around this issue by populating provisioningMacAddresses with a dummy address, like this: kind: Provisioning metadata: name: provisioning-configuration spec: provisioningMacAddresses: - aa:aa:aa:aa:aa:aa provisioningNetwork: Disabled watchAllNamespaces: true
Version-Release number of selected component (if applicable):
4.11.17
How reproducible:
Try to bring up Provisioning on SNO in 4.11.17 with provisioningNetwork set to Disabled apiVersion: metal3.io/v1alpha1 kind: Provisioning metadata: name: provisioning-configuration spec: provisioningNetwork: Disabled watchAllNamespaces: true
Steps to Reproduce:
1. 2. 3.
Actual results:
controller/provisioning "msg"="Reconciler error" "error"="machines with cluster-api-machine-role=master not found" "name"="provisioning-configuration" "namespace"="" "reconciler group"="metal3.io" "reconciler kind"="Provisioning"
Expected results:
metal3 pod should be deployed
Additional info:
This issue is a result of this change: https://github.com/openshift/cluster-baremetal-operator/pull/307 See this Slack thread: https://coreos.slack.com/archives/CFP6ST0A3/p1670530729168599
This is a clone of issue OCPBUGS-4969. The following is the description of the original issue:
—
Description of problem:
A ROSA machinepool is created and the label k8s.ovn.org/egress-assignable is added during creation. The newly created nodes are not discovered as egressIP nodes and no egressIP addresses are assigned. It was discovered that removing the k8s.ovn.org/egress-assignable label from the nodes, by editing the machinepool, and subsquently reapplying the label causes the nodes to be discovered as egressIP capable. While it is possible to workaround the issue be removing and reapplying the label, this will likely not work with node auto-scaling.
Version-Release number of selected component (if applicable):
4.11.18
How reproducible:
Always
Steps to Reproduce:
1. Create a machinepool and label for egressIP $ rosa create machinepool -c brosenbe --name mp-1 --labels k8s.ovn.org/egress-assignable="" --replicas=3 I: Machine pool 'mp-1' created successfully on cluster 'brosenbe' I: To view all machine pools, run 'rosa list machinepools -c brosenbe' 2. Wait for nodes to be instantiated $ watch -n 60 oc get nodes -l k8s.ovn.org/egress-assignable Every 60.0s: oc get nodes -l k8s.ovn.org/egress-assignable brosenbe.syd.csb: Fri Dec 16 15:20:47 2022 NAME STATUS ROLES AGE VERSION ip-10-0-136-123.ap-southeast-2.compute.internal Ready worker 7m55s v1.24.6+5658434 ip-10-0-178-34.ap-southeast-2.compute.internal Ready worker 7m59s v1.24.6+5658434 ip-10-0-192-110.ap-southeast-2.compute.internal Ready worker 8m v1.24.6+5658434 3. Create egressip object $ cat << EOF >egressip.yaml apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: name: egress-group1 spec: egressIPs: - 10.0.128.152 - 10.0.160.152 - 10.0.192.152 namespaceSelector: matchLabels: env: dev EOF 4. Apply egressip object $ oc apply -f egressip.yaml egressip.k8s.ovn.org/egress-group1 created 5. Note that no IP addresses from egressip/egress-group1 have been assigned $ oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egress-group1 10.0.128.152 $ oc get event -n default | egrep egressip | tail -1 34s Warning NoMatchingNodeFound egressip/egress-group1 no assignable nodes for EgressIP: egress-group1, please tag at least one node with label: k8s.ovn.org/egress-assignable $ ns=openshift-ovn-kubernetes; for pod in $(oc get pods -n $ns -l app=ovnkube-master -o name); do pod=${pod##*/}; echo $pod; oc logs -n $ns $pod -c ovnkube-master | grep 'No assignable nodes found for EgressIP' | tail -1; done ovnkube-master-bgz84 ovnkube-master-kzgpc ovnkube-master-pbtn9 E1216 04:21:50.578203 1 egressip.go:1567] No assignable nodes found for EgressIP: egress-group1 and requested IPs: [10.0.128.152 10.0.160.152 10.0.192.152] 6. Remove egressIP labels $ rosa edit machinepool -c brosenbe mp-1 --replicas 3 --labels '' I: Updated machine pool 'mp-1' on cluster 'brosenbe' 7. Wait a bit for labels to be removed... $ watch -n 60 oc get nodes -l k8s.ovn.org/egress-assignable Every 60.0s: oc get nodes -l k8s.ovn.org/egress-assignable brosenbe.syd.csb: Fri Dec 16 15:51:57 2022 No resources found 8. Reapply label k8s.ovn.org/egress-assignable $ rosa edit machinepool -c brosenbe mp-1 --replicas 3 --labels k8s.ovn.org/egress-assignable='' I: Updated machine pool 'mp-1' on cluster 'brosenbe'9. Wait a bit for labels to be applied... 9. Wait a while for labels to be applied $ watch -n 60 oc get nodes -l k8s.ovn.org/egress-assignable Every 60.0s: oc get nodes -l k8s.ovn.org/egress-assignable brosenbe.syd.csb: Fri Dec 16 16:00:03 2022 NAME STATUS ROLES AGE VERSION ip-10-0-136-123.ap-southeast-2.compute.internal Ready worker 47m v1.24.6+5658434 ip-10-0-178-34.ap-southeast-2.compute.internal Ready worker 47m v1.24.6+5658434 ip-10-0-192-110.ap-southeast-2.compute.internal Ready worker 47m v1.24.6+5658434 10. Note that egressIP addresses have now been assigned to nodes $ oc get egressip egress-group1 NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egress-group1 10.0.128.152 ip-10-0-167-202.ap-southeast-2.compute.internal 10.0.160.152 $ oc get egressip egress-group1 -o yaml | yq -y '.status' items: - egressIP: 10.0.128.152 node: ip-10-0-136-123.ap-southeast-2.compute.internal - egressIP: 10.0.192.152 node: ip-10-0-192-110.ap-southeast-2.compute.internal - egressIP: 10.0.160.152 node: ip-10-0-178-34.ap-southeast-2.compute.internal
Actual results:
EgressIP addresses not applied to nodes with k8s.ovn.org/egress-assignable label
Expected results:
EgressIP addresses are applied to nodes with k8s.ovn.org/egress-assignable label
Additional info:
This is a clone of issue OCPBUGS-4049. The following is the description of the original issue:
—
Description of problem:
In case of CRC we provision the cluster first and the create the disk image out of it and that what we share to our users. Now till now we always remove the pull secret from the cluster after provision it using https://github.com/crc-org/snc/blob/master/snc.sh#L241-L258 and it worked without any issue till 4.11.x but for 4.12.0-rc.1 we are seeing that MCO not able to reconcile.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create a single node cluster using cluster bot `launch 4.12.0-rc.1 aws,single-node` 2. Once cluster is provisioned update the pull secret from the config ``` $ cat pull-secret.yaml apiVersion: v1 data: .dockerconfigjson: e30K kind: Secret metadata: name: pull-secret namespace: openshift-config type: kubernetes.io/dockerconfigjson $ oc replace -f pull-secret.yaml ``` 3. Wait for MCO recocile and you will see failure to reconcile MCO
Actual results:
$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-66086aa249a9f92b773403f7c3745ea4 False True True 1 0 0 1 94m worker rendered-worker-0c07becff7d3c982e24257080cc2981b True False False 0 0 0 0 94m $ oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE machine-config 4.12.0-rc.1 True False True 93m Failed to resync 4.12.0-rc.1 because: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool master is not ready, retrying. Status: (pool degraded: true total: 1, ready 0, updated: 0, unavailable: 0)] $ oc logs machine-config-daemon-nf9mg -n openshift-machine-config-operator [...] I1123 15:00:37.864581 10194 run.go:19] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba: (Mirrors also failed: [quayio-pull-through-cache-us-west-2-ci.apps.ci.l2s4.p1.openshiftapps.com/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba: reading manifest sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba in quayio-pull-through-cache-us-west-2-ci.apps.ci.l2s4.p1.openshiftapps.com/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba: reading manifest sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized W1123 15:00:39.186103 10194 run.go:45] podman failed: running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba failed: Error: initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba: (Mirrors also failed: [quayio-pull-through-cache-us-west-2-ci.apps.ci.l2s4.p1.openshiftapps.com/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba: reading manifest sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba in quayio-pull-through-cache-us-west-2-ci.apps.ci.l2s4.p1.openshiftapps.com/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: authentication required]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba: reading manifest sha256:ffa3568233298408421ff7da60e5c594fb63b2551c6ab53843eb51c8cf6838ba in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized : exit status 125; retrying...
Expected results:
Additional info:
This is a clone of issue OCPBUGS-3084. The following is the description of the original issue:
—
Upstream Issue: https://github.com/kubernetes/kubernetes/issues/77603
Long log lines get corrupted when using '--timestamps' by the Kubelet.
The root cause is that the buffer reads up to a new line. If the line is greater than 4096 bytes and '--timestamps' is turrned on the kubelet will write the timestamp and the partial log line. We will need to refactor the ReadLogs function to allow for a partial line read.
apiVersion: v1
kind: Pod
metadata:
name: logs
spec:
restartPolicy: Never
containers:
- name: logs
image: fedora
args:
- bash
- -c
- 'for i in `seq 1 10000000`; do echo -n $i; done'
kubectl logs logs --timestamps
Description of problem:
For some reason, some of the packets on a DNS conversation to the {{openshift-dns/dns-default}} service cluster IP don't get properly denatted, i.e. the reply packet has the pod IP as source IP instead of the service IP.
Version-Release number of selected component (if applicable):
4.10.25
How reproducible:
Sometimes
Steps to Reproduce:
1. Try to resolve DNS with cluster DNS
Actual results:
DNS timeout. Reply packets have the pod IP instead of the service IP the request was sent to.
Expected results:
DNS working.
Additional info:
I'll elaborate about this in the attachments, but I could find nothing wrong in nbdb or any OVN-Kubernetes or OVN logs that rang a bell. The only interesting thing I could see was that `conntrack -L` had no reference to this conversation, so it makes kind of sense that the reply packet address is not translated back to the service IP one, but I have not been able to find the reason of this. The query/response packets can be correlated via DNS transaction ID.
Job was in terrible shape even before but it looks like upgrade started more consistently failing around Oct 2-4.
Looks like we fully lose the api (service unavailable), no artifacts get gathered, mass disruption reported.
Description of the problem:
I installed a cluster with OCS and CNV.
The issue is that cluster event contain repeated messages:
1/9/2022, 6:17:31 PM Operator ocs status: available message: install strategy completed with no errors 1/9/2022, 6:17:30 PM Operator lso status: available message: install strategy completed with no errors 1/9/2022, 6:17:30 PM Operator cnv status: available message: install strategy completed with no errors 1/9/2022, 6:17:06 PM Successfully completed installing cluster 1/9/2022, 6:17:06 PM Updated status of the cluster to installed 1/9/2022, 6:17:01 PM Operator ocs status: available message: install strategy completed with no errors 1/9/2022, 6:17:00 PM Operator lso status: available message: install strategy completed with no errors 1/9/2022, 6:17:00 PM Operator cnv status: available message: install strategy completed with no errors 1/9/2022, 6:16:31 PM Operator ocs status: progressing message: installing: waiting for deployment ocs-operator to become ready: deployment "ocs-operator" not available: Deployment does not have minimum availability. 1/9/2022, 6:16:30 PM Operator lso status: available message: install strategy completed with no errors 1/9/2022, 6:16:30 PM Operator cnv status: available message: install strategy completed with no errors 1/9/2022, 6:16:01 PM Operator ocs status: progressing message: installing: waiting for deployment ocs-operator to become ready: deployment "ocs-operator" not available: Deployment does not have minimum availability. 1/9/2022, 6:16:00 PM Operator lso status: available message: install strategy completed with no errors 1/9/2022, 6:16:00 PM Operator cnv status: available message: install strategy completed with no errors 1/9/2022, 6:15:31 PM Operator ocs status: progressing message: installing: waiting for deployment ocs-operator to become ready: deployment "ocs-operator" not available: Deployment does not have minimum availability. 1/9/2022, 6:15:31 PM Operator lso status: available message: install strategy completed with no errors 1/9/2022, 6:15:30 PM Operator cnv status: available message: install strategy completed with no errors
How reproducible:
100%
Steps to reproduce:
1. Install cluster with OCS and CNV
2. Watch cluster events
Actual results:
repeated message when olm operator completed installation
Expected results:
1 event record for olm operator finished successfully
Description of problem:
This bug is a copy of https://bugzilla.redhat.com/show_bug.cgi?id=2137616 as fix needs to go on OCP side. For must gather and attached screenshots please refer the bugzilla.
Add Capacity button does not exist after upgrade OCP version [OCP4.11->OCP4.12]
Version-Release number of selected component (if applicable):
ODF Version:4.11.3-3 OCP Version: 4.12.0-0.nightly-2022-10-24-103753 Provider: AWS
How reproducible:
Steps to Reproduce:
1.Install ODF4.11 +OCP4.11 2.Upgrade OCP4.11 to OCP4.12 3.Log in to the OpenShift Web Console. 4.Click Operators → Installed Operators. 5.Click OpenShift Data Foundation Operator. 6.Click the Storage Systems tab. 7.Click the Action Menu (⋮) on the far right of the storage system name to extend the options menu. "Add Capacity" button does not exist on menu. *Attached Screenshot
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-4166. The following is the description of the original issue:
—
Description of problem:
This is wrapper bug for library sync of 4.12
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem: upon attempting to install OCP 4.10 UPI on baremetal ppc64le, the openshift-install gather command returns `panic: unsupported platform "none"`
Version-Release number of selected component (if applicable):
OCP 4.10.16
openshift-install 4.10.24
How reproducible:
easily
Steps to Reproduce:
1. create install config
2. create manifests
3. create ignition configs
4. openshift-install gather bootstrap --log-level "debug"
Actual results:
DEBUG OpenShift Installer 4.10.24
DEBUG Built from commit d63a12ba0ec33d492093a8fc0e268a01a075f5da
DEBUG Fetching Bootstrap SSH Key Pair...
DEBUG Loading Bootstrap SSH Key Pair...
DEBUG Using Bootstrap SSH Key Pair loaded from state file
DEBUG Reusing previously-fetched Bootstrap SSH Key Pair
DEBUG Fetching Install Config...
DEBUG Loading Install Config...
DEBUG Loading SSH Key...
DEBUG Loading Base Domain...
DEBUG Loading Platform...
DEBUG Loading Cluster Name...
DEBUG Loading Base Domain...
DEBUG Loading Platform...
DEBUG Loading Networking...
DEBUG Loading Platform...
DEBUG Loading Pull Secret...
DEBUG Loading Platform...
DEBUG Loading Install Config from both state file and target directory
DEBUG On-disk Install Config matches asset in state file
DEBUG Using Install Config loaded from state file
DEBUG Reusing previously-fetched Install Config
panic: unsupported platform "none"
goroutine 1 [running]:
github.com/openshift/installer/pkg/terraform/stages/platform.StagesForPlatform({0x146f2d0a, 0x1619aa08})
/go/src/github.com/openshift/installer/pkg/terraform/stages/platform/stages.go:55 +0x2ff
main.runGatherBootstrapCmd({0x14d8e028, 0x1})
/go/src/github.com/openshift/installer/cmd/openshift-install/gather.go:115 +0x2d6
main.newGatherBootstrapCmd.func1(0xc001364500, {0xc0005a0b40, 0x2, 0x2})
/go/src/github.com/openshift/installer/cmd/openshift-install/gather.go:65 +0x59
github.com/spf13/cobra.(*Command).execute(0xc001364500, {0xc0005a0b20, 0x2, 0x2})
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:860 +0x5f8
github.com/spf13/cobra.(*Command).ExecuteC(0xc001334c80)
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:974 +0x3bc
github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:902
main.installerMain()
/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:72 +0x29e
main.main()
/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:50 +0x125
Expected results:
I'm not really sure what I expected to happen. I've never used that gather before..
I would assume at least no panicking.
Additional info:
Description of problem:
The current version of openshift/cluster-ingress-operator vendors Kubernetes 1.24 packages. OpenShift 4.12 is based on Kubernetes 1.25.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Check https://github.com/openshift/cluster-ingress-operator/blob/release-4.12/go.mod
Actual results:
Kubernetes packages (k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go) are at version v0.24.0.
Expected results:
Kubernetes packages are at version v0.25.0 or later.
Additional info:
Using old Kubernetes API and client packages brings risk of API compatibility issues.
During a normal installation, there are hundreds of debug logs reading:
bootstrap configmap not found: configmaps "bootstrap" not found
and dozens of the form:
Still waiting for cluster to initialize: ...
with duplicate data.
We should only log when we have some new information to report, not every time we poll.
Description of problem:
We have ODF bug for it here: https://bugzilla.redhat.com/show_bug.cgi?id=2169779 Discussed in formu-storage with Hemant here: https://redhat-internal.slack.com/archives/CBQHQFU0N/p1677085216391669 And asked to open bug for it. This currently blocking ODF 4.13 deployment over vSphere
Version-Release number of selected component (if applicable):
How reproducible:
YES
Steps to Reproduce:
1. Deploy ODF 4.13 on vSphere with `thin-csi` SC 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
See https://github.com/metal3-io/baremetal-operator/issues/1045
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description:
I was testing the DHCP scenario where only rendezvousIP is specified in the agent-config.yaml and no NMStateConfig is embedded. pre-network-manager-config.service fails on node0 when networkConfig is missing from agent-config.yaml. /usr/local/bin/pre-network-manager-config.sh is not found on node0.
If NMStateConfig is not provided, then perhaps the service should not be included and activated in the ignition.
agent-config.yaml used:
metadata:
name: ostest
namespace: cluster0
spec:
rendezvousIP: 192.168.122.2
Steps to reproduce:
1. Create agent.iso using install-config.yaml and agent-config.yaml
2. Deploy cluster using agent.iso
3. Log into node0 and pre-network-manager-config.service will be displayed as a failed unit.
Expected:
pre-network-manager-config.service in success state
Actual:
pre-network-manager-config.service in failed state
Aug 05 08:27:18 localhost systemd[1]: Starting Prepare network manager config content...
Aug 05 08:27:18 localhost systemd[1]: pre-network-manager-config.service: Main process exited, code=exited, status=203/EXEC
Aug 05 08:27:18 localhost systemd[1]: pre-network-manager-config.service: Failed with result 'exit-code'.
Aug 05 08:27:18 localhost systemd[1]: Failed to start Prepare network manager config content.
Description:
I was testing the DHCP scenario where only rendezvousIP is specified in the agent-config.yaml and no NMStateConfig is embedded. create-cluster-and-infraenv.service fails on node0 when networkConfig is missing from agent-config.yaml. /etc/assisted/manifests/nmstateconfig.yaml is an empty file.
agent-config.yaml used:
metadata:
name: ostest
namespace: cluster0
spec:
rendezvousIP: 192.168.122.2
Steps to reproduce:
1. Create agent.iso using install-config.yaml and agent-config.yaml
2. Deploy cluster using agent.iso
3. Log into node0 and create-cluster-and-infraenv.service will be displayed as a failed unit.
Expected:
create-cluster-and-infraenv.service in success state
Actual:
create-cluster-and-infraenv.service in failed state
Aug 05 08:27:59 control1 podman[2681]: time="2022-08-05T08:27:59Z" level=info msg="releaseImage version 4.11.0-0.okd-2022-08-04-074610 cpuarch x86_64"
Aug 05 08:27:59 control1 create-cluster-and-infraenv[2693]: time="2022-08-05T08:27:59Z" level=info msg="Registered cluster with id: 1cc3ea1a-5bbc-4c4d-ad66-6e052800fb0c"
Aug 05 08:27:59 control1 create-cluster-and-infraenv[2693]: time="2022-08-05T08:27:59Z" level=info msg="Registering infraenv"
Aug 05 08:27:59 control1 podman[2681]: time="2022-08-05T08:27:59Z" level=info msg="Registered cluster with id: 1cc3ea1a-5bbc-4c4d-ad66-6e052800fb0c"
Aug 05 08:27:59 control1 podman[2681]: time="2022-08-05T08:27:59Z" level=info msg="Registering infraenv"
Aug 05 08:27:59 control1 create-cluster-and-infraenv[2693]: time="2022-08-05T08:27:59Z" level=fatal msg="Failed to register infraenv with assisted-service: nmstateconfig should have at least one label set matching the infra-env label selector"
Aug 05 08:27:59 control1 podman[2681]: time="2022-08-05T08:27:59Z" level=fatal msg="Failed to register infraenv with assisted-service: nmstateconfig should have at least one label set matching the infra-env label selector"
Aug 05 08:27:59 control1 systemd[1]: create-cluster-and-infraenv.service: Main process exited, code=exited, status=1/FAILURE
Aug 05 08:27:59 control1 systemd[1]: create-cluster-and-infraenv.service: Failed with result 'exit-code'.
Aug 05 08:27:59 control1 systemd[1]: Failed to start Service that creates initial cluster and infraenv.
/etc/assisted/manifests/nmstateconfig.yaml is an empty file.
[core@control1 ~]$ sudo cat /etc/assisted/manifests/nmstateconfig.yaml
[core@control1 ~]$
Description of problem:
Installing 1000+ SNOs via ACM/MCE via ZTP with gitops, a small percentage of clusters end up never completing install because the monitoring operator does not reconcile to available.
# oc --kubeconfig=/root/hv-vm/sno/manifests/sno01219/kubeconfig get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version False True 16h Unable to apply 4.11.0: the cluster operator monitoring has not yet successfully rolled out
# oc --kubeconfig=/root/hv-vm/sno/manifests/sno01219/kubeconfig get co monitoring
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
monitoring False True True 15h Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
Version-Release number of selected component (if applicable):
How reproducible:
Additional info:
# oc --kubeconfig=/root/hv-vm/sno/manifests/sno01219/kubeconfig get po -n openshift-monitoring NAME READY STATUS RESTARTS AGE alertmanager-main-0 0/6 ContainerCreating 0 15h cluster-monitoring-operator-54dd78cc74-l5w24 2/2 Running 0 15h kube-state-metrics-b6455c4dc-8hcfn 3/3 Running 0 15h node-exporter-k7899 2/2 Running 0 15h openshift-state-metrics-7984888fbd-cl67v 3/3 Running 0 15h prometheus-adapter-785bf4f975-wgmnh 1/1 Running 0 15h prometheus-k8s-0 0/6 Init:0/1 0 15h prometheus-operator-74d8754ff7-9zrgw 2/2 Running 0 15h prometheus-operator-admission-webhook-6665fb687d-c5jgv 1/1 Running 0 15h thanos-querier-575496c665-jcc8l 6/6 Running 0 15h # oc --kubeconfig=/root/hv-vm/sno/manifests/sno01219/kubeconfig describe po -n openshift-monitoring alertmanager-main-0 Name: alertmanager-main-0 Namespace: openshift-monitoring Priority: 2000000000 Priority Class Name: system-cluster-critical Node: sno01219/fc00:1001::8aa Start Time: Mon, 15 Aug 2022 23:53:39 +0000 Labels: alertmanager=main app.kubernetes.io/component=alert-router app.kubernetes.io/instance=main app.kubernetes.io/managed-by=prometheus-operator app.kubernetes.io/name=alertmanager app.kubernetes.io/part-of=openshift-monitoring app.kubernetes.io/version=0.24.0 controller-revision-hash=alertmanager-main-fcf8dd5fb statefulset.kubernetes.io/pod-name=alertmanager-main-0 Annotations: kubectl.kubernetes.io/default-container: alertmanager openshift.io/scc: nonroot Status: Pending IP: IPs: <none> Controlled By: StatefulSet/alertmanager-main Containers: alertmanager: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:91308d35c1e56463f55c1aaa519ff4de7335d43b254c21abdb845fc8c72821a1 Image ID: Ports: 9094/TCP, 9094/UDP Host Ports: 0/TCP, 0/UDP Args: --config.file=/etc/alertmanager/config/alertmanager.yaml --storage.path=/alertmanager --data.retention=120h --cluster.listen-address= --web.listen-address=127.0.0.1:9093 --web.external-url=https:/console-openshift-console.apps.sno01219.rdu2.scalelab.redhat.com/monitoring --web.route-prefix=/ --cluster.peer=alertmanager-main-0.alertmanager-operated:9094 --cluster.reconnect-timeout=5m State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Requests: cpu: 4m memory: 40Mi Environment: POD_IP: (v1:status.podIP) Mounts: /alertmanager from alertmanager-main-db (rw) /etc/alertmanager/certs from tls-assets (ro) /etc/alertmanager/config from config-volume (rw) /etc/alertmanager/secrets/alertmanager-kube-rbac-proxy from secret-alertmanager-kube-rbac-proxy (ro) /etc/alertmanager/secrets/alertmanager-kube-rbac-proxy-metric from secret-alertmanager-kube-rbac-proxy-metric (ro) /etc/alertmanager/secrets/alertmanager-main-proxy from secret-alertmanager-main-proxy (ro) /etc/alertmanager/secrets/alertmanager-main-tls from secret-alertmanager-main-tls (ro) /etc/pki/ca-trust/extracted/pem/ from alertmanager-trusted-ca-bundle (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hl77l (ro) config-reloader: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:209e20410ec2d3d7a502f568d2b7fe1cd1beadcb36fff2d1e6f59d77be3200e3 Image ID: Port: <none> Host Port: <none> Command: /bin/prometheus-config-reloader Args: --listen-address=localhost:8080 --reload-url=http://localhost:9093/-/reload --watched-dir=/etc/alertmanager/config --watched-dir=/etc/alertmanager/secrets/alertmanager-main-tls --watched-dir=/etc/alertmanager/secrets/alertmanager-main-proxy --watched-dir=/etc/alertmanager/secrets/alertmanager-kube-rbac-proxy --watched-dir=/etc/alertmanager/secrets/alertmanager-kube-rbac-proxy-metric State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Requests: cpu: 1m memory: 10Mi Environment: POD_NAME: alertmanager-main-0 (v1:metadata.name) SHARD: -1 Mounts: /etc/alertmanager/config from config-volume (ro) /etc/alertmanager/secrets/alertmanager-kube-rbac-proxy from secret-alertmanager-kube-rbac-proxy (ro) /etc/alertmanager/secrets/alertmanager-kube-rbac-proxy-metric from secret-alertmanager-kube-rbac-proxy-metric (ro) /etc/alertmanager/secrets/alertmanager-main-proxy from secret-alertmanager-main-proxy (ro) /etc/alertmanager/secrets/alertmanager-main-tls from secret-alertmanager-main-tls (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hl77l (ro) alertmanager-proxy: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:140f8947593d92e1517e50a201e83bdef8eb965b552a21d3caf346a250d0cf6e Image ID: Port: 9095/TCP Host Port: 0/TCP Args: -provider=openshift -https-address=:9095 -http-address= -email-domain=* -upstream=http://localhost:9093 -openshift-sar=[{"resource": "namespaces", "verb": "get"}, {"resource": "alertmanagers", "resourceAPIGroup": "monitoring.coreos.com", "namespace": "openshift-monitoring", "verb": "patch", "resourceName": "non-existant"}] -openshift-delegate-urls={"/": {"resource": "namespaces", "verb": "get"}, "/": {"resource":"alertmanagers", "group": "monitoring.coreos.com", "namespace": "openshift-monitoring", "verb": "patch", "name": "non-existant"}} -tls-cert=/etc/tls/private/tls.crt -tls-key=/etc/tls/private/tls.key -client-secret-file=/var/run/secrets/kubernetes.io/serviceaccount/token -cookie-secret-file=/etc/proxy/secrets/session_secret -openshift-service-account=alertmanager-main -openshift-ca=/etc/pki/tls/cert.pem -openshift-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Requests: cpu: 1m memory: 20Mi Environment: HTTP_PROXY: HTTPS_PROXY: NO_PROXY: Mounts: /etc/pki/ca-trust/extracted/pem/ from alertmanager-trusted-ca-bundle (ro) /etc/proxy/secrets from secret-alertmanager-main-proxy (rw) /etc/tls/private from secret-alertmanager-main-tls (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hl77l (ro) kube-rbac-proxy: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b5e1c69d005727e3245604cfca7a63e4f9bc6e15128c7489e41d5e967305089e Image ID: Port: 9092/TCP Host Port: 0/TCP Args: --secure-listen-address=0.0.0.0:9092 --upstream=http://127.0.0.1:9096 --config-file=/etc/kube-rbac-proxy/config.yaml --tls-cert-file=/etc/tls/private/tls.crt --tls-private-key-file=/etc/tls/private/tls.key --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 --logtostderr=true --tls-min-version=VersionTLS12 State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Requests: cpu: 1m memory: 15Mi Environment: <none> Mounts: /etc/kube-rbac-proxy from secret-alertmanager-kube-rbac-proxy (rw) /etc/tls/private from secret-alertmanager-main-tls (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hl77l (ro) kube-rbac-proxy-metric: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b5e1c69d005727e3245604cfca7a63e4f9bc6e15128c7489e41d5e967305089e Image ID: Port: 9097/TCP Host Port: 0/TCP Args: --secure-listen-address=0.0.0.0:9097 --upstream=http://127.0.0.1:9093 --config-file=/etc/kube-rbac-proxy/config.yaml --tls-cert-file=/etc/tls/private/tls.crt --tls-private-key-file=/etc/tls/private/tls.key --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 --client-ca-file=/etc/tls/client/client-ca.crt --logtostderr=true --allow-paths=/metrics --tls-min-version=VersionTLS12 State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Requests: cpu: 1m memory: 15Mi Environment: <none> Mounts: /etc/kube-rbac-proxy from secret-alertmanager-kube-rbac-proxy-metric (ro) /etc/tls/client from metrics-client-ca (ro) /etc/tls/private from secret-alertmanager-main-tls (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hl77l (ro) prom-label-proxy: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2550b2cbdf864515b1edacf43c25eb6b6f179713c1df34e51f6e9bba48d6430a Image ID: Port: <none> Host Port: <none> Args: --insecure-listen-address=127.0.0.1:9096 --upstream=http://127.0.0.1:9093 --label=namespace --error-on-replace State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Requests: cpu: 1m memory: 20Mi Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hl77l (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: config-volume: Type: Secret (a volume populated by a Secret) SecretName: alertmanager-main-generated Optional: false tls-assets: Type: Projected (a volume that contains injected data from multiple sources) SecretName: alertmanager-main-tls-assets-0 SecretOptionalName: <nil> secret-alertmanager-main-tls: Type: Secret (a volume populated by a Secret) SecretName: alertmanager-main-tls Optional: false secret-alertmanager-main-proxy: Type: Secret (a volume populated by a Secret) SecretName: alertmanager-main-proxy Optional: false secret-alertmanager-kube-rbac-proxy: Type: Secret (a volume populated by a Secret) SecretName: alertmanager-kube-rbac-proxy Optional: false secret-alertmanager-kube-rbac-proxy-metric: Type: Secret (a volume populated by a Secret) SecretName: alertmanager-kube-rbac-proxy-metric Optional: false alertmanager-main-db: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> metrics-client-ca: Type: ConfigMap (a volume populated by a ConfigMap) Name: metrics-client-ca Optional: false alertmanager-trusted-ca-bundle: Type: ConfigMap (a volume populated by a ConfigMap) Name: alertmanager-trusted-ca-bundle-2rsonso43rc5p Optional: true kube-api-access-hl77l: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: kubernetes.io/os=linux Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 2m25s (x409 over 15h) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-main-0_openshift-monitoring_1c367a83-24e3-4249-861a-a107a6beaee2_0(dff5f302f774d060728261b3c86841ebdbd7ba11537ec9f4d90d57be17bdf44b): error adding pod openshift-monitoring_alertmanager-main-0 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-monitoring/alertmanager-main-0/1c367a83-24e3-4249-861a-a107a6beaee2:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-monitoring/alertmanager-main-0 dff5f302f774d060728261b3c86841ebdbd7ba11537ec9f4d90d57be17bdf44b] [openshift-monitoring/alertmanager-main-0 dff5f302f774d060728261b3c86841ebdbd7ba11537ec9f4d90d57be17bdf44b] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded oc --kubeconfig=/root/hv-vm/sno/manifests/sno01219/kubeconfig describe po -n openshift-monitoring prometheus-k8s-0 Name: prometheus-k8s-0 Namespace: openshift-monitoring Priority: 2000000000 Priority Class Name: system-cluster-critical Node: sno01219/fc00:1001::8aa Start Time: Mon, 15 Aug 2022 23:53:39 +0000 Labels: app.kubernetes.io/component=prometheus app.kubernetes.io/instance=k8s app.kubernetes.io/managed-by=prometheus-operator app.kubernetes.io/name=prometheus app.kubernetes.io/part-of=openshift-monitoring app.kubernetes.io/version=2.36.2 controller-revision-hash=prometheus-k8s-546b544f8b operator.prometheus.io/name=k8s operator.prometheus.io/shard=0 prometheus=k8s statefulset.kubernetes.io/pod-name=prometheus-k8s-0 Annotations: kubectl.kubernetes.io/default-container: prometheus openshift.io/scc: nonroot Status: Pending IP: IPs: <none> Controlled By: StatefulSet/prometheus-k8s Init Containers: init-config-reloader: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:209e20410ec2d3d7a502f568d2b7fe1cd1beadcb36fff2d1e6f59d77be3200e3 Image ID: Port: 8080/TCP Host Port: 0/TCP Command: /bin/prometheus-config-reloader Args: --watch-interval=0 --listen-address=:8080 --config-file=/etc/prometheus/config/prometheus.yaml.gz --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml --watched-dir=/etc/prometheus/rules/prometheus-k8s-rulefiles-0 State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Requests: cpu: 1m memory: 10Mi Environment: POD_NAME: prometheus-k8s-0 (v1:metadata.name) SHARD: 0 Mounts: /etc/prometheus/config from config (rw) /etc/prometheus/config_out from config-out (rw) /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-85zlc (ro) Containers: prometheus: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c7df53b796e81ba8301ba74d02317226329bd5752fd31c1b44d028e4832f21c3 Image ID: Port: <none> Host Port: <none> Args: --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries --storage.tsdb.retention.time=15d --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometheus --web.enable-lifecycle --web.external-url=https:/console-openshift-console.apps.sno01219.rdu2.scalelab.redhat.com/monitoring --web.route-prefix=/ --web.listen-address=127.0.0.1:9090 --web.config.file=/etc/prometheus/web_config/web-config.yaml State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Requests: cpu: 70m memory: 1Gi Liveness: exec [sh -c if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/healthy; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/healthy; else exit 1; fi] delay=0s timeout=3s period=5s #success=1 #failure=6 Readiness: exec [sh -c if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi] delay=0s timeout=3s period=5s #success=1 #failure=3 Startup: exec [sh -c if [ -x "$(command -v curl)" ]; then exec curl --fail http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi] delay=0s timeout=3s period=15s #success=1 #failure=60 Environment: <none> Mounts: /etc/pki/ca-trust/extracted/pem/ from prometheus-trusted-ca-bundle (ro) /etc/prometheus/certs from tls-assets (ro) /etc/prometheus/config_out from config-out (ro) /etc/prometheus/configmaps/kubelet-serving-ca-bundle from configmap-kubelet-serving-ca-bundle (ro) /etc/prometheus/configmaps/metrics-client-ca from configmap-metrics-client-ca (ro) /etc/prometheus/configmaps/serving-certs-ca-bundle from configmap-serving-certs-ca-bundle (ro) /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw) /etc/prometheus/secrets/kube-etcd-client-certs from secret-kube-etcd-client-certs (ro) /etc/prometheus/secrets/kube-rbac-proxy from secret-kube-rbac-proxy (ro) /etc/prometheus/secrets/metrics-client-certs from secret-metrics-client-certs (ro) /etc/prometheus/secrets/prometheus-k8s-proxy from secret-prometheus-k8s-proxy (ro) /etc/prometheus/secrets/prometheus-k8s-thanos-sidecar-tls from secret-prometheus-k8s-thanos-sidecar-tls (ro) /etc/prometheus/secrets/prometheus-k8s-tls from secret-prometheus-k8s-tls (ro) /etc/prometheus/web_config/web-config.yaml from web-config (ro,path="web-config.yaml") /prometheus from prometheus-k8s-db (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-85zlc (ro) config-reloader: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:209e20410ec2d3d7a502f568d2b7fe1cd1beadcb36fff2d1e6f59d77be3200e3 Image ID: Port: <none> Host Port: <none> Command: /bin/prometheus-config-reloader Args: --listen-address=localhost:8080 --reload-url=http://localhost:9090/-/reload --config-file=/etc/prometheus/config/prometheus.yaml.gz --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml --watched-dir=/etc/prometheus/rules/prometheus-k8s-rulefiles-0 State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Requests: cpu: 1m memory: 10Mi Environment: POD_NAME: prometheus-k8s-0 (v1:metadata.name) SHARD: 0 Mounts: /etc/prometheus/config from config (rw) /etc/prometheus/config_out from config-out (rw) /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-85zlc (ro) thanos-sidecar: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:36fc214537c763b3a3f0a9dc7a1bd4378a80428c31b2629df8786a9b09155e6d Image ID: Ports: 10902/TCP, 10901/TCP Host Ports: 0/TCP, 0/TCP Args: sidecar --prometheus.url=http://localhost:9090/ --tsdb.path=/prometheus --http-address=127.0.0.1:10902 --grpc-server-tls-cert=/etc/tls/grpc/server.crt --grpc-server-tls-key=/etc/tls/grpc/server.key --grpc-server-tls-client-ca=/etc/tls/grpc/ca.crt State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Requests: cpu: 1m memory: 25Mi Environment: <none> Mounts: /etc/tls/grpc from secret-grpc-tls (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-85zlc (ro) prometheus-proxy: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:140f8947593d92e1517e50a201e83bdef8eb965b552a21d3caf346a250d0cf6e Image ID: Port: 9091/TCP Host Port: 0/TCP Args: -provider=openshift -https-address=:9091 -http-address= -email-domain=* -upstream=http://localhost:9090 -openshift-service-account=prometheus-k8s -openshift-sar={"resource": "namespaces", "verb": "get"} -openshift-delegate-urls={"/": {"resource": "namespaces", "verb": "get"}} -tls-cert=/etc/tls/private/tls.crt -tls-key=/etc/tls/private/tls.key -client-secret-file=/var/run/secrets/kubernetes.io/serviceaccount/token -cookie-secret-file=/etc/proxy/secrets/session_secret -openshift-ca=/etc/pki/tls/cert.pem -openshift-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Requests: cpu: 1m memory: 20Mi Environment: HTTP_PROXY: HTTPS_PROXY: NO_PROXY: Mounts: /etc/pki/ca-trust/extracted/pem/ from prometheus-trusted-ca-bundle (ro) /etc/proxy/secrets from secret-prometheus-k8s-proxy (rw) /etc/tls/private from secret-prometheus-k8s-tls (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-85zlc (ro) kube-rbac-proxy: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b5e1c69d005727e3245604cfca7a63e4f9bc6e15128c7489e41d5e967305089e Image ID: Port: 9092/TCP Host Port: 0/TCP Args: --secure-listen-address=0.0.0.0:9092 --upstream=http://127.0.0.1:9090 --allow-paths=/metrics --config-file=/etc/kube-rbac-proxy/config.yaml --tls-cert-file=/etc/tls/private/tls.crt --tls-private-key-file=/etc/tls/private/tls.key --client-ca-file=/etc/tls/client/client-ca.crt --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 --logtostderr=true --tls-min-version=VersionTLS12 State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Requests: cpu: 1m memory: 15Mi Environment: <none> Mounts: /etc/kube-rbac-proxy from secret-kube-rbac-proxy (rw) /etc/tls/client from configmap-metrics-client-ca (ro) /etc/tls/private from secret-prometheus-k8s-tls (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-85zlc (ro) kube-rbac-proxy-thanos: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b5e1c69d005727e3245604cfca7a63e4f9bc6e15128c7489e41d5e967305089e Image ID: Port: 10902/TCP Host Port: 0/TCP Args: --secure-listen-address=[$(POD_IP)]:10902 --upstream=http://127.0.0.1:10902 --tls-cert-file=/etc/tls/private/tls.crt --tls-private-key-file=/etc/tls/private/tls.key --client-ca-file=/etc/tls/client/client-ca.crt --config-file=/etc/kube-rbac-proxy/config.yaml --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 --allow-paths=/metrics --logtostderr=true --tls-min-version=VersionTLS12 --client-ca-file=/etc/tls/client/client-ca.crt State: Waiting Reason: PodInitializing Ready: False Restart Count: 0 Requests: cpu: 1m memory: 10Mi Environment: POD_IP: (v1:status.podIP) Mounts: /etc/kube-rbac-proxy from secret-kube-rbac-proxy (rw) /etc/tls/client from metrics-client-ca (ro) /etc/tls/private from secret-prometheus-k8s-thanos-sidecar-tls (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-85zlc (ro) Conditions: Type Status Initialized False Ready False ContainersReady False PodScheduled True Volumes: config: Type: Secret (a volume populated by a Secret) SecretName: prometheus-k8s Optional: false tls-assets: Type: Projected (a volume that contains injected data from multiple sources) SecretName: prometheus-k8s-tls-assets-0 SecretOptionalName: <nil> config-out: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> prometheus-k8s-rulefiles-0: Type: ConfigMap (a volume populated by a ConfigMap) Name: prometheus-k8s-rulefiles-0 Optional: false web-config: Type: Secret (a volume populated by a Secret) SecretName: prometheus-k8s-web-config Optional: false secret-kube-etcd-client-certs: Type: Secret (a volume populated by a Secret) SecretName: kube-etcd-client-certs Optional: false secret-prometheus-k8s-tls: Type: Secret (a volume populated by a Secret) SecretName: prometheus-k8s-tls Optional: false secret-prometheus-k8s-proxy: Type: Secret (a volume populated by a Secret) SecretName: prometheus-k8s-proxy Optional: false secret-prometheus-k8s-thanos-sidecar-tls: Type: Secret (a volume populated by a Secret) SecretName: prometheus-k8s-thanos-sidecar-tls Optional: false secret-kube-rbac-proxy: Type: Secret (a volume populated by a Secret) SecretName: kube-rbac-proxy Optional: false secret-metrics-client-certs: Type: Secret (a volume populated by a Secret) SecretName: metrics-client-certs Optional: false configmap-serving-certs-ca-bundle: Type: ConfigMap (a volume populated by a ConfigMap) Name: serving-certs-ca-bundle Optional: false configmap-kubelet-serving-ca-bundle: Type: ConfigMap (a volume populated by a ConfigMap) Name: kubelet-serving-ca-bundle Optional: false configmap-metrics-client-ca: Type: ConfigMap (a volume populated by a ConfigMap) Name: metrics-client-ca Optional: false prometheus-k8s-db: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> metrics-client-ca: Type: ConfigMap (a volume populated by a ConfigMap) Name: metrics-client-ca Optional: false secret-grpc-tls: Type: Secret (a volume populated by a Secret) SecretName: prometheus-k8s-grpc-tls-crdkohb1gb92n Optional: false prometheus-trusted-ca-bundle: Type: ConfigMap (a volume populated by a ConfigMap) Name: prometheus-trusted-ca-bundle-2rsonso43rc5p Optional: true kube-api-access-85zlc: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: kubernetes.io/os=linux Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 4m19s (x409 over 15h) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-k8s-0_openshift-monitoring_debda4d2-6914-4b36-92e0-78f68d539ab3_0(86af91d4e64ab0fbad95352b029762e9856ff24005445b458bccb22e0ee9b655): error adding pod openshift-monitoring_prometheus-k8s-0 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-monitoring/prometheus-k8s-0/debda4d2-6914-4b36-92e0-78f68d539ab3:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-monitoring/prometheus-k8s-0 86af91d4e64ab0fbad95352b029762e9856ff24005445b458bccb22e0ee9b655] [openshift-monitoring/prometheus-k8s-0 86af91d4e64ab0fbad95352b029762e9856ff24005445b458bccb22e0ee9b655] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded
Both pods in error state seem to be waiting on this issue "failed to get pod annotation: timed out waiting for annotations: context deadline exceeded"
Description of problem:
"Failed to open directory, disabling udev device properties" in node-exporter logs
$ for i in $(oc -n openshift-monitoring get pod | grep node-exporter | awk '{print $1}'); do echo $i; oc -n openshift-monitoring logs -c node-exporter $i | grep "Failed to open directory, disabling udev device properties"; echo -e "\n"; done node-exporter-4279b ts=2022-10-17T01:16:05.833Z caller=diskstats_linux.go:264 level=error collector=diskstats msg="Failed to open directory, disabling udev device properties" path=/run/udev/data node-exporter-9tq64 ts=2022-10-17T01:16:04.642Z caller=diskstats_linux.go:264 level=error collector=diskstats msg="Failed to open directory, disabling udev device properties" path=/run/udev/data node-exporter-dwtwh ts=2022-10-17T01:16:04.936Z caller=diskstats_linux.go:264 level=error collector=diskstats msg="Failed to open directory, disabling udev device properties" path=/run/udev/data node-exporter-nrznc ts=2022-10-17T01:16:05.601Z caller=diskstats_linux.go:264 level=error collector=diskstats msg="Failed to open directory, disabling udev device properties" path=/run/udev/data node-exporter-q87s4 ts=2022-10-17T01:16:05.228Z caller=diskstats_linux.go:264 level=error collector=diskstats msg="Failed to open directory, disabling udev device properties" path=/run/udev/data node-exporter-twtxj ts=2022-10-17T01:16:05.249Z caller=diskstats_linux.go:264 level=error collector=diskstats msg="Failed to open directory, disabling udev device properties" path=/run/udev/data
debug on node, /run/udev/data is readable
# oc debug node/ip-10-0-138-107.us-east-2.compute.internal Temporary namespace openshift-debug-dhvqv is created for debugging node... Starting pod/ip-10-0-138-107us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.138.107 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# ls -l /run/udev/ total 0 srw-------. 1 root root 0 Oct 17 01:04 control drwxr-xr-x. 2 root root 3780 Oct 17 01:26 data drwxr-xr-x. 40 root root 800 Oct 17 01:04 links drwxr-xr-x. 3 root root 60 Oct 17 01:04 static_node-tags drwxr-xr-x. 5 root root 100 Oct 17 01:04 tags drwxr-xr-x. 2 root root 140 Oct 17 01:04 watch sh-4.4# ls -l /run/udev/data total 304 -rw-r--r--. 1 root root 55 Oct 17 01:04 +acpi:AMZN0000:00 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:LNXCPU:00 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:LNXCPU:01 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:LNXCPU:02 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:LNXCPU:03 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:LNXPWRBN:00 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:LNXSLPBN:00 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:LNXSYBUS:00 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:LNXSYBUS:01 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:LNXSYSTM:00 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:PNP0103:00 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:PNP0303:00 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:PNP0400:00 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:PNP0501:00 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:PNP0A03:00 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:PNP0B00:00 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:PNP0C0F:00 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:PNP0C0F:01 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:PNP0C0F:02 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:PNP0C0F:03 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:PNP0C0F:04 -rw-r--r--. 1 root root 57 Oct 17 01:04 +acpi:PNP0F13:00 -rw-r--r--. 1 root root 142 Oct 17 01:04 +input:input0 -rw-r--r--. 1 root root 142 Oct 17 01:04 +input:input1 -rw-r--r--. 1 root root 218 Oct 17 01:04 +input:input2 -rw-r--r--. 1 root root 198 Oct 17 01:04 +input:input4 -rw-r--r--. 1 root root 143 Oct 17 01:04 +input:input5 -rw-r--r--. 1 root root 60 Oct 17 01:04 +module:configfs -rw-r--r--. 1 root root 66 Oct 17 01:04 +module:fuse -rw-r--r--. 1 root root 188 Oct 17 01:04 +pci:0000:00:00.0 -rw-r--r--. 1 root root 195 Oct 17 01:04 +pci:0000:00:01.0 -rw-r--r--. 1 root root 213 Oct 17 01:04 +pci:0000:00:01.3 -rw-r--r--. 1 root root 207 Oct 17 01:04 +pci:0000:00:03.0 -rw-r--r--. 1 root root 259 Oct 17 01:04 +pci:0000:00:04.0 -rw-r--r--. 1 root root 208 Oct 17 01:04 +pci:0000:00:05.0 -rw-r--r--. 1 root root 55 Oct 17 01:04 +platform:AMZN0000:00 -rw-r--r--. 1 root root 825 Oct 17 01:04 b259:0 -rw-r--r--. 1 root root 1357 Oct 17 01:04 b259:1 -rw-r--r--. 1 root root 1568 Oct 17 01:04 b259:2 -rw-r--r--. 1 root root 1619 Oct 17 01:04 b259:3 -rw-r--r--. 1 root root 1602 Oct 17 01:04 b259:4 -rw-r--r--. 1 root root 0 Oct 17 01:04 c10:144 -rw-r--r--. 1 root root 0 Oct 17 01:04 c10:183 -rw-r--r--. 1 root root 0 Oct 17 01:04 c10:227 -rw-r--r--. 1 root root 0 Oct 17 01:04 c10:228 -rw-r--r--. 1 root root 0 Oct 17 01:04 c10:229 -rw-r--r--. 1 root root 0 Oct 17 01:04 c10:231 -rw-r--r--. 1 root root 0 Oct 17 01:04 c10:235 -rw-r--r--. 1 root root 0 Oct 17 01:04 c10:236 -rw-r--r--. 1 root root 0 Oct 17 01:04 c10:62 -rw-r--r--. 1 root root 0 Oct 17 01:04 c10:63 -rw-r--r--. 1 root root 193 Oct 17 01:04 c13:32 -rw-r--r--. 1 root root 0 Oct 17 01:04 c13:63 -rw-r--r--. 1 root root 113 Oct 17 01:04 c13:64 -rw-r--r--. 1 root root 113 Oct 17 01:04 c13:65 -rw-r--r--. 1 root root 232 Oct 17 01:04 c13:66 -rw-r--r--. 1 root root 199 Oct 17 01:04 c13:67 -rw-r--r--. 1 root root 143 Oct 17 01:04 c13:68 -rw-r--r--. 1 root root 0 Oct 17 01:04 c162:0 -rw-r--r--. 1 root root 0 Oct 17 01:04 c1:1 -rw-r--r--. 1 root root 0 Oct 17 01:04 c1:11 -rw-r--r--. 1 root root 0 Oct 17 01:04 c1:3 -rw-r--r--. 1 root root 0 Oct 17 01:04 c1:4 -rw-r--r--. 1 root root 0 Oct 17 01:04 c1:5 -rw-r--r--. 1 root root 0 Oct 17 01:04 c1:7 -rw-r--r--. 1 root root 0 Oct 17 01:04 c1:8 -rw-r--r--. 1 root root 0 Oct 17 01:04 c1:9 -rw-r--r--. 1 root root 0 Oct 17 01:04 c202:0 -rw-r--r--. 1 root root 0 Oct 17 01:04 c202:1 -rw-r--r--. 1 root root 0 Oct 17 01:04 c202:2 -rw-r--r--. 1 root root 0 Oct 17 01:04 c202:3 -rw-r--r--. 1 root root 0 Oct 17 01:04 c203:0 -rw-r--r--. 1 root root 0 Oct 17 01:04 c203:1 -rw-r--r--. 1 root root 0 Oct 17 01:04 c203:2 -rw-r--r--. 1 root root 0 Oct 17 01:04 c203:3 -rw-r--r--. 1 root root 0 Oct 17 01:04 c241:0 -rw-r--r--. 1 root root 259 Oct 17 01:04 c242:0 -rw-r--r--. 1 root root 0 Oct 17 01:04 c246:0 -rw-r--r--. 1 root root 23 Oct 17 01:04 c251:0 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:0 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:1 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:10 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:11 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:12 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:13 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:14 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:15 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:16 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:17 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:18 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:19 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:2 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:20 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:21 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:22 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:23 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:24 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:25 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:26 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:27 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:28 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:29 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:3 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:30 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:31 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:32 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:33 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:34 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:35 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:36 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:37 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:38 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:39 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:4 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:40 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:41 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:42 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:43 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:44 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:45 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:46 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:47 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:48 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:49 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:5 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:50 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:51 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:52 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:53 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:54 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:55 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:56 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:57 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:58 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:59 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:6 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:60 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:61 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:62 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:63 -rw-r--r--. 1 root root 20 Oct 17 01:04 c4:64 -rw-r--r--. 1 root root 20 Oct 17 01:04 c4:65 -rw-r--r--. 1 root root 20 Oct 17 01:04 c4:66 -rw-r--r--. 1 root root 20 Oct 17 01:04 c4:67 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:7 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:8 -rw-r--r--. 1 root root 0 Oct 17 01:04 c4:9 -rw-r--r--. 1 root root 0 Oct 17 01:04 c5:0 -rw-r--r--. 1 root root 0 Oct 17 01:04 c5:1 -rw-r--r--. 1 root root 0 Oct 17 01:04 c5:2 -rw-r--r--. 1 root root 0 Oct 17 01:04 c7:0 -rw-r--r--. 1 root root 0 Oct 17 01:04 c7:1 -rw-r--r--. 1 root root 0 Oct 17 01:04 c7:128 -rw-r--r--. 1 root root 0 Oct 17 01:04 c7:129 -rw-r--r--. 1 root root 0 Oct 17 01:04 c7:130 -rw-r--r--. 1 root root 0 Oct 17 01:04 c7:131 -rw-r--r--. 1 root root 0 Oct 17 01:04 c7:132 -rw-r--r--. 1 root root 0 Oct 17 01:04 c7:133 -rw-r--r--. 1 root root 0 Oct 17 01:04 c7:134 -rw-r--r--. 1 root root 0 Oct 17 01:04 c7:2 -rw-r--r--. 1 root root 0 Oct 17 01:04 c7:3 -rw-r--r--. 1 root root 0 Oct 17 01:04 c7:4 -rw-r--r--. 1 root root 0 Oct 17 01:04 c7:5 -rw-r--r--. 1 root root 0 Oct 17 01:04 c7:6 -rw-r--r--. 1 root root 87 Oct 17 01:04 n1 -rw-r--r--. 1 root root 360 Oct 17 01:06 n10 -rw-r--r--. 1 root root 360 Oct 17 01:06 n11 -rw-r--r--. 1 root root 360 Oct 17 01:06 n13 -rw-r--r--. 1 root root 360 Oct 17 01:07 n14 -rw-r--r--. 1 root root 595 Oct 17 01:04 n2 -rw-r--r--. 1 root root 360 Oct 17 01:09 n25 -rw-r--r--. 1 root root 360 Oct 17 01:10 n29 -rw-r--r--. 1 root root 195 Oct 17 01:04 n3 -rw-r--r--. 1 root root 360 Oct 17 01:10 n30 -rw-r--r--. 1 root root 360 Oct 17 01:11 n31 -rw-r--r--. 1 root root 360 Oct 17 01:14 n35 -rw-r--r--. 1 root root 360 Oct 17 01:14 n37 -rw-r--r--. 1 root root 360 Oct 17 01:14 n39 -rw-r--r--. 1 root root 188 Oct 17 01:04 n4 -rw-r--r--. 1 root root 360 Oct 17 01:15 n41 -rw-r--r--. 1 root root 193 Oct 17 01:04 n5 -rw-r--r--. 1 root root 360 Oct 17 01:18 n50 -rw-r--r--. 1 root root 362 Oct 17 01:26 n54 -rw-r--r--. 1 root root 189 Oct 17 01:04 n6 -rw-r--r--. 1 root root 357 Oct 17 01:05 n7 -rw-r--r--. 1 root root 357 Oct 17 01:05 n8 -rw-r--r--. 1 root root 359 Oct 17 01:05 n9
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-15-094115 node-exporter version=1.4.0
How reproducible:
always
Steps to Reproduce:
1. check node-exporter logs 2. 3.
Actual results:
"Failed to open directory, disabling udev device properties" in node-exporter logs
Expected results:
no error logs
Additional info:
no functional affection for the cluster code: https://github.com/prometheus/node_exporter/blob/release-1.4/collector/diskstats_linux.go#L262-L270
Description of problem: This is a follow-up to OCPBUGS-2795 and OCPBUGS-2941.
The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses. This can happen on responses with HTTP status code 204, where a reverse proxy is truncating content-related headers (see this nginX bug report). In such cases, the Installer errors with:
level=error msg=Bulk deleting of container "5ifivltb-ac890-chr5h-image-registry-fnxlmmhiesrfvpuxlxqnkoxdbl" objects failed: Cannot extract names from response with content-type: []
Listing container object suffers from the same issue as listing the containers and this one isn't fixed in latest versions of gophercloud. I've reported https://github.com/gophercloud/gophercloud/issues/2509 and fixing it with https://github.com/gophercloud/gophercloud/issues/2510, however we likely won't be able to backport the bump to gophercloud master back to release-4.8 so we'll have to look for alternatives.
I'm setting the priority to critical as it's causing all our jobs to fail in master.
Version-Release number of selected component (if applicable):
4.8.z
How reproducible:
Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/862
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The service project and the host project both have a private DNS zone named as "ipi-xpn-private-zone". The thing is, although platform.gcp.privateDNSZone.project is set as the host project, the installer checks the zone of the service project, and complains dns name not match.
Version-Release number of selected component (if applicable):
$ openshift-install version openshift-install 4.12.0-0.nightly-2022-10-25-210451 built from commit 14d496fdaec571fa97604a487f5df6a0433c0c68 release image registry.ci.openshift.org/ocp/release@sha256:d6cc07402fee12197ca1a8592b5b781f9f9a84b55883f126d60a3896a36a9b74 release architecture amd64
How reproducible:
Always, if both the service project and the host project have a private DNS zone with the same name.
Steps to Reproduce:
1. try IPI installation to a shared VPC, using "privateDNSZone" of the host project
Actual results:
$ openshift-install create cluster --dir test7 INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: platform.gcp.privateManagedZone: Invalid value: "ipi-xpn-private-zone": dns zone jiwei-1026a.qe1.gcp.devcluster.openshift.com. did not match expected jiwei-1027a.qe-shared-vpc.qe.gcp.devcluster.openshift.com $
Expected results:
The installer should check the private zone in the specified project (i.e. the host project).
Additional info:
$ yq-3.3.0 r test7/install-config.yaml platform gcp: projectID: openshift-qe region: us-central1 computeSubnet: installer-shared-vpc-subnet-2 controlPlaneSubnet: installer-shared-vpc-subnet-1 createFirewallRules: Disabled publicDNSZone: id: qe-shared-vpc project: openshift-qe-shared-vpc privateDNSZone: id: ipi-xpn-private-zone project: openshift-qe-shared-vpc network: installer-shared-vpc networkProjectID: openshift-qe-shared-vpc $ yq-3.3.0 r test7/install-config.yaml baseDomain qe-shared-vpc.qe.gcp.devcluster.openshift.com $ yq-3.3.0 r test7/install-config.yaml metadata creationTimestamp: null name: jiwei-1027a $ $ openshift-install create cluster --dir test7 INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: platform.gcp.privateManagedZone: Invalid value: "ipi-xpn-private-zone": dns zone jiwei-1026a.qe1.gcp.devcluster.openshift.com. did not match expected jiwei-1027a.qe-shared-vpc.qe.gcp.devcluster.openshift.com $ $ gcloud --project openshift-qe-shared-vpc dns managed-zones list --filter='name=qe-shared-vpc' NAME DNS_NAME DESCRIPTION VISIBILITY qe-shared-vpc qe-shared-vpc.qe.gcp.devcluster.openshift.com. public $ gcloud --project openshift-qe-shared-vpc dns managed-zones list --filter='name=ipi-xpn-private-zone' NAME DNS_NAME DESCRIPTION VISIBILITY ipi-xpn-private-zone jiwei-1027a.qe-shared-vpc.qe.gcp.devcluster.openshift.com. Preserved private zone for IPI XPN private $ gcloud dns managed-zones list --filter='name=ipi-xpn-private-zone' NAME DNS_NAME DESCRIPTION VISIBILITY ipi-xpn-private-zone jiwei-1026a.qe1.gcp.devcluster.openshift.com. Preserved private zone for IPI XPN private $ $ gcloud --project openshift-qe-shared-vpc dns managed-zones describe qe-shared-vpc cloudLoggingConfig: kind: dns#managedZoneCloudLoggingConfig creationTime: '2020-04-26T02:50:25.172Z' description: '' dnsName: qe-shared-vpc.qe.gcp.devcluster.openshift.com. id: '7036327024919173373' kind: dns#managedZone name: qe-shared-vpc nameServers: - ns-cloud-b1.googledomains.com. - ns-cloud-b2.googledomains.com. - ns-cloud-b3.googledomains.com. - ns-cloud-b4.googledomains.com. visibility: public $ $ gcloud --project openshift-qe-shared-vpc dns managed-zones describe ipi-xpn-private-zone cloudLoggingConfig: kind: dns#managedZoneCloudLoggingConfig creationTime: '2022-10-27T08:05:18.332Z' description: Preserved private zone for IPI XPN dnsName: jiwei-1027a.qe-shared-vpc.qe.gcp.devcluster.openshift.com. id: '5506116785330943369' kind: dns#managedZone name: ipi-xpn-private-zone nameServers: - ns-gcp-private.googledomains.com. privateVisibilityConfig: kind: dns#managedZonePrivateVisibilityConfig networks: - kind: dns#managedZonePrivateVisibilityConfigNetwork networkUrl: https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/networks/installer-shared-vpc visibility: private $ $ gcloud dns managed-zones describe ipi-xpn-private-zone cloudLoggingConfig: kind: dns#managedZoneCloudLoggingConfig creationTime: '2022-10-26T06:42:52.268Z' description: Preserved private zone for IPI XPN dnsName: jiwei-1026a.qe1.gcp.devcluster.openshift.com. id: '7663537481778983285' kind: dns#managedZone name: ipi-xpn-private-zone nameServers: - ns-gcp-private.googledomains.com. privateVisibilityConfig: kind: dns#managedZonePrivateVisibilityConfig networks: - kind: dns#managedZonePrivateVisibilityConfigNetwork networkUrl: https://www.googleapis.com/compute/v1/projects/openshift-qe-shared-vpc/global/networks/installer-shared-vpc visibility: private $
Description of problem:
OpenShift installer hits error when missing a topology section inside of a failureDomain like this in install-config.yaml:
- name: us-east-1 region: us-east zone: us-east-1a - name: us-east-2 region: us-east zone: us-east-2a topology: computeCluster: /IBMCloud/host/vcs-mdcnc-workload-2 networks: - ci-segment-154 datastore: workload_share_vcsmdcncworkload2_vyC6a
Version-Release number of selected component (if applicable):
Build from latest master (4.12)
How reproducible:
Each time
Steps to Reproduce:
1. Create install-config.yaml for vsphere multi-zone 2. Leave out a topology section (under failureDomains) 3. Attempt to create cluster
Actual results:
FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": platform.vsphere.failureDomains.topology.resourcePool: Invalid value: "//Resources": resource pool '//Resources' not found
Expected results:
Validation of topology before attempting to create any resources
Description of problem:
IPI installation failed with master nodes being NotReady and CCM error "alicloud: unable to split instanceid and region from providerID".
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-05-053337
How reproducible:
Always
Steps to Reproduce:
1. try IPI installation on alibabacloud, with credentialsMode being "Manual" 2. 3.
Actual results:
Installation failed.
Expected results:
Installation should succeed.
Additional info:
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 34m Unable to apply 4.12.0-0.nightly-2022-10-05-053337: an unknown error has occurred: MultipleErrors $ $ oc get nodes NAME STATUS ROLES AGE VERSION jiwei-1012-02-9jkj4-master-0 NotReady control-plane,master 30m v1.25.0+3ef6ef3 jiwei-1012-02-9jkj4-master-1 NotReady control-plane,master 30m v1.25.0+3ef6ef3 jiwei-1012-02-9jkj4-master-2 NotReady control-plane,master 30m v1.25.0+3ef6ef3 $ CCM logs: E1012 03:46:45.223137 1 node_controller.go:147] node-controller "msg"="fail to find ecs" "error"="cloud instance api fail, alicloud: unable to split instanceid and region from providerID, error unexpected providerID=" "providerId"="alicloud://" E1012 03:46:45.223174 1 controller.go:317] controller/node-controller "msg"="Reconciler error" "error"="find ecs: cloud instance api fail, alicloud: unable to split instanceid and region from providerID, error unexpected providerID=" "name"="jiwei-1012-02-9jkj4-master-0" "namespace"="" https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/145768/ (Finished: FAILURE) 10-12 10:55:15.987 ./openshift-install 4.12.0-0.nightly-2022-10-05-053337 10-12 10:55:15.987 built from commit 84aa8222b622dee71185a45f1e0ba038232b114a 10-12 10:55:15.987 release image registry.ci.openshift.org/ocp/release@sha256:41fe173061b00caebb16e2fd11bac19980d569cd933fdb4fab8351cdda14d58e 10-12 10:55:15.987 release architecture amd64 FYI the installation could succeed with 4.12.0-0.nightly-2022-09-28-204419: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/145756/ (Finished: SUCCESS) 10-12 09:59:19.914 ./openshift-install 4.12.0-0.nightly-2022-09-28-204419 10-12 09:59:19.914 built from commit 9eb0224926982cdd6cae53b872326292133e532d 10-12 09:59:19.914 release image registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc 10-12 09:59:19.914 release architecture amd64
Sample archive with both resources:
archives/compressed/3c/3cc4318d-e564-450b-b16e-51ef279b87fa/202209/30/200617.tar.gz
Sample query to find more archives:
with t as ( select cluster_id, file_path, json_extract_scalar(content, '$.kind') as kind from raw_io_archives where date = '2022-09-30' and file_path like 'config/storage/%' ) select cluster_id, count(*) as cnt from t group by cluster_id order by cnt desc;
This is a clone of issue OCPBUGS-5559. The following is the description of the original issue:
—
Description of problem:
Azure VIP 168.63.129.16 needs to be noProxy to let a VM report back about it's creation status [1]. A similar thing needs to be done for the armEndpoint of ASH - to make sure that future cluster nodes do not communicate with a Stack Hub API through proxy [1] https://docs.microsoft.com/en-us/azure/virtual-network/what-is-ip-address-168-63-129-16
Version-Release number of selected component (if applicable):
4.10.20
How reproducible:
Need to have a proxy server in ASH and run the installer
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected these two be auto-added as they are very specific and difficult to troubleshoot
Expected results:
Additional info:
This is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2104997 against the cluster-network-operator since the fix involves changing both the operator and the installer.
Description of problem:
Currently when installing Openshift on the Openstack cluster name length limit is allowed to 14 characters. Customer wants to know if is it possible to override this validation when installing Openshift on Openstack and create a cluster name that is greater than 14 characters. Version : OCP 4.8.5 UPI Disconnected Environment : Openstack 16 Issue: User reports that they are getting error for OCP cluster in Openstack UPI, where the name of the cluster is > 14 characters. Error events : ~~~ fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/openshift-install", "create", "manifests", "--dir=/home/gitlab-runner/builds/WK8mkokN/0/CPE/SKS/pipelines/non-prod/ocp4-openstack-build/ocpinstaller/install-upi"], "delta": "0:00:00.311397", "end": "2022-09-03 21:38:41.974608", "msg": "non-zero return code", "rc": 1, "start": "2022-09-03 21:38:41.663211", "stderr": "level=fatal msg=failed to fetch Master Machines: failed to load asset \"Install Config\": invalid \"install-config.yaml\" file: metadata.name: Invalid value: \"sks-osp-inf-cpe-1-cbr1a\": cluster name is too long, please restrict it to 14 characters", "stderr_lines": ["level=fatal msg=failed to fetch Master Machines: failed to load asset \"Install Config\": invalid \"install-config.yaml\" file: metadata.name: Invalid value: \"sks-osp-inf-cpe-1-cbr1a\": cluster name is too long, please restrict it to 14 characters"], "stdout": "", "stdout_lines": []} ~~~
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
Actual results:
Users are getting error "cluster name is too long" when clustername contains more than 14 characters for OCP on Openstack
Expected results:
The 14 characters limit should be change for the OCP clustername on Openstack
Additional info:
Description of problem:
The pod count for maxUnavailable of 2 or more is displayed as singular
Version-Release number of selected component (if applicable):
4.12.0-ec.2
How reproducible:
Steps to Reproduce:
1. Create a Deployment 2. Add a PDB to the Deployment and set the maxUnavailable to 2 3. Goto Deployment details page
Actual results:
The Max unavailable 6 of 3 pod
Expected results:
Should be Max unavailable 6 of 3 pods
Additional info:
Description of problem:
When running node-density (245 pods/node) on a 120 node cluster, we see that there is a huge spike (~22s) in Avg pod-latency. When the spike occurs we see all the ovnkube-master pods go through a restart.
The restart happens because of (ovnkube-master pods)
2022-08-10T04:04:44.494945179Z panic: reflect: call of reflect.Value.Len on ptr Value
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-08-09-114621
How reproducible:
Steps to Reproduce:
1. Run node-density on a 120 node cluster
Actual results:
Spike observed in pod-latency graph ~22s
Expected results:
Steady pod-latency graph ~4s
Additional info:
Description of problem:
According to OCP 4.11 doc (https://docs.openshift.com/container-platform/4.11/installing/installing_gcp/installing-gcp-account.html#installation-gcp-enabling-api-services_installing-gcp-account), the Service Usage API (serviceusage.googleapis.com) is an optional API service to be enabled. But, the installation cannot succeed if this API is disabled.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-25-071630
How reproducible:
Always, if the Service Usage API is disabled in the GCP project.
Steps to Reproduce:
1. Make sure the Service Usage API (serviceusage.googleapis.com) is disabled in the GCP project. 2. Try IPI installation in the GCP project.
Actual results:
The installation would fail finally, without any worker machines launched.
Expected results:
Installation should succeed, or the OCP doc should be updated.
Additional info:
Please see the attached must-gather logs (http://virt-openshift-05.lab.eng.nay.redhat.com/jiwei/jiwei-0926-03-cnxn5/) and the sanity check results. FYI if enabling the API, and without changing anything else, the installation could succeed.
This is a clone of issue OCPBUGS-2290. The following is the description of the original issue:
—
Description of problem:
If you try to deploy with Internal publishing strategy, and you have either already have a pubilc gateway or already permitted the VPC subnet to the DNS service, deploy will always fail.
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Add a public gateway to VPC network and/or add VPC subnet to permitted DNS networks 2. Set publish strategy to Internal 3. Deploy
Actual results:
Deploy fails
Expected results:
If the resources exist simply skip trying to create them.
Additional info:
Fix here https://github.com/openshift/installer/pull/6481
This bug is here to help cherry pick PR https://github.com/openshift/ironic-image/pull/320 into 4.12 branch
For the disconnected installation , we should not be able to provision machines successfully with publicIP:true , this has been the behavior earlier till -
4.11 and around 17th Aug nightly released 4.12 , but it has started allowing creation of machines with publicIP:true set in machineset
Issue reproduced on - Cluster version - 4.12.0-0.nightly-2022-08-23-223922
It is always reproducible .
Steps :
Create machineset using yaml with
{"spec":{"providerSpec":{"value":{"publicIP": true}}}}
Machineset created successfully and machine provisioned successfully .
This seems to be regression bug refer - https://bugzilla.redhat.com/show_bug.cgi?id=1889620
Here is the must gather log - https://drive.google.com/file/d/1UXjiqAx7obISTxkmBsSBuo44ciz9HD1F/view?usp=sharing
Here is the test successfully ran for 4.11 , for exactly same profile and machine creation failed with InvalidConfiguration Error- https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Runner/575822/console
We can confirm disconnected cluster using below there would be lot of mirrors used in those -
oc get ImageContentSourcePolicy image-policy-aosqe -o yaml apiVersion: operator.openshift.io/v1alpha1 kind: ImageContentSourcePolicy metadata: creationTimestamp: "2022-08-24T09:08:47Z" generation: 1 name: image-policy-aosqe resourceVersion: "34648" uid: 20e45d6d-e081-435d-b6bb-16c4ca21c9d6 spec: repositoryDigestMirrors: - mirrors: - miyadav-2408a.mirror-registry.qe.azure.devcluster.openshift.com:6001/olmqe source: quay.io/olmqe - mirrors: - miyadav-2408a.mirror-registry.qe.azure.devcluster.openshift.com:6001/openshifttest source: quay.io/openshifttest - mirrors: - miyadav-2408a.mirror-registry.qe.azure.devcluster.openshift.com:6001/openshift-qe-optional-operators source: quay.io/openshift-qe-optional-operators - mirrors: - miyadav-2408a.mirror-registry.qe.azure.devcluster.openshift.com:6002 source: registry.redhat.io - mirrors: - miyadav-2408a.mirror-registry.qe.azure.devcluster.openshift.com:6002 source: registry.stage.redhat.io - mirrors: - miyadav-2408a.mirror-registry.qe.azure.devcluster.openshift.com:6002 source: brew.registry.redhat.io
Description of problem:
Setting up Github App from the console is lacking the required permission
Events and Permissions: https://pipelinesascode.com/docs/install/github_apps/
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Setup Github App from administrator perspective.
2. Create Repository and configure it to use the Github App method.
Actual results:
Creates Github App with limited permission.
Expected results:
Created Github App should contain all the required permission and should trigger the pipelinerun successfully on git events.
Additional info:
Console needs to update the default_events and default_permissions here it has to be matching with the CLI - refer this.
we need to update the See Github permission section in the UI as well.
This is a clone of issue OCPBUGS-1453. The following is the description of the original issue:
—
Description of problem:
TargetDown alert fired while it shouldn't. Prometheus endpoints are not always properly unregistered and the alert will therefore think that some Kube service endpoints are down
Version-Release number of selected component (if applicable):
The problem as always been there.
How reproducible:
Not reproducible. Most of the time Prometheus endpoints are properly unregistered. Aim here is to get the TargetDown Prometheus expression be more resilient; this can be tested on past metrics data in which the unregistration issue was encountered.
Steps to Reproduce:
N/A
Actual results:
TargetDown alert triggered while Kube service endpoints are all up & running.
Expected results:
TargetDown alert should not have been trigerred.
Description of problem:
Whereabouts reconciliation is not launched when
How reproducible:
Always
Steps to Reproduce:
1. oc edit the networks object and create a net-attach-def that references whereabouts – in a conflist.
Actual results:
The reconciler is not launched.
Expected results:
The reconciler is launched.
This fix contains the following changes coming from updated version of kubernetes up to v1.25.7:
Changelog:
v1.25.7: https://github.com/kubernetes/kubernetes/blob/release-1.25/CHANGELOG/CHANGELOG-1.25.md#changelog-since-v1256
Description of problem:
revert "force cert rotation every couple days for development" in 4.12 We want short expiry times during development and long expiry times when we ship. --- Additional comment from Eric Paris on 2020-04-02 19:57:29 CEST --- This bug has been set to target the 4.5.0 release without specifying a severity. As part of triage when determining the priority of bugs a severity should be specified. Since these bugs have no been properly triaged I am removing the target release. Teams will need to add a severity before deferring these bugs again. --- Additional comment from Michal Fojtik on 2020-05-12 12:45:25 CEST --- This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity. If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. --- Additional comment from Standa Laznicka on 2020-05-12 14:53:12 CEST --- you don't really want to close this --- Additional comment from Stefan Schimanski on 2020-05-19 13:11:00 CEST --- Waiting for master to open. We will fix it then on the release branch. --- Additional comment from Stefan Schimanski on 2020-06-18 12:23:34 CEST --- Will be done when 4.6 branches from master. --- Additional comment from Michal Fojtik on 2020-07-09 14:46:02 CEST --- Stefan is PTO, adding UpcomingSprint to his bugs to fulfill the duty. --- Additional comment from Michal Fojtik on 2020-08-24 15:12:08 CEST --- This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. --- Additional comment from Michal Fojtik on 2020-08-31 15:59:33 CEST --- This bug hasn't had any activity 7 days after it was marked as LifecycleStale, so we are closing this bug as WONTFIX. If you consider this bug still valuable, please reopen it or create new bug. --- Additional comment from Michal Fojtik on 2020-08-31 17:00:25 CEST --- The LifecycleStale keyword was removed because the bug got commented on recently. The bug assignee was notified. --- Additional comment from Stefan Schimanski on 2020-09-11 13:00:27 CEST --- This is waiting for Eric Paris to stop fast forwarding release-4.6 from master. --- Additional comment from Michal Fojtik on 2020-10-30 11:12:07 CET --- This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that. --- Additional comment from Nick Stielau on 2021-01-20 18:49:09 CET --- Can we get some context on why this is blocker+? Would we further delay the release if we don't get a fix in for this? --- Additional comment from Stefan Schimanski on 2021-03-16 17:28:08 CET --- --- Additional comment from Eric Paris on 2021-06-08 14:00:16 CEST --- This bug sets blocker+ without setting a Target Release. This is an invalid state as it is impossible to determine what is being blocked. Please be sure to set Priority, Severity, and Target Release before you attempt to set blocker+ --- Additional comment from Michal Fojtik on 2021-06-10 10:49:36 CEST --- This is a blocker? until we have Target Release 4.9 (it is a blocker+ for 4.9). --- Additional comment from Wally on 2021-06-11 15:14:26 CEST --- Setting blocker- until next week to clear reports heading to code freeze. Will reset once 4.9 opens. --- Additional comment from Wally on 2021-08-31 19:26:13 UTC --- Setting blocker- until next week to clear reports heading to code freeze. Will reset once 4.10 opens. --- Additional comment from Michal Fojtik on 2022-02-03 21:53:15 UTC --- ** A NOTE ABOUT USING URGENT ** This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold. Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility. NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity. ** INFORMATION REQUIRED ** Please answer these questions before escalation to engineering: 1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather. 2. Give the output of "oc get clusteroperators -o yaml". 3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no] 4. List the top 5 relevant errors from the logs of the operators and operands in (3). 5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top. 6. Explain why (5) is likely the right order and list the information used for that assessment. 7. Explain why Engineering is necessary to make progress. --- Additional comment from Wally on 2022-02-09 20:11:25 UTC --- Setting blocker- for now but will add reminder and keep in my queue for visibility. --- Additional comment from Red Hat Bugzilla on 2022-05-09 08:32:21 UTC --- Account disabled by LDAP Audit for extended failure --- Additional comment from OpenShift Automated Release Tooling on 2022-06-24 01:06:13 UTC --- Elliott changed bug status from MODIFIED to ON_QA. This bug is expected to ship in the next 4.11 release. --- Additional comment from Ke Wang on 2022-06-24 15:24:03 UTC --- To verify the bug, refer to https://bugzilla.redhat.com/show_bug.cgi?id=1921139#c6 --- Additional comment from OpenShift BugZilla Robot on 2022-06-25 12:40:12 UTC --- Bugfix included in accepted release 4.11.0-0.nightly-2022-06-25-081133 Bug will not be automatically moved to VERIFIED for the following reasons: - PR openshift/cluster-kube-apiserver-operator#1307 not approved by QA contact This bug must now be manually moved to VERIFIED by dpunia@redhat.com --- Additional comment from Deepak Punia on 2022-06-27 08:20:33 UTC --- Below is the steps to verify this bug: # oc adm release info --commits registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-06-25-081133|grep -i cluster-kube-apiserver-operator cluster-kube-apiserver-operator https://github.com/openshift/cluster-kube-apiserver-operator 7764681777edfa3126981a0a1d390a6060a840a3 # git log --date local --pretty="%h %an %cd - %s" 776468 |grep -i "#1307" 08973b820 openshift-ci[bot] Thu Jun 23 22:40:08 2022 - Merge pull request #1307 from tkashem/revert-cert-rotation # oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-25-081133 True False 64m Cluster version is 4.11.0-0.nightly-2022-06-25-081133 $ cat scripts/check_secret_expiry.sh FILE="$1" if [ ! -f "$1" ]; then echo "must provide \$1" && exit 0 fi export IFS=$'\n' for i in `cat "$FILE"` do if `echo "$i" | grep "^#" > /dev/null`; then continue fi NS=`echo $i | cut -d ' ' -f 1` SECRET=`echo $i | cut -d ' ' -f 2` rm -f tls.crt; oc extract secret/$SECRET -n $NS --confirm > /dev/null echo "Check cert dates of $SECRET in project $NS:" openssl x509 -noout --dates -in tls.crt; echo done $ cat certs.txt openshift-kube-controller-manager-operator csr-signer-signer openshift-kube-controller-manager-operator csr-signer openshift-kube-controller-manager kube-controller-manager-client-cert-key openshift-kube-apiserver-operator aggregator-client-signer openshift-kube-apiserver aggregator-client openshift-kube-apiserver external-loadbalancer-serving-certkey openshift-kube-apiserver internal-loadbalancer-serving-certkey openshift-kube-apiserver service-network-serving-certkey openshift-config-managed kube-controller-manager-client-cert-key openshift-config-managed kube-scheduler-client-cert-key openshift-kube-scheduler kube-scheduler-client-cert-key Checking the Certs, they are with one day expiry times, this is as expected. # ./check_secret_expiry.sh certs.txt Check cert dates of csr-signer-signer in project openshift-kube-controller-manager-operator: notBefore=Jun 27 04:41:38 2022 GMT notAfter=Jun 28 04:41:38 2022 GMT Check cert dates of csr-signer in project openshift-kube-controller-manager-operator: notBefore=Jun 27 04:52:21 2022 GMT notAfter=Jun 28 04:41:38 2022 GMT Check cert dates of kube-controller-manager-client-cert-key in project openshift-kube-controller-manager: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jul 27 04:52:27 2022 GMT Check cert dates of aggregator-client-signer in project openshift-kube-apiserver-operator: notBefore=Jun 27 04:41:37 2022 GMT notAfter=Jun 28 04:41:37 2022 GMT Check cert dates of aggregator-client in project openshift-kube-apiserver: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jun 28 04:41:37 2022 GMT Check cert dates of external-loadbalancer-serving-certkey in project openshift-kube-apiserver: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jul 27 04:52:27 2022 GMT Check cert dates of internal-loadbalancer-serving-certkey in project openshift-kube-apiserver: notBefore=Jun 27 04:52:49 2022 GMT notAfter=Jul 27 04:52:50 2022 GMT Check cert dates of service-network-serving-certkey in project openshift-kube-apiserver: notBefore=Jun 27 04:52:28 2022 GMT notAfter=Jul 27 04:52:29 2022 GMT Check cert dates of kube-controller-manager-client-cert-key in project openshift-config-managed: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jul 27 04:52:27 2022 GMT Check cert dates of kube-scheduler-client-cert-key in project openshift-config-managed: notBefore=Jun 27 04:52:47 2022 GMT notAfter=Jul 27 04:52:48 2022 GMT Check cert dates of kube-scheduler-client-cert-key in project openshift-kube-scheduler: notBefore=Jun 27 04:52:47 2022 GMT notAfter=Jul 27 04:52:48 2022 GMT # # cat check_secret_expiry_within.sh #!/usr/bin/env bash # usage: ./check_secret_expiry_within.sh 1day # or 15min, 2days, 2day, 2month, 1year WITHIN=${1:-24hours} echo "Checking validity within $WITHIN ..." oc get secret --insecure-skip-tls-verify -A -o json | jq -r '.items[] | select(.metadata.annotations."auth.openshift.io/certificate-not-after" | . != null and fromdateiso8601<='$( date --date="+$WITHIN" +%s )') | "\(.metadata.annotations."auth.openshift.io/certificate-not-before") \(.metadata.annotations."auth.openshift.io/certificate-not-after") \(.metadata.namespace)\t\(.metadata.name)"' # ./check_secret_expiry_within.sh 1day Checking validity within 1day ... 2022-06-27T04:41:37Z 2022-06-28T04:41:37Z openshift-kube-apiserver-operator aggregator-client-signer 2022-06-27T04:52:26Z 2022-06-28T04:41:37Z openshift-kube-apiserver aggregator-client 2022-06-27T04:52:21Z 2022-06-28T04:41:38Z openshift-kube-controller-manager-operator csr-signer 2022-06-27T04:41:38Z 2022-06-28T04:41:38Z openshift-kube-controller-manager-operator csr-signer-signer
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-4101. The following is the description of the original issue:
—
Description of problem:
We experienced two separate upgrade failures relating to the introduction of the SYSTEM_RESERVED_ES node sizing parameter, causing kubelet to stop running. One cluster (clusterA) upgraded from 4.11.14 to 4.11.17. It experienced an issue whereby /etc/node-sizing.env on its master nodes contained an empty SYSTEM_RESERVED_ES value: --- cat /etc/node-sizing.env SYSTEM_RESERVED_MEMORY=5.36Gi SYSTEM_RESERVED_CPU=0.11 SYSTEM_RESERVED_ES= --- causing the kubelet to not start up. To restore service, this file was manually updated to set a value (1Gi), and kubelet was restarted. We are uncertain what conditions led to this occuring on the clusterA master nodes as part of the upgrade. A second cluster (clusterB) upgraded from 4.11.16 to 4.11.17. It experienced an issue whereby worker nodes were impacted by a similar problem, however this was because a custom node-sizing-enabled.env MachineConfig which did not set SYSTEM_RESERVED_ES This caused existing worker nodes to go into a NotReady state after the ugprade, and additionally new nodes did not join the cluster as their kubelet would become impacted. For clusterB the conditions are more well-known of why the value is empty. However, for both clusters, if SYSTEM_RESERVED_ES ends up as empty on a node it can cause the kubelet to not start. We have some asks as a result: - Can MCO be made to recover from this situation if it occurs, perhaps through application of a safe default if none exists, such that kubelet would start correctly? - Can there possibly be alerting that could indicate and draw attention to the misconfiguration?
Version-Release number of selected component (if applicable):
4.11.17
How reproducible:
Have not been able to reproduce it on a fresh cluster upgrading from 4.11.16 to 4.11.17
Expected results:
If SYSTEM_RESERVED_ES is empty in /etc/node-sizing*env then a default should be applied and/or kubelet able to continue running.
Additional info:
Description of problem:
The cluster-ingress-operator's udpateIngressClass function logs "updated IngressClass" on a failure, when it should be logging that on a success.
Version-Release number of selected component (if applicable):
4.8+
How reproducible:
Easily
Steps to Reproduce:
# Simulate a change in an ingressclass that will be reconciled oc apply -f - <<EOF apiVersion: networking.k8s.io/v1 kind: IngressClass metadata: name: openshift-default spec: controller: openshift.io/ingress-to-route parameters: apiGroup: operator.openshift.io kind: IngressController name: default scope: Namespace namespace: "test" EOF # Look at logs oc logs -n openshift-ingress-operator $(oc get -n openshift-ingress-operator pods --no-headers | head -1 | awk '{print $1}') -c ingress-operator | grep "updated IngressClass" #No output
Actual results:
<none>
Expected results:
2023-01-26T20:37:19.210Z INFO operator.ingressclass_controller ingressclass/ingressclass.go:63 updated IngressClass ...
Additional info:
Grafana has been removed in 4.11 and we can safely remove any logic in CMO that deals with Grafana (except dashboards since they are used by OCP console).
Another point to clarify is to communicate to ProdSec and ART that Grafana isn't part of OCP anymore.
Description of problem
`oc-mirror` does not work as expected with relative path for OCI format copy
How reproducible:
always
Steps to Reproduce:
Copy the operator image with OCI format to localhost with relative path my-oci-catalog;
cat imageset-copy.yaml
kind: ImageSetConfiguration
apiVersion: mirror.openshift.io/v1alpha2
mirror:
operators:
`oc mirror --config imageset-copy.yaml --use-oci-feature --oci-feature-action=copy oci://my-oci-catalog`
Actual results:
2. will create a dir with name : my-file-catalog , but no use for user specified dir: my-oci-catalog
ls -tl
total 20
drwxr-xr-x. 3 root root 4096 Dec 6 13:58 oc-mirror-workspace
drwxr-xr-x. 3 root root 4096 Dec 6 13:58 olm_artifacts
drwxr-x---. 3 root root 4096 Dec 6 13:58 my-file-catalog
drwxr-xr-x. 2 root root 4096 Dec 6 13:58 my-oci-catalog
rw-rr-. 1 root root 206 Dec 6 12:39 imageset-copy.yaml
Expected results:
2. Use the user specified directory .
Additional info:
``oc-mirror --config config-operator.yaml oci:///home/ocmirrortest/noo --use-oci-feature --oci-feature-action=copy` with full path works well.
Assisted Installer doesn't pivot bootstrap node to machine-os, instead it creates a temporary overlay with OKD rpms and start bootkube. This makes it miss manifests from bootstrap/manifests and MCO is unable to find initial rendered-master- MC
Bootstrap manifests need to be included `okd-rpms` (possibly we should rename the image) and Assisted service would copy manifests to `/opt/openshift/openshift` on bootstrap node so stay as close as possible to bootstrap process
Description of problem:
The IBM VPC block CSI driver was rebased to v5.0.0 in this PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/26 However, we're missing the manifest changes from this PR in 4.12 (delayed by CI issues): https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/45 That includes some important changes: - add csi-snapshotter sidecar and snapshotter manifests - only deploy volumesnapshotclass if CRD exists - set consistent imagePullPolicy in deployment manifests - enable topology tests
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-4656. The following is the description of the original issue:
—
Description of problem:
`/etc/hostname` may exist, but be empty. `vsphere-hostname` service should check that the file is not empty instead of just that it exists. OKD's machine-os-content starting from F37 has an empty /etc/hostname file, which breaks joining workers in vsphere IPI
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Install OKD w/ workers on vsphere 2. 3.
Actual results:
Workers get hostname resolved using NM
Expected results:
Workers get hostname resolved using vmtoolsd
Additional info:
Description of problem:
In ZTP input, we can put AdditionalNTPSources in order to have assisted-service mix the provided sources with those the nodes receive from DHCP. AdditionalNTPSources in AgentConfig needs to be generated in InfraEnv in order for it to be applied in the installation
Version-Release number of selected component (if applicable):
4.11 MVP patch 2
How reproducible:
100%
Steps to Reproduce:
1. Create AgentConfig with AdditionalNTPSources like for example "0.fedora.pool.ntp.org" 2. Generate ISO 3. Deploy 4. Check the resulting cluster nodes /etc/chrony.conf
Actual results:
chrony.conf only contains DHCP provided NTP sources (if not static network deplooyment)
Expected results:
/etc/chrony.conf in all the cluster nodes should have at least a server listed: server 0.fedora.pool.ntp.org iburst
Additional info:
This is a clone of issue OCPBUGS-266. The following is the description of the original issue:
—
Description of problem: I am working with a customer who uses the web console. From the Developer Perspective's Project Access tab, they cannot differentiate between users and groups and furthermore cannot add groups from this web console. This has led to confusion whether existing resources were in fact users or groups, and furthermore they have added users when they intended to add groups instead. What we really need is a third column in the Project Access tab that says whether a resource is a user or group.
Version-Release number of selected component (if applicable): This is an issue in OCP 4.10 and 4.11, and I presume future versions as well
How reproducible: Every time. My customer is running on ROSA, but I have determined this issue to be general to OpenShift.
Steps to Reproduce:
From the oc cli, I create a group and add a user to it.
$ oc adm groups new techlead
group.user.openshift.io/techlead created
$ oc adm groups add-users techlead admin
group.user.openshift.io/techlead added: "admin"
$ oc get groups
NAME USERS
cluster-admins
dedicated-admins admin
techlead admin
I create a new namespace so that I can assign a group project level access:
$ oc new-project my-namespace
$ oc adm policy add-role-to-group edit techlead -n my-namespace
I then went to the web console -> Developer perspective -> Project -> Project Access. I verified the rolebinding named 'edit' is bound to a group named 'techlead'.
$ oc get rolebinding
NAME ROLE AGE
admin ClusterRole/admin 15m
admin-dedicated-admins ClusterRole/admin 15m
admin-system:serviceaccounts:dedicated-admin ClusterRole/admin 15m
dedicated-admins-project-dedicated-admins ClusterRole/dedicated-admins-project 15m
dedicated-admins-project-system:serviceaccounts:dedicated-admin ClusterRole/dedicated-admins-project 15m
edit ClusterRole/edit 2m18s
system:deployers ClusterRole/system:deployer 15m
system:image-builders ClusterRole/system:image-builder 15m
system:image-pullers ClusterRole/system:image-puller 15m
$ oc get rolebinding edit -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
creationTimestamp: "2022-08-15T14:16:56Z"
name: edit
namespace: my-namespace
resourceVersion: "108357"
uid: 4abca27d-08e8-43a3-b9d3-d20d5c294bbe
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: edit
subjects:
Now back to the CLI, I view the newly created rolebinding named 'developer-view-c15b720facbc8deb', and find that the "View" role is assigned to a user named 'developer', rather than a group.
$ oc get rolebinding
NAME ROLE AGE
admin ClusterRole/admin 17m
admin-dedicated-admins ClusterRole/admin 17m
admin-system:serviceaccounts:dedicated-admin ClusterRole/admin 17m
dedicated-admins-project-dedicated-admins ClusterRole/dedicated-admins-project 17m
dedicated-admins-project-system:serviceaccounts:dedicated-admin ClusterRole/dedicated-admins-project 17m
edit ClusterRole/edit 4m25s
developer-view-c15b720facbc8deb ClusterRole/view 90s
system:deployers ClusterRole/system:deployer 17m
system:image-builders ClusterRole/system:image-builder 17m
system:image-pullers ClusterRole/system:image-puller 17m
[10:21:21] kechung:~ $ oc get rolebinding developer-view-c15b720facbc8deb -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
creationTimestamp: "2022-08-15T14:19:51Z"
name: developer-view-c15b720facbc8deb
namespace: my-namespace
resourceVersion: "113298"
uid: cc2d1b37-922b-4e9b-8e96-bf5e1fa77779
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: view
subjects:
So in conclusion, from the Project Access tab, we're unable to add groups and unable to differentiate between users and groups. This is in essence our ask for this RFE.
Actual results:
Developer perspective -> Project -> Project Access tab shows a list of resources which can be users or groups, but does not differentiate between them. Furthermore, when we add resources, they are only users and there is no way to add a group from this tab in the web console.
Expected results:
Should have the ability to add groups and differentiate between users and groups. Ideally, we're looking at a third column for user or group.
Additional info:
This is a clone of issue OCPBUGS-3987. The following is the description of the original issue:
—
Description of problem:
When the user supplies nmstateConfig in agent-config.yaml invalid configurations may not be detected
Version-Release number of selected component (if applicable):
4.12
How reproducible:
every time
Steps to Reproduce:
1. Create an invalid NM config. In this case an interface was defined with a route but no IP address 2. The ISO can be generated with no errors 3. At run time the invalid was detected by assisted-service, create-cluster-and-infraenv.service logged the error "failed to validate network yaml for host 0, invalid yaml, error:"
Actual results:
Installation failed
Expected results:
Invalid configuration would be detected when ISO is created
Additional info:
It looks like the ValidateStaticConfigParams check is ONLY done when the nmstateconfig is provided in nmstateconfig.yaml, not when the file is generated (supplied in agent-config.yaml). https://github.com/openshift/installer/blob/master/pkg/asset/agent/manifests/nmstateconfig.go#L188
Description of problem:
Clusters created with platform 'vsphere' in the install-config end up as type 'BareMetal' in the infrastructure CR.
Version-Release number of selected component (if applicable):
4.12.3
How reproducible:
100%
Steps to Reproduce:
1. Create a cluster through the agent installer with platform: vsphere in the install-config 2. oc get infrastructure cluster -o jsonpath='{.status.platform}'
Actual results:
BareMetal
Expected results:
VSphere
Additional info:
The platform type is not being case converted ("vsphere" -> "VSphere") when constructing the AgentClusterInstall CR. When read by the assisted-service client, the platform reads as unknown and therefore the platform field is left blank when the Cluster object is created in the assisted API. Presumably that results in the correct default platform for the topology: None for SNO, BareMetal for everything else, but never VSphere. Since the platform VIPs are passed through a non-platform-specific API in assisted, everything worked but the resulting cluster would have the BareMetal platform.
Description of problem:
This is the original bug: https://bugzilla.redhat.com/show_bug.cgi?id=2098054 It was fixed in https://github.com/openshift/kubernetes/pull/1340 but was reverted as it introduced a bug that meant we did not register instances on create for NLB services. Need to fix the issue and reintroduce the fix
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-3458. The following is the description of the original issue:
—
Description of problem:
Since way back in 4.8, we've had a banner with To request update recommendations, configure a channel that supports your version when ClusterVersion has RetrievedUpdates=False . But that's only one of several reasons we could be RetrievedUpdates=False. Can we pivot to passing through the ClusterVersion condition message?
Version-Release number of selected component (if applicable):
4.8 and later.
How reproducible:
100%
Steps to Reproduce:
1. Launch a cluster-bot cluster like 4.11.12.
2. Set a channel with oc adm upgrade channel stable-4.11.
3. Scale down the CVO with oc scale --replicas 0 -n openshift-cluster-version deployments/cluster-version-operator.
4. Patch in a RetrievedUpdates condition with:
$ CONDITIONS="$(oc get -o json clusterversion version | jq -c '[.status.conditions[] | if .type == "RetrievedUpdates" then .status = "False" | .message = "Testing" else . end]')" $ oc patch --subresource status clusterversion version --type json -p "[{\"op\": \"add\", \"path\": \"/status/conditions\", \"value\": ${CONDITIONS}}]"
5. View the admin console at /settings/cluster.
Actual results:
Advice about configuring the channel (but it's already configured).
Expected results:
See the message you patched into the RetrievedUpdates condition.
This is a clone of issue OCPBUGS-3085. The following is the description of the original issue:
—
Description of problem:
IPI on BareMetal Dual stack deployment failed and Bootstrap timed out before completion
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-25-210451
How reproducible:
Always
Steps to Reproduce:
1. Deploy IPI on BM using Dual stack 2. 3.
Actual results:
Deployment failed
Expected results:
Should pass
Additional info:
Same deployment works fine on 4.11
Description of problem:
Ingress Controller is missing a required AWS resource permission for SC2S region us-isob-east-1 During the OpenShift 4 installation in SC2S region us-isob-east-1, the ingress operator degrades due to missing "route53:ListTagsForResources" permission from the "openshift-ingress" CredentialsRequest for which customer proactively raised a PR. --> https://github.com/openshift/cluster-ingress-operator/pull/868 The code disables part of the logic for C2S isolated regions here: https://github.com/openshift/cluster-ingress-operator/blob/d9d1a2b44cc7955a18fbedfdc973daddba67bccd/pkg/dns/aws/dns.go#L167-L168 By not setting tagConfig, it results in the m.tags field to be set nil: https://github.com/openshift/cluster-ingress-operator/blob/d9d1a2b44cc7955a18fbedfdc973daddba67bccd/pkg/dns/aws/dns.go#L212-L222 This then drives the logic in the getZoneID method to use either lookupZoneID or lookupZoneIDWithoutResourceTagging: https://github.com/openshift/cluster-ingress-operator/blob/d9d1a2b44cc7955a18fbedfdc973daddba67bccd/pkg/dns/aws/dns.go#L280-L284 BLAB: the lookupZoneIDWithoutResourceTagging method is only ever called for endpoints.AwsIsoPartitionID, endpoints.AwsIsoBPartitionID regions.
Version-Release number of selected component (if applicable):
How reproducible:
Everytime
Steps to Reproduce:
1. Create an IPI cluster in SC2S region us-isob-east-1.
Actual results:
Ingress operator degrades due to missing "route53:ListTagsForResources" permission with following error. ~~~ The DNS provider failed to ensure the record: failed to find hosted zone for record: failed to get tagged resources: AccessDenied: User ....... rye... is not authorized to perform: route53:ListTagsForResources on resource.... hostedzone/.. because no identify based policy allows the route53:ListTagsForResources ~~~
Expected results:
Ingress operator should be in available state for new installation.
Additional info:
This is a clone of issue OCPBUGS-5466. The following is the description of the original issue:
—
Description of problem:
It is possible to change some of the fields in default catalogSource specs and the Marketplace Operator will not revert the changes
Version-Release number of selected component (if applicable):
4.13.0 and back
How reproducible:
Always
Steps to Reproduce:
1. Create a 4.13.0 OpenShift cluster 2. Set the redhat-operator catalogSource.spec.grpcPodConfig.SecurityContextConfig field to `legacy`.
Actual results:
The field remains set to `legacy` mode.
Expected results:
The field is reverted to `restricted` mode.
Additional info:
This code needs to be updated to account for new fields in the catalogSource spec.
Description of problem:
In a complete disconnected cluster, the dev catalog is taking too much time in loading
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. A complete disconnected cluster
2. In add page go to the All services page
3.
Actual results:
Taking too much time too load
Expected results:
Time taken should be reduced
Additional info:
Attached a gif for reference
Description of problem:
scale up more worker nodes but they are not added to the Load Balancer instances (backend pool), if moving the router pod to the new worker nodes then co/ingress becomes degraded
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-23-204408
How reproducible:
100%
Steps to Reproduce:
1. ensure the fresh install cluster works well. 2. scale up worker nodes. $ oc -n openshift-machine-api get machineset NAME DESIRED CURRENT READY AVAILABLE AGE hongli-1024-hnkrm-worker-us-east-2a 1 1 1 1 5h21m hongli-1024-hnkrm-worker-us-east-2b 1 1 1 1 5h21m hongli-1024-hnkrm-worker-us-east-2c 1 1 1 1 5h21m $ oc -n openshift-machine-api scale machineset hongli-1024-hnkrm-worker-us-east-2a --replicas=2 machineset.machine.openshift.io/hongli-1024-hnkrm-worker-us-east-2a scaled $ oc -n openshift-machine-api scale machineset hongli-1024-hnkrm-worker-us-east-2b --replicas=2 machineset.machine.openshift.io/hongli-1024-hnkrm-worker-us-east-2b scaled (about 5 minutes later) $ oc -n openshift-machine-api get machineset NAME DESIRED CURRENT READY AVAILABLE AGE hongli-1024-hnkrm-worker-us-east-2a 2 2 2 2 5h29m hongli-1024-hnkrm-worker-us-east-2b 2 2 2 2 5h29m hongli-1024-hnkrm-worker-us-east-2c 1 1 1 1 5h29m 3. delete router pods and to make new ones running on new workers $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-128-45.us-east-2.compute.internal Ready worker 71m v1.25.2+4bd0702 ip-10-0-131-192.us-east-2.compute.internal Ready control-plane,master 6h35m v1.25.2+4bd0702 ip-10-0-139-51.us-east-2.compute.internal Ready worker 6h29m v1.25.2+4bd0702 ip-10-0-162-228.us-east-2.compute.internal Ready worker 71m v1.25.2+4bd0702 ip-10-0-172-216.us-east-2.compute.internal Ready control-plane,master 6h35m v1.25.2+4bd0702 ip-10-0-190-82.us-east-2.compute.internal Ready worker 6h25m v1.25.2+4bd0702 ip-10-0-196-26.us-east-2.compute.internal Ready control-plane,master 6h35m v1.25.2+4bd0702 ip-10-0-199-158.us-east-2.compute.internal Ready worker 6h28m v1.25.2+4bd0702 $ oc -n openshift-ingress get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES router-default-86444dcd84-cm96l 1/1 Running 0 65m 10.130.2.7 ip-10-0-128-45.us-east-2.compute.internal <none> <none> router-default-86444dcd84-vpnjz 1/1 Running 0 65m 10.131.2.7 ip-10-0-162-228.us-east-2.compute.internal <none> <none>
Actual results:
$ oc get co ingress console authentication NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress 4.12.0-0.nightly-2022-10-23-204408 True False True 66m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing) console 4.12.0-0.nightly-2022-10-23-204408 False False False 66m RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.hongli-1024.qe.devcluster.openshift.com): Get "https://console-openshift-console.apps.hongli-1024.qe.devcluster.openshift.com": EOF authentication 4.12.0-0.nightly-2022-10-23-204408 False False True 66m OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.hongli-1024.qe.devcluster.openshift.com/healthz": EOF checked the Load Balancer on AWS console and found that new created nodes are not added to load balancer. see the snapshot attached.
Expected results:
the LB should added new created instances automatically and ingress should work with new workers.
Additional info:
1. this is also reproducible with common user created LoadBalancer service. 2. if the LB service is created after adding the new nodes then it works well, we can see that all nodes are added to LB on AWS console.
This is a clone of issue OCPBUGS-5547. The following is the description of the original issue:
—
Description of problem:
This is a follow-up on https://bugzilla.redhat.com/show_bug.cgi?id=2083087 and https://github.com/openshift/console/pull/12390
When creating a Knative Service and delete it again with enabled option "Delete other resources created by console" (only available on 4.13+ with the PR above) the secret "$name-github-webhook-secret" is not deleted.
When the user tries to create the same Knative Service again this fails with an error:
An error occurred
secrets "nodeinfo-github-webhook-secret" already exists
Version-Release number of selected component (if applicable):
4.13
(we might want to backport this together with https://github.com/openshift/console/pull/12390 and OCPBUGS-5548)
How reproducible:
Always
Steps to Reproduce:
Actual results:
Deleted resources:
Expected results:
Should also remove this resource
Additional info:
When delete the whole application all the resources are deleted correctly (and just once)!
Specifying spec.nodePlacement.nodeSelector.matchExpressions on an IngressController API object causes cluster-ingress-operator to log error messages instead of configuring a node selector.
All versions of OpenShift from 4.1 to 4.12.
100%.
1. Create an IngressController object with the following:
spec: nodePlacement: nodeSelector: matchExpressions: - key: node.openshift.io/remotenode operator: DoesNotExist
(Sorry if Jira has misformatted the yaml. I've given up on getting Jira to format it correctly. Edit the description to see the correctly formatted yaml.)
2. Check the cluster-ingress-operator logs: oc -n openshift-ingress-operator logs -c ingress-operator deploy/ingress-operator
The cluster-ingress-operator logs show the following error message:
2022-01-19T13:25:22.242Z ERROR operator.init.controller-runtime.manager.controller.ingress_controller controller/controller.go:253 Reconciler error {"name": "default", "namespace": "openshift-ingress-operator", "error": "failed to ensure deployment: failed to build router deployment: ingresscontroller \"default\" has invalid spec.nodePlacement.nodeSelector: operator \"NotIn\" cannot be converted into the old label selector format", "errorCauses": [{"error": "failed to ensure deployment: failed to build router deployment: ingresscontroller \"default\" has invalid spec.nodePlacement.nodeSelector: operator \"DoesNotExist\" cannot be converted into the old label selector format"}]}
Ideally, router pods should be configured with the specified node selector, and cluster-ingress-operator should not log an error. Unfortunately, this result cannot be implemented (see "Additional info").
Alternatively, we should document that using the spec.nodePlacement.nodeSelector.matchExpressions is unsupported.
Although it is possible to put a complex match expression in the IngressController.spec.nodePlacement.nodeSelector API field, it is impossible for the operator to convert this into a node selector for the router deployment's pod template spec because the latter requires the node selector be in a string form, and the string form for node selectors does not support complex expressions. This is an unfortunate oversight in the design of the API. We cannot make complex expressions work, and we cannot make a breaking API change, so the only feasible option here is to change the API godoc to warn users that using matchExpressions is not supported.
Related discussion: https://github.com/openshift/api/pull/870#discussion_r601577395.
This Jira issue is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2043573 to placate automation.
Tracker issue for bootimage bump in 4.12. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-5960.
This is a clone of issue OCPBUGS-4701. The following is the description of the original issue:
—
Description of problem:
In at least 4.12.0-rc.0, a user with read-only access to ClusterVersion can see a "Control plane is hosted" banner (despite the control plane not being hosted), because hasPermissionsToUpdate is false, so canPerformUpgrade is false.
Version-Release number of selected component (if applicable):
4.12.0-rc.0. Likely more. I haven't traced it out.
How reproducible:
Always.
Steps to Reproduce:
1. Install 4.12.0-rc.0
2. Create a user with cluster-wide read-only permissions. For me, it's via binding to a sudoer ClusterRole. I'm not sure where that ClusterRole comes from, but it's:
$ oc get -o yaml clusterrole sudoer apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: annotations: rbac.authorization.kubernetes.io/autoupdate: "true" creationTimestamp: "2020-05-21T19:39:09Z" name: sudoer resourceVersion: "7715" uid: 28eb2ffa-dccd-47e8-a2d5-6a95e0e8b1e9 rules: - apiGroups: - "" - user.openshift.io resourceNames: - system:admin resources: - systemusers - users verbs: - impersonate - apiGroups: - "" - user.openshift.io resourceNames: - system:masters resources: - groups - systemgroups verbs: - impersonate
3. View /settings/cluster
Actual results:
See the "Control plane is hosted" banner.
Expected results:
Possible cases:
There is a bug where creating OLM subscription manifests early in the installation process results in those OLM operators not being installed.
This is because the OLM installation Jobs fail when they are tried early in the installation process, and OLM does not retry those jobs sufficiently and eventually gives up on them.
This should be solved starting OCP 4.12, but until then, we should solve this using Assisted.
A way to solve this is to delay the installation of OLM operators to only occur after the cluster is up and healthy.
This can be done by creating the subscriptions with "installPlanApproval" set to "Manual" instead of "Automatic". Then once the cluster is up and healthy, the assisted-controller should approve the InstallPlans that OLM will create for the operators. This will then trigger the installation which is more likely to succeed since the cluster is up and healthy at this point
In multinode we can check nodes object in kubeapi as we can't really validate hosts that are not part of cluster, only the one controller is running on.
And we should validate ip of the host controller is running on.
In case ip was changed log it
This is a clone of issue OCPBUGS-1061. The following is the description of the original issue:
—
Description of problem:
grant monitoring-alertmanager-edit role to user
# oc adm policy add-cluster-role-to-user cluster-monitoring-view testuser-11 # oc adm policy add-role-to-user monitoring-alertmanager-edit testuser-11 -n openshift-monitoring --role-namespace openshift-monitoring
monitoring-alertmanager-edit user, go to administrator console, "Observe - Alerting - Silences" page is pending to list silences, debug in the console, no findings.
create silence with monitoring-alertmanager-edit user for Watchdog alert, silence page is also pending, checked with kubeadmin user, "Observe - Alerting - Silences" page shows the Watchdog alert is silenced, but checked with monitoring-alertmanager-edit user, Watchdog alert is not silenced.
this should be a regression for https://bugzilla.redhat.com/show_bug.cgi?id=1947005 since 4.9, no such issue then, but there is similiar issue with 4.9.0-0.nightly-2022-09-05-125502 now
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-08-114806
How reproducible:
always
Steps to Reproduce:
1. see the description 2. 3.
Actual results:
administrator console, monitoring-alertmanager-edit user list or create silence, "Observe - Alerting - Silences" page is pending
Expected results:
should not be pending
Additional info:
This is a clone of issue OCPBUGS-2141. The following is the description of the original issue:
—
Description of problem:
4.12 cluster, no pv for prometheus, the doc still link to 4.8
# oc get co monitoring -o jsonpath='{.status.conditions}' | jq 'map(select(.type=="Degraded"))' [ { "lastTransitionTime": "2022-10-09T02:36:16Z", "message": "Prometheus is running without persistent storage which can lead to data loss during upgrades and cluster disruptions. Please refer to the official documentation to see how to configure storage for Prometheus: https://docs.openshift.com/container-platform/4.8/monitoring/configuring-the-monitoring-stack.html", "reason": "PrometheusDataPersistenceNotConfigured", "status": "False", "type": "Degraded" } ]
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-05-053337
How reproducible:
always
Steps to Reproduce:
1. no PVs for prometheus, check the monitoring operator status 2. 3.
Actual results:
the doc still link to 4.8
Expected results:
links to the latest doc
Additional info:
slack thread: https://coreos.slack.com/archives/G79AW9Q7R/p1665283462123389
Description of problem:
When trying to enable Hardware Backed Management Ports (e.g. Virtual functions) on BF2 in NIC mode OR any other MLX NICs (CX-6, CX-5) by setting the node_mgmt_port_netdev_flags flags to a VF in the CNO; then OVN-K Node will crash.
Version-Release number of selected component (if applicable):
4.12.0
How reproducible:
Always
Steps to Reproduce:
Start by enabling OvS HWOL and setting sriovnetworknodepolicy https://docs.openshift.com/container-platform/4.11/networking/hardware_networks/configuring-hardware-offloading.html 1. Scale down CNO: oc scale --replicas=0 deploy/network-operator -n openshift-network-operator 2. Make changes to OVN-K node: oc edit daemonsets ovnkube-node -n openshift-ovn-kubernetes a. Find "node_mgmt_port_netdev_flags=" and replace it with something like this: node_mgmt_port_netdev_flags= if [[ ${K8S_NODE} != *"master"* ]]; then node_mgmt_port_netdev_flags="--ovnkube-node-mgmt-port-netdev=ens1f0v0" fi b. Additionally you have to add the "node_mgmt_port_netdev_flags" to the " exec /usr/bin/ovnkube --init-node "${K8S_NODE}"" call in the same script. Since this is missing. 3. Save the edit. 4. Observe OVN-K node on baremetal worker nodes.
Actual results:
I0822 14:21:56.250285 496356 ovs.go:204] Exec(3): stderr: "" I0822 14:21:56.250290 496356 node.go:310] Detected support for port binding with external IDs I0822 14:21:56.250516 496356 management-port-dpu.go:181] Setup management port dpu host: ens1f0v0 F0822 14:21:56.250568 496356 ovnkube.go:133] failed to set management port name. file exists Workaround is to go to the node and run this command: sudo ovs-vsctl del-port br-int ovn-k8s-mp0
Expected results:
There should not be any errors when changing node_mgmt_port_netdev_flags to a valid value.
Additional info:
Reported here: https://github.com/ovn-org/ovn-kubernetes/pull/3160 Discussed briefly here: https://issues.redhat.com/browse/OCPBUGS-4098 Fixed Upstream here: https://github.com/ovn-org/ovn-kubernetes/pull/3251
Description of problem:
KafkSink current desctiption in odc is `Kafka Sink is Addressable, it receives events and send them to a Kafka topic.` and this should be `A KafkaSink takes a CloudEvent, and sends it to an Apache Kafka Topic. Events can be specified in either Structured or Binary mode.` as provided by Serverless team
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Install Serverless operator 2. Create CR for knativeKafka in knative-eventing ns 3. go to dev perspective -> add -> event sink 4. Check the description of kafka sink
Actual results:
Expected results:
Update the description to as provided by serverless team
Additional info:
Description of problem:
cloud-network-config-controller pod crashloops in proxy deployments as it tries to reach Openstack keystone API directly (not through the proxy) and there is no connectivity. NAMESPACE NAME READY STATUS RESTARTS AGE openshift-cloud-network-config-controller cloud-network-config-controller-c4867b748-vlq9h 0/1 CrashLoopBackOff 158 (2m10s ago) 13h $ oc -n openshift-cloud-network-config-controller logs -p cloud-network-config-controller-c4867b748-vlq9h W0927 05:48:18.678947 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0927 05:48:18.680269 1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock... I0927 05:48:26.754377 1 leaderelection.go:258] successfully acquired lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock I0927 05:48:26.755413 1 openstack.go:121] Custom CA bundle found at location '/kube-cloud-config/ca-bundle.pem' - reading certificate information F0927 05:48:28.233519 1 main.go:101] Error building cloud provider client, err: Get "https://10.46.44.10:13000/": dial tcp 10.46.44.10:13000: connect: no route to host goroutine 51 [running]: k8s.io/klog/v2.stacks(0x1) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/klog/v2/klog.go:860 +0x8a k8s.io/klog/v2.(*loggingT).output(0x37696c0, 0x3, 0x0, 0xc000636000, 0x1, {0x2cbcbd8?, 0x1?}, 0xc000438400?, 0x0) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/klog/v2/klog.go:825 +0x686 k8s.io/klog/v2.(*loggingT).printfDepth(0x37696c0, 0x237798a?, 0x0, {0x0, 0x0}, 0x7fff81041af7?, {0x23a20d0, 0x2d}, {0xc00052c050, 0x1, ...}) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/klog/v2/klog.go:630 +0x1f2 k8s.io/klog/v2.(*loggingT).printf(...) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/klog/v2/klog.go:612 k8s.io/klog/v2.Fatalf(...) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/klog/v2/klog.go:1516 main.main.func1({0x26e5638, 0xc00016c040}) /go/src/github.com/openshift/cloud-network-config-controller/cmd/cloud-network-config-controller/main.go:101 +0x26d created by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:211 +0x11bgoroutine 1 [select]: k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00052bb60?, {0x26cee20, 0xc000581740}, 0x1, 0xc00052bb60) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:167 +0x135 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00016c080?, 0x60db88400, 0x0, 0x20?, 0x7fea470ec108?) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89 k8s.io/apimachinery/pkg/util/wait.Until(...) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew(0xc0000a8120, {0x26e5638?, 0xc00016c040?}) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:268 +0xd0 k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc0000a8120, {0x26e5638, 0xc00025fcc0}) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:212 +0x12f k8s.io/client-go/tools/leaderelection.RunOrDie({0x26e5638, 0xc00025fcc0}, {{0x26e7430, 0xc00062afa0}, 0x1fe5d61a00, 0x18e9b26e00, 0x60db88400, {0xc00065e630, 0xc000634810, 0x0}, ...}) /go/src/github.com/openshift/cloud-network-config-controller/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:226 +0x94 main.main() /go/src/github.com/openshift/cloud-network-config-controller/cmd/cloud-network-config-controller/main.go:86 +0x450
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-26-050728
How reproducible:
Always
Steps to Reproduce:
1. Install OCP with proxy
Actual results:
Bootstrap failure and pod crashloop
Expected results:
Successful installation
Additional info:
Please find the must-gather here.
With CSISnapshot capability is disabled, all Azure Disk CSI Driver Operator gets Degraded.
The reason is that cluster-csi-snapshot-controller-operator does not create VolumeSnapshotClass CRD, which the operator expects to exist.
This is a clone of issue OCPBUGS-8381. The following is the description of the original issue:
—
Derscription of problem:
On a hypershift cluster that has public certs for OAuth configured, the console reports a x509 certificate error when attempting to display a token
Version-Release number of selected component (if applicable):
4.12.z
How reproducible:
always
Steps to Reproduce:
1. Create a hosted cluster configured with a letsencrypt certificate for the oauth endpoint. 2. Go to the console of the hosted cluster. Click on the user icon and get token.
Actual results:
The console displays an oauth cert error
Expected results:
The token displays
Additional info:
The hcco reconciles the oauth cert into the console namespace. However, it is only reconciling the self-signed one and not the one that was configured through .spec.configuration.apiserver of the hostedcluster. It needs to detect the actual cert used for oauth and send that one.
Description of problem:
The storageclass "thin-csi" is created by vsphere-CSI-Driver-Operator, after deleting it manually, it should be re-created immediately.
Version-Release number of selected component (if applicable):
4.11.4
How reproducible:
Always
Steps to Reproduce:
1. Check storageclass in running cluster, thin-csi is present: $ oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE thin (default) kubernetes.io/vsphere-volume Delete Immediate false 41m thin-csi csi.vsphere.vmware.com Delete WaitForFirstConsumer true 38m
2. Delete thin-csi storageclass: $ oc delete sc thin-csi storageclass.storage.k8s.io "thin-csi" deleted
3. Check storageclass again, thin-csi is not present: $ oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE thin (default) kubernetes.io/vsphere-volume Delete Immediate false 50m
4. Check vmware-vsphere-csi-driver-operator log: ...... I0909 03:47:42.172866 1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1662695014\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1662695014\" (2022-09-09 02:43:34 +0000 UTC to 2023-09-09 02:43:34 +0000 UTC (now=2022-09-09 03:47:42.172853123 +0000 UTC))"I0909 03:49:38.294962 1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOFI0909 03:49:38.295468 1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOFI0909 03:49:38.295765 1 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
5. Only first time creating in vmware-vsphere-csi-driver-operator log: $ oc -n openshift-cluster-csi-drivers logs vmware-vsphere-csi-driver-operator-7cc6d44b5c-c8czw | grep -i "storageclass"I0909 03:46:31.865926 1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-csi-drivers", Name:"vmware-vsphere-csi-driver-operator", UID:"9e0c3e2d-d403-40a1-bf69-191d7aec202b", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'StorageClassCreated' Created StorageClass.storage.k8s.io/thin-csi because it was missing
Actual results:
The storageclass "thin-csi" could not be re-created after deleting
Expected results:
The storageclass "thin-csi" should be re-created after deleting
Additional info:
Description of problem:
When log line number is too big, the number will overlap with cut-off line in the log viewer.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-08-15-150248
How reproducible:
Always
Steps to Reproduce:
1.Go to a pod log page with lots of logs, such as pod in openshift-cluster-version namespace. Check log line numbers.
2.
3.
Actual results:
1. When line number is too big, it will overlap with cut-off line.
Expected results:
1. Should have no overlaps in logs
Additional info:
This is a clone of issue OCPBUGS-3440. The following is the description of the original issue:
—
Description of problem:
https://github.com/openshift/cluster-authentication-operator/pull/587 addresses an issue in which the auth operator goes degraded when the console capability is not enabled. The rest is that the console publicAssetURL is not configured when the console is disabled. However if the console capability is later enabled on the cluster, there is no logic in place to ensure the auth operator detects this and performs the configuration. Manually restarting the auth operator will address this, but we should have a solution that handles it automatically.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Install a cluster w/o the console cap 2. Inspect the auth configmap, see that assetPublicURL is empty 3. Enable the console capability, wait for console to start up 4. Inspect the auth configmap and see it is still empty
Actual results:
assetPublicURL does not get populated
Expected results:
assetPublicURL is populated once the console is enabled
Additional info:
Description of problem:
We need to include the `openshift_apps_deploymentconfigs_strategy_total` metrics to the IO archive file.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create a cluster 2. Download the IO archive 3. Check the file `config/metrics` 4. You must find `openshift_apps_deploymentconfigs_strategy_total` insde of it
Actual results:
Expected results:
You should see the `openshift_apps_deploymentconfigs_strategy_total` at the `config/metrics` file.
Additional info:
This is placeholder for backporting fixes related to checking the serving field to 4.12.z. This code is already in 4.14 and 4.13: https://github.com/openshift/ovn-kubernetes/commit/f6ef43368cc79b85bf0de535ff3b79a5856ab481
Description of problem:
The SQL-based index image created by old opm failed to run in 4.12 even if added the `privileged` permission to the namespace.
MacBook-Pro:~ jianzhang$ oc get pods NAME READY STATUS RESTARTS AGE jian-operators-4g5ln 0/1 CrashLoopBackOff 1 (2s ago) 11s MacBook-Pro:~ jianzhang$ oc logs jian-operators-4g5ln Error: open /etc/nsswitch.conf: permission denied
PS: the SQL-based index created by the new opm version doesn't have this issue.
opm version Version: version.Version{OpmVersion:"e41024eb3", GitCommit:"e41024eb37c721bc43e8b3df226dd30c0589aee7", BuildDate:"2022-08-16T01:50:17Z", GoOs:"darwin", GoArch:"amd64"}
Version-Release number of selected component (if applicable):
OCP 4.12
MacBook-Pro:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2022-08-15-150248 True False 3h25m Cluster version is 4.12.0-0.nightly-2022-08-15-150248
How reproducible:
always
Steps to Reproduce:
1. Deploy OCP 4.12
2, Deploy a CatalogSource in the `openshift-marketplace` namespace.
MacBook-Pro:~ jianzhang$ oc get ns openshift-marketplace -o yaml apiVersion: v1 kind: Namespace metadata: annotations: capability.openshift.io/name: marketplace include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" openshift.io/node-selector: "" openshift.io/sa.scc.mcs: s0:c16,c10 openshift.io/sa.scc.supplemental-groups: 1000260000/10000 openshift.io/sa.scc.uid-range: 1000260000/10000 workload.openshift.io/allowed: management creationTimestamp: "2022-08-15T23:15:27Z" labels: kubernetes.io/metadata.name: openshift-marketplace olm.operatorgroup.uid/1b776321-2714-4c1f-95ba-2ddff49c4efe: "" openshift.io/cluster-monitoring: "true" pod-security.kubernetes.io/audit: baseline pod-security.kubernetes.io/enforce: baseline pod-security.kubernetes.io/warn: baseline name: openshift-marketplace ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: cd81594b-4f6c-46d6-9369-75deef542ec8 resourceVersion: "8617" uid: 1c35352e-3636-4f2b-a3b1-c84ebc6681e0 spec: finalizers: - kubernetes status: phase: Active
3, Check the CatalogSource pod status, crashed.
MacBook-Pro:~ jianzhang$ oc get catalogsource -n openshift-marketplace jian-operators -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: creationTimestamp: "2022-08-16T02:24:20Z" generation: 1 name: jian-operators namespace: openshift-marketplace resourceVersion: "106145" uid: 6a75ecc9-7b88-4411-bcf5-e34618f9b3cd spec: displayName: Jian Operators image: quay.io/olmqe/etcd-index:v1 priority: -100 publisher: Jian sourceType: grpc updateStrategy: registryPoll: interval: 10m0s status: connectionState: address: jian-operators.openshift-marketplace.svc:50051 lastConnect: "2022-08-16T03:12:28Z" lastObservedState: TRANSIENT_FAILURE latestImageRegistryPoll: "2022-08-16T02:34:21Z" registryService: createdAt: "2022-08-16T02:24:20Z" port: "50051" protocol: grpc serviceName: jian-operators serviceNamespace: openshift-marketplace MacBook-Pro:~ jianzhang$ oc get pods -n openshift-marketplace NAME READY STATUS RESTARTS AGE 28bb83ea022e9728d25570ab0adbe09a31d6a0a606917488e0ddb00f925mnfw 0/1 Completed 0 3h23m 7049ea48beb27a712fa506b76ad672be201ce5d3a6a93d627a0091e0fesvdlj 0/1 Completed 0 3h23m certified-operators-ftt2n 1/1 Running 0 3h49m community-operators-27dx9 1/1 Running 0 3h49m jian-operators-5zq7d 0/1 CrashLoopBackOff 12 (71s ago) 38m jian-operators-gpg4v 0/1 CrashLoopBackOff 14 (57s ago) 48m marketplace-operator-9c8496b58-2jfmv 1/1 Running 0 3h56m qe-app-registry-rqrrv 1/1 Running 0 141m redhat-marketplace-s6zrj 1/1 Running 0 3h49m redhat-operators-54cqr 1/1 Running 0 3h49m MacBook-Pro:~ jianzhang$ oc -n openshift-marketplace logs jian-operators-gpg4v Error: open /etc/nsswitch.conf: permission denied Usage: opm registry serve [flags] Flags: -d, --database string relative path to sqlite db (default "bundles.db") --debug enable debug logging -h, --help help for serve -p, --port string port number to serve on (default "50051") --skip-migrate do not attempt to migrate to the latest db revision when starting -t, --termination-log string path to a container termination log file (default "/dev/termination-log") --timeout-seconds string Timeout in seconds. This flag will be removed later. (default "infinite") Global Flags: --skip-tls skip TLS certificate verification for container image registries while pulling bundles or index
4. Create a namespace with the `privileged` permission.
MacBook-Pro:~ jianzhang$ oc get ns debug -o yaml apiVersion: v1 kind: Namespace metadata: annotations: openshift.io/sa.scc.mcs: s0:c30,c10 openshift.io/sa.scc.supplemental-groups: 1000890000/10000 openshift.io/sa.scc.uid-range: 1000890000/10000 creationTimestamp: "2022-08-16T02:46:41Z" labels: kubernetes.io/metadata.name: debug pod-security.kubernetes.io/audit: privileged pod-security.kubernetes.io/enforce: privileged pod-security.kubernetes.io/warn: privileged security.openshift.io/scc.podSecurityLabelSync: "false" name: debug resourceVersion: "95718" uid: bdf93839-6c42-4365-a65c-d9c0b9fe0504 spec: finalizers: - kubernetes status: phase: Active
5. Deploy a CatalogSource as above step 2. Still crashed.
MacBook-Pro:~ jianzhang$ oc get pods -n debug NAME READY STATUS RESTARTS AGE jian-operators-4g5ln 0/1 CrashLoopBackOff 10 (114s ago) 28m jian-operators-wn766 0/1 CrashLoopBackOff 8 (2m25s ago) 18m MacBook-Pro:~ jianzhang$ oc -n debug logs jian-operators-wn766 Error: open /etc/nsswitch.conf: permission denied Usage: opm registry serve [flags] Flags: -d, --database string relative path to sqlite db (default "bundles.db") --debug enable debug logging -h, --help help for serve -p, --port string port number to serve on (default "50051") --skip-migrate do not attempt to migrate to the latest db revision when starting -t, --termination-log string path to a container termination log file (default "/dev/termination-log") --timeout-seconds string Timeout in seconds. This flag will be removed later. (default "infinite") Global Flags: --skip-tls skip TLS certificate verification for container image registries while pulling bundles or index
Actual results:
The sql-based index image created by the old opm version cannot be run.
MacBook-Pro:~ jianzhang$ oc -n debug logs jian-operators-wn766 Error: open /etc/nsswitch.conf: permission denied
Expected results:
The old SQL-based index image runs well. Or we have a workaround for it.
Additional info:
I changed another old sql-based image and have a try, get another permission issue.
MacBook-Pro:~ jianzhang$ oc get catalogsource NAME DISPLAY TYPE PUBLISHER AGE jian-operators Jian Operators grpc Jian 37m xia-operators Xia Operators grpc Xia 101s MacBook-Pro:~ jianzhang$ oc get catalogsource xia-operators -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: creationTimestamp: "2022-08-16T03:22:38Z" generation: 1 name: xia-operators namespace: debug resourceVersion: "110629" uid: 8be42e68-43be-4fd4-9b67-c74edc5e6353 spec: displayName: Xia Operators image: quay.io/olmqe/ditto-index:test-xzha-1 priority: -100 publisher: Xia sourceType: grpc updateStrategy: registryPoll: interval: 10m0s status: connectionState: address: xia-operators.debug.svc:50051 lastConnect: "2022-08-16T03:24:18Z" lastObservedState: CONNECTING registryService: createdAt: "2022-08-16T03:22:38Z" port: "50051" protocol: grpc serviceName: xia-operators serviceNamespace: debug MacBook-Pro:~ jianzhang$ oc project Using project "debug" on server "https://api.qe-daily-412-0816.ibmcloud.qe.devcluster.openshift.com:6443". MacBook-Pro:~ jianzhang$ oc get pods NAME READY STATUS RESTARTS AGE jian-operators-4g5ln 0/1 CrashLoopBackOff 11 (3m41s ago) 35m jian-operators-wn766 0/1 CrashLoopBackOff 9 (4m13s ago) 25m xia-operators-6wgjt 0/1 CrashLoopBackOff 1 (8s ago) 13s MacBook-Pro:~ jianzhang$ oc logs xia-operators-6wgjt time="2022-08-16T03:22:43Z" level=warning msg="\x1b[1;33mDEPRECATION NOTICE:\nSqlite-based catalogs and their related subcommands are deprecated. Support for\nthem will be removed in a future release. Please migrate your catalog workflows\nto the new file-based catalog format.\x1b[0m" Error: open ./db-609956243: permission denied Usage: opm registry serve [flags] Flags: -d, --database string relative path to sqlite db (default "bundles.db") --debug enable debug logging
Even if that namespace is `privileged`.
MacBook-Pro:~ jianzhang$ oc get ns debug -o yaml apiVersion: v1 kind: Namespace metadata: annotations: openshift.io/sa.scc.mcs: s0:c30,c10 openshift.io/sa.scc.supplemental-groups: 1000890000/10000 openshift.io/sa.scc.uid-range: 1000890000/10000 creationTimestamp: "2022-08-16T02:46:41Z" labels: kubernetes.io/metadata.name: debug pod-security.kubernetes.io/audit: privileged pod-security.kubernetes.io/enforce: privileged pod-security.kubernetes.io/warn: privileged security.openshift.io/scc.podSecurityLabelSync: "false" name: debug resourceVersion: "95718" uid: bdf93839-6c42-4365-a65c-d9c0b9fe0504 spec: finalizers: - kubernetes status: phase: Active
But, both of them work well in the 4.11 cluster. As follows,
MacBook-Pro:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-08-15-152346 True False 91m Cluster version is 4.11.0-0.nightly-2022-08-15-152346 MacBook-Pro:~ jianzhang$ oc get catalogsource NAME DISPLAY TYPE PUBLISHER AGE certified-operators Certified Operators grpc Red Hat 106m community-operators Community Operators grpc Red Hat 106m jian-operators Jian Operators grpc Jian 48m redhat-marketplace Red Hat Marketplace grpc Red Hat 106m redhat-operators Red Hat Operators grpc Red Hat 106m xia-operators Xia Operators grpc Xia 6s MacBook-Pro:~ jianzhang$ oc get pods NAME READY STATUS RESTARTS AGE certified-operators-fsjc8 1/1 Running 0 107m community-operators-9qvzt 1/1 Running 0 107m jian-operators-n5s8c 1/1 Running 0 48m marketplace-operator-7b777f747-22rwq 1/1 Running 0 109m redhat-marketplace-2mgrl 1/1 Running 0 107m redhat-operators-72q6z 1/1 Running 0 107m xia-operators-ngq86 1/1 Running 0 23s MacBook-Pro:~ jianzhang$ oc get catalogsource jian-operators -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: creationTimestamp: "2022-08-16T02:39:52Z" generation: 1 name: jian-operators namespace: openshift-marketplace resourceVersion: "58565" uid: 481a6fbe-00a5-4af5-86f7-d7413c658db3 spec: displayName: Jian Operators image: quay.io/olmqe/etcd-index:v1 priority: -100 publisher: Jian sourceType: grpc updateStrategy: registryPoll: interval: 10m0s status: connectionState: address: jian-operators.openshift-marketplace.svc:50051 lastConnect: "2022-08-16T02:44:45Z" lastObservedState: READY latestImageRegistryPoll: "2022-08-16T03:24:54Z" registryService: createdAt: "2022-08-16T02:39:52Z" port: "50051" protocol: grpc serviceName: jian-operators serviceNamespace: openshift-marketplace MacBook-Pro:~ jianzhang$ oc get catalogsource xia-operators -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: creationTimestamp: "2022-08-16T03:28:07Z" generation: 1 name: xia-operators namespace: openshift-marketplace resourceVersion: "59886" uid: a270f665-ee0b-49a5-badb-d3394c7a9344 spec: displayName: Xia Operators image: quay.io/olmqe/ditto-index:test-xzha-1 priority: -100 publisher: Xia sourceType: grpc updateStrategy: registryPoll: interval: 10m0s status: connectionState: address: xia-operators.openshift-marketplace.svc:50051 lastConnect: "2022-08-16T03:28:27Z" lastObservedState: READY registryService: createdAt: "2022-08-16T03:28:07Z" port: "50051" protocol: grpc serviceName: xia-operators serviceNamespace: openshift-marketplace MacBook-Pro:~ jianzhang$ oc get ns openshift-marketplace -o yaml apiVersion: v1 kind: Namespace metadata: annotations: capability.openshift.io/name: marketplace include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" openshift.io/node-selector: "" openshift.io/sa.scc.mcs: s0:c16,c5 openshift.io/sa.scc.supplemental-groups: 1000250000/10000 openshift.io/sa.scc.uid-range: 1000250000/10000 workload.openshift.io/allowed: management creationTimestamp: "2022-08-16T01:38:10Z" labels: kubernetes.io/metadata.name: openshift-marketplace olm.operatorgroup.uid/24dae571-2843-445b-b09f-5a4631cb25ba: "" openshift.io/cluster-monitoring: "true" pod-security.kubernetes.io/audit: baseline pod-security.kubernetes.io/warn: baseline name: openshift-marketplace ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 470d072e-37d9-4203-bc5a-c675800d593c resourceVersion: "6981" uid: 554a5ceb-8343-46f4-ae69-af36ee45d7fe spec: finalizers: - kubernetes status: phase: Active
Description of problem:
When running `oc adm must-gather` we do not download controlplanemachineset resourecs. Must gather uses `oc adm inspect` under the hood which looks at cluster operator related resources to identify what it should download. In must-gather, it does this across all namespaces. When gathering across all namespaces, you cannot set a name for the related object else an error is produced: error: errors occurred while gathering data: skipping gathering controlplanemachinesets.machine.openshift.io/cluster due to error: a resource cannot be retrieved by name across all namespaces We currently set the name of the CPMS in our related objects so we do not get it included in the gather.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. oc adm must-gather 2. Look in the openshift-machine-api folder for a controlplanemachineset 3.
Actual results:
It's not there
Expected results:
It's there
Additional info:
This is a clone of issue OCPBUGS-6651. The following is the description of the original issue:
—
Description of problem:
When running a hypershift HostedCluster with a publicAndPrivate / private setup behind a proxy, Nodes never go ready. ovn-kubernetes pods fail to run because the init container fails. [root@ip-10-0-129-223 core]# crictl logs cf142bb9f427d + [[ -f /env/ ]] ++ date -Iseconds 2023-01-25T12:18:46+00:00 - checking sbdb + echo '2023-01-25T12:18:46+00:00 - checking sbdb' + echo 'hosts: dns files' + proxypid=15343 + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt' + sbdb_ip=ssl:ovnkube-sbdb.apps.agl-proxy.hypershift.local:9645 + retries=0 + ovn-sbctl --no-leader-only --timeout=5 --db=ssl:ovnkube-sbdb.apps.agl-proxy.hypershift.local:9645 -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt get-connection + exec socat TCP-LISTEN:9645,reuseaddr,fork PROXY:10.0.140.167:ovnkube-sbdb.apps.agl-proxy.hypershift.local:443,proxyport=3128 ovn-sbctl: ssl:ovnkube-sbdb.apps.agl-proxy.hypershift.local:9645: database connection failed () + (( retries += 1 ))
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always.
Steps to Reproduce:
1. Create a publicAndPrivate hypershift HostedCluster behind a proxy. E.g" ➜ hypershift git:(main) ✗ ./bin/hypershift create cluster \ aws --pull-secret ~/www/pull-secret-ci.txt \ --ssh-key ~/.ssh/id_ed25519.pub \ --name agl-proxy \ --aws-creds ~/www/config/aws-osd-hypershift-creds \ --node-pool-replicas=3 \ --region=us-east-1 \ --base-domain=agl.hypershift.devcluster.openshift.com \ --zones=us-east-1a \ --endpoint-access=PublicAndPrivate \ --external-dns-domain=agl-services.hypershift.devcluster.openshift.com --enable-proxy=true 2. Get the kubeconfig for the guest cluster. E.g kubectl get secret -nclusters agl-proxy-admin-kubeconfig -oyaml 3. Get pods in the guest cluster. See ovnkube-node pods init container failing with [root@ip-10-0-129-223 core]# crictl logs cf142bb9f427d + [[ -f /env/ ]] ++ date -Iseconds 2023-01-25T12:18:46+00:00 - checking sbdb + echo '2023-01-25T12:18:46+00:00 - checking sbdb' + echo 'hosts: dns files' + proxypid=15343 + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt' + sbdb_ip=ssl:ovnkube-sbdb.apps.agl-proxy.hypershift.local:9645 + retries=0 + ovn-sbctl --no-leader-only --timeout=5 --db=ssl:ovnkube-sbdb.apps.agl-proxy.hypershift.local:9645 -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt get-connection + exec socat TCP-LISTEN:9645,reuseaddr,fork PROXY:10.0.140.167:ovnkube-sbdb.apps.agl-proxy.hypershift.local:443,proxyport=3128 ovn-sbctl: ssl:ovnkube-sbdb.apps.agl-proxy.hypershift.local:9645: database connection failed () + (( retries += 1 ))
To create a bastion an ssh into the Nodes See https://hypershift-docs.netlify.app/how-to/debug-nodes/
Actual results:
Nodes unready
Expected results:
Nodes go ready
Additional info:
Description of problem:
co/storage is not available due to csi driver not have proxy setting on ibm cloud
Version-Release number of selected component (if applicable):
{4.12.0-0.ci-2022-10-13-233744}How reproducible:
Always
Steps to Reproduce:
1.Install ocp cluster on ibm disconnected env with http proxy Template: private-templates/functionality-testing/aos-4_12/ipi-on-ibmcloud/versioned-installer-customer_vpc-http_proxy 2.Check co/storage oc get co/storage NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE storage 4.12.0-0.ci-2022-10-13-233744 False True False 6h55m IBMVPCBlockCSIDriverOperatorCRAvailable: IBMBlockDriverControllerServiceControllerAvailable: Waiting for Deployment... 3.oc get pods NAME READY STATUS RESTARTS AGE ibm-vpc-block-csi-controller-6c4bfc9fc-6dmz7 4/5 CrashLoopBackOff 83 (113s ago) 6h55m ibm-vpc-block-csi-driver-operator-7bd6fb5cdc-rktk2 1/1 Running 1 (6h44m ago) 6h55m ibm-vpc-block-csi-node-8s6dj 0/3 Init:0/1 77 (5m34s ago) 6h52m ibm-vpc-block-csi-node-9msld 0/3 Init:Error 76 (5m49s ago) 6h47m ibm-vpc-block-csi-node-fgs76 0/3 Init:CrashLoopBackOff 76 (5m ago) 6h52m ibm-vpc-block-csi-node-jd9fl 0/3 Init:CrashLoopBackOff 75 (4m16s ago) 6h47m ibm-vpc-block-csi-node-qkjxs 0/3 Init:CrashLoopBackOff 77 (2m53s ago) 6h52m ibm-vpc-block-csi-node-xbzm8 0/3 Init:0/1 76 (5m13s ago) 6h47m 4.oc -n openshift-cluster-csi-drivers logs -c vpc-node-label-updater ibm-vpc-block-csi-node-xbzm8 {"level":"info","timestamp":"2022-10-14T09:18:32.436Z","caller":"nodeupdater/utils.go:57","msg":"Fetching secret configuration.","watcher-name":"vpc-node-label-updater"} {"level":"info","timestamp":"2022-10-14T09:18:32.436Z","caller":"nodeupdater/utils.go:158","msg":"parsing conf file","watcher-name":"vpc-node-label-updater","confpath":"/etc/storage_ibmc/slclient.toml"} {"level":"error","timestamp":"2022-10-14T09:19:02.437Z","caller":"nodeupdater/utils.go:96","msg":"Failed to Get IAM access token","watcher-name":"vpc-node-label-updater","error":"Post \"https://iam.cloud.ibm.com/oidc/token\": dial tcp 23.203.93.6:443: i/o timeout"} {"level":"fatal","timestamp":"2022-10-14T09:19:02.437Z","caller":"cmd/main.go:140","msg":"Failed to read secret configuration from storage secret present in the cluster ","watcher-name":"vpc-node-label-updater","error":"Post \"https://iam.cloud.ibm.com/oidc/token\": dial tcp 23.203.93.6:443: i/o timeout"} 5.oc -n openshift-cluster-csi-drivers describe pod ibm-vpc-block-csi-node-xbzm8 Environment: ADDRESS: /csi/csi.sock DRIVER_REGISTRATION_SOCK: /var/lib/kubelet/plugins/vpc.block.csi.ibm.io/csi.sock KUBE_NODE_NAME: (v1:spec.nodeName) Actual results:{code:none}
Expected results:
Additional info:
Description of problem:
Using a daemonset that causes failures during draining as leases are not gracefully released and instead age out as pods are killed after potentially losing network access due to daemonset pods not being terminated. As pointed out in https://github.com/openshift/origin/pull/27394#discussion_r964002900 This should be fixed when moving to a deployment and is also tracked here https://issues.redhat.com/browse/BUILD-495
Version-Release number of selected component (if applicable):
How reproducible:
100
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The current version of openshift/cluster-dns-operator vendors Kubernetes 1.24 packages. OpenShift 4.12 is based on Kubernetes 1.25.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Check https://github.com/openshift/cluster-dns-operator/blob/release-4.12/go.mod
Actual results:
Kubernetes packages (k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go) are at version v0.24.0.
Expected results:
Kubernetes packages are at version v0.25.0 or later.
Additional info:
Using old Kubernetes API and client packages brings risk of API compatibility issues.
Description of problem:
console.openshift.io/use-i18n false in v1alpha API is converted to "" in the v1 APi, which is not a valid value for the enum type declared in the code.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-25-071630
How reproducible:
Always
Steps to Reproduce:
1. Load a dynamic plugin with v1alpha API console.openshift.io/use-i18n set to 'false' 2. In the v1 API the {"spec":{"i18n":{"loadType":""}}} loadType is set to empty string, which is not a valid value defined here: https://github.com/jhadvig/api/blob/22d69793277ffeb618d642724515f249262959a5/console/v1/types_console_plugin.go#L46 https://github.com/openshift/api/pull/1186/files#
Actual results:
{"spec":{"i18n":{"loadType":""}}}
Expected results:
{"spec":{"i18n":{"loadType":"Lazy"}}}
Additional info:
Description of problem:
Network policy code has some problems, most of them are races, therefore it can be difficult to reproduce and verify, here is the list 1. all kinds of add/delete port to/from default deny port group failures, possible symptoms: - port should’ve been added to default deny port group, but wasn’t: connections that should’ve been dropped are allowed - port should’ve been deleted from default deny port group, but wasn’t: connections that should be allowed are dropped - db ops failures when an attempt to add/delete port to/from default deny port group fails, e.g. because this operation already was done 2. default deny port group was overwritten when 2 network policies are created in a namespace at the same time. Can lead to ports not being added to the default deny port group => denied connections will be allowed 3. handle error when getting local pod from the cache fails, possible symptoms - "Failed to get LSP after multiple retries for pod %s/%s for networkPolicy" log message - pod is not added to netpol port groups, network policy is not applied 4. creating deleted namespace via ensureNamespaceLocked, symptoms: - namespace was deleted, but address set is present in the db 5. policy acl loglevel update wasn’t applied, possible symptoms: - netpol acl log level isn’t set/updated to namespace loglevel 6. netpol cleanup failures, symptoms: - network policy failed to be deleted, something is still left in the db, error messages like - "failed to destroy network policy" - "Rollback of default port groups and acls for policy: %s/%s failed, Unable to ensure namespace for network policy" 7. concurrent write to sets.String - this will panic, you won’t miss 8. retry for network policy handler after network policy was deleted, you should see failures saying that some network policy related object is nil or doesn’t exist, e.g. - "peer AddressSet is nil, cannot add <object>" 9. host network and completed pods selected by network policy can produce error logs, no real harm - "Failed to get LSP for pod <namespace>/<name> for networkPolicy %s refetching err" 10. namespace pod handlers are never stopped, can affect memory usage and look like a memory leak 11. add local pod failure, since netpol port group is not committed to db yet, error looks like - "Failed to create *factory.localPodSelector <name>, error: object not found"
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
Example 1 1. Create network policy with [in/e]gress selector that applies to a namespace labeled project: myproject apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: test-network-policy namespace: test spec: podSelector: {} policyTypes: - Ingress ingress: - from: - namespaceSelector: matchLabels: project: myproject 2. Use oc apply to delete network policy and crate a pod in project: myproject namespace at the same time 3. check ovnkube-master logs for "peer AddressSet is nil, cannot add peer pod(s)", this should retry with the same error 15 times 4. This may not work from the first try, since we need to hit specific order of network policy delete and pod add handling 5. With the new version no error messages should be present Example 2 1. create network policy that applies to a namespace test piVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: test-network-policy namespace: test spec: podSelector: {} policyTypes: - Ingress ingress: 2. Create host network pod in namespace test 3. Check 15 logs saying "Failed to get LSP for pod %s/%s for networkPolicy %s refetching err: " 4. check final log "Failed to get LSP after multiple retries for pod %s/%s for networkPolicy" 5. With the new version no error message should be present All the other cases are difficult to reproduce, maybe just running some standard network policy tests and making sure everything works will be a good verification.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Go to the detail page of some Deployments with PDB connected to it
2. Click Edit PDB from the kebab menu
3. Inspect the second input box under the `Availability requirement `
Actual results: The name and aria-label attributes always show minAvailable
Expected results: They should be consistent with the first input box
Additional info:
This is a clone of issue OCPBUGS-4700. The following is the description of the original issue:
—
Description of problem:
In at least 4.12.0-rc.0, a user with read-only access to ClusterVersion can see an "Update blocked" pop-up talking about "...alert above the visualization...". It is referencing a banner about "This cluster should not be updated to the next minor version...", but that banner is not displayed because hasPermissionsToUpdate is false, so canPerformUpgrade is false.
Version-Release number of selected component (if applicable):
4.12.0-rc.0. Likely more. I haven't traced it out.
How reproducible:
Always.
Steps to Reproduce:
1. Install 4.12.0-rc.0
2. Create a user with cluster-wide read-only permissions. For me, it's via binding to a sudoer ClusterRole. I'm not sure where that ClusterRole comes from, but it's:
$ oc get -o yaml clusterrole sudoer apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: annotations: rbac.authorization.kubernetes.io/autoupdate: "true" creationTimestamp: "2020-05-21T19:39:09Z" name: sudoer resourceVersion: "7715" uid: 28eb2ffa-dccd-47e8-a2d5-6a95e0e8b1e9 rules: - apiGroups: - "" - user.openshift.io resourceNames: - system:admin resources: - systemusers - users verbs: - impersonate - apiGroups: - "" - user.openshift.io resourceNames: - system:masters resources: - groups - systemgroups verbs: - impersonate
3. View /settings/cluster
Actual results:
See the "Update blocked" pop-up talking about "...alert above the visualization...".
Expected results:
Something more internally consistent. E.g. having the referenced banner "...alert above the visualization..." show up, or not having the "Update blocked" pop-up reference the non-existent banner.
This is a clone of issue OCPBUGS-9968. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-8692. The following is the description of the original issue:
—
Description of problem:
In hypershift context: Operands managed by Operators running in the hosted control plane namespace in the management cluster do not honour affinity opinions https://hypershift-docs.netlify.app/how-to/distribute-hosted-cluster-workloads/ https://github.com/openshift/hypershift/blob/main/support/config/deployment.go#L263-L265 These operands running management side should honour the same affinity, tolerations, node selector and priority rules than the operator. This could be done by looking at the operator deployment itself or at the HCP resource. multus-admission-controller cloud-network-config-controller ovnkube-master
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create a hypershift cluster. 2. Check affinity rules and node selector of the operands above. 3.
Actual results:
Operands missing affinity rules and node selecto
Expected results:
Operands have same affinity rules and node selector than the operator
Additional info:
Description of problem:
InstanceMetadataTags are not supported in AWS C2S region(us-iso-x)
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. OCP4.11 IPI Installation on AWS C2S regions 2. 3.
Actual results:
Expected results:
Additional info:
Actual Error: "Error launching resource Instance. Unsupported Operation Specifying InstanceMetadataTags is not yet supported" There is a related fix on upstream: resource/aws_instance: Handle regions where instance metadata tags are unsupported https://github.com/hashicorp/terraform-provider-aws/pull/26631
Description of problem:
When solving flakiness of a test in IO tests, we found that there are some issues in the cluster_version_matches condition for the conditional gatherer. Firstly the character limit should be increased as 32 characters does not cover every possible release version as some exceed that limit. Furthermore, there is an error in the schema
There is no name, it should be version
How reproducible:
Sometimes
Steps to Reproduce:
1. Spin a cluster from a PR 2. If version exceeds 32 characters, we get in the pod logs: 'Could not get version from string: "<"'
Actual results:
'Could not get version from string: "<"'
Expected results:
Metadata should contain "Metadata should contain invalid range error"
Additional info:
However, since there's the possibility for versions to exceed 32 characters, we shouldn't expect an error in this situation. Therefore, there might be more than one issue.
Catastrophic job runs where high numbers of tests fail are common. There are likely many root causes, but let's try to find one. This is a hard task because it's not "this one test failed, figure out why."
Clusters of failures are more common on certain platforms, it may be fruitful to start with the worst.
NURP's that average > 5 openshift-tests or openshift-tests-upgrade failures:
variants | avg -----------------------------------------------------+------------------------ {azure,amd64,ovn,upgrade,upgrade-micro,single-node} | 124.5294117647058824 {azure,amd64,ovn,upgrade,upgrade-minor,single-node} | 92.9090909090909091 {openstack,amd64,ovn,ha} | 49.2105263157894737 {azure,amd64,sdn,ha,fips} | 25.6666666666666667 {metal-ipi,amd64,ovn,ha} | 24.6000000000000000 {openstack,amd64,ovn,ha,fips} | 23.5000000000000000 {azure,amd64,ovn,ha,hypershift} | 22.6666666666666667 {s390x,sdn,ha} | 22.5454545454545455 {gcp,amd64,ovn,ha} | 21.5714285714285714 {ppc64le,sdn,ha} | 17.9545454545454545 {metal-ipi,amd64,sdn,ha} | 17.6000000000000000 {openstack,amd64,ovn,ha,serial} | 15.3333333333333333 {azure,amd64,ovn,ha} | 15.1627906976744186 {promote} | 15.0000000000000000 {aws,amd64,ovn,ha} | 14.2558139534883721 {metal-ipi,amd64,ovn,upgrade,upgrade-minor,ha} | 13.9375000000000000 {gcp,amd64,ovn,upgrade,upgrade-minor,ha,realtime} | 11.2000000000000000 {azure,amd64,sdn,upgrade,upgrade-minor,ha} | 9.6842105263157895 {never-stable} | 9.0740740740740741 {aws,amd64,ovn,single-node} | 8.8666666666666667 {metal-ipi,amd64,sdn,upgrade,upgrade-micro,ha} | 7.9090909090909091 {azure,amd64,sdn,upgrade,upgrade-micro,ha} | 6.4000000000000000 {aws,amd64,sdn,ha} | 5.7800000000000000 {vsphere-ipi,amd64,ovn,ha} | 5.6458333333333333 {openstack,amd64,ovn,upgrade,upgrade-minor,ha} | 5.6250000000000000 {metal-ipi,amd64,ovn,upgrade,upgrade-micro,ha} | 5.5882352941176471 {aws,amd64,sdn,upgrade,upgrade-micro,ha} | 5.5789473684210526
Here's a sippy link for 4.12 job runs with > 50 failures: https://sippy.dptools.openshift.org/sippy-ng/jobs/4.12/runs?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522test_failures%2522%252C%2522operatorValue%2522%253A%2522%253E%2522%252C%2522value%2522%253A%252250%2522%257D%252C%257B%2522columnField%2522%253A%2522overall_result%2522%252C%2522operatorValue%2522%253A%2522equals%2522%252C%2522value%2522%253A%2522F%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&sort=desc&sortField=timestamp
Description of problem:
In a single zone(us-east-2a) cluster, modify controlplanemachineset to use OnDelete strategy and add another two zones: us-east-2b and us-east-2c, the backend subnets for us-east-2b and us-east-2c aren't actually configured. If delete one master machine, even if the machine is up to date and not in need of replacement, a new master will be created in us-east-2a, the old master will be deleted.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-12-152748
How reproducible:
always
Steps to Reproduce:
1. Setup a single zone cluster 2. Modify controlplanemachineset to use OnDelete strategy and add another two zones: us-east-2b and us-east-2c 3 $ oc edit controlplanemachineset cluster strategy: type: OnDelete template: machineType: machines_v1beta1_machine_openshift_io machines_v1beta1_machine_openshift_io: failureDomains: platform: AWS aws: - placement: availabilityZone: us-east-2a subnet: type: Filters filters: - name: tag:Name values: - zhsunaws991-9n7r7-private-us-east-2a - placement: availabilityZone: us-east-2b subnet: type: Filters filters: - name: tag:Name values: - zhsunaws991-9n7r7-private-us-east-2b - placement: availabilityZone: us-east-2c subnet: type: Filters filters: - name: tag:Name values: - zhsunaws991-9n7r7-private-us-east-2c 3. Delete a master machine, the machine is up to date and not in need of replacement
Actual results:
A new master will be created in us-east-2a instead of us-east-2b and us-east-2c, the old master will be deleted. $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsunaws991-9n7r7-master-0 Deleting m6i.xlarge us-east-2 us-east-2a 71m zhsunaws991-9n7r7-master-1 Running m6i.xlarge us-east-2 us-east-2a 71m zhsunaws991-9n7r7-master-2 Running m6i.xlarge us-east-2 us-east-2a 71m zhsunaws991-9n7r7-master-9cwsg-0 Running m6i.xlarge us-east-2 us-east-2a 4m43s zhsunaws991-9n7r7-worker-us-east-2a-cgrcf Running m6i.xlarge us-east-2 us-east-2a 68m zhsunaws991-9n7r7-worker-us-east-2a-jslhj Running m6i.xlarge us-east-2 us-east-2a 68m zhsunaws991-9n7r7-worker-us-east-2a-xgh8l Running m6i.xlarge us-east-2 us-east-2a 68m $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsunaws991-9n7r7-master-1 Running m6i.xlarge us-east-2 us-east-2a 94m zhsunaws991-9n7r7-master-2 Running m6i.xlarge us-east-2 us-east-2a 94m zhsunaws991-9n7r7-master-9cwsg-0 Running m6i.xlarge us-east-2 us-east-2a 27m zhsunaws991-9n7r7-worker-us-east-2a-cgrcf Running m6i.xlarge us-east-2 us-east-2a 91m zhsunaws991-9n7r7-worker-us-east-2a-jslhj Running m6i.xlarge us-east-2 us-east-2a 91m zhsunaws991-9n7r7-worker-us-east-2a-xgh8l Running m6i.xlarge us-east-2 us-east-2a 91m
Expected results:
The current machine is up to date and not in need of replacement. The other two failure domains it will say need replacement and then we would expect those to fail if the backend infrastructure isn't there
Additional info:
https://issues.redhat.com/browse/OCPCLOUD-1503?focusedCommentId=20945295&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-20945295
Description of problem:
It is a disconnected cluster on AWS. There is an issue configuring Egress IP where the cluster uses STS. While looking into cloud-network-config-controller pod it is trying to connect to the global sts service "https://sts.amazonaws.com/" rather it should connect to the regional one "https://ec2.ap-southeast-1.amazonaws.com".
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create a disconected OCP cluster on AWS.
$ oc get netnamespace | grep egress egress-ip-test 2689387 ["172.16.1.24"]
$ oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS CIDRS EGRESS IPS ip-172-16-1-151.ap-southeast-1.compute.internal ip-172-16-1-151.ap-southeast-1.compute.internal 172.16.1.151 10.130.0.0/23 ip-172-16-1-53.ap-southeast-1.compute.internal ip-172-16-1-53.ap-southeast-1.compute.internal 172.16.1.53 10.131.0.0/23 ["172.16.1.24"] ip-172-16-2-15.ap-southeast-1.compute.internal ip-172-16-2-15.ap-southeast-1.compute.internal 172.16.2.15 10.128.0.0/23 ip-172-16-2-77.ap-southeast-1.compute.internal ip-172-16-2-77.ap-southeast-1.compute.internal 172.16.2.77 10.128.2.0/23 ip-172-16-3-111.ap-southeast-1.compute.internal ip-172-16-3-111.ap-southeast-1.compute.internal 172.16.3.111 10.129.0.0/23 ip-172-16-3-79.ap-southeast-1.compute.internal ip-172-16-3-79.ap-southeast-1.compute.internal 172.16.3.79 10.129.2.0/23
$ oc logs sdn-controller-6m5kb -n openshift-sdn I0922 04:09:53.348615 1 vnids.go:105] Allocated netid 2689387 for namespace "egress-ip-test" E0922 04:24:00.682018 1 egressip.go:254] Ignoring invalid HostSubnet ip-172-16-1-53.ap-southeast-1.compute.internal (host: "ip-172-16-1-53.ap-southeast-1.compute.internal", ip: "172.16.1.53", subnet: "10.131.0.0/23"): related node object "ip-172-16-1-53.ap-southeast-1.compute.internal" has an incomplete annotation "cloud.network.openshift.io/egress-ipconfig", CloudEgressIPConfig: <nil>
$ oc logs cloud-network-config-controller-5c7556db9f-x78bs -n openshift-cloud-network-config-controller E0922 04:26:59.468726 1 controller.go:165] error syncing 'ip-172-16-2-77.ap-southeast-1.compute.internal': error retrieving the private IP configuration for node: ip-172-16-2-77.ap-southeast-1.compute.internal, err: error: cannot list ec2 instance for node: ip-172-16-2-77.ap-southeast-1.compute.internal, err: WebIdentityErr: failed to retrieve credentials caused by: RequestError: send request failed caused by: Post "https://sts.amazonaws.com/": dial tcp 54.239.29.25:443: i/o timeout, requeuing in node workqueue
$ oc get Infrastructure -o yaml apiVersion: v1 items: - apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2022-09-22T03:28:15Z" generation: 1 name: cluster resourceVersion: "598" uid: 994da301-2a96-43b7-b43b-4b7c18d4b716 spec: cloudConfig: name: "" platformSpec: aws: serviceEndpoints: - name: sts url: https://sts.ap-southeast-1.amazonaws.com - name: ec2 url: https://ec2.ap-southeast-1.amazonaws.com - name: elasticloadbalancing url: https://elasticloadbalancing.ap-southeast-1.amazonaws.com type: AWS status: apiServerInternalURI: https://api-int.openshiftyy.ocpaws.sadiqueonline.com:6443 apiServerURL: https://api.openshiftyy.ocpaws.sadiqueonline.com:6443 controlPlaneTopology: HighlyAvailable etcdDiscoveryDomain: "" infrastructureName: openshiftyy-wfrpf infrastructureTopology: HighlyAvailable platform: AWS platformStatus: aws: region: ap-southeast-1 serviceEndpoints: - name: ec2 url: https://ec2.ap-southeast-1.amazonaws.com - name: elasticloadbalancing url: https://elasticloadbalancing.ap-southeast-1.amazonaws.com - name: sts url: https://sts.ap-southeast-1.amazonaws.com type: AWS kind: List metadata: resourceVersion: ""
$ oc get secret aws-cloud-credentials -n openshift-machine-api -o json |jq -r .data.credentials |base64 -d [default] sts_regional_endpoints = regional role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-machine-api-aws-cloud-credentials web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token [ec2-user@ip-172-17-1-229 ~]$ oc get secret cloud-credential-operator-iam-ro-creds -n openshift-cloud-credential-operator -o json |jq -r .data.credentials |base64 -d [default] sts_regional_endpoints = regional role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-cloud-credential-operator-cloud-creden web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token [ec2-user@ip-172-17-1-229 ~]$ oc get secret installer-cloud-credentials -n openshift-image-registry -o json |jq -r .data.credentials |base64 -d [default] sts_regional_endpoints = regional role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-image-registry-installer-cloud-credent web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token [ec2-user@ip-172-17-1-229 ~]$ oc get secret cloud-credentials -n openshift-ingress-operator -o json |jq -r .data.credentials |base64 -d [default] sts_regional_endpoints = regional role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-ingress-operator-cloud-credentials web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token [ec2-user@ip-172-17-1-229 ~]$ oc get secret cloud-credentials -n openshift-cloud-network-config-controller -o json |jq -r .data.credentials |base64 -d [default] sts_regional_endpoints = regional role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-cloud-network-config-controller-cloud- web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token [ec2-user@ip-172-17-1-229 ~]$ oc get secret ebs-cloud-credentials -n openshift-cluster-csi-drivers -o json |jq -r .data.credentials |base64 -d [default] sts_regional_endpoints = regional role_arn = arn:aws:iam::015719942846:role/sputhenp-sts-yy-openshift-cluster-csi-drivers-ebs-cloud-credenti web_identity_token_file = /var/run/secrets/openshift/serviceaccount/token
Actual results:
Egress IP not configured properly and cloud-network-config-controller trying to connect to global STS service.
Expected results:
Egress IP should get configured and cloud-network-config-controller should connect to regional STS service instead of global.
Additional info:
Description of problem:
OVN-Kubernetes master is crashing during upgrade from 4.11.5 to 4.11.6
Version-Release number of selected component (if applicable):
4.11.5 to 4.11.6
cannot clean up egress default deny ACL name: cannot update old NetworkPolicy ACLs for namespace ocm-myuser-1urk47c6ti1n94n1spdvo9902as3klar-sd6: error in transact with ops [{Op:update Table:ACL Row:map[action:drop direction:from-lport external_ids:{GoMap:map[default-deny-policy-type:Egress]} log:false match:inport == @a12995145443578534523_egressDefaultDeny meter:{GoSet:[acl-logging]} name:{GoSet:[ocm-myuser-1urk47c6ti1n94n1spdvo9902as3klar-sd6_egressDefaultDeny]} options:{GoMap:map[apply-after-lb:true]} priority:1000 severity:{GoSet:[info]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {5277db54-dd96-4c4d-bbed-99142cab91e7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:0 Error:constraint violation Details:"ocm-myuser-1urk47c6ti1n94n1spdvo9902as3klar-sd6_egressDefaultDeny" length 65 is greater than maximum allowed length 63 UUID:{GoUUID:} Rows:[]}] and errors
Description of problem:
When providing the openshift-install agent create command with installconfig + agentconfig manifests that contain the InstallConfig Proxy section, the Proxy configuration does not get applied.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1.Define InstallConfig with Proxy section 2.openshift-install agent create image 3.Boot ISO 4.Check /etc/assisted/manifests for InfraEnv to contain its Proxy section
Actual results:
Missing proxy
Expected results:
Proxy present and matching InstallConfig's
Additional info:
This is a clone of issue OCPBUGS-5505. The following is the description of the original issue:
—
The upgradeability check in CVO is throttled (essentially cached) for a nondeterministic period of time, same as the minimal sync period computed at runtime. The period can be up to 4 minutes, determined at CVO start time as 2minutes * (0..1 + 1). We agreed with Trevor that such throttling is unnecessarily aggressive (the check is not that expensive). It also causes CI flakes, because the matching test only has 3 minutes timeout. Additionally, the non-determinism and longer throttling results makes UX worse by actions done in the cluster may have their observable effect delayed.
discovered in 4.10 -> 4.11 upgrade jobs
The test seems to flake ~10% of 4.10->4.11 Azure jobs (sippy). There does not seem to be that much impact on non-Azure jobs though which is a bit weird.
Inspect the CVO log and E2E logs from failing jobs with the provided [^check-cvo.py] helper:
$ ./check-cvo.py cvo.log && echo PASS || echo FAIL
Preferably, inspect CVO logs of clusters that just underwent an upgrade (upgrades makes the original problematic behavior more likely to surface)
$ ./check-cvo.py openshift-cluster-version_cluster-version-operator-5b6966c474-g4kwk_cluster-version-operator.log && echo PASS || echo FAIL FAIL: Cache hit at 11:59:55.332339 0:03:13.665006 after check at 11:56:41.667333 FAIL: Cache hit at 12:06:22.663215 0:03:13.664964 after check at 12:03:08.998251 FAIL: Cache hit at 12:12:49.997119 0:03:13.665598 after check at 12:09:36.331521 FAIL: Cache hit at 12:19:17.328510 0:03:13.664906 after check at 12:16:03.663604 FAIL: Cache hit at 12:25:44.662290 0:03:13.666759 after check at 12:22:30.995531 Upgradeability checks: 5 Upgradeability check cache hits: 12 FAIL
Note that the bug is probabilistic, so not all unfixed clusters will exhibit the behavior. My guess of the incidence rate is about 30-40%.
$ ./check-cvo.py openshift-cluster-version_cluster-version-operator-7b8f85d455-mk9fs_cluster-version-operator.log && echo PASS || echo FAIL Upgradeability checks: 12 Upgradeability check cache hits: 11 PASS
The actual numbers are not relevant (unless the upgradeabilily check count is zero, which means the test is not conclusive, the script warns about that), lack of failure is.
$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7b7d4b5bbd-zjqdt_cluster-version-operator.log | grep upgradeable.go ... I1227 06:50:59.023190 1 upgradeable.go:122] Cluster current version=4.10.46 I1227 06:50:59.042735 1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later. I1227 06:51:14.024345 1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later. I1227 06:53:23.080768 1 upgradeable.go:42] Upgradeable conditions were recently checked, will try later. I1227 06:56:59.366010 1 upgradeable.go:122] Cluster current version=4.11.0-0.ci-2022-12-26-193640 $ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1607602927633960960/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Kubernetes 1.25 and therefore OpenShift 4.12' Dec 27 06:51:15.319: INFO: Waiting for Upgradeable to be AdminAckRequired for "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions." ... Dec 27 06:54:15.413: FAIL: Error while waiting for Upgradeable to complain about AdminAckRequired with message "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions.": timed out waiting for the condition
The test passes. Also, the "Upgradeable conditions were recently checked, will try later." messages in CVO logs should never occur after a deterministic, short amount of time (I propose 1 minute) after upgradeability was checked.
I tested the throttling period in https://github.com/openshift/cluster-version-operator/pull/880. With the period of 15m, the test passrate was 4 of 9. Wiht the period of 1m, the test did not fail at all.
Some context in Slack thread
Our CMO e2e tests create several containers besides the standard CMO deployment. These pods do currently not set any security context capabilities. Currently this creates a warning like so:
W0705 08:35:38.590283 15206 warnings.go:70] would violate PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "alertmanager-webhook-e2e-testutil" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "alertmanager-webhook-e2e-testutil" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "alertmanager-webhook-e2e-testutil" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "alertmanager-webhook-e2e-testutil" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
We should be proactive and set security capability contraints. From this run this seems to impact the following pods/containers:
Both are used more then once.
This is a clone of issue OCPBUGS-1627. The following is the description of the original issue:
—
Description of problem:
Two issues when setting user-defined folder in failureDomain.
1. installer get error when setting folder as a path of user-defined folder in failureDomain.
failureDomains setting in install-config.yaml:
failureDomains: - name: us-east-1 region: us-east zone: us-east-1a server: xxx topology: datacenter: IBMCloud computeCluster: /IBMCloud/host/vcs-mdcnc-workload-1 networks: - multi-zone-qe-dev-1 datastore: multi-zone-ds-1 folder: /IBMCloud/vm/qe-jima - name: us-east-2 region: us-east zone: us-east-2a server: xxx topology: datacenter: IBMCloud computeCluster: /IBMCloud/host/vcs-mdcnc-workload-2 networks: - multi-zone-qe-dev-1 datastore: multi-zone-ds-2 folder: /IBMCloud/vm/qe-jima - name: us-east-3 region: us-east zone: us-east-3a server: xxx topology: datacenter: IBMCloud computeCluster: /IBMCloud/host/vcs-mdcnc-workload-3 networks: - multi-zone-qe-dev-1 datastore: workload_share_vcsmdcncworkload3_joYiR folder: /IBMCloud/vm/qe-jima - name: us-west-1 region: us-west zone: us-west-1a server: ibmvcenter.vmc-ci.devcluster.openshift.com topology: datacenter: datacenter-2 computeCluster: /datacenter-2/host/vcs-mdcnc-workload-4 networks: - multi-zone-qe-dev-1 datastore: workload_share_vcsmdcncworkload3_joYiR
Error message in terraform after completing ova image import:
DEBUG vsphereprivate_import_ova.import[0]: Still creating... [1m40s elapsed] DEBUG vsphereprivate_import_ova.import[3]: Creation complete after 1m40s [id=vm-367860] DEBUG vsphereprivate_import_ova.import[1]: Creation complete after 1m49s [id=vm-367863] DEBUG vsphereprivate_import_ova.import[0]: Still creating... [1m50s elapsed] DEBUG vsphereprivate_import_ova.import[2]: Still creating... [1m50s elapsed] DEBUG vsphereprivate_import_ova.import[2]: Still creating... [2m0s elapsed] DEBUG vsphereprivate_import_ova.import[0]: Still creating... [2m0s elapsed] DEBUG vsphereprivate_import_ova.import[2]: Creation complete after 2m2s [id=vm-367862] DEBUG vsphereprivate_import_ova.import[0]: Still creating... [2m10s elapsed] DEBUG vsphereprivate_import_ova.import[0]: Creation complete after 2m20s [id=vm-367861] DEBUG data.vsphere_virtual_machine.template[0]: Reading... DEBUG data.vsphere_virtual_machine.template[3]: Reading... DEBUG data.vsphere_virtual_machine.template[1]: Reading... DEBUG data.vsphere_virtual_machine.template[2]: Reading... DEBUG data.vsphere_virtual_machine.template[3]: Read complete after 1s [id=42054e33-85d6-e310-7f4f-4c52a73f8338] DEBUG data.vsphere_virtual_machine.template[1]: Read complete after 2s [id=42053e17-cc74-7c89-f5d1-059c9030ecc7] DEBUG data.vsphere_virtual_machine.template[2]: Read complete after 2s [id=4205019f-26d8-f9b4-ac0c-2c073fd70b35] DEBUG data.vsphere_virtual_machine.template[0]: Read complete after 2s [id=4205eaf2-c727-c647-ad44-bd9ad7023c56] ERROR ERROR Error: error trying to determine parent targetFolder: folder '/IBMCloud/vm//IBMCloud/vm' not found ERROR ERROR with vsphere_folder.folder["IBMCloud-/IBMCloud/vm/qe-jima"], ERROR on main.tf line 61, in resource "vsphere_folder" "folder": ERROR 61: resource "vsphere_folder" "folder" { ERROR ERROR failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "pre-bootstrap" stage: failed to create cluster: failed to apply Terraform: exit status 1 ERROR ERROR Error: error trying to determine parent targetFolder: folder '/IBMCloud/vm//IBMCloud/vm' not found ERROR ERROR with vsphere_folder.folder["IBMCloud-/IBMCloud/vm/qe-jima"], ERROR on main.tf line 61, in resource "vsphere_folder" "folder": ERROR 61: resource "vsphere_folder" "folder" { ERROR ERROR
2. installer get panic error when setting folder as user-defined folder name in failure domains.
failure domain in install-config.yaml
failureDomains: - name: us-east-1 region: us-east zone: us-east-1a server: xxx topology: datacenter: IBMCloud computeCluster: /IBMCloud/host/vcs-mdcnc-workload-1 networks: - multi-zone-qe-dev-1 datastore: multi-zone-ds-1 folder: qe-jima - name: us-east-2 region: us-east zone: us-east-2a server: xxx topology: datacenter: IBMCloud computeCluster: /IBMCloud/host/vcs-mdcnc-workload-2 networks: - multi-zone-qe-dev-1 datastore: multi-zone-ds-2 folder: qe-jima - name: us-east-3 region: us-east zone: us-east-3a server: xxx topology: datacenter: IBMCloud computeCluster: /IBMCloud/host/vcs-mdcnc-workload-3 networks: - multi-zone-qe-dev-1 datastore: workload_share_vcsmdcncworkload3_joYiR folder: qe-jima - name: us-west-1 region: us-west zone: us-west-1a server: xxx topology: datacenter: datacenter-2 computeCluster: /datacenter-2/host/vcs-mdcnc-workload-4 networks: - multi-zone-qe-dev-1 datastore: workload_share_vcsmdcncworkload3_joYiR
panic error message in installer:
INFO Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/releases/rhcos-4.12/412.86.202208101039-0/x86_64/rhcos-412.86.202208101039-0-vmware.x86_64.ova?sha256='
INFO The file was found in cache: /home/user/.cache/openshift-installer/image_cache/rhcos-412.86.202208101039-0-vmware.x86_64.ova. Reusing...
panic: runtime error: index out of range [1] with length 1goroutine 1 [running]:
github.com/openshift/installer/pkg/tfvars/vsphere.TFVars({{0xc0013bd068, 0x3, 0x3}, {0xc000b11dd0, 0x12}, {0xc000b11db8, 0x14}, {0xc000b11d28, 0x14}, {0xc000fe8fc0, ...}, ...})
/go/src/github.com/openshift/installer/pkg/tfvars/vsphere/vsphere.go:79 +0x61b
github.com/openshift/installer/pkg/asset/cluster.(*TerraformVariables).Generate(0x1d1ed360, 0x5?)
/go/src/github.com/openshift/installer/pkg/asset/cluster/tfvars.go:847 +0x4798
Based on explanation of field folder, looks like folder name should be ok. If it is not allowed to use folder name, need to validate the folder and update explain.
sh-4.4$ ./openshift-install explain installconfig.platform.vsphere.failureDomains.topology.folder KIND: InstallConfig VERSION: v1RESOURCE: <string> folder is the name or inventory path of the folder in which the virtual machine is created/located.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-20-095559
How reproducible:
always
Steps to Reproduce:
see description
Actual results:
installation has errors when set user-defined folder
Expected results:
installation is successful when set user-defined folder
Additional info:
Description of problem:
In cluster-ingress-operator's ensureNodePortService, when there is a conflicting Ingress Controller loadbalancer, it states: a conflicting load balancer service exists that is not owned by the ingress controller: openshift-ingress/router-loadbalancer Technically that is the service name, not the ingress controller name. The IC name is openshift-ingress/loadbalancer in this example. So the error message wording is incorrect.
Version-Release number of selected component (if applicable):
4.13 4.12 4.11
How reproducible:
Easy
Steps to Reproduce:
# Create a service that will conflict with a new ingress controller oc create svc nodeport router-nodeport-test --tcp=80 -n openshift-ingress DOMAIN=$(oc get ingresses.config/cluster -o jsonpath={.spec.domain}) oc apply -f - <<EOF apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: test namespace: openshift-ingress-operator spec: domain: reproducer.$DOMAIN endpointPublishingStrategy: type: NodePortService replicas: 1 nodePlacement: nodeSelector: matchLabels: node-role.kubernetes.io/worker: "" EOF # Look for log message that is incorrect oc logs -n openshift-ingress-operator $(oc get -n openshift-ingress-operator pods --no-headers | head -1 | awk '{print $1}') -c ingress-operator | grep conflicting # The results provide service name, not ingress controller name # "error": "a conflicting nodeport service exists that is not owned by the ingress controller: openshift-ingress/router-test"
Actual results:
"error": "a conflicting nodeport service exists that is not owned by the ingress controller: openshift-ingress/router-test"
Expected results:
"error": "a conflicting nodeport service exists that is not owned by the ingress controller: openshift-ingress/router-nodeport-test"
Additional info:
Both `[sig-devex][Feature:ImageEcosystem][mysql][Slow] openshift mysql image Creating from a template should instantiate the template [apigroup:apps.openshift.io]` and `[sig-devex][Feature:ImageEcosystem][mariadb][Slow] openshift mariadb image Creating from a template should instantiate the template [apigroup:image.openshift.io][apigroup:operator.openshift.io][apigroup:config.openshift.io][apigroup:apps.openshift.io]` are repeatedly failing over multiple PRs.
More links in https://github.com/openshift/origin/pull/27502#issuecomment-1304613482
Opening this issue to temporarily skip the broken tests to unblocking merging PRs in openshift/origin:master
More details in https://issues.redhat.com/browse/OCPBUGS-3339
This is a clone of issue OCPBUGS-3235. The following is the description of the original issue:
—
Frequently we see the loading state of the topology view, even when there aren't many resources in the project.
Including an example
topology will sometimes hang with the loading indicator showing indefinitely
topology should load consistently without fail
intermittent
4.9
Description of problem:
Currently in 4.11, MAPI nutanix machine-controller does not provide the machine (VM)’s instance-type, region, zone, etc. labels to the Machine CR. And these columns are empty when viewing the Machine CRs, via cli “oc get Machine” or from the OCP cluster web console. $ oc -n openshift-machine-api get machine NAME PHASE TYPE REGION ZONE AGE demo-ocp-cluster-g1-77nws-master-0 Running 133m demo-ocp-cluster-g1-77nws-master-1 Running 133m demo-ocp-cluster-g1-77nws-master-2 Running 133m demo-ocp-cluster-g1-77nws-worker-2bsxn Running 129m demo-ocp-cluster-g1-77nws-worker-75hr5 Running 129m demo-ocp-cluster-g1-77nws-worker-rg7b9 Running 129m We can add something like the below labels to the Machine CR in the mapi-nutanix when reconciling for the Machine CRs: machine.openshift.io/instance-type: AHV machine.openshift.io/region: <prism-central-address> machine.openshift.io/zone: <prism-element-name/uuid>
Version-Release number of selected component (if applicable):
How reproducible:
run cli “oc get Machine” or from the OCP cluster web console to view the Machines resource
Steps to Reproduce:
1. 2. 3.
Actual results:
The "Type", "Region", "Zone" columns are empty for each Machine CR.
Expected results:
The "Type", "Region", "Zone" columns showing data for each Machine CR.
Additional info:
This is a clone of issue OCPBUGS-4997. The following is the description of the original issue:
—
The fix for OCPBUGS-3382 ensures that we pass the proxy settings from the install-config through to the final cluster. However, nothing in the agent ISO itself uses proxy settings (at least until bootstrapping starts.
It is probably less likely for the agent-based installer that proxies will be needed than e.g. for assisted (where agents running on-prem need to call back to assisted-service in the cloud), but we should be consistent about using any proxy config provided. There may certainly be cases where the registry is only reachable via a proxy.
This can be easily set system-wide by configuring default environment variables in the systemd config. An example (from the bootstrap ignition) is: https://github.com/openshift/installer/blob/master/data/data/bootstrap/files/etc/systemd/system.conf.d/10-default-env.conf.template
Note that current the agent service explicitly overrides these environment variables to be empty, so that will have to be cleared.
This is a clone of issue OCPBUGS-4954. The following is the description of the original issue:
—
Description of problem:
During the cluster destroy process for IBM Cloud IPI, failures can occur when COS Instances are deleted, but Reclamations are created for the COS deletions, and prevent cleanup of the ResourceGroup
Version-Release number of selected component (if applicable):
4.13.0 (and 4.12.0)
How reproducible:
Sporadic, it depends on IBM Cloud COS
Steps to Reproduce:
1. Create an IPI cluster on IBM Cloud
2. Delete the IPI cluster on IBM Cloud
3. COS Reclamation may be created, and can cause the destroy cluster to fail
Actual results:
time="2022-12-12T16:50:06Z" level=debug msg="Listing resource groups" time="2022-12-12T16:50:06Z" level=debug msg="Deleting resource group \"eu-gb-reclaim-1-zc6xg\"" time="2022-12-12T16:50:07Z" level=debug msg="Failed to delete resource group eu-gb-reclaim-1-zc6xg: Resource groups with active or pending reclamation instances can't be deleted. Use the CLI commands \"ibmcloud resource service-instances --type all\" and \"ibmcloud resource reclamations\" to check for remaining instances, then delete the instances and try again."
Expected results:
Successful destroy cluster (including deletion of ResourceGroup)
Additional info:
IBM Cloud is testing a potential fix currently.
It was also identified, the destroy stages are not in a proper order.
https://github.com/openshift/installer/blob/9377cb3974986a08b531a5e807fd90a3a4e85ebf/pkg/destroy/ibmcloud/ibmcloud.go#L128-L155
Changes are being made in an attempt to resolve this along with a fix for this bug as well.
Description of problem:
For OVNK to become CNCF complaint, we need to support session affinity timeout feature and enable the e2e's on OpenShift side. This bug tracks the efforts to get this into 4.12 OCP.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
E2E CI feature files are failing as Mocha version couldn't be determined
Version-Release number of selected component (if applicable):
How reproducible:
CI Search : https://search.ci.openshift.org/?search=Couldn%27t+determine+Mocha+version&maxAge=336h&context=1&type=bug%2Bjunit&name=pull-ci-openshift-console-operator-master-e2e-aws-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Steps to Reproduce:
1. 2. 3.
Actual results:
E2E tests failing with `Couldn't determine Mocha version` error
Expected results:
E2E tests should pass without any failures
Additional info:
in order to have more info to be able to debug router issue in sno , we want to see if router is healthy from node network point of view and enable router access logs,
Lets revert when https://bugzilla.redhat.com/show_bug.cgi?id=2097041 will be found
The relevant code in ironic-image was not updated to support TLS, so it still uses the old port and explicit http://
This is a clone of issue OCPBUGS-3405. The following is the description of the original issue:
—
In case it should be used for publishing artifacts in CI jobs.
Look into to see if the following things are leaked:
This is a clone of issue OCPBUGS-4357. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The default catalogSources are not being ran in restricted mode.
Version-Release number of selected component (if applicable):
4.12.0
How reproducible:
Always
Steps to Reproduce:
1. Create an 4.12 openshift cluster 2. Check the securityContextConfig for the default catalogSources
Actual results:
$ k get catsrc -n openshift-marketplace -o yaml | grep securityContextConfig securityContextConfig: legacy securityContextConfig: legacy securityContextConfig: legacy securityContextConfig: legacy
Expected results:
$ k get catsrc -n openshift-marketplace -o yaml | grep securityContextConfig securityContextConfig: restricted securityContextConfig: restricted securityContextConfig: restricted securityContextConfig: restricted
Additional info:
This is a clone of issue OCPBUGS-855. The following is the description of the original issue:
—
Description of problem:
When setting the allowedregistries like the example below, the openshift-samples operator is degraded: oc get image.config.openshift.io/cluster -o yaml apiVersion: config.openshift.io/v1 kind: Image metadata: annotations: release.openshift.io/create-only: "true" creationTimestamp: "2020-12-16T15:48:20Z" generation: 2 name: cluster resourceVersion: "422284920" uid: d406d5a0-c452-4a84-b6b3-763abb51d7a5 spec: additionalTrustedCA: name: registry-ca allowedRegistriesForImport: - domainName: quay.io insecure: false - domainName: registry.redhat.io insecure: false - domainName: registry.access.redhat.com insecure: false - domainName: registry.redhat.io/redhat/redhat-operator-index insecure: true - domainName: registry.redhat.io/redhat/redhat-marketplace-index insecure: true - domainName: registry.redhat.io/redhat/certified-operator-index insecure: true - domainName: registry.redhat.io/redhat/community-operator-index insecure: true registrySources: allowedRegistries: - quay.io - registry.redhat.io - registry.rijksapps.nl - registry.access.redhat.com - registry.redhat.io/redhat/redhat-operator-index - registry.redhat.io/redhat/redhat-marketplace-index - registry.redhat.io/redhat/certified-operator-index - registry.redhat.io/redhat/community-operator-index oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.10.21 True False False 5d13h baremetal 4.10.21 True False False 450d cloud-controller-manager 4.10.21 True False False 94d cloud-credential 4.10.21 True False False 624d cluster-autoscaler 4.10.21 True False False 624d config-operator 4.10.21 True False False 624d console 4.10.21 True False False 42d csi-snapshot-controller 4.10.21 True False False 31d dns 4.10.21 True False False 217d etcd 4.10.21 True False False 624d image-registry 4.10.21 True False False 94d ingress 4.10.21 True False False 94d insights 4.10.21 True False False 104s kube-apiserver 4.10.21 True False False 624d kube-controller-manager 4.10.21 True False False 624d kube-scheduler 4.10.21 True False False 624d kube-storage-version-migrator 4.10.21 True False False 31d machine-api 4.10.21 True False False 624d machine-approver 4.10.21 True False False 624d machine-config 4.10.21 True False False 17d marketplace 4.10.21 True False False 258d monitoring 4.10.21 True False False 161d network 4.10.21 True False False 624d node-tuning 4.10.21 True False False 31d openshift-apiserver 4.10.21 True False False 42d openshift-controller-manager 4.10.21 True False False 22d openshift-samples 4.10.21 True True True 31d Samples installation in error at 4.10.21: &errors.errorString{s:"global openshift image configuration prevents the creation of imagestreams using the registry "} operator-lifecycle-manager 4.10.21 True False False 624d operator-lifecycle-manager-catalog 4.10.21 True False False 624d operator-lifecycle-manager-packageserver 4.10.21 True False False 31d service-ca 4.10.21 True False False 624d storage 4.10.21 True False False 113d After applying the fix as described here( https://access.redhat.com/solutions/6547281 ) it is resolved: oc patch configs.samples.operator.openshift.io cluster --type merge --patch '{"spec": {"samplesRegistry": "registry.redhat.io"}}' But according the the BZ this should be fixed in 4.10.3 https://bugzilla.redhat.com/show_bug.cgi?id=2027745 but the issue is still occur in our 4.10.21 cluster: oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.21 True False 31d Error while reconciling 4.10.21: the cluster operator openshift-samples is degraded
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
And possibly other alerts. Declaring namespace labels on alerts makes it easy to find the source or affected resource, as described here. But because Insights alerts are based on metrics exported by the cluster-version operator, they inherit source information from the CVO, and end up looking like:
ALERTS{alertname="SimpleContentAccessNotAvailable", alertstate="firing", condition="SCAAvailable", endpoint="metrics", instance="10.58.57.116:9099", job="cluster-version-operator", name="insights", namespace="openshift-cluster-version", pod="cluster-version-operator-5d8579fb58-p5hfn", prometheus="openshift-monitoring/k8s", reason="NotFound", receive="true", service="cluster-version-operator", severity="info"}
Adding namespace: openshift-insights to the labels block for InsightsDisabled and SimpleContentAccessNotAvailable would avoid this confusion.
You might also want to clear the job and service labels as irrelevant source information. And you might want to clear the pod label to avoid churning alerts when the CVO rolls out a new pod. You can get the label clearing by wrapping the expr with max without (job, pod, service) (...) or similar.
Tracker issue for bootimage bump in 4.12. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-7529.
Description of problem:
When spot instances with taints are added to the cluster on AWS, machine-api-termination-handler daemonset pods do not launch on these instances because of the taints. machine-api-termination-handler is used for checking the notification of intance termination, so if it doesn't launch properly, application pods on spot instances could stop without normal shut down procedures. It is common to use taint-toleration to specify workloads on spot instances, because it does not require changing application manifests of other workloads.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Creating ROSA cluster 2. Adding spot instances with taints on OCM 3. oc get daemonset machine-api-termination-handler -n openshift-machine-api
Actual results:
machine-api-termination-handler pods do not launch on spot instances
Expected results:
machine-api-termination-handler pods launch on spot instances
Additional info:
Adding followings to machine-api-termination-handler daemonset could resolve the problem. --- tolerations: - operator: Exists
This is a clone of issue OCPBUGS-4850. The following is the description of the original issue:
—
Description of problem:
Kuryr might take a while to create Pods because it has to create Neutron ports for the pods. If a pod gets deleted while this is being processed, a warning Event will be generated causing the "[sig-network] pods should successfully create sandboxes by adding pod to network" to fail.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When adding new nodes to the existing cluster, the newly allocated node-subnet can be overlapped with the existing node.
Version-Release number of selected component (if applicable):
openshift 4.10.30
How reproducible:
It's quite hard to reproduce but there is a possibility it can happen any time.
Steps to Reproduce:
1. Create a OVN dual-stack cluster 2. add nodes to the existing cluster 3. check the allocated node subnet
Actual results:
Some newly added nodes have the same node-subnet and ovn-k8s-mp0 IP as some existing nodes.
Expected results:
Should have duplicated node-subnet and ovn-k8s-mp0 IP
Additional info:
Additional info can be found at the case 03329155 and the must-gather attached(comment #1) % omg logs ovnkube-master-v8crc -n openshift-ovn-kubernetes -c ovnkube-master | grep '2022-09-30T06:42:50.857' 2022-09-30T06:42:50.857031565Z W0930 06:42:50.857020 1 master.go:1422] Did not find any logical switches with other-config 2022-09-30T06:42:50.857112441Z I0930 06:42:50.857099 1 master.go:1003] Allocated Subnets [10.131.0.0/23 fd02:0:0:4::/64] on Node worker01.ss1.samsung.local 2022-09-30T06:42:50.857122455Z I0930 06:42:50.857105 1 master.go:1003] Allocated Subnets [10.129.4.0/23 fd02:0:0:a::/64] on Node oam04.ss1.samsung.local 2022-09-30T06:42:50.857130289Z I0930 06:42:50.857122 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.131.0.0/23","fd02:0:0:4::/64"]}] on node worker01.ss1.samsung.local 2022-09-30T06:42:50.857140773Z I0930 06:42:50.857132 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.129.4.0/23","fd02:0:0:a::/64"]}] on node oam04.ss1.samsung.local 2022-09-30T06:42:50.857166726Z I0930 06:42:50.857156 1 master.go:1003] Allocated Subnets [10.128.2.0/23 fd02:0:0:5::/64] on Node oam01.ss1.samsung.local 2022-09-30T06:42:50.857176132Z I0930 06:42:50.857157 1 master.go:1003] Allocated Subnets [10.131.0.0/23 fd02:0:0:4::/64] on Node rhel01.ss1.samsung.local 2022-09-30T06:42:50.857176132Z I0930 06:42:50.857167 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.128.2.0/23","fd02:0:0:5::/64"]}] on node oam01.ss1.samsung.local 2022-09-30T06:42:50.857185257Z I0930 06:42:50.857157 1 master.go:1003] Allocated Subnets [10.128.6.0/23 fd02:0:0:d::/64] on Node call03.ss1.samsung.local 2022-09-30T06:42:50.857192996Z I0930 06:42:50.857183 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.131.0.0/23","fd02:0:0:4::/64"]}] on node rhel01.ss1.samsung.local 2022-09-30T06:42:50.857200017Z I0930 06:42:50.857190 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.128.6.0/23","fd02:0:0:d::/64"]}] on node call03.ss1.samsung.local 2022-09-30T06:42:50.857282717Z I0930 06:42:50.857258 1 master.go:1003] Allocated Subnets [10.130.2.0/23 fd02:0:0:7::/64] on Node call01.ss1.samsung.local 2022-09-30T06:42:50.857304886Z I0930 06:42:50.857293 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.130.2.0/23","fd02:0:0:7::/64"]}] on node call01.ss1.samsung.local 2022-09-30T06:42:50.857338896Z I0930 06:42:50.857314 1 master.go:1003] Allocated Subnets [10.128.4.0/23 fd02:0:0:9::/64] on Node f501.ss1.samsung.local 2022-09-30T06:42:50.857349485Z I0930 06:42:50.857329 1 master.go:1003] Allocated Subnets [10.131.2.0/23 fd02:0:0:8::/64] on Node call02.ss1.samsung.local 2022-09-30T06:42:50.857371344Z I0930 06:42:50.857354 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.128.4.0/23","fd02:0:0:9::/64"]}] on node f501.ss1.samsung.local 2022-09-30T06:42:50.857371344Z I0930 06:42:50.857361 1 kube.go:99] Setting annotations map[k8s.ovn.org/node-subnets:{"default":["10.131.2.0/23","fd02:0:0:8::/64"]}] on node call02.ss1.samsung.local
DVO metrics have some sensitive data that isn't desired to be sent outside the cluster. For that, IO must remove this data from the metrics before saving it to the archive and uploading it to the pipeline.
Remove the name and namespace from DVO metrics before saving it to the IO archive.
Failures like:
$ oc login --token=... Logged into "https://api..." as "..." using the token provided. Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get projects.project.openshift.io)
break login, which tries to gather information before saving the configuration, including a giant project list.
Ideally login would be able to save the successful login credentials, even when the informative gathering had difficulties. And possibly the informative gathering could be made conditional (--quiet or similar?) so expensive gathering could be skipped in use-cases where the context was not needed.
Description of problem:
There were 4 ingress-controllers and totally 15 routes. On web console, try to query "route_metrics_controller_routes_per_shard" in Observe >> Metrics page. the stats for 3 ingress-controllers are 15, and it is 1 for the last ingress-controller
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-23-154914
How reproducible:
Create pods, services, ingress-controllers, routes, then check "route_metrics_controller_routes_per_shard" on web console
Steps to Reproduce:
1. get cluster's base domain % oc get dnses.config/cluster -oyaml | grep -i domain baseDomain: shudi-412gcpop36.qe.gcp.devcluster.openshift.com 2. create 3 clusters % oc -n openshift-ingress-operator get ingresscontroller NAME AGE default 7h5m extertest3 120m internal1 120m internal2 120m % 3. check the spec of the 4 ingress-controllres a, default b, extertest3 spec: domain: extertest3.shudi-412gcpop36.qe.gcp.devcluster.openshift.com endpointPublishingStrategy: loadBalancer: dnsManagementPolicy: Managed scope: External type: LoadBalancerService c, internal1 spec: domain: internal1.shudi-412gcpop36.qe.gcp.devcluster.openshift.com endpointPublishingStrategy: loadBalancer: dnsManagementPolicy: Managed scope: Internal type: LoadBalancerService d, internal2 spec: domain: internal2.shudi-412gcpop36.qe.gcp.devcluster.openshift.com endpointPublishingStrategy: loadBalancer: dnsManagementPolicy: Managed scope: Internal type: LoadBalancerService routeSelector: matchLabels: shard: alpha 4. check the route, there are 15 routes % oc get route -A | awk '{print $3}' HOST/PORT oauth-openshift.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com console-openshift-console.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com downloads-openshift-console.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com canary-openshift-ingress-canary.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com alertmanager-main-openshift-monitoring.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com prometheus-k8s-openshift-monitoring.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com prometheus-k8s-federate-openshift-monitoring.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com thanos-querier-openshift-monitoring.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com edge1-test.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com int1reen2-test.internal1.shudi-412gcpop36.qe.gcp.devcluster.openshift.com pass1-test.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com reen1-test.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com service-unsecure-test.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com int1edge2-test.internal1.shudi-412gcpop36.qe.gcp.devcluster.openshift.com test.shudi.com % % oc get route -A | awk '{print $3}' | grep apps.shudi oauth-openshift.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com console-openshift-console.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com downloads-openshift-console.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com canary-openshift-ingress-canary.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com alertmanager-main-openshift-monitoring.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com prometheus-k8s-openshift-monitoring.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com prometheus-k8s-federate-openshift-monitoring.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com thanos-querier-openshift-monitoring.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com edge1-test.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com pass1-test.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com reen1-test.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com service-unsecure-test.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com % % oc get route -A | awk '{print $3}' | grep apps.shudi | wc -l 12 % oc get route -A | awk '{print $3}' | grep internal1 | wc -l 2 % oc get route -A | awk '{print $3}' | grep shudi.com | wc -l 1 % 5. only route unsvc5 had the shard=alpha label % oc get route unsvc5 -oyaml | grep labels: -A2 labels: name: unsvc5 shard: alpha % oc get route unsvc5 -oyaml | grep spec: -A1 spec: host: test.shudi.com 6. login web console(https://https://console-openshift-console.apps.shudi-412gcpop36.qe.gcp.devcluster.openshift.com/monitoring/query-browser), then navigate to Observe >> Metrics 7. input"route_metrics_controller_routes_per_shard ", then click the "Run queries" button. As the attached picture showed: name value default 15 extertest3 15 internal1 15 internal2 1 8. Also there was a minor issue: As the attached picture showed, there were two name in the header line Name name value route_metrics_controller_routes_per_shard default 15 route_metrics_controller_routes_per_shard extertest3 15 route_metrics_controller_routes_per_shard internal1 15 route_metrics_controller_routes_per_shard internal2 1
Actual results:
name value default 15 extertest3 15 internal1 15 internal2 1
Expected results:
name value default 12 extertest3 0 internal1 2 internal2 1
Additional info:
This is a clone of issue OCPBUGS-3018. The following is the description of the original issue:
—
Description of problem:
When running an overnight run in dev-scripts (COMPACT_IPV4) with repeated installs I saw this panic in WaitForBootstrapComplete occur once. level=debug msg=Agent Rest API Initialized E1101 05:19:09.733309 1802865 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 1 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x4086520?, 0x1d875810}) /home/stack/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00056fb00?}) /home/stack/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75 panic({0x4086520, 0x1d875810}) /usr/local/go/src/runtime/panic.go:838 +0x207 github.com/openshift/installer/pkg/agent.(*NodeZeroRestClient).getClusterID(0xc0001341e0) /home/stack/go/src/github.com/openshift/installer/pkg/agent/rest.go:121 +0x53 github.com/openshift/installer/pkg/agent.(*Cluster).IsBootstrapComplete(0xc000134190) /home/stack/go/src/github.com/openshift/installer/pkg/agent/cluster.go:183 +0x4fc github.com/openshift/installer/pkg/agent.WaitForBootstrapComplete.func1() /home/stack/go/src/github.com/openshift/installer/pkg/agent/waitfor.go:31 +0x77 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x1d8fa901?) /home/stack/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157 +0x3e k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0001958c0?, {0x1a53c7a0, 0xc0011d4a50}, 0x1, 0xc0001958c0) /home/stack/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158 +0xb6 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0009ab860?, 0x77359400, 0x0, 0xa?, 0x8?) /home/stack/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135 +0x89 k8s.io/apimachinery/pkg/util/wait.Until(...) /home/stack/go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92 github.com/openshift/installer/pkg/agent.WaitForBootstrapComplete({0x7ffd7fccb4e3?, 0x40d7e7?}) /home/stack/go/src/github.com/openshift/installer/pkg/agent/waitfor.go:30 +0x1bc github.com/openshift/installer/pkg/agent.WaitForInstallComplete({0x7ffd7fccb4e3?, 0x5?}) /home/stack/go/src/github.com/openshift/installer/pkg/agent/waitfor.go:73 +0x56 github.com/openshift/installer/cmd/openshift-install/agent.newWaitForInstallCompleteCmd.func1(0xc0003b6c80?, {0xc0004d67c0?, 0x2?, 0x2?}) /home/stack/go/src/github.com/openshift/installer/cmd/openshift-install/agent/waitfor.go:73 +0x126 github.com/spf13/cobra.(*Command).execute(0xc0003b6c80, {0xc0004d6780, 0x2, 0x2}) /home/stack/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:876 +0x67b github.com/spf13/cobra.(*Command).ExecuteC(0xc0013b0a00) /home/stack/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:990 +0x3b4 github.com/spf13/cobra.(*Command).Execute(...) /home/stack/go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:918 main.installerMain() /home/stack/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:61 +0x2b0 main.main() /home/stack/go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x33d3cd3]
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-25-210451
How reproducible:
Occurred on the 12th run, all previous installs were successfule
Steps to Reproduce:
1.Set up dev-scripts for AGENT_E2E_TEST_SCENARIO=COMPACT_IPV4, no mirroring 2. Run 'make clean; make agent' in a loop 3. After repeated installs got the failure
Actual results:
Panic in WaitForBootstrapComplete
Expected results:
No failure
Additional info:
It looks like clusterResult is used here even on failure, which causes the dereference - https://github.com/openshift/installer/blob/master/pkg/agent/rest.go#L121
Clone of https://issues.redhat.com/browse/OCPBUGSM-44162.
Cannot use the original as the bot won't accept a security bug:
When the change merges, the Bugzilla associated with the CVE must be set to MODIFIED. Since the DPTP bugzilla bot is not permitted to scan bugs with the SECURITY group in Bugzilla, The REP will not be able to use the bot's public functionality of moving their bug to MODIFIED.
Description of problem:
Currently we are not gathering Machine objects. We got nomination for a rule that will use this resource.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The cluster-ingress-operator log output is a little noisy when starting the operator's controllers, in part because of the way in which the configurable-route controller configures its watches.
4.10+.
Always.
1. Check the ingress-operator logs, and search for "configurable_route_controller": oc -n openshift-ingress-operator logs -c ingress-operator deploy/ingress-operator | grep -e configurable_route_controller
The operator emits log messages like the following on startup:
2022-11-23T08:47:35.646-0600 INFO operator.init controller/controller.go:241 Starting EventSource {"controller": "configurable_route_controller", "source": "&{{%!s(*v1.Role=&{{ } { 0 {{0 0 <nil>}} <nil> <nil> map[] map[] [] [] []} []}) %!s(*cache.multiNamespaceCache=&{map[openshift-config:0xc000712110 openshift-config-managed:0xc000712108 openshift-ingress:0xc0007120f8 openshift-ingress-canary:0xc000712100 openshift-ingress-operator:0xc0007120e8] 0xc000261ea0 0xc00010e190 0xc0007120e0}) %!s(chan error=<nil>) %!s(func()=<nil>)}}"} 2022-11-23T08:47:35.646-0600 INFO operator.init controller/controller.go:241 Starting EventSource {"controller": "configurable_route_controller", "source": "&{{%!s(*v1.RoleBinding=&{{ } { 0 {{0 0 <nil>}} <nil> <nil> map[] map[] [] [] []} [] { }}) %!s(*cache.multiNamespaceCache=&{map[openshift-config:0xc000712110 openshift-config-managed:0xc000712108 openshift-ingress:0xc0007120f8 openshift-ingress-canary:0xc000712100 openshift-ingress-operator:0xc0007120e8] 0xc000261ea0 0xc00010e190 0xc0007120e0}) %!s(chan error=<nil>) %!s(func()=<nil>)}}"} 2022-11-23T08:47:35.646-0600 INFO operator.init controller/controller.go:241 Starting Controller {"controller": "configurable_route_controller"}
The operator should emit log messages like the following on startup:
2022-11-23T08:48:43.076-0600 INFO operator.init controller/controller.go:241 Starting EventSource {"controller": "configurable_route_controller", "source": "kind source: *v1.Role"} 2022-11-23T08:48:43.078-0600 INFO operator.init controller/controller.go:241 Starting EventSource {"controller": "configurable_route_controller", "source": "kind source: *v1.RoleBinding"} 2022-11-23T08:48:43.078-0600 INFO operator.init controller/controller.go:241 Starting Controller {"controller": "configurable_route_controller"}
The cited noisiness results from two issues. First, the configurable-route controller needlessly uses source.NewKindWithCache() to configure its watches when it would be sufficient and slightly simpler to use source.Kind.
Second, recent versions of controller-runtime have excessively noisy logging for the kindWithCache source type. The configurable-route controller was introduced in OpenShift 4.8, which uses controller-runtime v0.9.0-alpha.1. OpenShift 4.9 has controller-runtime v0.9.0, OpenShift 4.10 has controller-runtime v0.11.0, and OpenShift 4.11 has controller-runtime v0.12.0. A change in controller-runtime v0.11.0 causes the noisiness. Before this change, the output was excessively quiet, for example:
2022-09-28T20:51:40.979Z INFO operator.init.controller-runtime.manager.controller.configurable_route_controller controller/controller.go:221 Starting EventSource {"source": {}} 2022-09-28T20:51:40.979Z INFO operator.init.controller-runtime.manager.controller.configurable_route_controller controller/controller.go:221 Starting EventSource {"source": {}}
I have filed an issue upstream to improve the logging for kindWithCache: https://github.com/kubernetes-sigs/controller-runtime/pull/2057
This is a clone of issue OCPBUGS-1557. The following is the description of the original issue:
—
Seen in an instance created recently by a 4.12.0-ec.2 GCP provider:
"scheduling": { "automaticRestart": false, "onHostMaintenance": "MIGRATE", "preemptible": false, "provisioningModel": "STANDARD" },
From GCP's docs, they may stop instances on hardware failures and other causes, and we'd need automaticRestart: true to auto-recover from that. Also from GCP docs, the default for automaticRestart is true. And on the Go provider side, we doc:
If omitted, the platform chooses a default, which is subject to change over time, currently that default is "Always".
But the implementing code does not actually float the setting. Seems like a regression here, which is part of 4.10:
$ git clone https://github.com/openshift/machine-api-provider-gcp.git $ cd machine-api-provider-gcp $ git log --oneline origin/release-4.10 | grep 'migrate to openshift/api' 44f0f958 migrate to openshift/api
But that's not where the 4.9 and earlier code is located:
$ git branch -a | grep origin/release remotes/origin/release-4.10 remotes/origin/release-4.11 remotes/origin/release-4.12 remotes/origin/release-4.13
Hunting for 4.9 code:
$ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.9.48-x86_64 | grep gcp gcp-machine-controllers https://github.com/openshift/cluster-api-provider-gcp c955c03b2d05e3b8eb0d39d5b4927128e6d1c6c6 gcp-pd-csi-driver https://github.com/openshift/gcp-pd-csi-driver 48d49f7f9ef96a7a42a789e3304ead53f266f475 gcp-pd-csi-driver-operator https://github.com/openshift/gcp-pd-csi-driver-operator d8a891de5ae9cf552d7d012ebe61c2abd395386e
So looking there:
$ git clone https://github.com/openshift/cluster-api-provider-gcp.git $ cd cluster-api-provider-gcp $ git log --oneline | grep 'migrate to openshift/api' ...no hits... $ git grep -i automaticRestart origin/release-4.9 | grep -v '"description"\|compute-gen.go' origin/release-4.9:vendor/google.golang.org/api/compute/v1/compute-api.json: "automaticRestart": {
Not actually clear to me how that code is structured. So 4.10 and later GCP machine-API providers are impacted, and I'm unclear on 4.9 and earlier.
When we create an HCP, the Root CA in the HCP namespaces has the certificate and key named as
Done criteria: The Root CA should have the certificate and key named as the cert manager expects.
In 4.12.0-rc.0 some API-server components declare flowcontrol/v1beta1 release manifests:
$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.12.0-rc.0-x86_64 $ grep -r flowcontrol.apiserver.k8s.io manifests manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_20_etcd-operator_10_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-openshift-apiserver-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-openshift-apiserver-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-openshift-apiserver-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-openshift-controller-manager-operator_10_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
The APIs are scheduled for removal in Kube 1.26, which will ship with OpenShift 4.13. We want the 4.12 CVO to move to modern APIs in 4.12, so the APIRemovedInNext.*ReleaseInUse alerts are not firing on 4.12. This ticket tracks removing those manifests, or replacing them with a more modern resource type, or some such. Definition of done is that new 4.13 (and with backports, 4.12) nightlies no longer include flowcontrol.apiserver.k8s.io/v1beta1 manifests.
[It] clients should not use APIs that are removed in upcoming releases [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/apiserver/api_requests.go:27 Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times Nov 18 21:59:06.261: INFO: api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times Nov 18 21:59:06.261: INFO: api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times Nov 18 21:59:06.261: INFO: user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times [AfterEach] [sig-arch][Late] github.com/openshift/origin/test/extended/util/client.go:158 [AfterEach] [sig-arch][Late] github.com/openshift/origin/test/extended/util/client.go:159 flake: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times Ginkgo exit error 4: exit with code 4
This is required to unblock https://github.com/openshift/origin/pull/27561
Description of problem:
Duplicate notification "Getting started" would be shown on Search page
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-26-111919
How reproducible:
Always
Steps to Reproduce:
1. Login OCP as normal user, and change to developer prespective, create a new project 2. Delete the project on page (switch to Administator prespective, go to Home -> Projects page) 3. Switch to Developer prespective, and go to Search page, check the notification "Getting Started"
Actual results:
Two notification shown on page
Expected results:
Only one should exist
Additional info:
Description of problem:
https://github.com/openshift/api/pull/1186 - https://issues.redhat.com/browse/CONSOLE-3069 promoted ConsolePlugin CRD to v1. The PR introduces also a conversion webhook from v1alpha1 to v1. In new CRD version I18n ConsolePluginI18n is marked as optional. The conversion webhook will not set a default valid ("Lazy"/"Preload") value writing the v1 object and a v1 object completely omitting spec.i18n will be accepted we no valid default value as well. On the other side, at garbage collection time the object will be stuck forever due to the lack of a valid value for spec.i18n.loadType Example, create a v1 ConsolePlugin object: cat <<EOF | oc apply -f - apiVersion: console.openshift.io/v1 kind: ConsolePlugin metadata: name: test472 spec: backend: service: basePath: / name: test472-service namespace: kubevirt-hyperconverged port: 9443 type: Service displayName: Test 472 Plugin EOF Delete it in foreground mode: stirabos@t14s:~$ oc delete consoleplugin test472 --timeout=30s --cascade='foreground' -v 7 I1011 18:20:03.255605 31610 loader.go:372] Config loaded from file: /home/stirabos/.kube/config I1011 18:20:03.266567 31610 round_trippers.go:463] DELETE https://api.ci-ln-krdzphb-72292.gcp-2.ci.openshift.org:6443/apis/console.openshift.io/v1/consoleplugins/test472 I1011 18:20:03.266581 31610 round_trippers.go:469] Request Headers: I1011 18:20:03.266588 31610 round_trippers.go:473] Accept: application/json I1011 18:20:03.266594 31610 round_trippers.go:473] Content-Type: application/json I1011 18:20:03.266600 31610 round_trippers.go:473] User-Agent: oc/4.11.0 (linux/amd64) kubernetes/fcf512e I1011 18:20:03.266606 31610 round_trippers.go:473] Authorization: Bearer <masked> I1011 18:20:03.688569 31610 round_trippers.go:574] Response Status: 200 OK in 421 milliseconds consoleplugin.console.openshift.io "test472" deleted I1011 18:20:03.688911 31610 round_trippers.go:463] GET https://api.ci-ln-krdzphb-72292.gcp-2.ci.openshift.org:6443/apis/console.openshift.io/v1/consoleplugins?fieldSelector=metadata.name%3Dtest472 I1011 18:20:03.688919 31610 round_trippers.go:469] Request Headers: I1011 18:20:03.688928 31610 round_trippers.go:473] Authorization: Bearer <masked> I1011 18:20:03.688935 31610 round_trippers.go:473] Accept: application/json I1011 18:20:03.688941 31610 round_trippers.go:473] User-Agent: oc/4.11.0 (linux/amd64) kubernetes/fcf512e I1011 18:20:03.840103 31610 round_trippers.go:574] Response Status: 200 OK in 151 milliseconds I1011 18:20:03.840825 31610 round_trippers.go:463] GET https://api.ci-ln-krdzphb-72292.gcp-2.ci.openshift.org:6443/apis/console.openshift.io/v1/consoleplugins?fieldSelector=metadata.name%3Dtest472&resourceVersion=175205&watch=true I1011 18:20:03.840848 31610 round_trippers.go:469] Request Headers: I1011 18:20:03.840884 31610 round_trippers.go:473] Accept: application/json I1011 18:20:03.840907 31610 round_trippers.go:473] User-Agent: oc/4.11.0 (linux/amd64) kubernetes/fcf512e I1011 18:20:03.840928 31610 round_trippers.go:473] Authorization: Bearer <masked> I1011 18:20:03.972219 31610 round_trippers.go:574] Response Status: 200 OK in 131 milliseconds error: timed out waiting for the condition on consoleplugins/test472 and in kube-controller-manager logs we see: 2022-10-11T16:25:32.192864016Z I1011 16:25:32.192788 1 garbagecollector.go:501] "Processing object" object="test472" objectUID=0cc46a01-113b-4bbe-9c7a-829a97d6867c kind="ConsolePlugin" virtual=false 2022-10-11T16:25:32.282303274Z I1011 16:25:32.282161 1 garbagecollector.go:623] remove DeleteDependents finalizer for item [console.openshift.io/v1/ConsolePlugin, namespace: , name: test472, uid: 0cc46a01-113b-4bbe-9c7a-829a97d6867c] 2022-10-11T16:25:32.304835330Z E1011 16:25:32.304730 1 garbagecollector.go:379] error syncing item &garbagecollector.node{identity:garbagecollector.objectReference{OwnerReference:v1.OwnerReference{APIVersion:"console.openshift.io/v1", Kind:"ConsolePlugin", Name:"test472", UID:"0cc46a01-113b-4bbe-9c7a-829a97d6867c", Controller:(*bool)(nil), BlockOwnerDeletion:(*bool)(nil)}, Namespace:""}, dependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:1, readerWait:0}, dependents:map[*garbagecollector.node]struct {}{}, deletingDependents:true, deletingDependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, beingDeleted:true, beingDeletedLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, virtual:false, virtualLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, owners:[]v1.OwnerReference(nil)}: ConsolePlugin.console.openshift.io "test472" is invalid: spec.i18n.loadType: Unsupported value: "": supported values: "Preload", "Lazy"
Version-Release number of selected component (if applicable):
OCP 4.12.0 ec4
How reproducible:
100%
Steps to Reproduce:
1. cat <<EOF | oc apply -f - apiVersion: console.openshift.io/v1 kind: ConsolePlugin metadata: name: test472 spec: backend: service: basePath: / name: test472-service namespace: kubevirt-hyperconverged port: 9443 type: Service displayName: Test 472 Plugin EOF
2. oc delete consoleplugin test472 --timeout=30s --cascade='foreground' -v 7
Actual results:
2022-10-11T16:25:32.192864016Z I1011 16:25:32.192788 1 garbagecollector.go:501] "Processing object" object="test472" objectUID=0cc46a01-113b-4bbe-9c7a-829a97d6867c kind="ConsolePlugin" virtual=false 2022-10-11T16:25:32.282303274Z I1011 16:25:32.282161 1 garbagecollector.go:623] remove DeleteDependents finalizer for item [console.openshift.io/v1/ConsolePlugin, namespace: , name: test472, uid: 0cc46a01-113b-4bbe-9c7a-829a97d6867c] 2022-10-11T16:25:32.304835330Z E1011 16:25:32.304730 1 garbagecollector.go:379] error syncing item &garbagecollector.node{identity:garbagecollector.objectReference{OwnerReference:v1.OwnerReference{APIVersion:"console.openshift.io/v1", Kind:"ConsolePlugin", Name:"test472", UID:"0cc46a01-113b-4bbe-9c7a-829a97d6867c", Controller:(*bool)(nil), BlockOwnerDeletion:(*bool)(nil)}, Namespace:""}, dependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:1, readerWait:0}, dependents:map[*garbagecollector.node]struct {}{}, deletingDependents:true, deletingDependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, beingDeleted:true, beingDeletedLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, virtual:false, virtualLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, owners:[]v1.OwnerReference(nil)}: ConsolePlugin.console.openshift.io "test472" is invalid: spec.i18n.loadType: Unsupported value: "": supported values: "Preload", "Lazy"
Expected results:
Object correctly deleted
Additional info:
The issue doesn't happen with --cascade='background' which is the default on the CLI client
This is a clone of issue OCPBUGS-1428. The following is the description of the original issue:
—
Description of problem:
When using an OperatorGroup attached to a service account, AND if there is a secret present in the namespace, the operator installation will fail with the message: the service account does not have any API secret sa=testx-ns/testx-sa This issue seems similar to https://bugzilla.redhat.com/show_bug.cgi?id=2094303 - which was resolved in 4.11.0 - however, the new element now, is that the presence of a secret in the namespace is causing the issue. The name of the secret seems significant - suggesting something somewhere is depending on the order that secrets are listed in. For example, If the secret in the namespace is called "asecret", the problem does not occur. If it is called "zsecret", the problem always occurs.
"zsecret" is not a "kubernetes.io/service-account-token". The issue I have raised here relates to Opaque secrets - zsecret is an Opaque secret. The issue may apply to other types of secrets, but specifically my issue is that when there is an opaque secret present in the namespace, the operator install fails as described. I aught to be allowed to have an opaque secret present in the namespace where I am installing the operator.
Version-Release number of selected component (if applicable):
4.11.0 & 4.11.1
How reproducible:
100% reproducible
Steps to Reproduce:
1.Create namespace: oc new-project testx-ns 2. oc apply -f api-secret-issue.yaml
Actual results:
Expected results:
Additional info:
API YAML:
cat api-secret-issue.yaml
apiVersion: v1
kind: Secret
metadata:
name: zsecret
namespace: testx-ns
annotations:
kubernetes.io/service-account.name: testx-sa
type: Opaque
stringData:
mykey: mypass
—
apiVersion: v1
kind: ServiceAccount
metadata:
name: testx-sa
namespace: testx-ns
—
kind: OperatorGroup
apiVersion: operators.coreos.com/v1
metadata:
name: testx-og
namespace: testx-ns
spec:
serviceAccountName: "testx-sa"
targetNamespaces:
- testx-ns
—
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: testx-role
namespace: testx-ns
rules:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: testx-rolebinding
namespace: testx-ns
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: testx-role
subjects:
—
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: etcd-operator
namespace: testx-ns
spec:
channel: singlenamespace-alpha
installPlanApproval: Automatic
name: etcd
source: community-operators
sourceNamespace: openshift-marketplace
Description of problem:
When scaling down the machineSet for worker nodes, a PV(vmdk) file got deleted.
Version-Release number of selected component (if applicable):
4.10
How reproducible:
N/A
Steps to Reproduce:
1. Scale down worker nodes 2. Check VMware logs and VM gets deleted with vmdk still attached
Actual results:
After scaling down nodes, volumes still attached to the VM get deleted alongside the VM
Expected results:
Worker nodes scaled down without any accidental deletion
Additional info:
This is a clone of issue OCPBUGS-501. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable): 4.10.16
How reproducible: Always
Steps to Reproduce:
1. Edit the apiserver resource and add spec.audit.customRules field
$ oc get apiserver cluster -o yaml
spec:
audit:
customRules:
2. Allow the kube-apiserver pods to rollout new revision.
3. Once the kube-apiserver pods are in new revision execute $ oc get dc
Actual results:
Error from server (InternalError): an error on the server ("This request caused apiserver to panic. Look in the logs for details.") has prevented the request from succeeding (get deploymentconfigs.apps.openshift.io)
Expected results: The command "oc get dc" should display the deploymentconfig without any error.
Additional info:
Description of problem:
The API Explorer page layout is incorrect, please check the attachment for more details
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-08-15-150248
How reproducible:
Always
Steps to Reproduce:
1. Login OCP, Go to Home -> API Explorer page
2. Check if there is an extra blank line between the dropdown filter and the list
Actual results:
There is an extra blank line between the dropdown filter and the list
Expected results:
Use right patternfly package, remove the extra blank line
Additional info:
104.0.5112.79 (Official Build) (64-bit)
Description of problem:
This is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2074299 for backporting purposes.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-3186. The following is the description of the original issue:
—
Description of problem:
fail to get clear error message when zones is not match with the the subnets in BYON
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. install-config.yaml yq '.controlPlane.platform.ibmcloud.zones,.platform.ibmcloud.controlPlaneSubnets' install-config.yaml ["ca-tor-1", "ca-tor-2", "ca-tor-3"] - ca-tor-existing-network-1-cp-ca-tor-2 - ca-tor-existing-network-1-cp-ca-tor-3 2. openshift-install create manifests --dir byon-az-test-1
Actual results:
FATAL failed to fetch Master Machines: failed to generate asset "Master Machines": failed to create master machine objects: failed to create provider: no subnet found for ca-tor-1
Expected results:
more clear error message in install-config.yaml
Additional info:
Description of problem:
Create Loadbalancer type service within the OCP 4.11.x OVNKubernetes cluster to expose the api server endpoint, the service does not response for normal oc request. But some of them are working, like "oc whoami", "oc get --raw /api"
Version-Release number of selected component (if applicable):
4.11.8 with OVNKubernetes
How reproducible:
always
Steps to Reproduce:
1. Setup openshift cluster 4.11 on AWS with OVNKubernetes as the default network 2. Create the following service under openshift-kube-apiserver namespace to expose the api ---- apiVersion: v1 kind: Service metadata: annotations: service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "1800" finalizers: - service.kubernetes.io/load-balancer-cleanup name: test-api namespace: openshift-kube-apiserver spec: allocateLoadBalancerNodePorts: true externalTrafficPolicy: Cluster internalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack loadBalancerSourceRanges: - <my_ip>/32 ports: - nodePort: 31248 port: 6443 protocol: TCP targetPort: 6443 selector: apiserver: "true" app: openshift-kube-apiserver sessionAffinity: None type: LoadBalancer 3. Setup the DNS resolution for the access xxx.mydomain.com ---> <elb-auto-generated-dns> 4. Try to access the cluster api via the service above by updating the kubeconfig to use the custom dns name
Actual results:
No response from the server side. $ time oc get node -v8 I1025 08:29:10.284069 103974 loader.go:375] Config loaded from file: bmeng.kubeconfig I1025 08:29:10.294017 103974 round_trippers.go:420] GET https://rh-api.bmeng-ccs-ovn.3o13.s1.devshift.org:6443/api/v1/nodes?limit=500 I1025 08:29:10.294035 103974 round_trippers.go:427] Request Headers: I1025 08:29:10.294043 103974 round_trippers.go:431] Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json I1025 08:29:10.294052 103974 round_trippers.go:431] User-Agent: oc/openshift (linux/amd64) kubernetes/e40bd2d I1025 08:29:10.365119 103974 round_trippers.go:446] Response Status: 200 OK in 71 milliseconds I1025 08:29:10.365142 103974 round_trippers.go:449] Response Headers: I1025 08:29:10.365148 103974 round_trippers.go:452] Audit-Id: 83b9d8ae-05a4-4036-bff6-de371d5bec12 I1025 08:29:10.365155 103974 round_trippers.go:452] Cache-Control: no-cache, private I1025 08:29:10.365161 103974 round_trippers.go:452] Content-Type: application/json I1025 08:29:10.365167 103974 round_trippers.go:452] X-Kubernetes-Pf-Flowschema-Uid: 2abc2e2d-ada3-4cb8-a86f-235df3a4e214 I1025 08:29:10.365173 103974 round_trippers.go:452] X-Kubernetes-Pf-Prioritylevel-Uid: 02f7a188-43c7-4827-af58-5ebe861a1891 I1025 08:29:10.365179 103974 round_trippers.go:452] Date: Tue, 25 Oct 2022 08:29:10 GMT ^C real 17m4.840s user 0m0.567s sys 0m0.163s However, it has the correct response if using --raw to request, eg: $ oc get --raw /api/v1 --kubeconfig bmeng.kubeconfig {"kind":"APIResourceList","groupVersion":"v1","resources":[{"name":"bindings","singularName":"","namespaced":true,"kind":"Binding","verbs":["create"]},{"name":"componentstatuses","singularName":"","namespaced":false,"kind":"ComponentStatus","verbs":["get","list"],"shortNames":["cs"]},{"name":"configmaps","singularName":"","namespaced":true,"kind":"ConfigMap","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["cm"],"storageVersionHash":"qFsyl6wFWjQ="},{"name":"endpoints","singularName":"","namespaced":true,"kind":"Endpoints","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["ep"],"storageVersionHash":"fWeeMqaN/OA="},{"name":"events","singularName":"","namespaced":true,"kind":"Event","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["ev"],"storageVersionHash":"r2yiGXH7wu8="},{"name":"limitranges","singularName":"","namespaced":true,"kind":"LimitRange","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["limits"],"storageVersionHash":"EBKMFVe6cwo="},{"name":"namespaces","singularName":"","namespaced":false,"kind":"Namespace","verbs":["create","delete","get","list","patch","update","watch"],"shortNames":["ns"],"storageVersionHash":"Q3oi5N2YM8M="},{"name":"namespaces/finalize","singularName":"","namespaced":false,"kind":"Namespace","verbs":["update"]},{"name":"namespaces/status","singularName":"","namespaced":false,"kind":"Namespace","verbs":["get","patch","update"]},{"name":"nodes","singularName":"","namespaced":false,"kind":"Node","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["no"],"storageVersionHash":"XwShjMxG9Fs="},{"name":"nodes/proxy","singularName":"","namespaced":false,"kind":"NodeProxyOptions","verbs":["create","delete","get","patch","update"]},{"name":"nodes/status","singularName":"","namespaced":false,"kind":"Node","verbs":["get","patch","update"]},{"name":"persistentvolumeclaims","singularName":"","namespaced":true,"kind":"PersistentVolumeClaim","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["pvc"],"storageVersionHash":"QWTyNDq0dC4="},{"name":"persistentvolumeclaims/status","singularName":"","namespaced":true,"kind":"PersistentVolumeClaim","verbs":["get","patch","update"]},{"name":"persistentvolumes","singularName":"","namespaced":false,"kind":"PersistentVolume","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["pv"],"storageVersionHash":"HN/zwEC+JgM="},{"name":"persistentvolumes/status","singularName":"","namespaced":false,"kind":"PersistentVolume","verbs":["get","patch","update"]},{"name":"pods","singularName":"","namespaced":true,"kind":"Pod","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["po"],"categories":["all"],"storageVersionHash":"xPOwRZ+Yhw8="},{"name":"pods/attach","singularName":"","namespaced":true,"kind":"PodAttachOptions","verbs":["create","get"]},{"name":"pods/binding","singularName":"","namespaced":true,"kind":"Binding","verbs":["create"]},{"name":"pods/ephemeralcontainers","singularName":"","namespaced":true,"kind":"Pod","verbs":["get","patch","update"]},{"name":"pods/eviction","singularName":"","namespaced":true,"group":"policy","version":"v1","kind":"Eviction","verbs":["create"]},{"name":"pods/exec","singularName":"","namespaced":true,"kind":"PodExecOptions","verbs":["create","get"]},{"name":"pods/log","singularName":"","namespaced":true,"kind":"Pod","verbs":["get"]},{"name":"pods/portforward","singularName":"","namespaced":true,"kind":"PodPortForwardOptions","verbs":["create","get"]},{"name":"pods/proxy","singularName":"","namespaced":true,"kind":"PodProxyOptions","verbs":["create","delete","get","patch","update"]},{"name":"pods/status","singularName":"","namespaced":true,"kind":"Pod","verbs":["get","patch","update"]},{"name":"podtemplates","singularName":"","namespaced":true,"kind":"PodTemplate","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"storageVersionHash":"LIXB2x4IFpk="},{"name":"replicationcontrollers","singularName":"","namespaced":true,"kind":"ReplicationController","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["rc"],"categories":["all"],"storageVersionHash":"Jond2If31h0="},{"name":"replicationcontrollers/scale","singularName":"","namespaced":true,"group":"autoscaling","version":"v1","kind":"Scale","verbs":["get","patch","update"]},{"name":"replicationcontrollers/status","singularName":"","namespaced":true,"kind":"ReplicationController","verbs":["get","patch","update"]},{"name":"resourcequotas","singularName":"","namespaced":true,"kind":"ResourceQuota","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["quota"],"storageVersionHash":"8uhSgffRX6w="},{"name":"resourcequotas/status","singularName":"","namespaced":true,"kind":"ResourceQuota","verbs":["get","patch","update"]},{"name":"secrets","singularName":"","namespaced":true,"kind":"Secret","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"storageVersionHash":"S6u1pOWzb84="},{"name":"serviceaccounts","singularName":"","namespaced":true,"kind":"ServiceAccount","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["sa"],"storageVersionHash":"pbx9ZvyFpBE="},{"name":"serviceaccounts/token","singularName":"","namespaced":true,"group":"authentication.k8s.io","version":"v1","kind":"TokenRequest","verbs":["create"]},{"name":"services","singularName":"","namespaced":true,"kind":"Service","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["svc"],"categories":["all"],"storageVersionHash":"0/CO1lhkEBI="},{"name":"services/proxy","singularName":"","namespaced":true,"kind":"ServiceProxyOptions","verbs":["create","delete","get","patch","update"]},{"name":"services/status","singularName":"","namespaced":true,"kind":"Service","verbs":["get","patch","update"]}]}
Expected results:
The normal oc request should be working.
Additional info:
There is no such issue for clusters with openshift-sdn with the same OpenShift version and same LoadBalancer service. We suspected that it might be related to the MTU setting, but this cannot explain why OpenShiftSDN works well. Another thing might be related is that the OpenShiftSDN is using iptables for service loadbalancing and OVN is dealing that within the OVN services.
Please let me know if any debug log/info is needed.
Description of problem:
If a master fails and is drained, the old copy of the metal3 pod gets stuck in Terminating state for some (possibly long) time. While the new pod works correctly, CBO expects only one port to exist and thus cannot determine the applicable Ironic IP address.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. On dev-scripts: virsh destroy <VM with metal3 pod> 2. Wait for drain to happen or trigger it manually 3. Check CBO logs
Actual results:
"unable to determine Ironic's IP to pass to the machine-image-customization-controller: there should be only one pod listed for the given label"
Expected results:
CBO reconfigures its pods with the new Ironic IP
Additional info:
I don't know how to filter out pods in Terminating state...
Package golang.org/x/text/language has a vulnerability which could cause a denial of service (included on version v0.3.7 of golang.org/x/text/language)
This is a clone of issue OCPBUGS-3280. The following is the description of the original issue:
—
I have a script that does continuous installs using AGENT_E2E_TEST_SCENARIO=COMPACT_IPV4, just starting a new install after the previous one completes. What I'm seeing is that eventually I end up getting installation failures due to the container-images-available validation failure. What gets logged in wait-for bootstrap-complete is:
level=debug msg=Host master-0: New image status quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f6ddae72f6d730ca07a265691401571a8d8f7e62546f1bcda26c9a01628f4d6. result: failure.
level=debug msg=Host master-0: validation 'container-images-available' that used to succeed is now failing
level=debug msg=Host master-0: updated status from preparing-for-installation to preparing-failed (Host failed to prepare for installation due to following failing validation(s): Failed to fetch container images needed for installation from quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f6ddae72f6d730ca07a265691401571a8d8f7e62546f1bcda26c9a01628f4d6. This may be due to a network hiccup. Retry to install again. If this problem persists, check your network settings to make sure you’re not blocked. ; Host couldn't synchronize with any NTP server)
Sometimes the image gets loaded onto the other masters OK and sometimes there are failures with more than one host. In either case the install stalls at this point.
When using a disconnected environment (MIRROR_IMAGES=true) I don't see this occurring.
Containers on host0
[core@master-0 ~]$ sudo podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
00a0eebb989c localhost/podman-pause:4.2.0-1661537366 11 hours ago Up 11 hours ago cef65dd7f170-infra
5d0eced94979 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:caa73897dcb9ff6bc00a4165f4170701f4bd41e36bfaf695c00461ec65a8d589 /bin/bash start_d... 11 hours ago Up 11 hours ago assisted-db
813bef526094 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:caa73897dcb9ff6bc00a4165f4170701f4bd41e36bfaf695c00461ec65a8d589 /assisted-service 11 hours ago Up 11 hours ago service
edde1028a542 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e43558e28be8fbf6fe4529cf9f9beadbacbbba8c570ecf6cb81ae732ec01807f next_step_runner ... 11 hours ago Up 11 hours ago next-step-runner
Some relevant logs from assisted-service for this container image:
time="2022-11-03T01:48:44Z" level=info msg="Submitting step <container-image-availability> id <container-image-availability-b72665b1> to infra_env <17c8b837-0130-4b8c-ad06-19bcd2a61dbf> host <df170326-772b-43b5-87ef-3dfff91ba1a9> Arguments: <[{\"images\":[\"registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451\",\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ca122ab3a82dfa15d72a05f448c48a7758a2c7b0ecbb39011235bcf0666fbc15\",\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f6ddae72f6d730ca07a265691401571a8d8f7e62546f1bcda26c9a01628f4d6\",\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9e52a45b47cd9d70e7378811f4ba763fd43ec2580378822286c7115fbee6ef3a\"],\"timeout\":960}]>" func=github.com/openshift/assisted-service/internal/host/hostcommands.logSteps file="/src/internal/host/hostcommands/instruction_manager.go:285" go-id=841 host_id=df170326-772b-43b5-87ef-3dfff91ba1a9 infra_env_id=17c8b837-0130-4b8c-ad06-19bcd2a61dbf pkg=instructions request_id=47cc221f-4f47-4d0d-8278-c0f5af933567
time="2022-11-03T01:49:35Z" level=error msg="Received step reply <container-image-availability-9788cfa7> from infra-env <17c8b837-0130-4b8c-ad06-19bcd2a61dbf> host <845f1e3c-c286-4d2f-ba92-4c5cab953641> exit-code <2> stderr <> stdout <{\"images\":[
{\"name\":\"registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451\",\"result\":\"success\"},{\"download_rate\":159.65409925994226,\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ca122ab3a82dfa15d72a05f448c48a7758a2c7b0ecbb39011235bcf0666fbc15\",\"result\":\"success\",\"size_bytes\":523130669,\"time\":3.276650405},{\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f6ddae72f6d730ca07a265691401571a8d8f7e62546f1bcda26c9a01628f4d6\",\"result\":\"failure\"},{\"download_rate\":278.8962416008878,\"name\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9e52a45b47cd9d70e7378811f4ba763fd43ec2580378822286c7115fbee6ef3a\",\"result\":\"success\",\"size_bytes\":402688178,\"time\":1.443863767}]}>" func=github.com/openshift/assisted-service/internal/bminventory.logReplyReceived file="/src/internal/bminventory/inventory.go:3287" go-id=845 host_id=845f1e3c-c286-4d2f-ba92-4c5cab953641 infra_env_id=17c8b837-0130-4b8c-ad06-19bcd2a61dbf pkg=Inventory request_id=3a571ba6-5175-4bbe-b89a-20cdde30b884
time="2022-11-03T01:49:35Z" level=info msg="Adding new image status for quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f6ddae72f6d730ca07a265691401571a8d8f7e62546f1bcda26c9a01628f4d6 with status failure to host 845f1e3c-c286-4d2f-ba92-4c5cab953641" func="github.com/openshift/assisted-service/internal/host.(*Manager).UpdateImageStatus" file="/src/internal/host/host.go:805" pkg=host-state
This is a clone of issue OCPBUGS-5891. The following is the description of the original issue:
—
Description of problem:
When used in heads-only mode, oc-mirror does not record the operator bundles minimum version if a target name is set. The record values ensures that bundles that still exist in the catalog are included as part of the generated catalog and that the associated images are not pruned. This behavior will prune bundles that have when no minimum version is set in the imageset configuration and the bundles still exist in the source catalog.
Version-Release number of selected component (if applicable):
Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.13.0-202212011938.p0.g8bf1402.assembly.stream-8bf1402", GitCommit:"8bf14023aa018e12425e29993e6f53f0ab07e6ab", GitTreeState:"clean", BuildDate:"2022-12-01T19:56:31Z", GoVersion:"go1.18.4", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
100%
Steps to Reproduce:
Using the advanced cluster management package as an example. 1. Find the latest bundle for acm in the release-2.6 channel with oc-mirror list operators --catalog registry.redhat.io/redhat/redhat-operator-index:v4.10-1663021232 --package advanced-cluster-management 2. Create an mirror set configuration to mirror an operator from an older catalog version apiVersion: mirror.openshift.io/v1alpha2 kind: ImageSetConfiguration storageConfig: local: path: test mirror: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.10-1663021232 targetName: test targetTag: test packages: - name: advanced-cluster-management channels: - name: release-2.6 2. Run oc-mirror --config config-with-operators.yaml file:// 3. Check the bundle minimum version on the metadata using oc-mirror describe mirror_seq1_000000.tar under the field operators, the advanced-cluster-management should show version found in Step 1. 4. Create another ImageSetConfiguration for a later version of the catalog apiVersion: mirror.openshift.io/v1alpha2 kind: ImageSetConfiguration storageConfig: local: path: test mirror: operators: - catalog: registry.redhat.io/redhat/redhat-operator-index:v4.10 targetName: test targetTag: test packages: - name: advanced-cluster-management channels: - name: release-2.6 4. Check the bundle minimum version on the metadata using oc-mirror describe mirror_seq2_000000.tar under the operators field.
Actual results:
The catalog entry in the metadata shows packages as null.
Expected results:
It should have the advanced-cluster-managament package with the original minimum version or an updated minimum version if the original bundle was pruned.
This is a clone of issue OCPBUGS-3713. The following is the description of the original issue:
—
This is essentially the same issue in OCPBUGS-3668 where we found the username must be fully qualified (e.g. "rbost@vsphere.local" not just "rbost").
This is a clone of issue OCPBUGS-6799. The following is the description of the original issue:
—
Description of problem:
The pipelines -> repositories list view in Dev Console does not show the running pipelineline as the last pipelinerun in the table.
Original BugZilla Link: https://bugzilla.redhat.com/show_bug.cgi?id=2016006
OCPBUGSM: https://issues.redhat.com/browse/OCPBUGSM-36408
Description of problem:
GCP XPN is in tech preview. There are two features which are affected: 1. selecting a DNS zone from a different project should only be allowed if tech preview is enabled in the install config. (Using a DNS zone from a different project will fail to install due to outstanding work in the cluster ingress operator). 2. GCP XPN passes through the installer host service account for control plane nodes. This should only happen if XPN (networkProjectID) is enabled. It should not happen during normal installs.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Steps to Reproduce:
For install config fields: 1.specify a project ID for a DNS zone without featureSet: TechPreviewNoUpgrade 2.run openshift-install create manifests ==== For service accounts: 1. perform normal (not XPN) install 2. Check service account on control plane VM
Actual results:
For install config fields: you can specify project id without an error For service accounts: the control plane vm will have same service account used for install
Expected results:
For install config fields: installer should complain that tech preview is not enabled For service accounts: should have a new service account, created during install
Additional info:
Not all information provided in the install-config gets passed through to assisted-service.
An example is that platform settings other than the VIPs are ignored. So are the "capabilities". There may be others - we need to do a thorough audit.
If the user supplies data that we then ignore, we should log a warning. However, we must not return an error, because this may prevent people using their existing install-configs with the agent install method.
This is a clone of issue OCPBUGS-3032. The following is the description of the original issue:
—
If installation fails at an early stage (e.g. pulling release images, configuring hosts, waiting for agents to come up) there is no indication that anything has gone wrong, and the installer binary may not even be able to connect.
We should at least display what is happening on the console so that users have some avenue to figure out for themselves what is going on.
This is a clone of issue OCPBUGS-1125. The following is the description of the original issue:
—
(originally reported in BZ as https://bugzilla.redhat.com/show_bug.cgi?id=1983200)
test:
[sig-etcd][Feature:DisasterRecovery][Disruptive] [Feature:EtcdRecovery] Cluster should restore itself after quorum loss [Serial]
is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-etcd%5C%5D%5C%5BFeature%3ADisasterRecovery%5C%5D%5C%5BDisruptive%5C%5D+%5C%5BFeature%3AEtcdRecovery%5C%5D+Cluster+should+restore+itself+after+quorum+loss+%5C%5BSerial%5C%5D
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.8/1413625606435770368
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.8/1415075413717159936
—
some brief triaging from Thomas Jungblut on:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.11/1568747321334697984
it seems the last guard pod doesn't come up, etcd operator installs this properly and the revision installer also does not spout any errors. It just doesn't progress to the latest revision. At first glance doesn't look like an issue with etcd itself, but needs to be taken a closer look at for sure.
We should deprecate and eventually remove react-helmet as a shared plugin dependency. This dependency is small, and plugins can bring their own version if needed.
This requires updated our webpack plugin to allow dependency fallbacks when a shared dependency is not present.
AC:
Description of problem:
TestUnmanagedDNSToManagedDNSInternalIngressController E2E test is failing on the error: { unmanaged_dns_test.go:272: failed to verify connectivity with workload with reqURL http://10.0.128.7 using external client: timed out waiting for the condition
How reproducible:
About 75% of the time.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
75%
Steps to Reproduce:
1. Run CI E2E tests on cluster-ingress-operator or make test-e2e TEST=TestUnmanagedDNSToManagedDNSInternalIngressController
Actual results:
E2E test fails about 75% of the time
Expected results:
E2E should always pass
Additional info:
Description of problem:
Found during 1.25 rebase work, test hit this panic in two runs of 4.12-e2e-vsphere-ovn-upi-serial:
Full error for reference:
```github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/runner.go:107 +0x96
panic({0x766b520, 0xc183570})
runtime/panic.go:838 +0x207
k8s.io/kubernetes/test/e2e/network.glob..func15.4()
k8s.io/kubernetes@v1.24.0/test/e2e/network/ingressclass.go:97 +0x284
github.com/onsi/ginkgo/internal/leafnodes.(*runner).runSync(0x300000002?)
github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/runner.go:113 +0xb1
github.com/onsi/ginkgo/internal/leafnodes.(*runner).run(0xc002466e40?)
github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/runner.go:64 +0x125
github.com/onsi/ginkgo/internal/leafnodes.(*ItNode).Run(0x7f72ca69cfff?)
github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/it_node.go:26 +0x7b
github.com/onsi/ginkgo/internal/spec.(*Spec).runSample(0xc003305b30, 0xc00066b208?, {0x8faff00, 0xc00045edc0})
github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/spec/spec.go:215 +0x28a
github.com/onsi/ginkgo/internal/spec.(*Spec).Run(0xc003305b30, {0x8faff00, 0xc00045edc0})
github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/spec/spec.go:138 +0xe7
github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runSpec(0xc002480280, 0xc003305b30)
github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/specrunner/spec_runner.go:200 +0xe8
github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runSpecs(0xc002480280)
github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/specrunner/spec_runner.go:170 +0x1a5
github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).Run(0xc002480280)
github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/specrunner/spec_runner.go:66 +0xc5
github.com/onsi/ginkgo/internal/suite.(*Suite).Run(0xc0004762d0, {0x8fb0260, 0xc002ba2690}, {0x0, 0x0}, {0xc002bb8600, 0x1, 0x1}, {0x8ff18e0, 0xc00045edc0}, ...)
github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/suite/suite.go:62 +0x4b2
github.com/openshift/origin/pkg/test/ginkgo.(*TestOptions).Run(0xc0024b28c0, {0xc000311420, 0xc58c8b0?, 0x4f19d80?})
github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:448 +0x32
github.com/openshift/origin/test/extended/util.WithCleanup(0xc002527bb8)
github.com/openshift/origin/test/extended/util/test.go:168 +0xad
main.newRunTestCommand.func1(0xc0024cc780?, {0xc000311420, 0x1, 0x1})
github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:448 +0x325
github.com/spf13/cobra.(*Command).execute(0xc0024cc780, {0xc0003113a0, 0x1, 0x1})
github.com/spf13/cobra@v1.4.0/command.go:856 +0x67c
github.com/spf13/cobra.(*Command).ExecuteC(0xc000c3fb80)
github.com/spf13/cobra@v1.4.0/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
github.com/spf13/cobra@v1.4.0/command.go:902
main.main.func1(0xc000de1700?)
github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:94 +0x8a
main.main()
github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:95 +0x476
fail [runtime/panic.go:220]: Test Panicked: runtime error: invalid memory address or nil pointer dereference
Ginkgo exit error 1: exit with code 1
```
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Test hit panic
Expected results:
No panic
Additional info:
This is a clone of issue OCPBUGS-8342. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-8258. The following is the description of the original issue:
—
Invoking 'create cluster-manifests' fails when imageContentSources is missing in install-config yaml:
$ openshift-install agent create cluster-manifests INFO Consuming Install Config from target directory FATAL failed to write asset (Mirror Registries Config) to disk: failed to write file: open .: is a directory
install-config.yaml:
apiVersion: v1alpha1 metadata: name: appliance rendezvousIP: 192.168.122.116 hosts: - hostname: sno installerArgs: '["--save-partlabel", "agent*", "--save-partlabel", "rhcos-*"]' interfaces: - name: enp1s0 macAddress: 52:54:00:e7:05:72 networkConfig: interfaces: - name: enp1s0 type: ethernet state: up mac-address: 52:54:00:e7:05:72 ipv4: enabled: true dhcp: true
Description of problem:
etcd and kube-apiserver pods get restarted due to failed liveness probes while deleting/re-creating pods on SNO
Version-Release number of selected component (if applicable):
4.10.32
How reproducible:
Not always, after ~10 attempts
Steps to Reproduce:
1. Deploy SNO with Telco DU profile applied 2. Create multiple pods with local storage volumes attached(attaching yaml manifest) 3. Force delete and re-create pods 10 times
Actual results:
etcd and kube-apiserver pods get restarted, making to cluster unavailable for a period of time
Expected results:
etcd and kube-apiserver do not get restarted
Additional info:
Attaching must-gather. Please let me know if any additional info is required. Thank you!
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-3316. The following is the description of the original issue:
—
Description of problem:
Branch name in repository pipelineruns list view should match the actual github branch name.
Version-Release number of selected component (if applicable):
4.11.z
How reproducible:
alwaus
Steps to Reproduce:
1. Create a repository 2. Trigger the pipelineruns by push or pull request event on the github
Actual results:
Branch name contains "refs-heads-" prefix in front of the actual branch name eg: "refs-heads-cicd-demo" (cicd-demo is the branch name)
Expected results:
Branch name should be the acutal github branch name. just `cicd-demo`should be shown in the branch column.
Additional info:
Ref: https://coreos.slack.com/archives/CHG0KRB7G/p1667564311865459
Description of problem:
[OVN][OSP] After reboot egress node, egress IP cannot be applied anymore.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-11-07-181244
How reproducible:
Frequently happened in automation. But didn't reproduce it in manual.
Steps to Reproduce:
1. Label one node as egress node 2. Config one egressIP object STEP: Check one EgressIP assigned in the object. Nov 8 15:28:23.591: INFO: egressIPStatus: [{"egressIP":"192.168.54.72","node":"huirwang-1108c-pg2mt-worker-0-2fn6q"}] 3. Reboot the node, wait for the node ready.
Actual results:
EgressIP cannot be applied anymore. Waited more than 1 hour. oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-47031 192.168.54.72
Expected results:
The egressIP should be applied correctly.
Additional info:
Some logs E1108 07:29:41.849149 1 egressip.go:1635] No assignable nodes found for EgressIP: egressip-47031 and requested IPs: [192.168.54.72] I1108 07:29:41.849288 1 event.go:285] Event(v1.ObjectReference{Kind:"EgressIP", Namespace:"", Name:"egressip-47031", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'NoMatchingNodeFound' no assignable nodes for EgressIP: egressip-47031, please tag at least one node with label: k8s.ovn.org/egress-assignable W1108 07:33:37.401149 1 egressip_healthcheck.go:162] Could not connect to huirwang-1108c-pg2mt-worker-0-2fn6q (10.131.0.2:9107): context deadline exceeded I1108 07:33:37.401348 1 master.go:1364] Adding or Updating Node "huirwang-1108c-pg2mt-worker-0-2fn6q" I1108 07:33:37.437465 1 egressip_healthcheck.go:168] Connected to huirwang-1108c-pg2mt-worker-0-2fn6q (10.131.0.2:9107)
After this log, seems like no logs related to "192.168.54.72" happened.
This is a clone of issue OCPBUGS-3027. The following is the description of the original issue:
—
Description of problem:
When running the console in development mode per https://github.com/openshift/console#frontend-development, metrics do not load on the cluster overview, pods list page, pod details page (Metrics tab is missing), etc.
Samuel Padgett suspects the changes in https://github.com/openshift/console/commit/0bd839da219462ea585183de1c856fb60e9f96fb are related.
These two tests are permafailing on webhook errors related to the CRD:
[sig-installer][Feature:baremetal][Serial] A baremetal deployment without a provisioning network should show the Provisioning Network as 'Disabled' [Suite:openshift/conformance/serial]
[sig-installer][Feature:baremetal][Serial] A baremetal deployment without a provisioning network should [apigroup:config.openshift.io] show the Provisioning Network as 'Disabled' [Suite:openshift/conformance/serial]
[sig-installer][Feature:baremetal][Serial] A baremetal deployment without a provisioning network should allow setting the ProvisioningNetwork to 'Managed' with valid settings [Suite:openshift/conformance/serial]
[sig-installer][Feature:baremetal][Serial] A baremetal deployment without a provisioning network should [apigroup:config.openshift.io] allow setting the ProvisioningNetwork to 'Managed' with valid settings [Suite:openshift/conformance/serial]
job=periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-serial-virtualmedia=all
Sippy links:
Description of problem:
The setting of systemReserved: ephemeral-storage in KubeletConfig is not working as expected.
Version-Release number of selected component (if applicable):
4.10.z, may exist on other OCP versions as well.
How reproducible:
always
Steps to Reproduce:
1. Create a KubeletConfig on the node: apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: system-reserved-config spec: machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/master: "" kubeletConfig: systemReserved: cpu: 500m memory: 500Mi ephemeral-storage: 10Gi 2. Check node allocatable storage with command: oc describe node |grep -C 5 ephemeral-storage
Actual results:
The Allocatable:ephemeral-storage on the node is not capacity.ephemeral-storage - systemReserved.ephemeral-storage - eviction-thresholds (10% of the capacity.ephemeral-storage by default)
Expected results:
The Allocatable:ephemeral-storage on the node should be capacity.ephemeral-storage - systemReserved.ephemeral-storage - eviction-thresholds (10% of the capacity.ephemeral-storage by default)
Additional info:
The root cause might be: process argument '--system-reserved=cpu=500m,memory=500Mi' overwrote the setting in /etc/kubernetes/kubelet.conf, one example: root 6824 1 27 Sep30 ? 1-09:00:24 kubelet --config=/etc/kubernetes/kubelet.conf --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig --kubeconfig=/var/lib/kubelet/kubeconfig --container-runtime=remote --container-runtime-endpoint=/var/run/crio/crio.sock --runtime-cgroups=/system.slice/crio.service --node-labels=node-role.kubernetes.io/master,node.openshift.io/os_id=rhcos --node-ip=192.168.58.47 --minimum-container-ttl-duration=6m0s --cloud-provider= --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec --hostname-override= --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a7b6408460148cb73c59677dbc2c261076bc07226c43b0c9192cc70aef5ba62 --system-reserved=cpu=500m,memory=500Mi --v=2 --housekeeping-interval=30s
Description of problem:
E2E test Installs Red Hat Integration - 3scale operator test is failing due to change of Operator name
Description of problem:
During an upgrade from 4.12.0 to 4.12.1 a customer has observed crashlooping ovn-master pods with the following error message $ oc logs -n openshift-ovn-kubernetes ovnkube-master-bx99r -c ovnkube-master --tail=20 -p :Transaction causes multiple rows in "IGMP_Group" table to have identical values (mrouters, 038b16fa-6aba-4244-9d4f-00a1e2cbf9a2, and []) for index on columns "address", "datapath", and "chassis". First row, with UUID 7e9a18fa-e58c-4547-a7cb-afa934b6cdc9, had the following index values before the trans action: mrouters, 038b16fa-6aba-4244-9d4f-00a1e2cbf9a2, and d9755997-e909-4d0c-8770-82a902d69a90. Second row, with UUID 84da3622-3ac7-41f0-a6b5-536a2d5f9137, had the following index values before the transaction: mrouters, 038b16fa-6aba-4244-9d4f-00a1e2cbf9a2, and 578d4dd9-cc02-4bcc-8a9c-08dcc3a94190. UUID:{GoUUID:} Rows:[]}] and errors []: constraint violation: Transaction causes multiple rows in "IGMP_Group" table to have identical values (mrouters, 038b16fa-6aba-4244-9d4f-00a1e2cbf9a2, and []) for index on columns "address", "datapath", and "chassis". First row, with UUID 7e9a18fa-e58c-4547-a7cb-afa934b6cdc9, had the following index values before the transaction: mrouters, 038b16fa-6aba-4244-9d4f-00a1e2cbf9a2, and d9755997-e909-4d0c-8770-82a902d69a90. Second row, with UUID 84da3622-3ac7-41f0-a6b5-536a2d5f9137, ha d the following index values before the transaction: mrouters, 038b16fa-6aba-4244-9d4f-00a1e2cbf9a2, and 578d4dd9-cc02-4bcc-8a9c-08dcc3a94190.
Version-Release number of selected component (if applicable):
4.12.0
How reproducible:
Unknown
Steps to Reproduce:
1. Upgrade from 4.12.0 to 4.12.1 2. 3.
Actual results:
crashlooping ovnkube-master pods
Expected results:
functional ovnkube-master pods
Additional info:
This cluster was upgraded from 4.11 to 4.12.0 then to 4.12.1. The attached case has a must-gather.
Tracker bug for bootimage bump in 4.12. This bug should block bugs which need a bootimage bump to fix.
The previous tracker is OCPBUGS-561.
Description of problem:
Seems ART is having trouble building OLM images: https://redhat-internal.slack.com/archives/CB95J6R4N/p1676531421724929 I've already fixed master: * https://github.com/openshift/cluster-policy-controller/pull/103 * https://github.com/openshift/cluster-policy-controller/pull/101 Need a bug to backport...
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Currently, the AWS actuator has a static list of instance types embedded in it. This means that as new instance types are added, we have to continually update this list.
Ideally, we could fetch this information from the AWS API as we do in GCP.
DoD:
This is a clone of issue OCPBUGS-4026. The following is the description of the original issue:
—
Description of problem:
There is an endless re-render loop and a browser feels slow to stuck when opening the add page or the topology.
Saw also endless API calls to /api/kubernetes/apis/binding.operators.coreos.com/v1alpha1/bindablekinds/bindable-kinds
Version-Release number of selected component (if applicable):
1. Console UI 4.12-4.13 (master)
2. Service Binding Operator (tested with 1.3.1)
How reproducible:
Always with installed SBO
But the "stuck feeling" depends on the browser (Firefox feels more stuck) and your locale machine power
Steps to Reproduce:
1. Install Service Binding Operator
2. Create or update the BindableKinds resource "bindable-kinds"
apiVersion: binding.operators.coreos.com/v1alpha1 kind: BindableKinds metadata: name: bindable-kinds
3. Open the browser console log
4. Open the console UI and navigate to the add page
Actual results:
1. Saw endless API calls to /api/kubernetes/apis/binding.operators.coreos.com/v1alpha1/bindablekinds/bindable-kinds
2. Browser feels slow and get stuck after some time
3. The page crashs after some time
Expected results:
1. The API call should be called just once
2. The add page should just work without feeling laggy
3. No crash
Additional info:
Get introduced after we watching the bindable-kinds resource with https://github.com/openshift/console/pull/11161
It looks like this happen only if the SBO is installed and the bindable-kinds resource exist, but doesn't contain any status.
The status list all available bindable resource types. I could not reproduce this by installing and uninstalling an operator, but you can manually create or update this resource as mentioned above.
This is a clone of issue OCPBUGS-7780. The following is the description of the original issue:
—
Description of problem:
4.9 and 4.10 oc calls to oc adm upgrade channel ... for 4.11+ clusters would clear spec.capabilities. Not all that many clusters try to restrict capabilities, but folks will need to bump their channel for at least every other minor (if their using EUS channels), and while we recommend folks use an oc from the 4.y they're heading towards, we don't have anything in place to enforce that.
Version-Release number of selected component (if applicable):
4.9 and 4.10 oc are exposed vs. the new-in-4.11 spec.capabilities. Newer oc could theoretically be exposed vs. any new ClusterVersion spec capabilities.
How reproducible:
100%
Steps to Reproduce:
1. Install a 4.11+ cluster with None capabilities.
2. Set the channel with a 4.10.51 oc, like oc adm upgrade channel fast-4.11.
3. Check the capabilities with oc get -o json clusterversion version | jq -c .spec.capabilities.
Actual results:
null
Expected results:
{"baselineCapabilitySet":"None"}
This is a clone of issue OCPBUGS-8702. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-8523. The following is the description of the original issue:
—
Description of problem:
Due to rpm-ostree regression (OKD-63) MCO was copying /var/lib/kubelet/config.json into /run/ostree/auth.json on FCOS and SCOS. This breaks Assisted Installer flow, which starts with Live ISO and doesn't have /var/lib/kubelet/config.json
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Deployed hypershift cluster with recent multi-arch build. Storage cluster operator has become available but having below warning message PowerVSBlockCSIDriverOperatorCRDegraded: PowerVSBlockCSIDriverStaticResourcesControllerDegraded: "rbac/attacher_role.yaml" (string): clusterroles.rbac.authorization.k8s.io "ibm-powervs-block-external-attacher-role" is forbidden: user "system:serviceaccount:openshift-cluster-csi-drivers:powervs-block-csi-driver-operator" (groups=["system:serviceaccounts" "system:serviceaccounts:openshift-cluster-csi-drivers" "system:authenticated"]) is attempting to grant RBAC permissions not currently held: PowerVSBlockCSIDriverOperatorCRDegraded: PowerVSBlockCSIDriverStaticResourcesControllerDegraded: {APIGroups:["csi.storage.k8s.io"], Resources:["csinodeinfos"], Verbs:["get" "list" "watch"]} PowerVSBlockCSIDriverOperatorCRDegraded: PowerVSBlockCSIDriverStaticResourcesControllerDegraded: "rbac/attacher_binding.yaml" (string): clusterroles.rbac.authorization.k8s.io "ibm-powervs-block-external-attacher-role" not found
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.Deploy 4.12.0-0.nightly-multi-2022-09-01-220105 nightly build
Actual results:
Expected results:
Additional info:
Description of problem:
The pod fails to mount the PVC using IBM Cloud VPC block storage.
Version-Release number of selected component (if applicable):
How reproducible:
The steps can be followed here: from this link
https://cloud.ibm.com/docs/openshift?topic=openshift-vpc-block
.
The error occurs when the application pod tried to mount the VPC.
Steps to Reproduce:
Describe above.
Actual results:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 26m default-scheduler Successfully assigned default/test to a100-huge-m25p7-worker-3-with-secondary-xdwvl
Normal SuccessfulAttachVolume 26m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba"
Warning FailedMount 26m (x2 over 26m) kubelet MountVolume.MountDevice failed for volume "pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba" : rpc error: code = Internal desc = {RequestID: ffbb97b4-e4d0-4016-87a9-dc46f80c5478 , Code: FormatAndMountFailed, Description: Failed to format '/dev/disk/by-id/virtio-0777-6872e22d-5c00-4' and mount it at '/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba/globalmount', BackendError: format of disk "/dev/disk/by-id/virtio-0777-6872e22d-5c00-4" failed: type"ext4") target"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba/globalmount") options"defaults") errcode:(exit status 1) output:(mke2fs 1.45.6 (20-Mar-2020)
The file /dev/disk/by-id/virtio-0777-6872e22d-5c00-4 does not exist and no size was specified.
) , Action: Please check if there is any error in POD describe related with volume attach}
Warning FailedMount 22m kubelet Unable to attach or mount volumes: unmounted volumes=[bs-pvc], unattached volumes=[kube-api-access-6bgvj bs-pvc]: timed out waiting for the condition
Warning FailedMount 4m11s (x9 over 24m) kubelet Unable to attach or mount volumes: unmounted volumes=[bs-pvc], unattached volumes=[bs-pvc kube-api-access-6bgvj]: timed out waiting for the condition
Warning FailedMount 3m51s (x17 over 26m) kubelet MountVolume.MountDevice failed for volume "pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba" : rpc error: code = Internal desc = {RequestID: 1a12a7c5-3bd0-41cf-b8a9-90dd3224c2fb , Code: FormatAndMountFailed, Description: Failed to format '/dev/disk/by-id/virtio-0777-6872e22d-5c00-4' and mount it at '/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba/globalmount', BackendError: format of disk "/dev/disk/by-id/virtio-0777-6872e22d-5c00-4" failed: type"ext4") target"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8721c341-739d-4607-bbcb-9dcf66ef6dba/globalmount") options"defaults") errcode:(exit status 1) output:(mke2fs 1.45.6 (20-Mar-2020)
Expected results:
The pod should successfully mount the PVC
Additional info:
Had a debugging session with Sameer Shaikh and Arashad Ahamad from the IBM VPC block storage team. The conclusion is that the udevadm utility is missing in the IPI image used by the IBM Cloud VPC block storage CSI.
Description of problem:
When deleting a BYOH node in Platform:none, as well as in an Azure IPI cluster the node gets reconciled correctly, however when added back to the cluster it stays in Ready,SchedulingDisabled. When checking the WMCO logs, we can observe the following log: {"level":"error","ts":"2022-12-14T16:14:31Z","msg":"Reconciler error","controller":"configmap","controllerGroup":"","controllerKind":"ConfigMap","configMap":{"name":"windows-instances","namespace":"openshift-windows-machine-config-operator"},"namespace":"openshift-windows-machine-config-operator","name":"windows-instances","reconcileID":"d66a3142-d52c-43f5-8a42-214ce9c88417","error":"error configuring host with address 10.0.55.21: configuring node network failed: error waiting for k8s.ovn.org/hybrid-overlay-node-subnet node annotation for byoh-2019: timeout waiting for k8s.ovn.org/hybrid-overlay-node-subnet node annotation: timed out waiting for the condition" And when checking the node's annotation, it is indeed missing: $ oc get nodes byoh-2019 -o=jsonpath="{.metadata.annotations}" {"volumes.kubernetes.io/controller-managed-attach-detach":"true","windowsmachineconfig.openshift.io/desired-version":"7.0.0-16f486a","windowsmachineconfig.openshift.io/pub-key-hash":"1df2c166b1c401180523270e9cf6bc2cd2724b9279ea65668a3b95298525a0f5","windowsmachineconfig.openshift.io/username":"wx4EBwMICL6qT+4RY8tgbx4hiRmQdHlwUsHgVGCTVY7S5gG/G5gb/Wzv0JBLhNP9\u003cwmcoMarker\u003ejlmI5ExHPYFrd2Fw6Lxe/6PKEE5/vYAhZ2n1Z2nBIoa1xN1/HEaXhqR2CuXNe7Ez\u003cwmcoMarker\u003eg2Hg+gA=\u003cwmcoMarker\u003e=ubWA"} Tested in Azure IPI and Platform:None, in both cases the issue got reproduced.
Version-Release number of selected component (if applicable):
$ oc get cm -n openshift-windows-machine-config-operator NAME DATA AGE kube-root-ca.crt 1 10h openshift-service-ca.crt 1 10h windows-instances 2 9h windows-machine-config-operator-lock 0 6h24m windows-services-7.0.0-16f486a 2 6h23m $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-rc.4 True False 6h48m Cluster version is 4.12.0-rc.4
How reproducible:
Steps to Reproduce:
1. Deploy a OCP 4.11 cluster with WMCO 6.0.0 2. Add one or two byoh nodes to the cluster 3. Upgrade the cluster to OCP 4.12, and later WMCO to 7.0.0 4. Remove one of the byoh nodes using: oc delete node <byoh-node-id> 5. Wait for reconciliation to bring the node back
Actual results:
The deleted node gets re-added but stays in Ready,SchedulingDisabled and the workloads left in Pending state.
Expected results:
The node gets properly added to the cluster and stays in Ready.
Additional info:
This is a clone of issue OCPBUGS-7374. The following is the description of the original issue:
—
Originally reported by lance5890 in issue https://github.com/openshift/cluster-etcd-operator/issues/1000
The controllers sometimes get stuck on listing members in failure scenarios, this is known and can be mitigated by simply restarting the CEO.
similar BZ 2093819 with stuck controllers was fixed slightly different in https://github.com/openshift/cluster-etcd-operator/commit/4816fab709e11e0681b760003be3f1de12c9c103
This fix was contributed by lance5890, thanks a lot!
This is a clone of issue OCPBUGS-5542. The following is the description of the original issue:
—
Description of problem:
The project list orders projects by its name and is smart enough to keep a "numerical order" like:
The more prominent project dropdown is not so smart and shows just a simple "ascii ordered" list:
Version-Release number of selected component (if applicable):
4.8-4.13 (master)
How reproducible:
Always
Steps to Reproduce:
1. Create some new projects called test-1, test-11, test-2
2. Check the project list page (in admin perspective)
3. Check the project dropdown (in dev perspective)
Actual results:
Order is
Expected results:
Order should be
Additional info:
none
Description of problem:
In OCP 4.9, the package-server-manager was introduced to manage the packageserver CSV. However, when OCP 4.8 in upgraded to 4.9, the packageserver stays stuck in v0.17.0, which is the version in OCP 4.8, and v0.18.3 does not roll out, which is the version in OCP 4.9
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Install OCP 4.8 2. Upgrade to OCP 4.9 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2022-08-31-160214 True True 50m Working towards 4.9.47: 619 of 738 done (83% complete) $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.47 True False 4m26s Cluster version is 4.9.47
Actual results:
Check packageserver CSV. It's in v0.17.0 $ oc get csv NAME DISPLAY VERSION REPLACES PHASE packageserver Package Server 0.17.0 Succeeded
Expected results:
packageserver CSV is at 0.18.3
Additional info:
packageserver CSV version in 4.8: https://github.com/openshift/operator-framework-olm/blob/release-4.8/manifests/0000_50_olm_15-packageserver.clusterserviceversion.yaml#L12 packageserver CSV version in 4.9: https://github.com/openshift/operator-framework-olm/blob/release-4.9/pkg/manifests/csv.yaml#L8
Description of problem:
The user mirrored the 4.11.0 release and attempted to use it to generate the the installation ISO in a completely disconnected environment. When it was the turn for extracting the os image from machine-os-images, the agent based installer ran : oc adm release info --image-for=machine-os-images --insecure=true quay.io/openshift-release-dev/ocp-release@sha256:300bce8246cf880e792e106607925de0a404484637627edf5f517375517d54a4 --registry-config=/tmp/registry-config1141450352 This does not include the --icsp-file, and thus the image reference can be retrieved to perform the extraction.
Version-Release number of selected component (if applicable):
https://github.com/openshift/installer/releases/tag/agent-installer-v4.11.0-dev-preview-2
How reproducible:
100%
Steps to Reproduce:
1. Mirroring the images of 4.11.0 using oc adm mirror command to the local registry. 2. Created install-config.yaml with mirror config 3. Created agent-config.yaml 4. openshift-install-sep1 agent create image --dir kni-22
Actual results:
INFO[0001] Start configuring static network for 3 hosts pkg=manifests INFO[0002] Adding NMConnection file <bond0.nmconnection> pkg=manifests INFO[0002] Adding NMConnection file <eno49.nmconnection> pkg=manifests INFO[0002] Adding NMConnection file <eno50.nmconnection> pkg=manifests INFO[0003] Adding NMConnection file <bond0.nmconnection> pkg=manifests INFO[0003] Adding NMConnection file <eno49.nmconnection> pkg=manifests INFO[0003] Adding NMConnection file <eno50.nmconnection> pkg=manifests INFO[0004] Adding NMConnection file <bond0.nmconnection> pkg=manifests INFO[0004] Adding NMConnection file <eno49.nmconnection> pkg=manifests INFO[0004] Adding NMConnection file <eno50.nmconnection> pkg=manifests DEBUG Fetching BaseIso Image... DEBUG Fetching Agent Manifests... DEBUG Reusing previously-fetched Agent Manifests DEBUG Fetching Install Config... DEBUG Reusing previously-fetched Install Config DEBUG Fetching Mirror Registries Config... DEBUG Reusing previously-fetched Mirror Registries Config DEBUG Generating BaseIso Image... INFO[0004] Extracting base ISO from release payload ERRO[0014] command 'oc adm release info --image-for=machine-os-images --insecure=true quay.io/openshift-release-dev/ocp-release@sha256:300bce8246cf880e792e106607925de0a404484637627edf5f517375517d54a4 --registry-config=/tmp/registry-config1141450352' exited with non-zero exit code 1: error: unable to read image quay.io/openshift-release-dev/ocp-release@sha256:300bce8246cf880e792e106607925de0a404484637627edf5f517375517d54a4: Get "http://quay.io/v2/": dial tcp: lookup quay.io on 10.92.86.56:53: server misbehaving WARN[0014] Failed to extract base ISO from release payload - check registry configuration INFO[0014] Downloading base ISO DEBUG Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/releases/rhcos-4.11/411.86.202207150124-0/x86_64/rhcos-411.86.202207150124-0-live.x86_64.iso' ERROR failed to write asset (Agent Installer ISO) to disk: image reader not available FATAL failed to fetch Agent Installer ISO: failed to fetch dependency of "Agent Installer ISO": failed to generate asset "BaseIso Image": failed to get base ISO image: command 'oc adm release info --image-for=machine-os-images --insecure=true quay.io/openshift-release-dev/ocp-release@sha256:300bce8246cf880e792e106607925de0a404484637627edf5f517375517d54a4 --registry-config=/tmp/registry-config1141450352' exited with non-zero exit code 1: FATAL error: unable to read image quay.io/openshift-release-dev/ocp-release@sha256:300bce8246cf880e792e106607925de0a404484637627edf5f517375517d54a4: Get "http://quay.io/v2/": dial tcp: lookup quay.io on 10.92.86.56:53: server misbehaving FATAL
Expected results:
Image correctly generated
Additional info:
Host OS: RHEL 8.4 NMstate version: nmstate-1.0.2-5.el8.noarch
Description of problem:
AWS tagging - when applying user defined tags you cannot add more than 10
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Configure userTags for aws platform with more than 8 tags. 2. Installer fails to add the tags while AWS supports upto 50 tags.
Actual results:
Installer validation fails.
Expected results:
Installer should be able to add more than 8 tags.
Additional info:
Description of problem:
The ingress operator has a log message with weird formatting during startup handleSingleNode4Dot11Upgrade function
Version-Release number of selected component (if applicable):
4.11
How reproducible:
100%
Steps to Reproduce:
1. Install 4.10 single node cluster 2. Upgrade to 4.11
Actual results:
Ingress operator prints badly formatted log message
Expected results:
Ingress operator prints correctly formatted log message
Additional info:
github rate limit failures for upi image downloading govc.
This is a clone of issue OCPBUGS-7973. The following is the description of the original issue:
—
Description of problem:
After destroyed the private cluster, the cluster's dns records left.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2023-02-26-022418 4.13.0-0.nightly-2023-02-26-081527
How reproducible:
always
Steps to Reproduce:
1.create a private cluster 2.destroy the cluster 3.check the dns record $ibmcloud dns zones | grep private-ibmcloud.qe.devcluster.openshift.com (base_domain) 3c7af30d-cc2c-4abc-94e1-3bcb36e01a9b private-ibmcloud.qe.devcluster.openshift.com PENDING_NETWORK_ADD $zone_id=3c7af30d-cc2c-4abc-94e1-3bcb36e01a9b $ibmcloud dns resource-records $zone_id CNAME:520c532f-ca61-40eb-a04e-1a2569c14a0b api-int.ci-op-wkb4fgd6-eef7e.private-ibmcloud.qe.devcluster.openshift.com CNAME 60 10a7a6c7-jp-tok.lb.appdomain.cloud CNAME:751cf3ce-06fc-4daf-8a44-bf1a8540dc60 api.ci-op-wkb4fgd6-eef7e.private-ibmcloud.qe.devcluster.openshift.com CNAME 60 10a7a6c7-jp-tok.lb.appdomain.cloud CNAME:dea469e3-01cd-462f-85e3-0c1e6423b107 *.apps.ci-op-wkb4fgd6-eef7e.private-ibmcloud.qe.devcluster.openshift.com CNAME 120 395ec2b3-jp-tok.lb.appdomain.cloud
Actual results:
the dns records of the cluster were left
Expected results:
created dns record by installer are all deleted, after destroyed the cluster
Additional info:
this block create private cluster later, caused the maximum limit of 5 wildcard records are easily reached. (qe account limitation) checking the *ingress-operator.log of the failed cluster, got the error: "createOrUpdateDNSRecord: failed to create the dns record: Reached the maximum limit of 5 wildcard records."
Description of problem:
catsrc is not ready due to "compute digest: compute hash: write tar: open /tmp/cache/cache: permission denied"
Version-Release number of selected component (if applicable):
zhaoxia@xzha-mac test % ../bin/opm version Version: version.Version{OpmVersion:"b94e073b5", GitCommit:"b94e073b5187ecaa687c322beccf76f1d1f26d54", BuildDate:"2022-08-29T06:30:05Z", GoOs:"darwin", GoArch:"amd64"} zhaoxia@xzha-mac test % oc exec catalog-operator-79d885b755-6cnbp -- olm --version OLM version: 0.19.0 git commit: dfa7f0e70578432117e63867706630cda5366fb7
How reproducible:
always
Steps to Reproduce:
1. generate index image zhaoxia@xzha-mac test % mkdir catalog zhaoxia@xzha-mac test % ../bin/opm generate dockerfile catalog zhaoxia@xzha-mac test % cat catalog.Dockerfile # The base image is expected to contain # /bin/opm (with a serve subcommand) and /bin/grpc_health_probe FROM quay.io/operator-framework/opm:latest # Configure the entrypoint and command ENTRYPOINT ["/bin/opm"] CMD ["serve", "/configs", "--cache-dir=/tmp/cache"] # Copy declarative config root into image at /configs and pre-populate serve cache ADD catalog /configs RUN ["/bin/opm", "serve", "/configs", "--cache-dir=/tmp/cache", "--cache-only"] # Set DC-specific label for the location of the DC root directory # in the image LABEL operators.operatorframework.io.index.configs.v1=/configs zhaoxia@xzha-mac test % docker build . -f catalog.Dockerfile -t quay.io/olmqe/nginxolm-operator-index:2726 zhaoxia@xzha-mac test % docker push quay.io/olmqe/nginxolm-operator-index:2726 2. create catsrc zhaoxia@xzha-mac test % cat catsrc.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: test-index namespace: test-1 spec: displayName: Test publisher: OLM-QE sourceType: grpc image: quay.io/olmqe/nginxolm-operator-index:2726 updateStrategy: registryPoll: interval: 10m oc new-project test-1 oc apply -f catsrc.yaml 3. check pod status zhaoxia@xzha-mac test % oc get pod NAME READY STATUS RESTARTS AGE test-index-hbqlv 0/1 Error 8 (5m13s ago) 16m test-index-l6mzq 0/1 CrashLoopBackOff 10 (59s ago) 27m zhaoxia@xzha-mac test % oc get pod test-index-hbqlv -o yaml apiVersion: v1 kind: Pod metadata: annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: "true" k8s.v1.cni.cncf.io/network-status: |- [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.131.0.84" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: |- [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.131.0.84" ], "default": true, "dns": {} }] kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","metadata":{"annotations":{},"name":"test-index","namespace":"test-1"},"spec":{"displayName":"Test","image":"quay.io/olmqe/nginxolm-operator-index:2726","publisher":"OLM-QE","sourceType":"grpc","updateStrategy":{"registryPoll":{"interval":"10m"}}}} openshift.io/scc: restricted-v2 seccomp.security.alpha.kubernetes.io/pod: runtime/default creationTimestamp: "2022-08-29T06:57:55Z" generateName: test-index- labels: catalogsource.operators.coreos.com/update: test-index olm.catalogSource: "" olm.pod-spec-hash: 777849c67c name: test-index-hbqlv namespace: test-1 ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: false kind: CatalogSource name: test-index uid: 5ef60ce9-6ade-43e1-bae4-7d69f6c9d5e0 resourceVersion: "218774" uid: 7606a54a-6a7d-4979-833a-97c2f87a88b8 spec: containers: - image: quay.io/olmqe/nginxolm-operator-index:2726 imagePullPolicy: Always livenessProbe: exec: command: - grpc_health_probe - -addr=:50051 failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 name: registry-server ports: - containerPort: 50051 name: grpc protocol: TCP readinessProbe: exec: command: - grpc_health_probe - -addr=:50051 failureThreshold: 3 initialDelaySeconds: 5 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 resources: requests: cpu: 10m memory: 50Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: false runAsNonRoot: true runAsUser: 1001130000 startupProbe: exec: command: - grpc_health_probe - -addr=:50051 failureThreshold: 15 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-bfzvh readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true imagePullSecrets: - name: test-index-dockercfg-wp8s4 nodeName: qe-daily-412-0829-qf9lx-worker-1-djpwq nodeSelector: kubernetes.io/os: linux preemptionPolicy: PreemptLowerPriority priority: 0 restartPolicy: Always schedulerName: default-scheduler securityContext: fsGroup: 1001130000 seLinuxOptions: level: s0:c34,c4 seccompProfile: type: RuntimeDefault serviceAccount: test-index serviceAccountName: test-index terminationGracePeriodSeconds: 30 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 - effect: NoSchedule key: node.kubernetes.io/memory-pressure operator: Exists volumes: - name: kube-api-access-bfzvh projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: token - configMap: items: - key: ca.crt path: ca.crt name: kube-root-ca.crt - downwardAPI: items: - fieldRef: apiVersion: v1 fieldPath: metadata.namespace path: namespace - configMap: items: - key: service-ca.crt path: service-ca.crt name: openshift-service-ca.crt status: conditions: - lastProbeTime: null lastTransitionTime: "2022-08-29T06:57:55Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2022-08-29T06:57:55Z" message: 'containers with unready status: [registry-server]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2022-08-29T06:57:55Z" message: 'containers with unready status: [registry-server]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2022-08-29T06:57:55Z" status: "True" type: PodScheduled containerStatuses: - containerID: cri-o://54d7a5ba94c061fb86ad056ad964dbda2824c864c6fdcd2d7d5a7ada515bc70e image: quay.io/olmqe/nginxolm-operator-index:2726 imageID: quay.io/olmqe/nginxolm-operator-index@sha256:d70f38fa773ea5030b5b80bfe34d9168aabff5039ead44b7f7e7cd76f8705eb1 lastState: terminated: containerID: cri-o://54d7a5ba94c061fb86ad056ad964dbda2824c864c6fdcd2d7d5a7ada515bc70e exitCode: 1 finishedAt: "2022-08-29T07:14:23Z" message: |+ Error: compute digest: compute hash: write tar: open /tmp/cache/cache: permission denied Usage: opm serve <source_path> [flags] Flags: --cache-dir string if set, sync and persist server cache directory --cache-only sync the serve cache and exit without serving --debug enable debug logging -h, --help help for serve -p, --port string port number to serve on (default "50051") --pprof-addr string address of startup profiling endpoint (addr:port format) -t, --termination-log string path to a container termination log file (default "/dev/termination-log") Global Flags: --skip-tls-verify skip TLS certificate verification for container image registries while pulling bundles --use-http use plain HTTP for container image registries while pulling bundles reason: Error startedAt: "2022-08-29T07:14:23Z" name: registry-server ready: false restartCount: 8 started: false state: waiting: message: back-off 5m0s restarting failed container=registry-server pod=test-index-hbqlv_test-1(7606a54a-6a7d-4979-833a-97c2f87a88b8) reason: CrashLoopBackOff hostIP: 10.242.0.4 phase: Running podIP: 10.131.0.84 podIPs: - ip: 10.131.0.84 qosClass: Burstable startTime: "2022-08-29T06:57:55Z"
Actual results:
the status of pod for catsrc is not running
Expected results:
the status of pod for catsrc is running
Additional info:
When using project openshift-marketplace, the same error will be raised. Error: compute digest: compute hash: write tar: open /tmp/cache/cache: permission denied
Description of the problem:
In case we are installing a cluster using the kubeapi the installer fails to send the logs due to a missing volume mount of the caCert
time="2022-07-06T08:25:59Z" level=info msg="failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --rm --privileged --net=host --pid=host -v /run/systemd/journal/socket:/run/systemd/journal/socket -v /var/log:/var/log quay.io/edge-infrastructure/assisted-installer-agent@sha256:20d9e31e37f881fcd34aed44b2ee9f143382f87cbf4b634325d2260f8dffe6c2 logs_sender -cluster-id 4d4be932-42a8-4d37-b5d2-41f42a487821 -url https://assisted-service-assisted-installer.apps.ostest.test.metalkube.org -host-id 17babad0-f2d0-419f-a69b-8c6895df26f4 -infra-env-id 37c26d69-6416-4888-bd2e-aec610f241b3 -pull-secret-token <SECRET> -insecure=false -bootstrap=true -cacert=/etc/assisted-service/service-ca-cert.crt], env vars [PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin TERM=xterm container=oci http_proxy= https_proxy= NO_PROXY= OPENSHIFT_BUILD_NAME=assisted-installer PULL_SECRET_TOKEN=<SECRET> no_proxy= HTTP_PROXY= HTTPS_PROXY= OPENSHIFT_BUILD_NAMESPACE=ci-op-8wiv6td6 BUILD_LOGLEVEL=0 HOME=/root HOSTNAME=extraworker-0], error exit status 1, waitStatus 1, Output \"time=\"06-07-2022 08:25:59\" level=fatal msg=\"Failed to initialize connection: &{%!e(string=open) %!e(string=/etc/assisted-service/service-ca-cert.crt) %!e(syscall.Errno=2)}\" file=\"send_logs.go:92\"\ntime=\"2022-07-06T08:25:59Z\" level=warning msg=\"lstat /sys/fs/cgroup/devices/machine.slice/libpod-8b070b62a9482fc0add228b77844b2c4e0a614e2b171ca87f76f56a4305a6ee7.scope: no such file or directory\"\"" time="2022-07-06T08:25:59Z" level=error msg="upload installation logs failed executing nsenter [--target 1 --cgroup --mount --ipc --pid -- podman run --rm --privileged --net=host --pid=host -v /run/systemd/journal/socket:/run/systemd/journal/socket -v /var/log:/var/log quay.io/edge-infrastructure/assisted-installer-agent@sha256:20d9e31e37f881fcd34aed44b2ee9f143382f87cbf4b634325d2260f8dffe6c2 logs_sender -cluster-id 4d4be932-42a8-4d37-b5d2-41f42a487821 -url https://assisted-service-assisted-installer.apps.ostest.test.metalkube.org -host-id 17babad0-f2d0-419f-a69b-8c6895df26f4 -infra-env-id 37c26d69-6416-4888-bd2e-aec610f241b3 -pull-secret-token <SECRET> -insecure=false -bootstrap=true -cacert=/etc/assisted-service/service-ca-cert.crt], Error exit status 1, LastOutput \"... :92\"\ntime=\"2022-07-06T08:25:59Z\" level=warning msg=\"lstat /sys/fs/cgroup/devices/machine.slice/libpod-8b070b62a9482fc0add228b77844b2c4e0a614e2b171ca87f76f56a4305a6ee7.scope: no such file or directory\"\""
How reproducible:
100%
Steps to reproduce:
1. Install a cluster using the kubeapi
2. look for the host logs after the host reboots or the installation complete
3.
Actual results:
no host logs
Expected results:
...
Description of problem:
On storageclass creation page, the dropdown items for "Reclaim policy" and "Volume binding tyep" are not marked for i18n.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-08-22-143022
How reproducible:
always
Steps to Reproduce:
1.Go to storageclass creation page, check if dropdown items for "Reclaim policy" and "Volume binding type" support i18n.
2.
3.
Actual results:
1. They are not marked for i18n.
Expected results:
1. Should support i18n.
Additional info:
This is a clone of issue OCPBUGS-2851. The following is the description of the original issue:
—
Description of problem:
The current implementation of registries.conf support is not working as expected. This bug report will outline the expectations of how we believe this should work.
The containers/image project defines a configuration file called registries.conf, which controls how image pulls can be redirected to another registry. Effectively the pull request for a given registry is redirected to another registry which can satisfy the image pull request instead. The specification for the registries.conf file is located here. For tools such as podman and skopeo, this configuration file allows those tools to indicate where images should be pulled from, and the containers/image project rewrites the image reference on the fly and tries to get the image from the first location it can, preferring these "alternate locations" and then falling back to the original location if one of the alternate locations can't satisfy the image request.
An important aspect of this redirection mechanism is it allows the "host:port" and "namespace" portions of the image reference to be redirected. To be clear on the nomenclature used in the registries.conf specification, a namespace refers to zero or more slash separated sections leading up to the image name (which is called repo in the specification and has the tag or digest after it. See repo(:_tag|@digest) below) and the host[:port] refers to the domain where the image registry is being hosted.
Example:
host[:port]/namespace[/namespace…]/repo(:_tag|@digest)
For example, if we have an image called myimage@sha:1234 the and the image normally resides in quay.io/foo/myimage@sha:1234 you could redirect the image pull request to my registry.com/bar/baz/myimage@sha:1234. Note that in this example the alternate registry location is in a different host, and the namespace "path" is different too.
In a typical development scenario, image references within an OLM catalog should always point to a production location where the image is intended to be pulled from when a catalog is published publicly. Doing this prevents publishing a catalog which contains image references to internal repositories, which would never be accessible by a customer. By using the registries.conf redirection mechanism, we can perform testing even before the images are officially published to public locations, and we can redirect the image reference from a production location to an internal repository for testing purposes. Below is a simple example of a registries.conf file that redirects image pull requests away from prodlocation.io to preprodlocation.com:
[[registry]] location = "prodlocation.io/xx" insecure = false blocked = false mirror-by-digest-only = true prefix = "" [[registry.mirror]] location = "preprodlocation.com/xx" insecure = false
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
oc mirror -c ImageSetConfiguration.yaml --use-oci-feature --oci-feature-action mirror --oci-insecure-signature-policy --oci-registries-config registries.conf --dest-skip-tls docker://localhost:5000/example/test
Example registries.conf
[[registry]] prefix = "" insecure = false blocked = false location = "prod.com/abc" mirror-by-digest-only = true [[registry.mirror]] location = "internal.exmaple.io/cp" insecure = false [[registry]] prefix = "" insecure = false blocked = false location = "quay.io" mirror-by-digest-only = true [[registry.mirror]] location = "internal.exmaple.io/abcd" insecure = false
Actual results:
images are not pulled from "internal" registry
Expected results:
images should be pulled from "internal" registry
Additional info:
The current implementation in oc mirror creates its own structs to approximate the ones provided by the containers/image project, but it might not be necessary to do that. Since the oc mirror project already uses containers/image as a dependency, it could leverage the FindRegistry function, which takes a image reference, loads the registries.conf information and returns the most appropriate [[registry]] reference (in the form of Registry struct) or nil if no match was found. Obviously custom processing will be necessary to do something useful with the Registry instance. Using this code is not a requirement, just a suggestion of another possible path to load the configuration.
This is a clone of issue OCPBUGS-2513. The following is the description of the original issue:
—
Description of problem:
Agent based installation is failing for Disconnected env due to pull secret is required for registry.ci.openshift.org. As we are installing cluster in disconnected env, only mirror registry secrets should be enough for pulling the image.
Version-Release number of selected component (if applicable):
registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-18-041406
How reproducible:
Always
Steps to Reproduce:
1. Setup mirror registry with this registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-18-041406 release. 2. Add the ICSP information in the install-config file 4. Create agent.iso using install-config.yaml and agent-config.yaml 5. ssh to the node zero to see the error in create-cluster-and-infraenv.service.
Actual results:
create-cluster-and-infraenv.service is failing with below error: time="2022-10-18T09:36:13Z" level=fatal msg="Failed to register cluster with assisted-service: AssistedServiceError Code: 400 Href: ID: 400 Kind: Error Reason: pull secret for new cluster is invalid: pull secret must contain auth for \"registry.ci.openshift.org\""
Expected results:
create-cluster-and-infraenv.service should be successfully started.
Additional info:
Refer this similar bug https://bugzilla.redhat.com/show_bug.cgi?id=1990659
This is a clone of issue OCPBUGS-4367. The following is the description of the original issue:
—
Description of problem:
The calls to log.Debugf() from image/baseiso.go and image/oc.go are not being output when the "image create" command is run.
Version-Release number of selected component (if applicable):
4.12.0
How reproducible:
Every time
Steps to Reproduce:
1. Run ../bin/openshift-install agent create image --dir ./cluster-manifests/ --log-level debug
Actual results:
No debug log messages from log.Debugf() calls in pkg/asset/agent/image/oc.go
Expected results:
Debug log messages are output
Additional info:
Note from Zane: We should probably also use the real global logger instead of [creating a new one](https://github.com/openshift/installer/blob/2698cbb0ec7e96433a958ab6b864786c0c503c0b/pkg/asset/agent/image/baseiso.go#L109) with the default config that ignores the --log-level flag and prints weird `[0001]` stuff in the output for some reason. (The NMStateConfig manifests logging suffers from the same problem.)
This is a clone of issue OCPBUGS-6621. The following is the description of the original issue:
—
Description of problem:
Image registry pods panic while deploying OCP in ap-southeast-4 AWS region
Version-Release number of selected component (if applicable):
4.12.0
How reproducible:
Deploy OCP in AWS ap-southeast-4 region
Steps to Reproduce:
Deploy OCP in AWS ap-southeast-4 region
Actual results:
panic: Invalid region provided: ap-southeast-4
Expected results:
Image registry pods should come up with no errors
Additional info:
Description of problem:
Not all rules removed from iptables after disabling multinetworkpolicy
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1. Configure sriov (nodepolicy + sriovnetwork) 2. Configure 2 pods 3. enable MutiNetworkPolicy 4. apply ~20 rules for pod1: spec: podSelector: matchLabels: pod: pod1 policyTypes: - Ingress ingress: [] 5. Disable multinetworkpolicy 6. send ping pod2 => pod1
Actual results:
Traffic is still blocked
Expected results:
Traffic should be passed
Additional info:
Before disabling multiNetworkPolicy: -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default net-attach-def:ns1/sriovnetwork2" -j MULTI-0-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default24 net-attach-def:ns1/sriovnetwork2" -j MULTI-1-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default17 net-attach-def:ns1/sriovnetwork2" -j MULTI-2-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default15 net-attach-def:ns1/sriovnetwork2" -j MULTI-3-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default14 net-attach-def:ns1/sriovnetwork2" -j MULTI-4-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default7 net-attach-def:ns1/sriovnetwork2" -j MULTI-5-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default5 net-attach-def:ns1/sriovnetwork2" -j MULTI-6-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default20 net-attach-def:ns1/sriovnetwork2" -j MULTI-7-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default19 net-attach-def:ns1/sriovnetwork2" -j MULTI-8-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default11 net-attach-def:ns1/sriovnetwork2" -j MULTI-9-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default10 net-attach-def:ns1/sriovnetwork2" -j MULTI-10-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default9 net-attach-def:ns1/sriovnetwork2" -j MULTI-11-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default6 net-attach-def:ns1/sriovnetwork2" -j MULTI-12-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default3 net-attach-def:ns1/sriovnetwork2" -j MULTI-13-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default16 net-attach-def:ns1/sriovnetwork2" -j MULTI-14-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default13 net-attach-def:ns1/sriovnetwork2" -j MULTI-15-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default2 net-attach-def:ns1/sriovnetwork2" -j MULTI-16-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default22 net-attach-def:ns1/sriovnetwork2" -j MULTI-17-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default21 net-attach-def:ns1/sriovnetwork2" -j MULTI-18-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default18 net-attach-def:ns1/sriovnetwork2" -j MULTI-19-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default12 net-attach-def:ns1/sriovnetwork2" -j MULTI-20-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default8 net-attach-def:ns1/sriovnetwork2" -j MULTI-21-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default4 net-attach-def:ns1/sriovnetwork2" -j MULTI-22-INGRESS -A MULTI-0-INGRESS -j DROP -A MULTI-1-INGRESS -j DROP -A MULTI-2-INGRESS -j DROP -A MULTI-3-INGRESS -j DROP -A MULTI-4-INGRESS -j DROP -A MULTI-5-INGRESS -j DROP -A MULTI-6-INGRESS -j DROP -A MULTI-7-INGRESS -j DROP -A MULTI-8-INGRESS -j DROP -A MULTI-9-INGRESS -j DROP -A MULTI-10-INGRESS -j DROP -A MULTI-11-INGRESS -j DROP -A MULTI-12-INGRESS -j DROP -A MULTI-13-INGRESS -j DROP -A MULTI-14-INGRESS -j DROP -A MULTI-15-INGRESS -j DROP -A MULTI-16-INGRESS -j DROP -A MULTI-17-INGRESS -j DROP -A MULTI-18-INGRESS -j DROP -A MULTI-19-INGRESS -j DROP -A MULTI-20-INGRESS -j DROP -A MULTI-21-INGRESS -j DROP -A MULTI-22-INGRESS -j DROP ============================================================= After disabling multiNetworkPolicy: -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default5 net-attach-def:ns1/sriovnetwork2" -j MULTI-0-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default24 net-attach-def:ns1/sriovnetwork2" -j MULTI-1-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default17 net-attach-def:ns1/sriovnetwork2" -j MULTI-2-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default15 net-attach-def:ns1/sriovnetwork2" -j MULTI-3-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default7 net-attach-def:ns1/sriovnetwork2" -j MULTI-4-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default3 net-attach-def:ns1/sriovnetwork2" -j MULTI-5-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default20 net-attach-def:ns1/sriovnetwork2" -j MULTI-6-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default19 net-attach-def:ns1/sriovnetwork2" -j MULTI-7-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default9 net-attach-def:ns1/sriovnetwork2" -j MULTI-8-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default6 net-attach-def:ns1/sriovnetwork2" -j MULTI-9-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default16 net-attach-def:ns1/sriovnetwork2" -j MULTI-10-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default2 net-attach-def:ns1/sriovnetwork2" -j MULTI-11-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default22 net-attach-def:ns1/sriovnetwork2" -j MULTI-12-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default21 net-attach-def:ns1/sriovnetwork2" -j MULTI-13-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default18 net-attach-def:ns1/sriovnetwork2" -j MULTI-14-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default8 net-attach-def:ns1/sriovnetwork2" -j MULTI-15-INGRESS -A MULTI-INGRESS -i int1 -m comment --comment "policy:deny-by-default4 net-attach-def:ns1/sriovnetwork2" -j MULTI-16-INGRESS -A MULTI-0-INGRESS -j DROP -A MULTI-1-INGRESS -j DROP -A MULTI-2-INGRESS -j DROP -A MULTI-3-INGRESS -j DROP -A MULTI-4-INGRESS -j DROP -A MULTI-5-INGRESS -j DROP -A MULTI-6-INGRESS -j DROP -A MULTI-7-INGRESS -j DROP -A MULTI-8-INGRESS -j DROP -A MULTI-9-INGRESS -j DROP -A MULTI-10-INGRESS -j DROP -A MULTI-11-INGRESS -j DROP -A MULTI-12-INGRESS -j DROP -A MULTI-13-INGRESS -j DROP -A MULTI-14-INGRESS -j DROP -A MULTI-15-INGRESS -j DROP -A MULTI-16-INGRESS -j DROP
With CSISnapshot capability is disabled, all CSI driver operators are Degraded. For example AWS EBS CSI driver operator during installation:
18:12:16.895: Some cluster operators are not ready: storage (Degraded=True AWSEBSCSIDriverOperatorCR_AWSEBSDriverStaticResourcesController_SyncError: AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverStaticResourcesControllerDegraded: "volumesnapshotclass.yaml" (string): the server could not find the requested resource AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverStaticResourcesControllerDegraded: ) Ginkgo exit error 1: exit with code 1}
Version-Release number of selected component (if applicable):
4.12.nightly
The reason is that cluster-csi-snapshot-controller-operator does not create VolumeSnapshotClass CRD, which AWS EBS CSI driver operator expects to exist.
CSI driver operators must skip VolumeSnapshotClass creation if the CRD does not exist.
Description of problem:
project viewer is able to see a 'Create Pod Disruption Budget' button on Pods list page while the creation will fail finally due to less permission, in this way console should not show a 'Create Pod Disruption Budget' button for project viewer, other resources list page doesn’t have the issue
Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2021-09-16-212009
How reproducible:
Always
Steps to Reproduce:
1. normal user has a project and workloads
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/example 3/3 3 3 79s
NAME DESIRED CURRENT READY AGE
replicaset.apps/example-787f749bb 3 3 3 79s
2. grant another user with view access to user project 'yapei1-project'
Actual results:
3. project viewer 'uiauto1' can see pods list successfully, at the same time console also shows a 'Create Pod Disruption Budget' button while the creation will finally fail if project viewer tries to create a pod
Expected results:
3. console should not show 'Create Pod Disruption Budget' button for a project viewer
Additional info:
For comparison: we doesn't show resource creation button('Create xxx' button) on other workloads list page for a project viewer, such as Deployments, DeploymentConfigs list etc
Hypershift does not use kubernetes.default.svc as the api audience on the KAS. It is set to the URL of the OIDC provider. ROSA also does this so I don't imagine this test passes for it either at the moment.
Explicit setting of the Audiences on the TokenRequest is not required. If not set, it will just default to the audiences configured in the KAS.
Causing conformance failure for hypershift
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-hypershift-main-periodics-4.13-conformance-aws-ovn/1620240601058381824
Description of problem:
When the user runs:
openshift-install agent create image --dir cluster-manifests
But the manifests are either not in cluster-manifests or are missing, the error code generated by the tool leads users to believe that they are missing some tool dependency:
ERROR failed to write asset (Agent Installer ISO) to disk: image reader not available
Version-Release number of selected component (if applicable):4.11.0
How reproducible: 100%
Steps to Reproduce:
1. rm -fr /tmp/cluster-manifests && mkdir /tmp/cluster-manifests
2.openshift-install agent create image --dir cluster-manifests
Actual results:
ERROR failed to write asset (Agent Installer ISO) to disk: image reader not available
Expected results:
Error: Missing manifets in the specified cluster manifest directory: "/tmp/cluster-manifests"
Additional info:
This is a clone of issue OCPBUGS-3499. The following is the description of the original issue:
—
Description of problem:
On clusters serving Route via CRD (i.e. MicroShift), Route validation does not perform the same validation as on OCP.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
$ cat<<EOF | oc apply --server-side -f- apiVersion: route.openshift.io/v1 kind: Route metadata: name: hello-microshift spec: to: kind: Service name: hello-microshift EOF route.route.openshift.io/hello-microshift serverside-applied $ oc get route hello-microshift -o yaml apiVersion: route.openshift.io/v1 kind: Route metadata: annotations: openshift.io/host.generated: "true" creationTimestamp: "2022-11-11T23:53:33Z" generation: 1 name: hello-microshift namespace: default resourceVersion: "2659" uid: cd35cd20-b3fd-4d50-9912-f34b3935acfd spec: host: hello-microshift-default.cluster.local to: kind: Service name: hello-microshift wildcardPolicy: None $ cat<<EOF | oc apply --server-side -f- apiVersion: route.openshift.io/v1 kind: Route metadata: name: hello-microshift spec: to: kind: Service name: hello-microshift wildcardPolicy: "" EOF
Actual results:
route.route.openshift.io/hello-microshift serverside-applied
Expected results:
The Route "hello-microshift" is invalid: spec.wildcardPolicy: Invalid value: "": field is immutable
Additional info:
** This change will be inert on OCP, which already has the correct behavior. **
Description of problem:
Disconnected IPI OCP 4.11.5 cluster install on baremetal fails when hostname of master nodes does not include "master"
Version-Release number of selected component (if applicable): 4.11.5
How reproducible: Perform disconnected IPI install of OCP 4.11.5 on bare metal with master nodes that do not contain the text "master"
Steps to Reproduce:
Perform disconnected IPI install of OCP 4.11.5 on bare metal with master nodes that do not contain the text "master"
Actual results: master nodes do come up.
Expected results: master nodes should come up despite that the text "master" is not in their hostname.
Additional info:
Disconnected IPI OCP 4.11.5 cluster install on baremetal fails when hostname of master nodes does not include "master"
My cust reinstall new cluster using the fix here . But they have the exact same issue. The metal3 pod have PROVISIONING_MACS value empty. Can we work together with them to understand why the new code fix https://github.com/openshift/cluster-baremetal-operator/commit/76bd6bc461b30a6a450f85a42e492a0933178aee is not working.
cat metal3-static-ip-set/metal3-static-ip-set/logs/current.log 2022-09-27T14:19:38.140662564Z + '[' -z 10.17.199.3/27 ']' 2022-09-27T14:19:38.140662564Z + '[' -z '' ']' 2022-09-27T14:19:38.140662564Z + '[' -n '' ']' 2022-09-27T14:19:38.140722345Z ERROR: Could not find suitable interface for "10.17.199.3/27" 2022-09-27T14:19:38.140726312Z + '[' -n '' ']' 2022-09-27T14:19:38.140726312Z + echo 'ERROR: Could not find suitable interface for "10.17.199.3/27"' 2022-09-27T14:19:38.140726312Z + exit 1
cat metal3-b9bf8d595-gv94k.yaml ... initContainers: command: /set-static-ip env: name: PROVISIONING_IP value: 10.17.199.3/27 name: PROVISIONING_INTERFACE name: PROVISIONING_MACS <------------------------- missing MACS image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4f04793bd109ecba2dfe43be93dc990ac5299272482c150bd5f2eee0f80c983b imagePullPolicy: IfNotPresent name: metal3-static-ip-set ....
omc logs machine-api-controllers-6b9ffd96cd-grh6l -c nodelink-controller -n openshift-machine-api 2022-09-21T16:13:43.600517485Z I0921 16:13:43.600513 1 nodelink_controller.go:408] Finding machine from node "blocp-1-106-m-0.c106-1.sc.evolhse.hydro.qc.ca" 2022-09-21T16:13:43.600521381Z I0921 16:13:43.600517 1 nodelink_controller.go:425] Finding machine from node "blocp-1-106-m-0.c106-1.sc.evolhse.hydro.qc.ca" by ProviderID 2022-09-21T16:13:43.600525225Z W0921 16:13:43.600521 1 nodelink_controller.go:427] Node "blocp-1-106-m-0.c106-1.sc.evolhse.hydro.qc.ca" has no providerID 2022-09-21T16:13:43.600528917Z I0921 16:13:43.600524 1 nodelink_controller.go:448] Finding machine from node "blocp-1-106-m-0.c106-1.sc.evolhse.hydro.qc.ca" by IP 2022-09-21T16:13:43.600532711Z I0921 16:13:43.600529 1 nodelink_controller.go:453] Found internal IP for node "blocp-1-106-m-0.c106-1.sc.evolhse.hydro.qc.ca": "10.17.192.33" 2022-09-21T16:13:43.600551289Z I0921 16:13:43.600544 1 nodelink_controller.go:477] Matching machine not found for node "blocp-1-106-m-0.c106-1.sc.evolhse.hydro.qc.ca" with internal IP "10.17.192.33"
From @dtantsur WIP PR: https://github.com/openshift/cluster-baremetal-operator/pull/299
Customer is waiting for this fix. The previous code change don't fix customer situation.
Please refer to this slack thread :https://coreos.slack.com/archives/CFP6ST0A3/p1664215102459219
This is a clone of issue OCPBUGS-4758. The following is the description of the original issue:
—
Description of problem:
See: https://issues.redhat.com/browse/CPSYN-143 tldr: Based on the previous direction that 4.12 was going to enforce PSA restricted by default, OLM had to make a few changes because the way we run catalog pods (and we have to run them that way because of how the opm binary worked) was incompatible w/ running restricted. 1) We set openshift-marketplace to enforce restricted (this was our choice, we didn't have to do it, but we did) 2) we updated the opm binary so catalog images using a newer opm binary don't have to run privileged 3) we added a field to catalogsource that allows you to choose whether to run the pod privileged(legacy mode) or restricted. The default is restricted. We made that the default so that users running their own catalogs in their own NSes (which would be default PSA enforcing) would be able to be successful w/o needing their NS upgraded to privileged. Unfortunately this means: 1) legacy catalog images(i.e. using older opm binaries) won't run on 4.12 by default (the catalogsource needs to be modified to specify legacy mode. 2) legacy catalog images cannot be run in the openshift-marketplace NS since that NS does not allow privileged pods. This means legacy catalogs can't contribute to the global catalog (since catalogs must be in that NS to be in the global catalog). Before 4.12 ships we need to: 1) remove the PSA restricted label on the openshift-marketplace NS 2) change the catalogsource securitycontextconfig mode default to use "legacy" as the default, not restricted. This gives catalog authors another release to update to using a newer opm binary that can run restricted, or get their NSes explicitly labeled as privileged (4.12 will not enforce restricted, so in 4.12 using the legacy mode will continue to work) In 4.13 we will need to revisit what we want the default to be, since at that point catalogs will start breaking if they try to run in legacy mode in most NSes.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
https://github.com/openshift/assisted-installer/issues/513
How reproducible:
https://github.com/openshift/assisted-installer/issues/513
Steps to reproduce:
1. Create cluster with proxy and additional certificate bundle
2.Install
Actual results:
Controller failed to reach service cause of self signed certificate
Expected results:
Installation succeeds
Description of problem:
seeing test failure due to panic in cvo here:
Undiagnosed panic detected in pod expand_less 0s { pods/openshift-cluster-version_cluster-version-operator-96cf55b5-rffgt_cluster-version-operator_previous.log.gz:E0915 18:38:42.763315 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) pods/openshift-cluster-version_cluster-version-operator-96cf55b5-rffgt_cluster-version-operator_previous.log.gz:E0915 18:38:42.763418 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)}
full error from logs:
/E0915 18:38:42.763315 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 187 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1934980?, 0x2bc6240}) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x4d2604?}) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75 panic({0x1934980, 0x2bc6240}) /usr/lib/golang/src/runtime/panic.go:838 +0x207 github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).calculateNext(0xc0015c6000, 0xc001df2000) /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:716 +0x14d github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1() /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:575 +0x2a9 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x10000000000?) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x3e k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001df2000?, {0x1e44e60, 0xc002739f50}, 0x1, 0xc00058e0c0) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xb6 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x989680, 0x0, 0x60?, 0x0?) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x89 k8s.io/apimachinery/pkg/util/wait.Until(...) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start(0xc0015c6000?, {0x1e5eb30?, 0xc0000cacc0?}, 0x10?, {0x0?, 0x0?}, {0x0?, 0x0?}) /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:556 +0x145 github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run.func2() /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:387 +0x83 created by github.com/openshift/cluster-version-operator/pkg/cvo.(*Operator).Run /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:385 +0x4af E0915 18:38:42.763418 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
Version-Release number of selected component (if applicable):
How reproducible:
currently unsure hit this in a test run, but shouldn't ever panic.
Steps to Reproduce:
1. 2. 3.
Actual results:
panic in cvo pod
Expected results:
no panic in cvo pod
Additional info:
This is a clone of issue OCPBUGS-4305. The following is the description of the original issue:
—
Description of problem:
Please add an option to DISABLE debug in ironic-api. Presently it is enabled by default and there is no way to disable it or reduce log level
https://github.com/metal3-io/ironic-image/blob/main/ironic-config/ironic.conf.j2#L3
Version-Release number of selected component (if applicable): none
How reproducible: Every time
Steps to Reproduce:
Please check source code here: https://github.com/metal3-io/ironic-image/blob/main/ironic-config/ironic.conf.j2#L3
It is enabled by default and there is no way to disable it or reduce log level
Actual results:
Please check Case: 03371411, the log file grew to 409 GB
Expected results: Need a way to disable debug
Additional info: Case 03371411. A cluster must gather and log file can be found in the case.
Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/87
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Assisted installations default to setting platform: baremetal. Using the ReST API, it is possible to select vsphere (or ovirt) as the platform type. In every case, the actual platform data is filled in by assisted-service, and cannot be specified by the user.
The ClusterDeployment resource (from Hive) contains a Platform field. We could look for a platform specified in this field and set that platform when creating the cluster in the create-cluster-and-infraenv service. If ZTP were ever to support other deployment methods, this would probably be a good choice for that also.
We should probably warn the user if they attempt to put any data inside the platform settings, as this will be ignored. This shouldn't be an error, though, as it would prevent users from using existing install configs. Perhaps it should be an error if they specify a platform we don't support.
[Pawan]: We can simply use the PlatformType from ACI and then no assisted service client changes are required. We will throw an error if the user provides an unsupported platformType ( aws, gcp, etc)
Ignoring the unwanted Platform settings from install-config.yaml to be handled in https://issues.redhat.com/browse/AGENT-348
Description of problem:
The icon color of Alerts in the Topology list view should be based on alert type.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. create a deployment 2. Create a resource quota so that quota alert will be visible in topology list page 3. navigate to topology list page 3.
Actual results:
Alert icon color is black and white. See the screenshots
Expected results:
Alert icon color should be base on alert type.
Additional info:
This is a clone of issue OCPBUGS-3287. The following is the description of the original issue:
—
Description of problem:
Configure both IPv4 and IPv6 addresses in api/ingress in install-config.yaml, install the cluster using agent-based installer. The cluster provisioned has only IPv4 stack for API/Ingress
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. As description 2. 3.
Actual results:
The cluster provisioned has only IPv4 stack for API/Ingress
Expected results:
The cluster provisioned has both IPv4 and IPv6 for API/Ingress
Additional info:
We should capture a metric for topology usage in OCP and report it via telemetry.
Description of problem:
Egress firewall returned error is overridden by the status update error, and never returned.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create egress firewall with bad cidr kind: EgressFirewall apiVersion: k8s.ovn.org/v1 metadata: name: default namespace: default spec: egress: - type: Allow to: cidrSelector: 1.2.3.345/32 2. Before fix: you should see the log "Creating *v1.EgressFirewall default/default took: 4.662942ms" 3. After fix: you should see the log "Failed to create *v1.EgressFirewall default/default, error: cannot create EgressFirewall Rule to destination 1.2.3.345/32 for namespace default: invalid CIDR address: 1.2.3.345/32" 4. These logs are mutually exclusive, check one of them is present and the other is not
Actual results:
Expected results:
Additional info:
When using an install-config with missing VIP values set in the baremetal-platform section, we attempt to get defaults for them by doing a DNS lookup on the cluster domain name. If this lookup fails, we set the error message from DNS as the default value, resulting in a very confusing error message:
[platform.baremetal.apiVIPs: Invalid value: []string{"DNS lookup failure: lookup api.test-cluster.test-domain on 10.0.80.11:53: no such host"}: ip <nil> is invalid, platform.baremetal.apiVIPs: Invalid value: "DNS lookup failure: lookup api.test-cluster.test-domain on 10.0.80.11:53: no such host": "DNS lookup failure: lookup api.test-cluster.test-domain on 10.0.80.11:53: no such host" is not a valid IP, platform.baremetal.apiVIPs: Invalid value: "DNS lookup failure: lookup api.test-cluster.test-domain on 10.0.80.11:53: no such host": IP expected to be in one of the machine networks: 192.168.122.0/23]
This has been the case since the inception of baremetal IPI, but it has gotten considerably worse in 4.12 due to the VIP fields changing from a single string to a list.
If the user doesn't supply a value and we can't generate a sensible default, we should report that the error is that they didn't supply a value, not that they supplied an invalid value that they did not in fact supply:
[platform.baremetal.apiVIPs: Required value: must specify at least one VIP for the API, platform.baremetal.apiVIPs: Required value: must specify VIP for API, when VIP for ingress is set]
Derrick got an "old and new refs are equal" on rebase error; this is similar to OCPBUGS-1899 but I think has a different root cause. In this case, when a manual rollback is performed via the bootloader, we've computed that there's an osimageurl diff between the expected and desired state, but actually the desired state is already set.
We just need to skip doing the rebase if we're already in the target state.
(A real root of this problem again is that the whole "current/desired config" thing is trying to track state independently of the bootloader...if we made node state == container image, all of that goes away. The MCO would understand that it got booted into a previous state)
In order to install OKD via Assisted Installer currently an additional configuration option - `OKD_RPMS` is required. This image was previously built manually and uploaded to quay.
It would be useful to include it in the payload and teach Assisted Service to extract it automatically, so that this configuration change would not be required. As a result, the same Assisted Installer can be used to install both OCP and OKD versions. Implementing this would also simplify agent-based cluster 0 installation.
cc [~andrea.fasano]
OpenShift 4.12 is going to be built with Go 1.19, but automatic migration in our repository has failed. The migration should be done manually.
[1]: https://github.com/openshift/cluster-image-registry-operator/pull/802
Description of problem:
Pod in the openshift-marketplace cause PodSecurityViolation alerts in vanilla OpenShift cluster
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2023-01-04-203333
How reproducible:
100%
Steps to Reproduce:
1. install a freshly new cluster 2. check the alerts in the console
Actual results:
PodSecurityViolation alert is present
Expected results:
No alerts
Additional info:
I'll provide a filtered version of the audit logs containing the violations
Description of problem:
$ oc adm must-gather -- gather_ingress_node_firewall [must-gather ] OUT Using must-gather plug-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dec5a08681e11eedcd31f075941b74f777b9187f0e711a498a212f9d96adb2f When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information: ClusterID: 0ef60b50-4378-431d-8ca2-faa5af098274 ClusterVersion: Stable at "4.12.0-0.nightly-2022-09-26-111919" ClusterOperators: clusteroperator/insights is not available (Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: UHC services authentication failed ) because Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: UHC services authentication failed[must-gather ] OUT namespace/openshift-must-gather-fr7kc created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-xx2fh created [must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3dec5a08681e11eedcd31f075941b74f777b9187f0e711a498a212f9d96adb2f created [must-gather-xvfj4] POD 2022-09-28T16:57:00.887445531Z /bin/bash: /usr/bin/gather_ingress_node_firewall: Permission denied [must-gather-xvfj4] OUT waiting for gather to complete [must-gather-xvfj4] OUT downloading gather output [must-gather-xvfj4] OUT receiving incremental file list [must-gather-xvfj4] OUT ./ [must-gather-xvfj4] OUT [must-gather-xvfj4] OUT sent 27 bytes received 40 bytes 26.80 bytes/sec [must-gather-xvfj4] OUT total size is 0 speedup is 0.00 [must-gather ] OUT namespace/openshift-must-gather-fr7kc deleted [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-xx2fh deleted Reprinting Cluster State: When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information: ClusterID: 0ef60b50-4378-431d-8ca2-faa5af098274 ClusterVersion: Stable at "4.12.0-0.nightly-2022-09-26-111919" ClusterOperators: clusteroperator/insights is not available (Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: UHC services authentication failed ) because Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired: UHC services authentication failed
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
– NOT A BUG –
This was a story, but CI is not working for OLM project, so moved to OCPBUGS where it is.
----------------------------
upstream the `opm alpha diff` functionality moved to `oc-mirror` team by a non-RH actor.
This story is to track downstreaming the two PRs.
The only thing to verify here is that there is no more `opm alpha diff` command.
Other changes in the PRs are to externalize some interfaces and implement an undocumented alpha-level internal channel-level property list.
Description of problem:
OCP cluster installation (SNO) using assisted installer running on ACM hub cluster. Hub cluster is OCP 4.10.33 ACM is 2.5.4 When a cluster fails to install we remove the installation CRs and cluster namespace from the hub cluster (to eventually redeploy). The termination of the namespace hangs indefinitely (14+ hours) with finalizers remaining. To resolve the hang we can remove the finalizers by editing both the secret pointed to by BareMetalHost .spec.bmc.credentialsName and BareMetalHost CR. When these finalizers are removed the namespace termination completes within a few seconds.
Version-Release number of selected component (if applicable):
OCP 4.10.33 ACM 2.5.4
How reproducible:
Always
Steps to Reproduce:
1. Generate installation CRs (AgentClusterInstall, BMH, ClusterDeployment, InfraEnv, NMStateConfig, ...) with an invalid configuration parameter. Two scenarios validated to hit this issue: a. Invalid rootDeviceHint in BareMetalHost CR b. Invalid credentials in the secret referenced by BareMetalHost.spec.bmc.credentialsName 2. Apply installation CRs to hub cluster 3. Wait for cluster installation to fail 4. Remove cluster installation CRs and namespace
Actual results:
Cluster namespace remains in terminating state indefinitely: $ oc get ns cnfocto1 NAME STATUS AGE cnfocto1 Terminating 17h
Expected results:
Cluster namespace (and all installation CRs in it) are successfully removed.
Additional info:
The installation CRs are applied to and removed from the hub cluster using argocd. The CRs have the following waves applied to them which affects the creation order (lowest to highest) and removal order (highest to lowest): Namespace: 0 AgentClusterInstall: 1 ClusterDeployment: 1 NMStateConfig: 1 InfraEnv: 1 BareMetalHost: 1 HostFirmwareSettings: 1 ConfigMap: 1 (extra manifests) ManagedCluster: 2 KlusterletAddonConfig: 2
Description of problem:
Since coreos-installer writes to stdout, its logs are not available for us.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
the in repo upi docs point to the terraform configs. If we remove those, we should update the docs to not use them.
Description of problem:
Create network LoadBalancer service, but always get Connection time out when accessing the LB
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-27-135134
How reproducible:
100%
Steps to Reproduce:
1. create custom ingresscontroller that using Network LB service $ Domain="nlb.$(oc get dns.config cluster -o=jsonpath='{.spec.baseDomain}')" $ oc create -f - << EOF kind: IngressController apiVersion: operator.openshift.io/v1 metadata: name: nlb namespace: openshift-ingress-operator spec: domain: ${Domain} replicas: 3 endpointPublishingStrategy: loadBalancer: providerParameters: aws: type: NLB type: AWS scope: External type: LoadBalancerService EOF 2. wait for the ingress NLB service is ready. $ oc -n openshift-ingress get svc/router-nlb NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-nlb LoadBalancer 172.30.75.134 a765a5eb408aa4a68988e35b72672379-78a76c339ded64fa.elb.us-east-2.amazonaws.com 80:31833/TCP,443:32499/TCP 117s 3. curl the network LB $ curl a765a5eb408aa4a68988e35b72672379-78a76c339ded64fa.elb.us-east-2.amazonaws.com -I <hang>
Actual results:
Connection time out
Expected results:
curl should return 503
Additional info:
the NLB service has the annotation: service.beta.kubernetes.io/aws-load-balancer-type: nlb
This is a clone of issue OCPBUGS-3358. The following is the description of the original issue:
—
Description of problem:
Due to changes in BUILD-407 which merged into release-4.12, we have a permafailing test `e2e-aws-csi-driver-no-refreshresource` and are unable to merge subsequent pull requests.
Version-Release number of selected component (if applicable):
How reproducible: Always
Steps to Reproduce:
1. Bring up cluster using release-4.12 or release-4.13 or master branch 2. Run `e2e-aws-csi-driver-no-refreshresource` test 3.
Actual results:
I1107 05:18:31.131666 1 mount_linux.go:174] Cannot run systemd-run, assuming non-systemd OS
I1107 05:18:31.131685 1 mount_linux.go:175] systemd-run failed with: exit status 1
I1107 05:18:31.131702 1 mount_linux.go:176] systemd-run output: System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to create bus connection: Host is down
Expected results:
Test should pass
Additional info:
This is a clone of issue OCPBUGS-3501. The following is the description of the original issue:
—
Description of problem:
On clusters serving Route via CRD (i.e. MicroShift), .spec.host values are not automatically assigned during Route creation, as they are on OCP.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
$ cat<<EOF | oc apply --server-side -f- apiVersion: route.openshift.io/v1 kind: Route metadata: name: hello-microshift spec: to: kind: Service name: hello-microshift EOF route.route.openshift.io/hello-microshift serverside-applied $ oc get route hello-microshift -o yaml apiVersion: route.openshift.io/v1 kind: Route metadata: annotations: openshift.io/host.generated: "true" creationTimestamp: "2022-11-11T23:53:33Z" generation: 1 name: hello-microshift namespace: default resourceVersion: "2659" uid: cd35cd20-b3fd-4d50-9912-f34b3935acfd spec: host: hello-microshift-default.cluster.local to: kind: Service name: hello-microshift wildcardPolicy: None
Expected results:
... metadata: annotations: openshift.io/host.generated: "true" ... spec: host: hello-microshift-default.foo.bar.baz ...
Actual results:
Host and host.generated annotation are missing.
Additional info:
** This change will be inert on OCP, which already has the correct behavior. **
This is a clone of issue OCPBUGS-2384. The following is the description of the original issue:
—
Version:
$ openshift-install version
openshift-install 4.10.0-0.nightly-2021-12-23-153012
built from commit 94a3ed9cbe4db66dc50dab8b85d2abf40fb56426
release image registry.ci.openshift.org/ocp/release@sha256:39cacdae6214efce10005054fb492f02d26b59fe9d23686dc17ec8a42f428534
release architecture amd64
Platform: alibabacloud
Please specify:
What happened?
Unexpected error of 'Internal publish strategy is not supported on "alibabacloud" platform', because Internal publish strategy should be supported for "alibabacloud", please clarify otherwise, thanks!
$ openshift-install create install-config --dir work
? SSH Public Key /home/jiwei/.ssh/openshift-qe.pub
? Platform alibabacloud
? Region us-east-1
? Base Domain alicloud-qe.devcluster.openshift.com
? Cluster Name jiwei-uu
? Pull Secret [? for help] *********
INFO Install-Config created in: work
$
$ vim work/install-config.yaml
$ yq e '.publish' work/install-config.yaml
Internal
$ openshift-install create cluster --dir work --log-level info
FATAL failed to fetch Metadata: failed to load asset "Install Config": invalid "install-config.yaml" file: publish: Invalid value: "Internal": Internal publish strategy is not supported on "alibabacloud" platform
$
What did you expect to happen?
"publish: Internal" should be supported for platform "alibabacloud".
How to reproduce it (as minimally and precisely as possible)?
Always
Description of problem:
When attempting to load ISO to the remote server, the InsertMedia request fails with `Base.1.5.PropertyMissing`. The system is Mt.Jade Server / GIGABYTE G242-P36. BMC is provided by Megarac.
Version-Release number of selected component (if applicable):
OCP 4.12
How reproducible:
Always
Steps to Reproduce:
1. Create a BMH against such server 2. Create InfraEnv and attempt provisioning
Actual results:
Image provisioning failed: Deploy step deploy.deploy failed with BadRequestError: HTTP POST https://192.168.53.149/redfish/v1/Managers/Self/VirtualMedia/CD1/Actions/VirtualMedia.InsertMedia returned code 400. Base.1.5.PropertyMissing: The property TransferProtocolType is a required property and must be included in the request. Extended information: [{'@odata.type': '#Message.v1_0_8.Message', 'Message': 'The property TransferProtocolType is a required property and must be included in the request.', 'MessageArgs': ['TransferProtocolType'], 'MessageId': 'Base.1.5.PropertyMissing', 'RelatedProperties': ['#/TransferProtocolType'], 'Resolution': 'Ensure that the property is in the request body and has a valid value and resubmit the request if the operation failed.', 'Severity': 'Warning'}].
Expected results:
Image provisioning to work
Additional info:
The following patch attempted to fix the problem: https://opendev.org/openstack/sushy/commit/ecf1bcc80bd14a1836d015c3dbdb4fd88f2bbd75 but the response code checked by the logic in the patch above is `Base.1.5.ActionParameterMissing` whic doesn’t quite address the response code I’m getting, which is Base.1.5.PropertyMissing
This is a clone of issue OCPBUGS-4491. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The current version of openshift/router vendors Kubernetes 1.24 packages. OpenShift 4.12 is based on Kubernetes 1.25.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Check https://github.com/openshift/router/blob/release-4.12/go.mod
Actual results:
Kubernetes packages (k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go) are at version v0.24.0.
Expected results:
Kubernetes packages are at version v0.25.0 or later.
Additional info:
Using old Kubernetes API and client packages brings risk of API compatibility issues.
This is a clone of issue OCPBUGS-3744. The following is the description of the original issue:
—
Description of problem:
Egress router POD creation on Openshift 4.11 is failing with below error. ~~~ Nov 15 21:51:29 pltocpwn03 hyperkube[3237]: E1115 21:51:29.467436 3237 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"stage-wfe-proxy-ext-qrhjw_stage-wfe-proxy(c965a287-28aa-47b6-9e79-0cc0e209fcf2)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"stage-wfe-proxy-ext-qrhjw_stage-wfe-proxy(c965a287-28aa-47b6-9e79-0cc0e209fcf2)\\\": rpc error: code = Unknown desc = failed to create pod network sandbox k8s_stage-wfe-proxy-ext-qrhjw_stage-wfe-proxy_c965a287-28aa-47b6-9e79-0cc0e209fcf2_0(72bcf9e52b199061d6e651e84b0892efc142601b2442c2d00b92a1ba23208344): error adding pod stage-wfe-proxy_stage-wfe-proxy-ext-qrhjw to CNI network \\\"multus-cni-network\\\": plugin type=\\\"multus\\\" name=\\\"multus-cni-network\\\" failed (add): [stage-wfe-proxy/stage-wfe-proxy-ext-qrhjw/c965a287-28aa-47b6-9e79-0cc0e209fcf2:openshift-sdn]: error adding container to network \\\"openshift-sdn\\\": CNI request failed with status 400: 'could not open netns \\\"/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669\\\": unknown FS magic on \\\"/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669\\\": 1021994\\n'\"" pod="stage-wfe-proxy/stage-wfe-proxy-ext-qrhjw" podUID=c965a287-28aa-47b6-9e79-0cc0e209fcf2 ~~~ I have checked SDN POD log from node where egress router POD is failing and I could see below error message. ~~~ 2022-11-15T21:51:29.283002590Z W1115 21:51:29.282954 181720 pod.go:296] CNI_ADD stage-wfe-proxy/stage-wfe-proxy-ext-qrhjw failed: could not open netns "/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669": unknown FS magic on "/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669": 1021994 ~~~ Crio is logging below event and looking at the log it seems the namespace has been created on node. ~~~ Nov 15 21:51:29 pltocpwn03 crio[3150]: time="2022-11-15 21:51:29.307184956Z" level=info msg="Got pod network &{Name:stage-wfe-proxy-ext-qrhjw Namespace:stage-wfe-proxy ID:72bcf9e52b199061d6e651e84b0892efc142601b2442c2d00b92a1ba23208344 UID:c965a287-28aa-47b6-9e79-0cc0e209fcf2 NetNS:/var/run/netns/8c5ca402-3381-4935-baed-ea454161d669 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}" ~~~
Version-Release number of selected component (if applicable):
4.11.12
How reproducible:
Not Sure
Steps to Reproduce:
1. 2. 3.
Actual results:
Egress router POD is failing to create. Sample application could be created without any issue.
Expected results:
Egress router POD should get created
Additional info:
Egress router POD is created following below document and it does contain pod.network.openshift.io/assign-macvlan: "true" annotation. https://docs.openshift.com/container-platform/4.11/networking/openshift_sdn/deploying-egress-router-layer3-redirection.html#nw-egress-router-pod_deploying-egress-router-layer3-redirection
This is a clone of issue OCPBUGS-3164. The following is the description of the original issue:
—
During first bootstrap boot we need crio and kubelet on the disk, so we start release-image-pivot systemd task. However, its not blocking bootkube, so these two run in parallel.
release-image-pivot restarts the node to apply new OS image, which may leave bootkube in an inconsistent state. This task should run before bootkube
This ticket is linked with
https://issues.redhat.com/browse/SDA-8177
https://issues.redhat.com/browse/SDA-8178
As a summary, a base domain for a hosted cluster may already contain the "cluster-name".
But it seems that Hypershift also encodes it during some reconciliation step:
https://github.com/openshift/hypershift/blob/main/support/globalconfig/dns.go#L20
Then when using a DNS base domain like:
"rosa.lponce-prod-01.qtii.p3.openshiftapps.com"
we will have A records like:
"*.apps.lponce-prod-01.rosa.lponce-prod-01.qtii.p3.openshiftapps.com"
The expected behaviour would be that given a DNS base domain:
"rosa.lponce-prod-01.qtii.p3.openshiftapps.com"
The resulting wildcard for Ingress would be:
"*.apps.rosa.lponce-prod-01.qtii.p3.openshiftapps.com"
Note that trying to configure a specific IngressSpec for a hosted cluster didn't work for our case, as the wildcards records are not created.
This is a clone of issue OCPBUGS-4724. The following is the description of the original issue:
—
Description of problem: Installing OCP4.12 on top of Openstack 16.1 following the multi-availabilityZone installation is creating a cluster where the egressIP annotations ("cloud.network.openshift.io/egress-ipconfig") are created with empty value for the workers:
$ oc get nodes NAME STATUS ROLES AGE VERSION ostest-kncvv-master-0 Ready control-plane,master 9h v1.25.4+86bd4ff ostest-kncvv-master-1 Ready control-plane,master 9h v1.25.4+86bd4ff ostest-kncvv-master-2 Ready control-plane,master 9h v1.25.4+86bd4ff ostest-kncvv-worker-0-qxr5g Ready worker 8h v1.25.4+86bd4ff ostest-kncvv-worker-1-bmvvv Ready worker 8h v1.25.4+86bd4ff ostest-kncvv-worker-2-pbgww Ready worker 8h v1.25.4+86bd4ff $ oc get node ostest-kncvv-worker-0-qxr5g -o json | jq -r '.metadata.annotations' { "alpha.kubernetes.io/provided-node-ip": "10.196.2.156", "cloud.network.openshift.io/egress-ipconfig": "null", "csi.volume.kubernetes.io/nodeid": "{\"cinder.csi.openstack.org\":\"8327aef0-c6a7-4bf6-8f8f-d25c9abd9bce\",\"manila.csi.openstack.org\":\"ostest-kncvv-worker-0-qxr5g\"}", "k8s.ovn.org/host-addresses": "[\"10.196.2.156\",\"172.17.5.154\"]", "k8s.ovn.org/l3-gateway-config": "{\"default\":{\"mode\":\"shared\",\"interface-id\":\"br-ex_ostest-kncvv-worker-0-qxr5g\",\"mac-address\":\"fa:16:3e:7e:b5:70\",\"ip-addresses\":[\"10.196.2.156/16\"],\"ip-address\":\"10.196.2.156/16\",\"next-hops\":[\"10.196.0.1\"],\"next-hop\":\"10.196.0.1\",\"node-port-enable\":\"true\",\"vlan-id\":\"0\"}}", "k8s.ovn.org/node-chassis-id": "fd777b73-aa64-4fa5-b0b1-70c3bebc2ac6", "k8s.ovn.org/node-gateway-router-lrp-ifaddr": "{\"ipv4\":\"100.64.0.6/16\"}", "k8s.ovn.org/node-mgmt-port-mac-address": "42:e8:4f:42:9f:7d", "k8s.ovn.org/node-primary-ifaddr": "{\"ipv4\":\"10.196.2.156/16\"}", "k8s.ovn.org/node-subnets": "{\"default\":\"10.128.2.0/23\"}", "machine.openshift.io/machine": "openshift-machine-api/ostest-kncvv-worker-0-qxr5g", "machineconfiguration.openshift.io/controlPlaneTopology": "HighlyAvailable", "machineconfiguration.openshift.io/currentConfig": "rendered-worker-31323caf2b510e5b81179bb8ec9c150f", "machineconfiguration.openshift.io/desiredConfig": "rendered-worker-31323caf2b510e5b81179bb8ec9c150f", "machineconfiguration.openshift.io/desiredDrain": "uncordon-rendered-worker-31323caf2b510e5b81179bb8ec9c150f", "machineconfiguration.openshift.io/lastAppliedDrain": "uncordon-rendered-worker-31323caf2b510e5b81179bb8ec9c150f", "machineconfiguration.openshift.io/reason": "", "machineconfiguration.openshift.io/state": "Done", "volumes.kubernetes.io/controller-managed-attach-detach": "true" }
Furthermore, Below is observed on openshift-cloud-network-config-controller:
$ oc logs -n openshift-cloud-network-config-controller cloud-network-config-controller-5fcdb6fcff-6sddj | grep egress I1212 00:34:14.498298 1 node_controller.go:146] Setting annotation: 'cloud.network.openshift.io/egress-ipconfig: null' on node: ostest-kncvv-worker-2-pbgww I1212 00:34:15.777129 1 node_controller.go:146] Setting annotation: 'cloud.network.openshift.io/egress-ipconfig: null' on node: ostest-kncvv-worker-0-qxr5g I1212 00:38:13.115115 1 node_controller.go:146] Setting annotation: 'cloud.network.openshift.io/egress-ipconfig: null' on node: ostest-kncvv-worker-1-bmvvv I1212 01:58:54.414916 1 node_controller.go:146] Setting annotation: 'cloud.network.openshift.io/egress-ipconfig: null' on node: ostest-kncvv-worker-0-drd5l I1212 02:01:03.312655 1 node_controller.go:146] Setting annotation: 'cloud.network.openshift.io/egress-ipconfig: null' on node: ostest-kncvv-worker-1-h976w I1212 02:04:11.656408 1 node_controller.go:146] Setting annotation: 'cloud.network.openshift.io/egress-ipconfig: null' on node: ostest-kncvv-worker-2-zxwrv
Version-Release number of selected component (if applicable):
RHOS-16.1-RHEL-8-20221206.n.1 4.12.0-0.nightly-2022-12-09-063749
How reproducible:
Always
Steps to Reproduce:
1. Run AZ job on D/S CI (Openshift on Openstack QE CI) 2. Run conformance/serial tests
Actual results:
conformance/serial TCs are failing because it is not finding the egressIP annotation on the workers
Expected results:
Tests passing
Additional info:
Links provided on private comment.
The application dropdown menu uses a custom component with a configuration to favorite applications, like the Project selection menu favorites projects, but its UX is inconsistent in the way it looks and behaves.
The Project selection UI element uses the PatternFly Menu component. It would be better to have the Application dropdown menu looks and behavior be consistent with the PatternFly Menu component.
This is a clone of issue OCPBUGS-2992. The following is the description of the original issue:
—
Description of problem:
The metal3-ironic container image in OKD fails during steps in configure-ironic.sh that look for additional Oslo configuration entries as environment variables to configure the Ironic instance. The mechanism by which it fails in OKD but not OpenShift is that the image for OpenShift happens to have unrelated variables set which match the regex, because it is based on the builder image, but the OKD image is based only on a stream8 image without these unrelated OS_ prefixed variables set. The metal3 pod created in response to even a provisioningNetwork: Disabled Provisioning object will therefore crashloop indefinitely.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Always
Steps to Reproduce:
1. Deploy OKD to a bare metal cluster using the assisted-service, with the OKD ConfigMap applied to podman play kube, as in :https://github.com/openshift/assisted-service/tree/master/deploy/podman#okd-configuration 2. Observe the state of the metal3 pod in the openshift-machine-api namespace.
Actual results:
The metal3-ironic container repeatedly exits with nonzero, with the logs ending here: ++ export IRONIC_URL_HOST=10.1.1.21 ++ IRONIC_URL_HOST=10.1.1.21 ++ export IRONIC_BASE_URL=https://10.1.1.21:6385 ++ IRONIC_BASE_URL=https://10.1.1.21:6385 ++ export IRONIC_INSPECTOR_BASE_URL=https://10.1.1.21:5050 ++ IRONIC_INSPECTOR_BASE_URL=https://10.1.1.21:5050 ++ '[' '!' -z '' ']' ++ '[' -f /etc/ironic/ironic.conf ']' ++ cp /etc/ironic/ironic.conf /etc/ironic/ironic.conf_orig ++ tee /etc/ironic/ironic.extra # Options set from Environment variables ++ echo '# Options set from Environment variables' ++ env ++ grep '^OS_' ++ tee -a /etc/ironic/ironic.extra
Expected results:
The metal3-ironic container starts and the metal3 pod is reported as ready.
Additional info:
This is the PR that introduced pipefail to the downstream ironic-image, which is not yet accepted in the upstream: https://github.com/openshift/ironic-image/pull/267/files#diff-ab2b20df06f98d48f232d90f0b7aa464704257224862780635ec45b0ce8a26d4R3 This is the line that's failing: https://github.com/openshift/ironic-image/blob/4838a077d849070563b70761957178055d5d4517/scripts/configure-ironic.sh#L57 This is the image base that OpenShift uses for ironic-image (before rewriting in ci-operator): https://github.com/openshift/ironic-image/blob/4838a077d849070563b70761957178055d5d4517/Dockerfile.ocp#L9 Here is where the relevant environment variables are set in the builder images for OCP: https://github.com/openshift/builder/blob/973602e0e576d7eccef4fc5810ba511405cd3064/hack/lib/build/version.sh#L87 Here is the final FROM line in the OKD image build (just stream8): https://github.com/openshift/ironic-image/blob/4838a077d849070563b70761957178055d5d4517/Dockerfile.okd#L9 This results in the following differences between the two images: $ podman run --rm -it --entrypoint bash quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:519ac06836d972047f311de5e57914cf842716e22a1d916a771f02499e0f235c -c 'env | grep ^OS_' OS_GIT_MINOR=11 OS_GIT_TREE_STATE=clean OS_GIT_COMMIT=97530a7 OS_GIT_VERSION=4.11.0-202210061001.p0.g97530a7.assembly.stream-97530a7 OS_GIT_MAJOR=4 OS_GIT_PATCH=0 $ podman run --rm -it --entrypoint bash quay.io/openshift/okd-content@sha256:6b8401f8d84c4838cf0e7c598b126fdd920b6391c07c9409b1f2f17be6d6d5cb -c 'env | grep ^OS_' Here is what the OS_ prefixed variables should be used for: https://github.com/metal3-io/ironic-image/blob/807a120b4ce5e1675a79ebf3ee0bb817cfb1f010/README.md?plain=1#L36 https://opendev.org/openstack/oslo.config/src/commit/84478d83f87e9993625044de5cd8b4a18dfcaf5d/oslo_config/sources/_environment.py It's worth noting that ironic.extra is not consumed anywhere, and is simply being used here to save off the variables that Oslo _might_ be consuming (it won't consume the variables that are present in the OCP builder image, though they do get caught by this regex). With pipefail set, grep returns non-zero when it fails to find an environment variable that matches the regex, as in the case of the OKD ironic-image builds.
This is a clone of issue OCPBUGS-95. The following is the description of the original issue:
—
In an OpenShift cluster with OpenShiftSDN network plugin with egressIP and NMstate operator configured, there are some conditions when the egressIP is deconfigured from the network interface.
The bug is 100% reproducible.
Steps for reproducing the issue are:
1. Install a cluster with OpenShiftSDN network plugin.
2. Configure egressip for a project.
3. Install NMstate operator.
4. Create a NodeNetworkConfigurationPolicy.
5. Identify on which node the egressIP is present.
6. Restart the nmstate-handler pod running on the identified node.
7. Verify that the egressIP is no more present.
Restarting the sdn pod related to the identified node will reconfigure the egressIP in the node.
This issue has a high impact since any changes triggered for the NMstate operator will prevent application traffic. For example, in the customer environment, the issue is triggered any time a new node is added to the cluster.
The expectation is that NMstate operator should not interfere with SDN configuration.
This is a clone of issue OCPBUGS-4874. The following is the description of the original issue:
—
OCPBUGS-3278 is supposed to fix the issue where the user was required to provide data about the baremetal hosts (including MAC addresses) in the install-config, even though this data is ignored.
However, we determine whether we should disable the validation by checking the second CLI arg to see if it is agent.
This works when the command is:
openshift-install agent create image --dir=whatever
But fails when the argument is e.g., as in dev-scripts:
openshift-install --log-level=debug --dir=whatever agent create image
Description of problem:
When you migrate a HostedCluster, the AWSEndpointService conflicts from the old MGMT Server with the new MGMT Server. The AWSPrivateLink_Controller does not have any validation when this happens. This is needed to make the Disaster Recovery HC Migration works. So the issue will raise up when the nodes of the HostedCluster cannot join the new Management cluster because the AWSEndpointServiceName is still pointing to the old one.
Version-Release number of selected component (if applicable):
4.12 4.13 4.14
How reproducible:
Follow the migration procedure from upstream documentation and the nodes in the destination HostedCluster will keep in NotReady state.
Steps to Reproduce:
1. Setup a management cluster with the 4.12-13-14/main version of the HyperShift operator. 2. Run the in-place node DR Migrate E2E test from this PR https://github.com/openshift/hypershift/pull/2138: bin/test-e2e \ -test.v \ -test.timeout=2h10m \ -test.run=TestInPlaceUpgradeNodePool \ --e2e.aws-credentials-file=$HOME/.aws/credentials \ --e2e.aws-region=us-west-1 \ --e2e.aws-zones=us-west-1a \ --e2e.pull-secret-file=$HOME/.pull-secret \ --e2e.base-domain=www.mydomain.com \ --e2e.latest-release-image="registry.ci.openshift.org/ocp/release:4.13.0-0.nightly-2023-03-17-063546" \ --e2e.previous-release-image="registry.ci.openshift.org/ocp/release:4.13.0-0.nightly-2023-03-17-063546" \ --e2e.skip-api-budget \ --e2e.aws-endpoint-access=PublicAndPrivate
Actual results:
The nodes stay in NotReady state
Expected results:
The nodes should join the migrated HostedCluster
Additional info:
Description of problem:
Agent based installation fails during the 3+1 deployment. I found that the machine-api-operator degraded due to minimum worker replica count is 2 and for 3+1 deployment we need to define one worker node.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create agent.iso (openshift-install agent create image) using install-config.yaml and agent-config.yaml (PFA sample files) 2. Deploy a 3+1 cluster using agent.iso 3. Execute "openshift-install agent wait-for install-complete" command to wait for install complete.
Actual results:
Getting below error: ERROR Cluster operator kube-controller-manager Degraded is True with GarbageCollector_Error: GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: no such host INFO Cluster operator machine-api Progressing is True with SyncingResources: Progressing towards operator: 4.12.0-0.nightly-2022-10-05-053337 ERROR Cluster operator machine-api Degraded is True with SyncingFailed: Failed when progressing towards operator: 4.12.0-0.nightly-2022-10-05-053337 because minimum worker replica count (2) not yet met: current running replicas 1, waiting for [] INFO Cluster operator machine-api Available is False with Initializing: Operator is initializing INFO Cluster operator monitoring Available is False with UpdatingPrometheusOperatorFailed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. ERROR Cluster operator monitoring Degraded is True with UpdatingPrometheusOperatorFailed: Failed to rollout the stack. Error: updating prometheus operator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 1 unavailable replicas INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. INFO Cluster operator network ManagementStateDegraded is False with : ERROR Cluster initialization failed because one or more operators are not functioning properly. ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below, ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
Expected results:
3+1 deployment should be successful.
Additional info:
I found that there is a condition in the machine-api-operator to check that the worker node count should be 2 which is preventing the 3+1 deployment. https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/sync.go#L322
Description of problem:
Alert actions are not triggering modal from where storage cluster can be expanded.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
1/1
Steps to Reproduce:
1. Fill up a storage cluster to 80% 2. Alert is seen in cluster dashboard. 3. Click the Add Capacity button
Actual results:
Modal is not launched.
Expected results:
Modal should be launched.
Additional info:
Description of problem:
In looking at jobs on an accepted payload at https://amd64.ocp.releases.ci.openshift.org/releasestream/4.12.0-0.ci/release/4.12.0-0.ci-2022-08-30-122201 , I observed this job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-serial/1564589538850902016 with "Undiagnosed panic detected in pod" "pods/openshift-controller-manager-operator_openshift-controller-manager-operator-74bf985788-8v9qb_openshift-controller-manager-operator.log.gz:E0830 12:41:48.029165 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)"
Version-Release number of selected component (if applicable):
4.12
How reproducible:
probably relatively easy to reproduce (but not consistently) given it's happened several times according to this search: https://search.ci.openshift.org/?search=Observed+a+panic%3A+%22invalid+memory+address+or+nil+pointer+dereference%22&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Steps to Reproduce:
1. let nightly payloads run or run one of the presubmit jobs mentioned in the search above 2. 3.
Actual results:
Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)}
Expected results:
no panics
Additional info:
Since 4.11 OCP comes with OperatorHub definition which declares a capability
and enables all catalog sources. For OKD we want to enable just community-operators
as users may not have Red Hat pull secret set.
This commit would ensure that OKD version of marketplace operator gets
its own OperatorHub manifest with a custom set of operator catalogs enabled
This is a clone of issue OCPBUGS-186. The following is the description of the original issue:
—
Description of problem:
When resizing the browser window, the PipelineRun task status bar would overlap the status text that says "Succeeded" in the screenshot.
Actual results:
Status text is overlapped by the task status bar
Expected results:
Status text breaks to a newline or gets shortened by "..."
Description of problem:
Restore size in snapshot output is not the same size of pvc request size
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create IBM cluster. Flexy template: aos-4_12/ipi-on-ibmcloud/versioned-installer- private_cluster-ovn-fips-ci Payload: 4.12.0-0.nightly-2022-11-29-131548 2. Create sc, pvc, dep 3. Create volumesnapshot from default volumesnapshotclass. 4. Check the volumesnapshot output restore size
sc_pvc_dep.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: mysc
parameters:
profile: 10iops-tier
provisioner: vpc.block.csi.ibm.io
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
—
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mypvc-csi
namespace: testropatil
spec:
accessModes:
rohitpatil@ropatil-mac Downloads % oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGEmysc vpc.block.csi.ibm.io Delete WaitForFirstConsumer true 2m37s
rohitpatil@ropatil-mac Downloads % oc get pvc,pod -n testropatilNAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGEpersistentvolumeclaim/mypvc-csi Bound pvc-1a014601-8176-4c55-93cf-d408460b9359 26Gi RWO mysc 27s NAME READY STATUS RESTARTS AGEpod/mydep-5477fd946b-w77sw 1/1 Running 0 27s
rohitpatil@ropatil-mac Downloads % oc get volumesnapshot -n testropatilNAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGEmy-snapshot-new true mypvc-csi 1Gi vpc-block-snapshot snapcontent-a40f3a17-8697-4215-8a2f-77d3d5592c60 29s 32s
Actual results:
volumesnapshot RESTORESIZE is 1Gi which is not the same to pvc request size(26Gi)
Expected results:
volumesnapshot should be the same size of pvc request size
Additional info:
In 4.12.0-rc.0 some API-server components declare flowcontrol/v1beta1 release manifests:
$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.12.0-rc.0-x86_64 $ grep -r flowcontrol.apiserver.k8s.io manifests manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-authentication-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_20_etcd-operator_10_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_20_kube-apiserver-operator_08_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-openshift-apiserver-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-openshift-apiserver-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-openshift-apiserver-operator_09_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 manifests/0000_50_cluster-openshift-controller-manager-operator_10_flowschema.yaml:apiVersion: flowcontrol.apiserver.k8s.io/v1beta1
The APIs are scheduled for removal in Kube 1.26, which will ship with OpenShift 4.13. We want the 4.12 CVO to move to modern APIs in 4.12, so the APIRemovedInNext.*ReleaseInUse alerts are not firing on 4.12. This ticket tracks removing those manifests, or replacing them with a more modern resource type, or some such. Definition of done is that new 4.13 (and with backports, 4.12) nightlies no longer include flowcontrol.apiserver.k8s.io/v1beta1 manifests.
[It] clients should not use APIs that are removed in upcoming releases [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] github.com/openshift/origin/test/extended/apiserver/api_requests.go:27 Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times Nov 18 21:59:06.261: INFO: api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times Nov 18 21:59:06.261: INFO: api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times Nov 18 21:59:06.261: INFO: user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times Nov 18 21:59:06.261: INFO: user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times Nov 18 21:59:06.261: INFO: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times [AfterEach] [sig-arch][Late] github.com/openshift/origin/test/extended/util/client.go:158 [AfterEach] [sig-arch][Late] github.com/openshift/origin/test/extended/util/client.go:159 flake: api flowschemas.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 254 times api horizontalpodautoscalers.v2beta2.autoscaling, removed in release 1.26, was accessed 10 times api prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io, removed in release 1.26, was accessed 22 times user/system:admin accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 14 times user/system:serviceaccount:openshift-cluster-version:default accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 224 times user/system:serviceaccount:openshift-cluster-version:default accessed prioritylevelconfigurations.v1beta1.flowcontrol.apiserver.k8s.io 22 times user/system:serviceaccount:openshift-kube-storage-version-migrator:kube-storage-version-migrator-sa accessed flowschemas.v1beta1.flowcontrol.apiserver.k8s.io 16 times user/system:serviceaccount:openshift-monitoring:kube-state-metrics accessed horizontalpodautoscalers.v2beta2.autoscaling 10 times Ginkgo exit error 4: exit with code 4
This is required to unblock https://github.com/openshift/origin/pull/27561
This is a clone of issue OCPBUGS-2988. The following is the description of the original issue:
—
Description of problem:
openshift-apiserver, openshift-oauth-apiserver and kube-apiserver pods cannot validate the certificate when trying to reach etcd reporting certificate validation errors: }. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for ::1, 127.0.0.1, ::1, fd69::2, not 2620:52:0:198::10" W1018 11:36:43.523673 15 logging.go:59] [core] [Channel #186 SubChannel #187] grpc: addrConn.createTransport failed to connect to { "Addr": "[2620:52:0:198::10]:2379", "ServerName": "2620:52:0:198::10", "Attributes": null, "BalancerAttributes": null, "Type": 0, "Metadata": null }. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for ::1, 127.0.0.1, ::1, fd69::2, not 2620:52:0:198::10"
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-18-041406
How reproducible:
100%
Steps to Reproduce:
1. Deploy SNO with single stack IPv6 via ZTP procedure
Actual results:
Deployment times out and some of the operators aren't deployed successfully. NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-2022-10-18-041406 False False True 124m APIServerDeploymentAvailable: no apiserver.openshift-oauth-apiserver pods available on any node.... baremetal 4.12.0-0.nightly-2022-10-18-041406 True False False 112m cloud-controller-manager 4.12.0-0.nightly-2022-10-18-041406 True False False 111m cloud-credential 4.12.0-0.nightly-2022-10-18-041406 True False False 115m cluster-autoscaler 4.12.0-0.nightly-2022-10-18-041406 True False False 111m config-operator 4.12.0-0.nightly-2022-10-18-041406 True False False 124m console control-plane-machine-set 4.12.0-0.nightly-2022-10-18-041406 True False False 111m csi-snapshot-controller 4.12.0-0.nightly-2022-10-18-041406 True False False 111m dns 4.12.0-0.nightly-2022-10-18-041406 True False False 111m etcd 4.12.0-0.nightly-2022-10-18-041406 True False True 121m ClusterMemberControllerDegraded: could not get list of unhealthy members: giving up getting a cached client after 3 tries image-registry 4.12.0-0.nightly-2022-10-18-041406 False True True 104m Available: The registry is removed... ingress 4.12.0-0.nightly-2022-10-18-041406 True True True 111m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 0/1 of replicas are available) insights 4.12.0-0.nightly-2022-10-18-041406 True False False 118s kube-apiserver 4.12.0-0.nightly-2022-10-18-041406 True False False 102m kube-controller-manager 4.12.0-0.nightly-2022-10-18-041406 True False True 107m GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp [fd02::3c5f]:9091: connect: connection refused kube-scheduler 4.12.0-0.nightly-2022-10-18-041406 True False False 107m kube-storage-version-migrator 4.12.0-0.nightly-2022-10-18-041406 True False False 117m machine-api 4.12.0-0.nightly-2022-10-18-041406 True False False 111m machine-approver 4.12.0-0.nightly-2022-10-18-041406 True False False 111m machine-config 4.12.0-0.nightly-2022-10-18-041406 True False False 115m marketplace 4.12.0-0.nightly-2022-10-18-041406 True False False 116m monitoring False True True 98m deleting Thanos Ruler Route failed: Timeout: request did not complete within requested timeout - context deadline exceeded, deleting UserWorkload federate Route failed: Timeout: request did not complete within requested timeout - context deadline exceeded, reconciling Alertmanager Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io alertmanager-main), reconciling Thanos Querier Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io thanos-querier), reconciling Prometheus API Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io prometheus-k8s), prometheuses.monitoring.coreos.com "k8s" not found network 4.12.0-0.nightly-2022-10-18-041406 True False False 124m node-tuning 4.12.0-0.nightly-2022-10-18-041406 True False False 111m openshift-apiserver 4.12.0-0.nightly-2022-10-18-041406 True False False 104m openshift-controller-manager 4.12.0-0.nightly-2022-10-18-041406 True False False 107m openshift-samples False True False 103m The error the server was unable to return a response in the time allotted, but may still be processing the request (get imagestreams.image.openshift.io) during openshift namespace cleanup has left the samples in an unknown state operator-lifecycle-manager 4.12.0-0.nightly-2022-10-18-041406 True False False 111m operator-lifecycle-manager-catalog 4.12.0-0.nightly-2022-10-18-041406 True False False 111m operator-lifecycle-manager-packageserver 4.12.0-0.nightly-2022-10-18-041406 True False False 106m service-ca 4.12.0-0.nightly-2022-10-18-041406 True False False 124m storage 4.12.0-0.nightly-2022-10-18-041406 True False False 111m
Expected results:
Deployment succeeds without issues.
Additional info:
I was unable to run must-gather so attaching the pods logs copied from the host file system.
Description of problem:
OLM has a dependency on openshift/cluster-policy-controller. This project had dependencies with v0.0.0 versions, which due to a bug in ART was causing issues building the olm image. To fix this, we have to update the dependencies in the cluster-policy-controller project to point to actual versions. This was already done: * https://github.com/openshift/cluster-policy-controller/pull/103 * https://github.com/openshift/cluster-policy-controller/pull/101 And these changes already made it to 4.14 and 4.13 branches of the cluster-policy-controller. The backport to 4.12 is: https://github.com/openshift/cluster-policy-controller/pull/102
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When a user tries to run `oc debug,` they end up getting errors about pod security labels:
Ensure the target namespace has the appropriate security level set or consider creating a dedicated privileged namespace using: "oc create ns <namespace> -o yaml | oc label -f - security.openshift.io/scc.podSecurityLabelSync=false pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/audit=privileged pod-security.kubernetes.io/warn=privileged". Original error: pods "ip-10-0-129-209ec2internal-debug" is forbidden: violates PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true, hostIPC=true), privileged (container "container-00" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "container-00" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "container-00" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volume "host" uses restricted volume type "hostPath"), runAsNonRoot != true (pod or container "container-00" must set securityContext.runAsNonRoot=true), runAsUser=0 (container "container-00" must not set runAsUser=0), seccompProfile (pod or container "container-00" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") command failed, 3 retries left
This happens since https://docs.openshift.com/container-platform/4.11/authentication/understanding-and-managing-pod-security-admission.html
Fixing it requires the user running something like
oc create ns fips-check -o yaml | \
oc label -f - \
security.openshift.io/scc.podSecurityLabelSync=false \
pod-security.kubernetes.io/enforce=privileged \
pod-security.kubernetes.io/audit=privileged \
pod-security.kubernetes.io/warn=privileged
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Try to run `oc debug node/....` in a new namespace
Actual results:
Error message
Expected results:
oc debug works without the user having to perform additional steps. If namespace is omitted, perhaps oc debug could create a temporary one with the correct pod security labels?
Additional info:
Description of the problem:
Noticed there were no thread IDs in the assisted-installer logs when debugging 240 node cluster deployment with MCE (slack thread) making it difficult to debug.
How reproducible: 100%
Steps to reproduce:
1. Create cluster using assisted service and start the install
2. Look at the assisted-installer logs
Actual results:
Logs look like
time="2022-07-14T16:17:31Z" level=info msg="Start complete installation step, with params success: true, error info: "
Expected results: Thread ID would also print so we can understand which thread it came from
Adding setReportCaller to true will also help
cloud-controller-manager does not react to changes to infrastructure secrets (in the OpenStack case: clouds.yaml).
As a consequence, if credentials are rotated (and the old ones are rendered useless), load balancer creation and deletion will not succeed any more. Restarting the controller fixes the issue on a live cluster.
Logs show that it couldn't find the application credentials:
Dec 19 12:58:58.909: INFO: At 2022-12-19 12:53:58 +0000 UTC - event for udp-lb-default-svc: {service-controller } EnsuringLoadBalancer: Ensuring load balancer Dec 19 12:58:58.909: INFO: At 2022-12-19 12:53:58 +0000 UTC - event for udp-lb-default-svc: {service-controller } SyncLoadBalancerFailed: Error syncing load balancer: failed to ensure load balancer: failed to get subnet to create load balancer for service e2e-test-openstack-q9jnk/udp-lb-default-svc: Unable to re-authenticate: Expected HTTP response code [200 204 300] when accessing [GET https://compute.rdo.mtl2.vexxhost.net/v2.1/0693e2bb538c42b79a49fe6d2e61b0fc/servers/fbeb21b8-05f0-4734-914e-926b6a6225f1/os-interface], but got 401 instead {"error": {"code": 401, "title": "Unauthorized", "message": "The request you have made requires authentication."}}: Resource not found: [POST https://identity.rdo.mtl2.vexxhost.net/v3/auth/tokens], error message: {"error":{"code":404,"message":"Could not find Application Credential: 1b78233956b34c6cbe5e1c95445972a4.","title":"Not Found"}}
OpenStack CI has been instrumented to restart CCM after credentials rotation, so that we silence this particular issue and avoid masking any other. That workaround must be reverted once this bug is fixed.
This is a clone of issue OCPBUGS-3277. The following is the description of the original issue:
—
I saw this occur one time when running installs in a continuous loop. This was with COMPaCT_IPV4 in a non-disconnected setup.
WaitForBootrapComplete shows it can't access the API
level=info msg=Unable to retrieve cluster metadata from Agent Rest API: no clusterID known for the cluster
level=debug msg=cluster is not registered in rest API
level=debug msg=infraenv is not registered in rest API
This is because create-cluster-and-infraenv.service failed
Failed Units: 2
create-cluster-and-infraenv.service
NetworkManager-wait-online.service
The agentbasedinstaller register command wasn't able to retrieve the image to get the version
Nov 03 23:03:24 master-0 create-cluster-and-infraenv[2702]: time="2022-11-03T23:03:24Z" level=error msg="command 'oc adm release info -o template --template '\{{.metadata.version}}' --insecure=false registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451 --registry-config=/tmp/registry-config3852044519' exited with non-zero exit code 1: \nerror: unable to read image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451: Get \"https://registry.ci.openshift.org/v2/\": dial tcp: lookup registry.ci.openshift.org on 192.168.111.1:53: read udp 192.168.111.80:51315->192.168.111.1:53: i/o timeout\n" Nov 03 23:03:24 master-0 create-cluster-and-infraenv[2702]: time="2022-11-03T23:03:24Z" level=error msg="failed to get image openshift version from release image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451" error="command 'oc adm release info -o template --template '\{{.metadata.version}}' --insecure=false registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451 --registry-config=/tmp/registry-config3852044519' exited with non-zero exit code 1: \nerror: unable to read image registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-10-25-210451: Get \"https://registry.ci.openshift.org/v2/\": dial tcp: lookup registry.ci.openshift.org on 192.168.111.1:53: read udp 192.168.111.80:51315->192.168.111.1:53: i/o timeout\n"
This occurs when attempting to get the release here:
https://github.com/openshift/assisted-service/blob/master/cmd/agentbasedinstaller/register.go#L58
We should add a retry mechanism or restart the service to handle spurious network failures like this.
This is a clone of issue OCPBUGS-5016. The following is the description of the original issue:
—
Description of problem:
When editing any pipeline in the openshift console, the correct content cannot be obtained (the obtained information is the initial information).
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
Developer -> Pipeline -> select pipeline -> Details -> Actions -> Edit Pipeline -> YAML view -> Cancel -> Actions -> Edit Pipeline -> YAML view
Actual results:
displayed content is incorrect.
Expected results:
Get the content of the current pipeline, not the "pipeline create" content.
Additional info:
If cancel or save in the "Pipeline Builder" interface after "Edit Pipeline", can get the expected content. ~ Developer -> Pipeline -> select pipeline -> Details -> Actions -> Edit Pipeline -> Pipeline builder -> Cancel -> Actions -> Edit Pipeline -> YAML view :Display resource content normally ~
Description of problem:
On Pod definitions gathering, Operator should obfuscate particular environment variables (HTTP_PROXY and HTTPS_PROXY) from containers by default. Pods from the control plane can have those variables injected from the cluster-wide proxy, and they may contain values as "user:password@[http://6.6.6.6:1234|http://6.6.6.6:1234/]".
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. In order to change deployments, scale down: * cluster-version-operator * cluster-monitoring-operator * prometheus-operator 2. Introduce a new environment variable on alertmanager-main statusSet with either or both HTTP_PROXY, HTTPS_PROXY. Any value but void will do. 4. Run insight-operator to get that pod definitions. 5. Check in the archive (usually config/pod/openshift-monitoring/alertmanager-main-0.json) that target environment variable(s) value is obfuscated.
Actual results:
... "spec": { ... "containers": { ... "env": [ { "name": "HTTP_PROXY" "value": "jdow:1qa2wd@[http://8.8.8.8:8080|http://8.8.8.8:8080/]" } ] } } ...
Expected results:
... "spec": { ... "containers": { ... "env": [ { "name": "HTTP_PROXY" "value": "<obfuscated>" } ] } } ...
Additional info:
This is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2083087 (OCPBUGSM-44070) to backport this issue.
Description of problem:
"Delete dependent objects of this resource" is a bit of confusing for some users because when creating the Application in Dev console not only the deployment but also IS, route, svc, secret objects will be created as well. When deleting the Application (in fact it is deployment), there is an option called "Delete dependent objects of this resource" and some users might think this means the IS, route, svc and any other objects which are created alongside with the deployment will be deleted as well
Version-Release number of selected component (if applicable):
4.8
How reproducible:
Always
Steps to Reproduce:
1. Create Application in Dev console
2. Delete the deployment
3. Check "Delete dependent objects of this resource"
Actual results:
Only deployment will be deleted and IS, svc, route will not be deleted
Expected results:
We either change the description of this option, or we really delete IS, svc, route and any other objects created under this Application.
Additional info:
This is a clone of issue OCPBUGS-6663. The following is the description of the original issue:
—
Description of problem:
When running openshift-install agent create image, and the install-config.yaml does not contain platform baremetal settings (except for VIPs) warnings are still generated as below: DEBUG Loading Install Config... WARNING Platform.Baremetal.ClusterProvisioningIP: 172.22.0.3 is ignored DEBUG Platform.Baremetal.BootstrapProvisioningIP: 172.22.0.2 is ignored WARNING Platform.Baremetal.ExternalBridge: baremetal is ignored WARNING Platform.Baremetal.ExternalMACAddress: 52:54:00:12:e1:68 is ignored WARNING Platform.Baremetal.ProvisioningBridge: provisioning is ignored WARNING Platform.Baremetal.ProvisioningMACAddress: 52:54:00:82:91:8d is ignored WARNING Platform.Baremetal.ProvisioningNetworkCIDR: 172.22.0.0/24 is ignored WARNING Platform.Baremetal.ProvisioningDHCPRange: 172.22.0.10,172.22.0.254 is ignored WARNING Capabilities: %!!(MISSING)s(*types.Capabilities=<nil>) is ignored It looks like these fields are populated with values from libvirt as shown in .openshift_install_state.json: "platform": { "baremetal": { "libvirtURI": "qemu:///system", "clusterProvisioningIP": "172.22.0.3", "bootstrapProvisioningIP": "172.22.0.2", "externalBridge": "baremetal", "externalMACAddress": "52:54:00:12:e1:68", "provisioningNetwork": "Managed", "provisioningBridge": "provisioning", "provisioningMACAddress": "52:54:00:82:91:8d", "provisioningNetworkInterface": "", "provisioningNetworkCIDR": "172.22.0.0/24", "provisioningDHCPRange": "172.22.0.10,172.22.0.254", "hosts": null, "apiVIPs": [ "10.1.101.7", "2620:52:0:165::7" ], "ingressVIPs": [ "10.1.101.9", "2620:52:0:165::9" ] The install-config.yaml used to generate this has the following snippet: platform: baremetal: apiVIPs: - 10.1.101.7 - 2620:52:0:165::7 ingressVIPs: - 10.1.101.9 - 2620:52:0:165::9 additionalTrustBundle: |
Version-Release number of selected component (if applicable):
4.12.0
How reproducible:
Happens every time
Steps to Reproduce:
1. Use install-config.yaml with no platform baremetal fields except for the VIPs 2. run openshift-install agent create image
Actual results:
Warning messages are output
Expected results:
No warning messags
Additional info:
Description: agent.iso is created in case of invalid macAddress
Here is the content of agent-config.yaml
--------------------------------------------
kind: AgentConfig
metadata:
name: sno-cluster
spec:
rendezvousIP: 192.168.111.80
hosts:
How reproducible:
always
Repro Steps:
1) Get the latest agent-installer and build
git clone -b agent-installer https://github.com/openshift/installer.git
cd installer/
hack/build.sh
2) Create agent.iso using agent-config and install-config files.
Expected: Installer should throw an error message something like this: hosts.host[0].interfaces.macAddress: Invalid value: “0000”: macAddress must provide the valid macAddress.
And
hosts.host[0].networkConfig.interfaces.macAddress: Invalid value: “00000”: macAddress must provide the valid macAddress.
Actual: Able to create agent.iso image.
This is a clone of issue OCPBUGS-8741. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5889. The following is the description of the original issue:
—
Description of problem:
Customer running a cluster with following config: 4.10.23 AWS/IPI OVNKubernetes Observed that in namespace with networkpolicy rules enabled, and a policy for allow-from-same namespace, pods will have different behaviors when calling service IP's hosted in that same namespace. Example: Deployment1 with two pods (A/B) exists in namespace <EXAMPLE> Deployment2 with 1 pod hosting a service and route exists in same namespace Pod A will unexpectedly stop being able to call service IP of deployment2; Pod B will never lose access to calling service IP of deployment2. Pod A remains able to call out through br-ex interface, tag the ROUTE address, and reach deployment2 pod via haproxy (this never breaks) Pod A remains able to reach the local gateway on the node Host node for Pod A is able to reach the service IP of deployment2 and remains able to do so, even while pod A is impacted. Issue can be mitigated by applying a label or annotation to pod A, which immediately allows it to reach internal service IPs again within the namespace. I suspect that the issue is to do with the networkpolicy rules failing to stay updated on the pod object, and the pod needs to be 'refreshed' --> label appendation/other update, to force the pod to 'remember' that it is allowed to call peers within the namespace. Additional relevant data: - pods affects throughout cluster; no specific project/service/deployment/application - pods ride on different nodes all the time (no one node affected) - pods with fail condition are on same node with other pods without issue - multiple namespaces see this problem - all namespaces are using similar networkpolicy isolation and allow-from-same-namespace ruleset (which matches our documentation on syntax).
Version-Release number of selected component (if applicable):
4.10.23
How reproducible:
every time --> unclear what the trigger is that causes this; pods will be functional and several hours/days later, will stop being able to talk to peer services.
Steps to Reproduce:
1. deploy pod with at least two replicas in a namespace with allow-from same network policy 2. deploy a different service and route example httpd instance in same namespace 3. observe that one of the two pods may fail to reach service IP after some time 4. apply annotation to pod and it is immediately able to reach services again.
Actual results:
pods intermittently fail to reach internal service addresses, but are able to be interacted with otherwise, and can reach upstream/external addresses including routes on cluster.
Expected results:
pods should not lose access to service network peers.
Additional info:
see next comments for relevant uploads/sosreports and inspects.
Description of problem:
TestEditUnmanagedPodDisruptionBudget flakes in the console-operator e2e
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Flake
Steps to Reproduce:
1. Check https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_console-operator/665/pull-ci-openshift-console-operator-master-e2e-aws-operator/1562005782164148224
2.
3.
Actual results:
Expected results:
Additional info:
There is a chance that the PDB instances is not present since prior to the Unmanaged* TCs the RemoveTest is running which is removing all the console resources (Pods, Services, PDBs, ...).
This is a clone of issue OCPBUGS-2824. The following is the description of the original issue:
—
Description of problem:
When users adjust their browsers to small size, the deploymnet details page on the Topology page overrides the drop-down list component, which prevents the user from using the drop-down list functionality. All content on the dropdown list would be covered
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-24-103753
How reproducible:
Always
Steps to Reproduce:
1. Login OCP, go to developer perspective -> Topology page 2. Click and open one resource (eg: deployment), make sure the resource sidebar has been opened 3. Adjust the browser windows to small size 4. Check if the dropdown list component has been covered
Actual results:
All the dorpdown list component will be covered by the deployment details page (See attachment for more details)
Expected results:
The dropdown list component should be displayed on the top, the function should work even if the windows is small
Additional info:
This is a clone of issue OCPBUGS-1427. The following is the description of the original issue:
—
Description of problem:
Jump looks the worst on gcp, but looking closer Azure and AWS both jumped as well just not as high.
Disruption data indicates that the image registry on GCP was averaging around 30-40 seconds of disruption during an upgrade, until Aug 27th when it jumped to 125-135 seconds and has remained there ever since.
We see similar spikes in ingress-to-console and ingress-to-oauth. NOTE: image registry backend is also behind ingress, so all three are ingress related disruption.
https://datastudio.google.com/s/uBC4zuBFdTE
These charts show the problem on Aug 27 for registry, ingress to console, and ingress to oauth.
sdn network type appears unaffected.
Something merged Aug 26-27 that caused a significant change for anything behind ingress using ovn on gcp.
Description of problem:
When the cluster install finished, wait-for install-complete command didn't exit as expected.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Get the latest agent-installer and build image git clone https://github.com/openshift/installer.git cd installer/ hack/build.sh Edit agent-config and install-config yaml file Create the agent.iso image: OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=quay.io/openshift-release-dev/ocp-release:4.12.0-ec.3-x86_64 bin/openshift-install agent create image --log-level debug 2. Install SNO cluster virt-install --connect qemu:///system -n control-0 -r 33000 --vcpus 8 --cdrom ./agent.iso --disk pool=installer,size=120 --boot uefi,hd,cdrom --os-variant=rhel8.5 --network network=default,mac=52:54:00:aa:aa:aa --wait=-1 3. Run 'bin/openshift agent wait-for bootstrap-complete --log-level debug' and the command finished as expected. 4. After 'bootstrap' completion, run 'bin/openshift agent wait-for install-complete --log-level debug', the command didn't finish as expected.
Actual results:
Expected results:
Additional info:
Currently, we have this validation https://github.com/openshift/installer/blob/master/pkg/asset/agent/installconfig_test.go#L103 which checks if the platform is none then the number of control planes should be 1 and workers should be zero.
We need another validation to check if the number of control planes is 1 and workers are zero, the in the install-config.yaml the platform can only be set as none and in agent-cluster-install.yaml, the platformType should only be set as none. If we try to do SNO (i.e. control planes is 1 and workers are zero) with e.g. platform: baremetal then assisted will reject it, so we should catch it as early as possible
Description of problem:
Kebab menu for helm repository is showing inconsistent behavior
Version-Release number of selected component (if applicable): 4.12
How reproducible: Always
Steps to Reproduce:
1. Create some helm chart repository
2. Go to the Helm page and switch to the repositories tab
3. Open kebab menu for different repos
Actual results:
Menus are overlapping
Expected results:
The menu should work properly; one menu should close before opening a new one
Additional info:
Video has been added for the reference
Since openenshift/cluster-ingress-operator#817 merged, the e2e-aws-operator CI job has been failing for multiple PRs in the cluster-ingress-operator repository. In particular, the TestScopeChange test has been consistently failing. Example failures:
The operator is repeatedly logging errors like the following in those failing CI jobs:
ERROR operator.dns_controller controller/controller.go:121 failed to delete dnsrecord; will retry \{"dnsrecord": {"metadata":{"name":"scope-wildcard","namespace":"openshift-ingress-operator","uid":"2cb9936f-d6a0-4377-b3ed-c5167c5e9e4d","resourceVersion":"42217","generation":2,"creationTimestamp":"2022-10-13T16:19:23Z","deletionTimestamp":"2022-10-13T16:20:27Z","deletionGracePeriodSeconds":0,"labels":{"ingresscontroller.operator.openshift.io/owning-ingresscontroller":"scope"},"ownerReferences":[\{"apiVersion":"operator.openshift.io/v1","kind":"IngressController","name":"scope","uid":"713ac1c5-451b-42d1-89fd-c3910eb80fe3","controller":true,"blockOwnerDeletion":true}],"finalizers":["operator.openshift.io/ingress-dns"],"managedFields":[\{"manager":"ingress-operator","operation":"Update","apiVersion":"ingress.operator.openshift.io/v1","time":"2022-10-13T16:19:23Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:finalizers":{".":{},"v:\"operator.openshift.io/ingress-dns\"":{}},"f:labels":\{".":{},"f:ingresscontroller.operator.openshift.io/owning-ingresscontroller":{}},"f:ownerReferences":\{".":{},"k:\{\"uid\":\"713ac1c5-451b-42d1-89fd-c3910eb80fe3\"}":{}}},"f:spec":\{".":{},"f:dnsManagementPolicy":{},"f:dnsName":{},"f:recordTTL":{},"f:recordType":{},"f:targets":{}}}},\{"manager":"ingress-operator","operation":"Update","apiVersion":"ingress.operator.openshift.io/v1","time":"2022-10-13T16:19:24Z","fieldsType":"FieldsV1","fieldsV1":{"f:status":{".":{},"f:observedGeneration":{},"f:zones":{}}},"subresource":"status"}]},"spec":\{"dnsName":"*.scope.ci-op-x1j7dsgt-43abb.origin-ci-int-aws.dev.rhcloud.com.","targets":["af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com"],"recordType":"CNAME","recordTTL":30,"dnsManagementPolicy":"Managed"},"status":\{"zones":[{"dnsZone":{"tags":{"Name":"ci-op-x1j7dsgt-43abb-45zhd-int","kubernetes.io/cluster/ci-op-x1j7dsgt-43abb-45zhd":"owned"}},"conditions":[\{"type":"Published","status":"True","lastTransitionTime":"2022-10-13T16:19:23Z","reason":"ProviderSuccess","message":"The DNS provider succeeded in ensuring the record"}]},\{"dnsZone":{"id":"Z2GYOLTZHS5VK"},"conditions":[\{"type":"Published","status":"True","lastTransitionTime":"2022-10-13T16:19:24Z","reason":"ProviderSuccess","message":"The DNS provider succeeded in ensuring the record"}]}],"observedGeneration":1}}, "error": "failed to get hosted zone for load balancer target \"af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com\": couldn't find hosted zone ID of ELB af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com", "errorCauses": [\{"error": "failed to get hosted zone for load balancer target \"af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com\": couldn't find hosted zone ID of ELB af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com"}, \{"error": "failed to get hosted zone for load balancer target \"af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com\": couldn't find hosted zone ID of ELB af6e309caa14c41eabe69f3f9eb15cf1-1656133782.us-west-2.elb.amazonaws.com"}]}}}
The scope-wildcard dnsrecord is created for the TestScopeChange test.
Using search.ci, it seems that the failures occurred many times on #817 before it merged and then started occurring for the other PRs after #817 merged.
I filed a PR, openshift/cluster-ingress-operator#838, that reverts #817. I have run the e2e-aws-operator CI job on this PR twice. While the job has failed both times, the TestScopeChange test did not fail either time.
At this point, we have strong evidence that #817 is causing TestScopeChange to fail.
Grant Spence did some testing and determined that there is some interaction between TestAllowedSourceRangesStatus and TestScopeChange. It may suffice to serialize some tests (TestScopeChanged is currently a parallel test, as is TestAllowedSourceRangesStatus and two other tests that #817 adds).
If the problem cannot be resolved by serializing tests, it may be necessary to revert #817 to unblock CI.
Note that this issue is blocking NE-942, NE-1072, and NE-682, as well as any bugfix PRs for the master branch in openshift/cluster-ingress-operator.
4.12
Consistently.
1. Run CI on a PR against the master branch of cluster-ingress-operator.
The TestScopeChange test fails as described.
TestScopeChange should not fail.
Description of problem:
When an error happens continuously, such as a failure to create a machine because of an invalid provider spec, the operator sits and does not report up to the end user that an issue is occurring. There are logs that reveal the errors but these do not show on the CPMS status.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create an Azure cluster with a CPMS 2. Ensure the subnet field and vnet fields are set in the providerSpec 3. Activate the CPMS, it should be healthy and running ok 4. Drop the vnet field from the providerSpec, observer errors in logs
Actual results:
Errors are logged but nothing shows on CPMS -o yaml
Expected results:
The error should be shown on the CPMS object if it happens continuously
Additional info:
https://github.com/openshift/api/pull/1213 and https://github.com/openshift/api/pull/1202 PR's have been merged but the latest 4.12 OCP clusters do not show the changes .
According to https://github.com/openshift/console-operator/blob/bd2a7c9077ccf214dd8a725a7660e86d96e045b0/Dockerfile.rhel7#L18-L23, we need to vendor the openshift/api in console operator repo so that the latest manifests get's applied.
We added server groups for control plane and computes as part of OSASINFRA-2570, except for UPI that only creates server group for the control plane.
We need to update the UPI scripts to create server group for computes to be consistent with IPI and have the instruction at https://docs.openshift.com/container-platform/4.11/machine_management/creating_machinesets/creating-machineset-osp.html work out of the box in case customers want to create MachineSets on their UPI clusters.
Related to OCPCLOUD-1135.
Description of problem:
The name of "Role" on Compute -> Nodes page should update to "Roles" to match the name in the CLI
Compare with other resources, the title of the column should keep pace with the name in CLI
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-08-15-150248
How reproducible:
Always
Steps to Reproduce:
1. Login OCP with CLI, use below command to get nodes information
$ oc get nodes
2. Go to Compute -> nodes page, check the column name of "Role"
3.
Actual results:
CLI will return information as below shown, and the title of the column is "ROLES"
NAME STATUS ROLES AGE VERSION ip-10-0-145-18.us-east-2.compute.internal Ready worker 9h v1.24.0+4f0dd4d ip-10-0-145-203.us-east-2.compute.internal Ready master 9h v1.24.0+4f0dd4d ip-10-0-163-205.us-east-2.compute.internal Ready master 9h v1.24.0+4f0dd4d ip-10-0-169-118.us-east-2.compute.internal Ready worker 9h v1.24.0+4f0dd4d ip-10-0-198-234.us-east-2.compute.internal Ready master 9h v1.24.0+4f0dd4d ip-10-0-212-34.us-east-2.compute.internal Ready worker 9h v1.24.0+4f0dd4d
But in UI, the name of ROLES is "Role" which is incorrect. (Attached)
Expected results:
The title of "Role" should update to "Roles"
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
{ 4.12.0-0.nightly-2022-08-21-135326 }How reproducible:
Steps to Reproduce:
{See https://bugzilla.redhat.com/show_bug.cgi?id=2118563#c5, The following messages here are "normal" on startup, but it is very misleading with error statement, suggest suppress them or update them to some more clear context that we can know they are in normal process. E0818 02:18:53.709223 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:53.715530 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:53.735885 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:53.775984 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:53.790449 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:53.856911 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:53.950782 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:54.017583 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:54.271967 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:54.338944 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:54.916988 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-c955q': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-c955q, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue E0818 02:18:54.982211 1 controller.go:165] error syncing 'br709bt-b5564-6jgdx-worker-0-sl9jn': error retrieving the private IP configuration for node: br709bt-b5564-6jgdx-worker-0-sl9jn, err: cannot parse valid nova server ID from providerId '', requeuing in node workqueue}
Actual results:
Expected results:
Additional info:
Description of problem:
Using OLM descriptor components.Using OLM descriptor components deletes operand
Steps to Reproduce:
Description of problem:
Egress IP is not being assigned to primary interface of node as per hostsubnet definition. The issue being observed at an Openshift cluster hosted on Disconnected AWS environment. Following steps were performed at AWS end: - Disconnected VPC was created and installation of Openshift was done as per documentation. - Elastic IP could not be used as it is a disconnected environment. Customer identified a free IP from same subnet as the node and modified interface of the node to add a secondary IP. It seems cloud.network.openshift.io/egress-ipconfig annotation is need on the node to attach IP to primary interface but its missing. From SDN POD log on the same node I could see its complaining about 'an incomplete annotation "cloud.network.openshift.io/egress-ipconfig"'. Will share more details over comments.
Version-Release number of selected component (if applicable):
Openshift 4.10.28
How reproducible:
Always
Steps to Reproduce:
1. Create a disconnected environment on AWS 2. find a free IP from subnet where a worker node is hosted and add that as secondary IP to NIC of that node. 3. Configure hostsubnet and netnamespace on Openshift cluster
Actual results:
- Eress IP is not being attached to primary interface of node for which hostsubnet has been configured
Expected results:
- Egress IP should get configured without any issue.
Additional info:
Description of problem:
CPMS failureDomains is not keep consistent with master machines on heterogeneous cluster after upgrade from 4.11 to 4.12
Version-Release number of selected component (if applicable):
4.11.9-multi -> 4.12.0-0.nightly-multi-2022-10-20-153503
How reproducible:
always
Steps to Reproduce:
1.Launch a 4.11 heterogeneous cluster on AWS, we use automated template https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/master/functionality-testing/aos-4_11/ipi-on-aws/versioned-installer-x86_arm64_heterogeneous_workers liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.9-multi True False 25m Cluster version is 4.11.9-multi liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.11.9-multi True False False 25m baremetal 4.11.9-multi True False False 43m cloud-controller-manager 4.11.9-multi True False False 45m cloud-credential 4.11.9-multi True False False 46m cluster-api 4.11.9-multi True False False 44m cluster-autoscaler 4.11.9-multi True False False 43m config-operator 4.11.9-multi True False False 44m console 4.11.9-multi True False False 31m csi-snapshot-controller 4.11.9-multi True False False 43m dns 4.11.9-multi True False False 43m etcd 4.11.9-multi True False False 42m image-registry 4.11.9-multi True False False 38m ingress 4.11.9-multi True False False 38m insights 4.11.9-multi True False False 37m kube-apiserver 4.11.9-multi True False False 40m kube-controller-manager 4.11.9-multi True False False 41m kube-scheduler 4.11.9-multi True False False 41m kube-storage-version-migrator 4.11.9-multi True False False 44m machine-api 4.11.9-multi True False False 40m machine-approver 4.11.9-multi True False False 43m machine-config 4.11.9-multi True False False 42m marketplace 4.11.9-multi True False False 43m monitoring 4.11.9-multi True False False 35m network 4.11.9-multi True False False 45m node-tuning 4.11.9-multi True False False 43m openshift-apiserver 4.11.9-multi True False False 38m openshift-controller-manager 4.11.9-multi True False False 43m openshift-samples 4.11.9-multi True False False 37m operator-lifecycle-manager 4.11.9-multi True False False 43m operator-lifecycle-manager-catalog 4.11.9-multi True False False 43m operator-lifecycle-manager-packageserver 4.11.9-multi True False False 38m service-ca 4.11.9-multi True False False 44m storage 4.11.9-multi True False False 38m 2.Upgrade to 4.12.0-0.nightly-multi-2022-10-20-153503 liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-multi-2022-10-20-153503 True False 15m Cluster version is 4.12.0-0.nightly-multi-2022-10-20-153503 liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 120m baremetal 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 137m cloud-controller-manager 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 140m cloud-credential 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 140m cluster-api 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m cluster-autoscaler 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 137m config-operator 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m console 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 125m control-plane-machine-set 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 55m csi-snapshot-controller 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m dns 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 137m etcd 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 136m image-registry 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 132m ingress 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 132m insights 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 132m kube-apiserver 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 134m kube-controller-manager 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 135m kube-scheduler 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 135m kube-storage-version-migrator 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 27m machine-api 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 134m machine-approver 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m machine-config 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 76m marketplace 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 137m monitoring 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 130m network 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 139m node-tuning 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 52m openshift-apiserver 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 132m openshift-controller-manager 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 52m openshift-samples 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 55m operator-lifecycle-manager 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m operator-lifecycle-manager-catalog 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m operator-lifecycle-manager-packageserver 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 132m platform-operators-aggregated 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 27m service-ca 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 138m storage 4.12.0-0.nightly-multi-2022-10-20-153503 True False False 132m 3.Found there is CPMS, but the failureDomains shows us-east-2a, us-east-2c, us-east-2a, us-east-2b, which does not keep consistent with the master machines (us-east-2a, us-east-2b, us-east-2c). liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws411he2-gbt55-master-0 Running m6i.xlarge us-east-2 us-east-2a 113m huliu-aws411he2-gbt55-master-1 Running m6i.xlarge us-east-2 us-east-2b 113m huliu-aws411he2-gbt55-master-2 Running m6i.xlarge us-east-2 us-east-2c 113m huliu-aws411he2-gbt55-worker-us-east-2a-additional-nmkwf Running m6g.large us-east-2 us-east-2a 109m huliu-aws411he2-gbt55-worker-us-east-2a-additional-xw2df Running m6g.large us-east-2 us-east-2a 109m huliu-aws411he2-gbt55-worker-us-east-2a-pbsxw Running m6i.xlarge us-east-2 us-east-2a 109m huliu-aws411he2-gbt55-worker-us-east-2b-tpzn2 Running m6i.xlarge us-east-2 us-east-2b 109m huliu-aws411he2-gbt55-worker-us-east-2c-bxchx Running m6i.xlarge us-east-2 us-east-2c 109m liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 2 Inactive 44m liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset cluster -o yaml apiVersion: machine.openshift.io/v1 kind: ControlPlaneMachineSet metadata: creationTimestamp: "2022-10-21T09:19:02Z" finalizers: - controlplanemachineset.machine.openshift.io generation: 1 name: cluster namespace: openshift-machine-api resourceVersion: "63863" uid: c33d01d3-c7f3-411f-aaed-4c5339b166d3 spec: replicas: 3 selector: matchLabels: machine.openshift.io/cluster-api-cluster: huliu-aws411he2-gbt55 machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master state: Inactive strategy: type: RollingUpdate template: machineType: machines_v1beta1_machine_openshift_io machines_v1beta1_machine_openshift_io: failureDomains: aws: - placement: availabilityZone: us-east-2a subnet: filters: - name: tag:Name values: - huliu-aws411he2-gbt55-private-us-east-2a type: Filters - placement: availabilityZone: us-east-2c subnet: filters: - name: tag:Name values: - huliu-aws411he2-gbt55-private-us-east-2c type: Filters - placement: availabilityZone: us-east-2a subnet: filters: - name: tag:Name values: - huliu-aws411he2-gbt55-private-us-east-2a type: Filters - placement: availabilityZone: us-east-2b subnet: filters: - name: tag:Name values: - huliu-aws411he2-gbt55-private-us-east-2b type: Filters platform: AWS metadata: labels: machine.openshift.io/cluster-api-cluster: huliu-aws411he2-gbt55 machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master spec: lifecycleHooks: {} metadata: {} providerSpec: value: ami: id: ami-0abf0ec5cdd856934 apiVersion: machine.openshift.io/v1beta1 blockDevices: - ebs: encrypted: true iops: 0 kmsKey: arn: "" volumeSize: 120 volumeType: gp3 credentialsSecret: name: aws-cloud-credentials deviceIndex: 0 iamInstanceProfile: id: huliu-aws411he2-gbt55-master-profile instanceType: m6i.xlarge kind: AWSMachineProviderConfig loadBalancers: - name: huliu-aws411he2-gbt55-int type: network - name: huliu-aws411he2-gbt55-ext type: network metadata: creationTimestamp: null metadataServiceOptions: {} placement: region: us-east-2 securityGroups: - filters: - name: tag:Name values: - huliu-aws411he2-gbt55-master-sg subnet: {} tags: - name: kubernetes.io/cluster/huliu-aws411he2-gbt55 value: owned userDataSecret: name: master-user-data 4.If I edit CPMS, change state from Inactive to Active, it will trigger update immediately. But seems no need update, as the three master machines already in the CPMS failureDomains. liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster controlplanemachineset.machine.openshift.io/cluster edited liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws411he2-gbt55-master-0 Running m6i.xlarge us-east-2 us-east-2a 138m huliu-aws411he2-gbt55-master-1 Running m6i.xlarge us-east-2 us-east-2b 138m huliu-aws411he2-gbt55-master-gcszr-2 Running m6i.xlarge us-east-2 us-east-2a 14m huliu-aws411he2-gbt55-worker-us-east-2a-additional-nmkwf Running m6g.large us-east-2 us-east-2a 134m huliu-aws411he2-gbt55-worker-us-east-2a-additional-xw2df Running m6g.large us-east-2 us-east-2a 134m huliu-aws411he2-gbt55-worker-us-east-2a-pbsxw Running m6i.xlarge us-east-2 us-east-2a 134m huliu-aws411he2-gbt55-worker-us-east-2b-tpzn2 Running m6i.xlarge us-east-2 us-east-2b 134m huliu-aws411he2-gbt55-worker-us-east-2c-bxchx Running m6i.xlarge us-east-2 us-east-2c 134m
Actual results:
CPMS failureDomains doesn’t keep consistent with master machines
Expected results:
CPMS failureDomains should keep consistent with master machines
Additional info:
Must-gather https://drive.google.com/file/d/1fnz22ay9wvXPwKirkSmjX7qCj2aIH8Wg/view?usp=sharing Install a 4.12 heterogeneous cluster, no such issue. Upgrade a non heterogeneous cluster, no such issue. So seems it only occurs on heterogeneous cluster upgrade.
This is a clone of issue OCPBUGS-2891. The following is the description of the original issue:
—
Deprovisioning can fail with the error:
level=warning msg=unrecognized elastic load balancing resource type listener arn=arn:aws:elasticloadbalancing:us-west-2:460538899914:listener/net/a9ac9f1b3019c4d1299e7ededc92b42b/a6f0655da877ddd4/45e05ee69d99bab0
Further background is available in this write up:
https://docs.google.com/document/d/1TsTqIVwHDmjuDjG7v06w_5AAbXSisaDX-UfUI9-GVJo/edit#
Incident channel:
incident-aws-leaking-tags-for-deleted-resources
AWS CPMS changes made here causes the single node clusters to fail installation
https://github.com/openshift/installer/pull/6172
Need to fix the issue by checking and not creating the CPMS manifest if the installation type is single node.
Description of problem:
This bug is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=2109140 on odf-console side. Corresponding PR needed to be merged in console as well. Please, verify this Jira console's bug and https://bugzilla.redhat.com/show_bug.cgi?id=2109140 simultaneous. Steps are exactly same, no difference.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The Console Operator has a suite of tests responsible for assuring that Console can successfully interact with Operators managed by OLM. The operator-hub.spec test references an operator no longer present in the 4.12 certified operators catalog source: https://github.com/openshift/console/blob/master/frontend/packages/operator-lifecycle-manager/integration-tests-cypress/tests/operator-hub.spec.ts#L64 OLM is unable to set the default catalog sources to the 4.12 image tag until the test is update to reference an operator in both the 4.11 and 4.12 images of the certified operators catalog source.
Version-Release number of selected component (if applicable):4.12
How reproducible: always
Steps to Reproduce:
1. Update the certified operators catalogSource images to the 4.12 tag 2. Attempt to run the operatorhub.spec test suite.
Actual results:
The test fails
Expected results:
The test passes
Additional info:
This is a clone of issue OCPBUGS-3432. The following is the description of the original issue:
—
Description of problem:
E2E test cases for knative and pipeline packages have been disabled on CI due to respective operator installation issues. Tests have to be enabled after new operator version be available or the issue resolves
References:
https://coreos.slack.com/archives/C6A3NV5J9/p1664545970777239
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
system:openshift:openshift-controller-manager:leader-locking-ingress-to-route-controller role and role-binding should not be present in openshift-route-controller-manager namespace. Not needed since the leader locking responsibility was moved to route-controller-manager which is managed by leader-locking-openshift-route-controller-manager This was added in and used by https://github.com/openshift/openshift-controller-manager/pull/230/files#diff-2ddbbe8d5a13b855786852e6dc0c6213953315fd6e6b813b68dbdf9ffebcf112R20
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-3706. The following is the description of the original issue:
—
Description of problem:
While running ./openshift-install agent wait-for install-complete --dir billi --log-level debug on a real bare metal dual stack compact cluster installation it errors out with ERROR Attempted to gather ClusterOperator status after wait failure: Listing ClusterOperator objects: Get "https://api.kni-qe-0.lab.eng.rdu2.redhat.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp [2620:52:0:11c::10]:6443: connect: connection refused but installation is still progressing DEBUG Uploaded logs for host openshift-master-1 cluster d8b0979d-3d69-4e65-874a-d1f7da79e19e DEBUG Host: openshift-master-1, reached installation stage Rebooting DEBUG Host: openshift-master-1, reached installation stage Configuring DEBUG Host: openshift-master-2, reached installation stage Configuring DEBUG Host: openshift-master-2, reached installation stage Joined DEBUG Host: openshift-master-1, reached installation stage Joined DEBUG Host: openshift-master-0, reached installation stage Waiting for bootkube DEBUG Host openshift-master-1: updated status from installing-in-progress to installed (Done) DEBUG Host: openshift-master-1, reached installation stage Done DEBUG Host openshift-master-2: updated status from installing-in-progress to installed (Done) DEBUG Host: openshift-master-2, reached installation stage Done DEBUG Host: openshift-master-0, reached installation stage Waiting for controller: waiting for controller pod ready event ERROR Attempted to gather ClusterOperator status after wait failure: Listing ClusterOperator objects: Get "https://api.kni-qe-0.lab.eng.rdu2.redhat.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp [2620:52:0:11c::10]:6443: connect: connection refused ERROR Cluster initialization failed because one or more operators are not functioning properly. ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below, ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
Version-Release number of selected component (if applicable):
4.12.0-rc.0
How reproducible:
100%
Steps to Reproduce:
1. ./openshift-install agent create image --dir billi --log-level debug 2. mount resulting iso image and reboot nodes via iLO 3. /openshift-install agent wait-for install-complete --dir billi --log-level debug
Actual results:
ERROR Attempted to gather ClusterOperator status after wait failure: Listing ClusterOperator objects: Get "https://api.kni-qe-0.lab.eng.rdu2.redhat.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp [2620:52:0:11c::10]:6443: connect: connection refused cluster installation is not complete and it needs more time to complete
Expected results:
waits until the cluster installation completes
Additional info:
The cluster installation eventually completes fine if waiting after the error. Attaching install-config.yaml and agent-config.yaml
Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/95
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
node_exporter collects network metrics for "virtual" interfaces like br-*. When OVN is used, it also reports metrics for ovs-*, ovn, and genev_sys_* interfaces.
Version-Release number of selected component (if applicable):
4.12 (and before)
How reproducible:
Always
Steps to Reproduce:
1. Launch a 4.12 cluster. 2. Run the following PromQL query: "group by(device) (node_network_info)" 3.
Expected results:
Only real host interfaces should be present.
Additional info:
OVS 2.17+ introduced an optimization of "weak references" to substantially speed up database snapshots. in some cases weak references may leak memory; to aforementioned commit fixes that and has been pulled into ovs2.17-62 and later.
This relates to the recovery of a cluster following an etcd outage.
The ingress path to kube-apiserver is:
───────────> VIP ─────────────────> Local HAProxy ────┬─> kube-apiserver-master-0 (managed by keepalived) │ ├─> kube-apiserver-master-1 │ └─> kube-apiserver-master-2
Each master is running an HAProxy which load balances between the 3 kube-apiservers. Each HAProxy is running health checks against each kube-apiserver, and will add or remove it from the available pool based on its health.
We only use keepalived to ensure that HAProxy is not a single point of failure. It is the job of keepalived to ensure that incoming traffic is being directed to an HAProxy which is functioning correctly.
The current health check we are using for keepalived involves polling /readyz against the local HAProxy. While this seems intuitively correct it is in fact testing the wrong thing. It is testing whether the kube-apiserver it connects to is functioning correctly. However, this is not the purpose of keepalived. HAProxy runs health checks against kube-apiserver backends. keepalived simply selects a correctly functioning HAProxy.
This becomes important during recovery from an outage. When none of the kube-apiservers are healthy this health check will fail continuously, and the API VIP will move uselessly between masters. However the situation is much worse when only one of the kube-apiservers is up. In this case there is a high probability that it is overloaded and at least rate limiting incoming connections. This may lead us to fail the keepalived health check and fail the VIP over to the next HAProxy. This will cause all open kube-apiserver connections to reset, even the established ones. This increases the load on the kube-apiserver and increases the probability that the health check will fail again.
Ideally the keepalived health check would check only the health of HAProxy itself, not the health of the pool of kube-apiservers. In practise it will probably never be necessary to move the VIP while the master is up, regardless of the health of the cluster. A network partition affecting HAProxy would already be handled by VRRP between the masters, so it may be that it would be sufficient to check that the local HAProxy pod is healthy.
Description of problem:
"opm alpha render-veneer semver" raise error when no "Candidate" in config yaml
Version-Release number of selected component (if applicable):
zhaoxia@xzha-mac semver % opm version Version: version.Version{OpmVersion:"11644a543", GitCommit:"11644a5433442c33698d2eee8d3f865b0d9386c0", BuildDate:"2022-08-29T08:16:54Z", GoOs:"darwin", GoArch:"amd64"}
How reproducible:
always
Steps to Reproduce:
1. prepare catalog-semver-veneer-wrong.yaml zhaoxia@xzha-mac semver % cat catalog-semver-veneer-wrong.yaml Schema: olm.semver GenerateMajorChannels: false GenerateMinorChannels: true Stable: Bundles: - Image: quay.io/olmqe/nginxolm-operator-bundle:v1.0.2 - Image: quay.io/olmqe/nginxolm-operator-bundle:v2.1.0 Fast: Bundles: - Image: quay.io/olmqe/nginxolm-operator-bundle:v0.0.1 - Image: quay.io/olmqe/nginxolm-operator-bundle:v2.0.1 - Image: quay.io/olmqe/nginxolm-operator-bundle:v2.1.0 2. run "opm alpha render-veneer semver" zhaoxia@xzha-mac semver % opm alpha render-veneer semver catalog-semver-veneer-wrong.yaml 2022/08/29 16:48:56 semver "catalog-semver-veneer-wrong.yaml": semver-render: no bundles specified or no bundles could be rendered 3.
Actual results:
error "no bundles specified or no bundles could be rendered" is raised.
Expected results:
no error
Additional info:
This is a clone of issue OCPBUGS-3476. The following is the description of the original issue:
—
Description of problem:
When we detect a refs/heads/branchname we should show the label as what we have now: - Branch: branchname And when we detect a refs/tags/tagname we should instead show the label as: - Tag: tagname
I haven't implemented this in cli but there is an old issue for that here openshift-pipelines/pipelines-as-code#181
Version-Release number of selected component (if applicable):
4.11.z
How reproducible:
Steps to Reproduce:
1. Create a repository 2. Trigger the pipelineruns by push or pull request event on the github
Actual results:
We do not show tag name even is tag is present instead of branch
Expected results:
We should show tag if tag is detected and branch if branch is detedcted.
Additional info:
https://github.com/openshift/console/pull/12247#issuecomment-1306879310
This is a clone of issue OCPBUGS-4900. The following is the description of the original issue:
—
The test:
test=[sig-storage] Volume limits should verify that all nodes have volume limits [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s]
Is hard failing on aws and gcp techpreview clusters:
The failure message is consistently:
fail [github.com/onsi/ginkgo/v2@v2.1.5-0.20220909190140-b488ab12695a/internal/suite.go:612]: Dec 15 09:07:51.278: Expected volume limits to be set Ginkgo exit error 1: exit with code 1
Sample failure:
A fix for this will bring several jobs back to life, but they do span 4.12 and 4.13.
job=periodic-ci-openshift-release-master-ci-4.12-e2e-gcp-sdn-techpreview=all
job=periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-techpreview=all
job=periodic-ci-openshift-release-master-ci-4.13-e2e-aws-sdn-techpreview=all
job=periodic-ci-openshift-release-master-ci-4.13-e2e-gcp-sdn-techpreview=all
job=periodic-ci-openshift-multiarch-master-nightly-4.13-ocp-e2e-aws-ovn-arm64-techpreview=all
job=periodic-ci-openshift-multiarch-master-nightly-4.12-ocp-e2e-aws-ovn-arm64-techpreview=all
This is a clone of issue OCPBUGS-3069. The following is the description of the original issue:
—
Description of problem:
On cluster setting page, it shows available upgrade on page. After user chooses one target version and clicks "Upgrade", wait for a long time, there is no info about upgrade status.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-25-210451
How reproducible:
Always
Steps to Reproduce:
1.Login console with available upgrade for the cluster, select a target version in the available version list. Then click "Update". Check the upgrade progress on the cluster setting page. 2.Check upgrade info from client with "oc adm upgrade". 3.
Actual results:
1.There is not any information or upgrade progress shown on the page. 2.It shows info about retrieving target version failed. $ oc adm upgrade Cluster version is 4.12.0-0.nightly-2022-10-25-210451 ReleaseAccepted=False Reason: RetrievePayload Message: Retrieving payload failed version="4.12.0-0.nightly-2022-10-27-053332" image="registry.ci.openshift.org/ocp/release@sha256:fd4e9bec095b845c6f726f9ce17ee70449971b8286bb9b7478c06c5f697f05f1" failure=The update cannot be verified: unable to verify sha256:fd4e9bec095b845c6f726f9ce17ee70449971b8286bb9b7478c06c5f697f05f1 against keyrings: verifier-public-key-redhatUpstream: https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/graph Channel: stable-4.12 Recommended updates: VERSION IMAGE 4.12.0-0.nightly-2022-11-01-135441 registry.ci.openshift.org/ocp/release@sha256:f79d25c821a73496f4664a81a123925236d0c7818fd6122feb953bc64e91f5d0 4.12.0-0.nightly-2022-10-31-232349 registry.ci.openshift.org/ocp/release@sha256:cb2d157805abc413394fc579776d3f4406b0a2c2ed03047b6f7958e6f3d92622 4.12.0-0.nightly-2022-10-28-001703 registry.ci.openshift.org/ocp/release@sha256:c914c11492cf78fb819f4b617544cd299c3a12f400e106355be653c0013c2530 4.12.0-0.nightly-2022-10-27-053332 registry.ci.openshift.org/ocp/release@sha256:fd4e9bec095b845c6f726f9ce17ee70449971b8286bb9b7478c06c5f697f05f1
Expected results:
1. It should also show this kind of message on console page if retrieving target payload failed, so that user knows the actual result after try to upgrade.
Additional info:
This is a clone of issue OCPBUGS-4490. The following is the description of the original issue:
—
Description of problem:
When hypershift HostedCluster has endpointAccess: Private, the csi-snapshot-controller is in CrashLoopBackoff because the guest APIServer url in the admin-kubeconfig isn't reachable in Private mode.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-3114. The following is the description of the original issue:
—
Description of problem:
When running a Hosted Cluster on Hypershift the cluster-networking-operator never progressed to Available despite all the components being up and running
Version-Release number of selected component (if applicable):
quay.io/openshift-release-dev/ocp-release:4.11.11-x86_64 for the hosted clusters hypershift operator is quay.io/hypershift/hypershift-operator:4.11 4.11.9 management cluster
How reproducible:
Happened once
Steps to Reproduce:
1. 2. 3.
Actual results:
oc get co network reports False availability
Expected results:
oc get co network reports True availability
Additional info:
Description of problem:
OVNKubernetesControllerDisconnectedSouthboundDatabase alert seems to fire in the e2e-aws-ovn-serial CI job. Note that something funny happens in the job itself, which is that a set of ovnkube-node pods get created and then deleted and then get recreated again and test runs. But the alert gets fired for the first set of pods that got deleted. From the initial screening of artifacts alone its not clear what happened to the old pods. This needs investigation
Version-Release number of selected component (if applicable):
4.12 OCP
How reproducible:
Seems like always
Steps to Reproduce:
1.https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27043/pull-ci-openshift-origin-master-e2e-aws-ovn-serial/1568166237639282688 2. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27043/pull-ci-openshift-origin-master-e2e-aws-ovn-serial/1567913444936519680
Actual results:
Alert is fired
Expected results:
Alert shouldn't be fired and this is expected in the serial job then we need to silence that alert for that job, make it flaky and not fail hard if that alert fires.
Additional info:
Description of problem:
See: https://issues.redhat.com/browse/CPSYN-143 tldr: Based on the previous direction that 4.12 was going to enforce PSA restricted by default, OLM had to make a few changes because the way we run catalog pods (and we have to run them that way because of how the opm binary worked) was incompatible w/ running restricted. 1) We set openshift-marketplace to enforce restricted (this was our choice, we didn't have to do it, but we did) 2) we updated the opm binary so catalog images using a newer opm binary don't have to run privileged 3) we added a field to catalogsource that allows you to choose whether to run the pod privileged(legacy mode) or restricted. The default is restricted. We made that the default so that users running their own catalogs in their own NSes (which would be default PSA enforcing) would be able to be successful w/o needing their NS upgraded to privileged. Unfortunately this means: 1) legacy catalog images(i.e. using older opm binaries) won't run on 4.12 by default (the catalogsource needs to be modified to specify legacy mode. 2) legacy catalog images cannot be run in the openshift-marketplace NS since that NS does not allow privileged pods. This means legacy catalogs can't contribute to the global catalog (since catalogs must be in that NS to be in the global catalog). Before 4.12 ships we need to: 1) remove the PSA restricted label on the openshift-marketplace NS 2) change the catalogsource securitycontextconfig mode default to use "legacy" as the default, not restricted. This gives catalog authors another release to update to using a newer opm binary that can run restricted, or get their NSes explicitly labeled as privileged (4.12 will not enforce restricted, so in 4.12 using the legacy mode will continue to work) In 4.13 we will need to revisit what we want the default to be, since at that point catalogs will start breaking if they try to run in legacy mode in most NSes.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Insights operator gathers related clusteroperator's related objects from operators.openshift.io group. Ingresscontrollers are now missing, because it's a namespaceed resource and the "default" name is not provided in the related objects of the ingress clusteroperator
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This bug is a backport clone of [Bugzilla Bug 2113973](https://bugzilla.redhat.com/show_bug.cgi?id=2113973). The following is the description of the original bug:
—
If we define a custom scc like this:
allowHostDirVolumePlugin: true
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: false
allowPrivilegedContainer: false
allowedCapabilities: []
apiVersion: security.openshift.io/v1
defaultAddCapabilities: []
fsGroup:
type: MustRunAs
groups:
we can see that the pod originally has this scc:
oc get pod machine-config-operator-7f57686f5c-g895k -o yaml | grep scc
openshift.io/scc: hostmount-anyuid
After applying the new SCC ( even if we set a higher priority ) the pod is showing after restart:
oc get pod machine-config-operator-7f57686f5c-jg2jv -o yaml | grep scc
openshift.io/scc: vault-unsealer
Description of problem:
Normal user cannot open the debug container from the pods(crashLoopbackoff) they created, And would be got error message: pods "<pod name>" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-20-040107, 4.11.z, 4.10.z
How reproducible:
Always
Steps to Reproduce:
1. Login OCP as a normal user eg: flexy-htpasswd-provider 2. Create a project, go to Developer prespective -> +Add page 3. Click "Import from Git", and provide below data to get a Pods with CrashLoopBackOff state Git Repo URL: https://github.com/sclorg/nodejs-ex.git Name: nodejs-ex-git Run command: star a wktw 4. Navigate to /k8s/ns/<project name>/pods page, find the pod with CrashLoopBackOff status, and go to it details page -> Logs Tab 5. Click the link of "Debug container" 6. Check if the Debug container can be opened
Actual results:
6. Error message would be shown on page, user cannot open debug container via UI pods "nodejs-ex-git-6dd986d8bd-9h2wj-debug-tkqk2" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>
Expected results:
6. Normal user could use debug container without any error message
Additional info:
The debug container could be created for the normal user successfully via CommandLine $ oc debug <crashloopbackoff pod name> -n <project name>
This is a clone of issue OCPBUGS-2727. The following is the description of the original issue:
—
Description of problem:
CVO recently introduced a new precondition RecommendedUpdate[1]. While we request an upgrade to a version which is not an available update, the precondition got UnknownUpdate and blocks the upgrade. # oc get clusterversion/version -ojson | jq -r '.status.availableUpdates'null # oc get clusterversion/version -ojson | jq -r '.status.conditions[]|select(.type == "ReleaseAccepted")' { "lastTransitionTime": "2022-10-20T08:16:59Z", "message": "Preconditions failed for payload loaded version=\"4.12.0-0.nightly-multi-2022-10-18-153953\" image=\"quay.io/openshift-release-dev/ocp-release-nightly@sha256:71c1912990db7933bcda1d6914228e8b9b0d36ddba265164ee33a1bca06fe695\": Precondition \"ClusterVersionRecommendedUpdate\" failed because of \"UnknownUpdate\": RetrievedUpdates=False (VersionNotFound), so the recommended status of updating from 4.12.0-0.nightly-multi-2022-10-18-091108 to 4.12.0-0.nightly-multi-2022-10-18-153953 is unknown.", "reason": "PreconditionChecks", "status": "False", "type": "ReleaseAccepted" } [1]https://github.com/openshift/cluster-version-operator/pull/841/
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-multi-2022-10-18-091108
How reproducible:
Always
Steps to Reproduce:
1. Install a 4.12 cluster 2. Upgrade to a version which is not in the available update # oc adm upgrade --allow-explicit-upgrade --to-image=quay.io/openshift-release-dev/ocp-release-nightly@sha256:71c1912990db7933bcda1d6914228e8b9b0d36ddba265164ee33a1bca06fe695warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anywayRequesting update to release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:71c1912990db7933bcda1d6914228e8b9b0d36ddba265164ee33a1bca06fe695
Actual results:
CVO precondition check fails and blocks upgrade
Expected results:
Upgrade proceeds
Additional info:
Description of problem:
If cluster install failed and no tag attached to vm, run ./openshift-install destroy cluster get stuck, details pls see openshift-install.log
...
time="2022-09-28T08:19:14-04:00" level=debug msg="Delete Folder"
time="2022-09-28T08:19:14-04:00" level=debug msg="Find attached Folder on tag"
time="2022-09-28T08:19:15-04:00" level=debug msg="Folder: Expected Folder sgao-rtf6v to be empty"
time="2022-09-28T08:19:25-04:00" level=debug msg="Power Off Virtual Machines"
time="2022-09-28T08:19:25-04:00" level=debug msg="Find attached VirtualMachine on tag"
time="2022-09-28T08:19:25-04:00" level=debug msg="Delete Virtual Machines"
time="2022-09-28T08:19:25-04:00" level=debug msg="Find attached VirtualMachine on tag"
time="2022-09-28T08:19:25-04:00" level=debug msg="Delete Folder"
time="2022-09-28T08:19:25-04:00" level=debug msg="Find attached Folder on tag"
time="2022-09-28T08:19:25-04:00" level=debug msg="Folder: Expected Folder sgao-rtf6v to be empty"
time="2022-09-28T08:19:35-04:00" level=debug msg="Power Off Virtual Machines"
time="2022-09-28T08:19:35-04:00" level=debug msg="Find attached VirtualMachine on tag"
time="2022-09-28T08:19:35-04:00" level=debug msg="Delete Virtual Machines"
time="2022-09-28T08:19:35-04:00" level=debug msg="Find attached VirtualMachine on tag"
time="2022-09-28T08:19:35-04:00" level=debug msg="Delete Folder"
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-25-071630
How reproducible:
always when cluster install failed and no tag attached to vm
Steps to Reproduce:
1. cluster install failed and no tag attached to vm
2. run ./openshift-install destroy cluster
3.
Actual results:
installer destroy get stuck
Expected results:
installer destroy should set timeout and be able to quit in such situation
Additional info:
This is a clone of issue OCPBUGS-1327. The following is the description of the original issue:
—
See this comment for some updated information
—
Description of problem:
During IPI installation on IBM Cloud (x86_64), some of the worker machines have been seen to have no network connectivity during their initial bootup. Investigations were performed with IBM Cloud VPC to attempt to identify the issue, but in all appearances, all virtualization appears to be working.
Unfortunately due to this issue, no network traffic, no access to these worker machines is available to help identify the issue (Ignition is stuck without network traffic), so no SSH or console login is available to collect logs, or perform any testing on these machines.
The only content available is the console output, showing ignition is stuck due to the network issue.
Version-Release number of selected component (if applicable):
4.12.0
How reproducible:
About 60%
Steps to Reproduce:
1. Create an IPI cluster on IBM Cloud
2. Wait for the worker machines to be provisioned, causing IPI to fail waiting on machine-api operator
3. Check console of worker machines failing to report in to cluster (in this case 2 of 3 failed)
Actual results:
IPI creation failed waiting on machine-api operator to complete all worker node deployment
Expected results:
Successful IPI creation on IBM Cloud
Additional info:
As stated, investigation was performed by IBM Cloud VPC, but no further investigation could be performed since no access to these worker machines is available. Any further details that could be provided to help identify the issue would be helpful.
This appears to have become more prominent recently as well, causing concern for IBM Cloud's IPI GA support on the 4.12 release.
The only solution to restore network connectivity is rebooting the machine, which loses ignition bring up (I assume it must be triggered manually now), and in the case of IPI, isn't a great mitigation.
This is a clone of issue OCPBUGS-3441. The following is the description of the original issue:
—
Update the cluster-authentication-operator to not go degraded when it can’t determine the console url. This risks masking certain cases where we would want to raise an error to the admin, but the expectation is that this failure mode is rare.
Risk could be avoided by looking at ClusterVersion's enabledCapabilities to decide if missing Console was expected or not (unclear if the risk is high enough to be worth this amount of effort).
AC: Update the cluster-authentication-operator to not go degraded when console config CRD is missing and ClusterVersion config has Console in enabledCapabilities.
Description of problem:
On Make Serverless page, to change values of the inputs minpod, maxpod and concurrency fields, we need to click the ‘ + ’ or ‘ - ', it can't be changed by typing in it.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
always
Steps to Reproduce:
1. Create a deployment workload from import from git 2. Right click on workload and select Make Serverless option 3. Check functioning of inputs minpod, maxpod etc.
Actual results:
To change values of the inputs minpod, maxpod and concurrency fields, we need to click the ‘ + ’ or ‘ - ', it can't be changed by typing in it.
Expected results:
We can change values of the inputs minpod, maxpod and concurrency fields, by clicking the ‘ + ’ or ‘ - ' and also by typing in it.
Additional info:
Works fine in v4.11
CI is failing due to the updated pod security admission controller. We need to update the console test pods with the correct security values.
Error: Command failed: echo '{"apiVersion":"v1","kind":"Pod","metadata":
{"name":"test-jxlpt-event-test-pod","namespace":"test-jxlpt"},"spec":{"containers":[
{"name":"httpd","image":"image-registry.openshift-image-registry.svc:5000/openshift/httpd:latest"}]}}' | kubectl create -n test-jxlpt -f -
Error from server (Forbidden): error when creating "STDIN": pods "test-jxlpt-event-test-pod" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "httpd" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "httpd" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "httpd" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "httpd" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
This is a clone of issue OCPBUGS-4207. The following is the description of the original issue:
—
Description of problem:
We added a line to increase debugging verbosity to aid in debugging WRKLDS-540
Version-Release number of selected component (if applicable):
13
How reproducible:
very
Steps to Reproduce:
1.just a revert 2. 3.
Actual results:
Extra debugging lines are present in the openshift-config-operator pod logs
Expected results:
Extra debugging lines no longer in the openshift-config-operator pod logs
Additional info:
This is a clone of issue OCPBUGS-6714. The following is the description of the original issue:
—
Description of problem:
Traffic from egress IPs was interrupted after Cluster patch to Openshift 4.10.46
a customer cluster was patched. It is an Openshift 4.10.46 cluster with SDN.
More description about issue is available in private comment below since it contains customer data.
Description of problem:
OCP v4.9.31 cluster didn't have the $search domain in /etc/resolv.conf, which was there in the v4.8.29 OCP cluster. This was observed in all the nodes of the v4.9.31 cluster.
~~~
OpenShift 4.9.31
sh-4.4# cat /etc/resolv.conf
OpenShift 4.8.29
ENV: OpenStack IAD2, IPI installation. Connected cluster.
Version-Release number of selected component (if applicable):
OCP v4.9.31
How reproducible:
Always
Steps to Reproduce:
1. Install IPI cluster on OpenStack IAD2 platform having cluster version 4.9.31
2. Debug to any of the node(master/worker)
3. Check and confirm the missing search domain on all nodes of the cluster.
Actual results:
The search domain was missing when checked in `/etc/resolv.conf` file on all nodes of the cluster causing serious issues in the cluster.
Expected results:
The installer should embed the search domain in /etc/resolv.conf file on all nodes of the cluster.
Additional info:
set -eo pipefail
DISPATCHER_FILE="/etc/NetworkManager/dispatcher.d/30-resolv-prepender"
DOMAINS="$(grep -E '\s*DOMAINS=.*iad2.dc.paas.redhat.com' $DISPATCHER_FILE \
grep -oE '[a-z0-9]*.dev.iad2.dc.paas.redhat.com' \ |
tr '\n' ' ')" |
>&2 echo "IT-PaaS: overwriting search domains in /etc/resolv.conf with: $DOMAINS"
sed -e "/^search/d" \
-e "/Generated by/c# Generated by KNI resolv prepender NM dispatcher script \nsearch $DOMAINS" \
/etc/resolv.conf > /etc/resolv.tmp
mv /etc/resolv.tmp /etc/resolv.conf
~~~
This is a clone of issue OCPBUGS-3767. The following is the description of the original issue:
—
Description of problem:
Start maintenance action moved from Nodes tab to Bare Metal Hosts tab
Version-Release number of selected component (if applicable):
Cluster version is 4.12.0-0.nightly-2022-11-15-024309
How reproducible:
100%
Steps to Reproduce:
1. Install Node Maintenance operator 2. Go Compute -> Nodes 3. Start maintenance from 3dots menu of worker-0-0 see https://docs.openshift.com/container-platform/4.11/nodes/nodes/eco-node-maintenance-operator.html#eco-setting-node-maintenance-actions-web-console_node-maintenance-operator
Actual results:
No 'Start maintenance' option
Expected results:
Maintenance started successfully
Additional info:
worked for 4.11
Description of problem:
During restart egress firewall acls will be deleted and re-created from scratch, meaning that egress firewall rules won't be applied for some time during restart
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The install_type field in telemetry data is not automatically set from the installer invoker value. Any values we wish to appear must be explicity converted to the corresponding install_type value.
Currently this make clusters installed with the agent-based installer (invoker agent-installer) invisible in telemetry.
We do not have a well defined method to find these all just yet, identifying that would be a good first step.
This is a clone of issue OCPBUGS-3304. The following is the description of the original issue:
—
Assisted-service can use only one mirror of the release image. In the install-config, the user may specify multiple matching mirrors. Currently the last matching mirror is the one used by assisted-service. This is confusing; we should use the first matching one instead.
This is a clone of issue OCPBUGS-3320. The following is the description of the original issue:
—
Description of problem:
New master will be created if add duplicated failuredomains in controlplanemachineset
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-11-06-054655
How reproducible:
Always
Steps to Reproduce:
1. Update controlplanemachineset and add a duplicated failuredomains us-east-2a in the first position of failuredomains failureDomains: aws: - placement: availabilityZone: us-east-2a subnet: filters: - name: tag:Name values: - zhsun117-x6jjt-private-us-east-2a type: Filters - placement: availabilityZone: us-east-2a subnet: filters: - name: tag:Name values: - zhsun117-x6jjt-private-us-east-2a type: Filters - placement: availabilityZone: us-east-2b subnet: filters: - name: tag:Name values: - zhsun117-x6jjt-private-us-east-2b type: Filters - placement: availabilityZone: us-east-2c subnet: filters: - name: tag:Name values: - zhsun117-x6jjt-private-us-east-2c type: Filters platform: AWS 2. 3.
Actual results:
A new master will be created in duplicated zone us-east-2a and the old master in zone us-east-2c will be removed. $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsun117-x6jjt-master-0 Running m6i.xlarge us-east-2 us-east-2a 5h37m zhsun117-x6jjt-master-1 Running m6i.xlarge us-east-2 us-east-2b 5h37m zhsun117-x6jjt-master-w8785-2 Running m6i.xlarge us-east-2 us-east-2a 15m zhsun117-x6jjt-worker-us-east-2a-nxn6j Running m6i.xlarge us-east-2 us-east-2a 5h34m zhsun117-x6jjt-worker-us-east-2b-7vmr8 Running m6i.xlarge us-east-2 us-east-2b 5h34m zhsun117-x6jjt-worker-us-east-2c-2zwwv Running m6i.xlarge us-east-2 us-east-2c 5h34m I1107 08:28:56.243804 1 provider.go:416] "msg"="Created machine" "controller"="controlplanemachineset" "failureDomain"="AWSFailureDomain{AvailabilityZone:us-east-2a, Subnet:{Type:Filters, Value:&[{Name:tag:Name Values:[zhsun117-x6jjt-private-us-east-2a]}]}}" "index"=2 "machineName"="zhsun117-x6jjt-master-lzs4c-2" "name"="zhsun117-x6jjt-master-4v8wl-2" "namespace"="openshift-machine-api" "reconcileID"="eec9a27c-4b7e-467a-b28c-6470c3068ab2" "updateStrategy"="RollingUpdate"
Expected results:
Cluster no update.
Additional info:
If add the duplicate failuredomains us-east-2a at the end in failuredomains, it does not trigger update.
This is a clone of issue OCPBUGS-6175. The following is the description of the original issue:
—
Description of problem:
When the cluster is configured with Proxy the swift client in the image registry operator is not using the proxy to authenticate with OpenStack, so it's unable to reach the OpenStack API. This issue became evident since recently the support was added to not fallback to cinder in case swift is available[1].
[1]https://github.com/openshift/cluster-image-registry-operator/pull/819
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Deploy a cluster with proxy and restricted installation 2. 3.
Actual results:
Expected results:
Additional info:
Tracker issue for bootimage bump in 4.12. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-1941.
This is a clone of issue OCPBUGS-2873. The following is the description of the original issue:
—
Description of problem:
Prometheus fails to scrape metrics from the storage operator after some time.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Always
Steps to Reproduce:
1. Install storage operator. 2. Wait for 24h (time for the certificate to be recycled). 3.
Actual results:
Targets being down because Prometheus didn't reload the CA certificate.
Expected results:
Prometheus reloads its client TLS certificate and scrapes the target successfully.
Additional info:
Description of problem:
when install private cluster, firstly failed , then need
ibmcloud is security-group-rule-add "${infra}-sg-kube-api-lb" inbound tcp --port-min 6443 --port-max 6443 --remote $sg
then openshift-install wait-for again.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. try to create cluster with BYON, in install-config.yaml publish: Internal, install failed
Actual results:
firstly time, install failed
Expected results:
Just need install once. need not manually security-group-rule-add.
Additional info:
https://coreos.slack.com/archives/C01U40AM37F/p1664439142279079?thread_ts=1663769891.358229&cid=C01U40AM37F
this issue blocked set up private cluster automatically
Description of problem:
i18n translation missing in "Remove component node from application" modal
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Navigate to dev console and create a workload under an Application group 2. On the Toplogy remove the workload from the Application group 3. See the i18n error in the console
Actual results:
Missing i18n key "Remove component node from application" in namespace "topology" and language "en." in console
Expected results:
No i18n error should be shown in the console.
Additional info:
we need to make sure that the ironic containers use the latest available bugfix versions
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
As a downstream consumer of the installer (as a library), I want to be able to choose whether or not the image gallery is used when creating machinesets on Azure so that I can achieve backwards compatibility with pre-4.12
Version-Release number of selected component (if applicable):
4.12+
How reproducible:
always
Steps to Reproduce:
1. Try to generate machinesets in pre-4.12 environment 2. Lament as the installer automatically uses image gallery regardless
Actual results:
Installer attempts to guess whether to use image gallery
Expected results:
I should be able to choose myself
Additional info:
Description of problem:
We're seeing frequent private DNS zone creation failures in Azure CI jobs recent two days, the Azure CI jobs have been greatly affected. https://search.ci.openshift.org/?search=error+creating%2Fupdating+Private+DNS+Zone+Virtual+network&maxAge=48h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Such as the following error from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade/1566852244215697408 level=info msg=Consuming Openshift Manifests from target directory level=info msg=Consuming Common Manifests from target directory level=info msg=Credentials loaded from file "/var/run/secrets/ci.openshift.io/cluster-profile/osServicePrincipal.json" level=info msg=Creating infrastructure resources... level=error level=error msg=Error: error creating/updating Private DNS Zone Virtual network link "ci-op-1w80vs6f-7f65d-t2zlz-network-link" (Resource Group "ci-op-1w80vs6f-7f65d-t2zlz-rg"): privatedns.VirtualNetworkLinksClient#CreateOrUpdate: Failure sending request: StatusCode=404 -- Original Error: Code="ParentResourceNotFound" Message="Can not perform requested operation on nested resource. Parent resource 'ci-op-1w80vs6f-7f65d.ci2.azure.devcluster.openshift.com' not found." level=error level=error msg= with module.dns.azureprivatedns_zone_virtual_network_link.network, level=error msg= on dns/dns.tf line 13, in resource "azureprivatedns_zone_virtual_network_link" "network": level=error msg= 13: resource "azureprivatedns_zone_virtual_network_link" "network"
Version-Release number of selected component (if applicable):
All OCP versions
How reproducible:
https://search.ci.openshift.org/chart?name=e2e-azure&search=error+creating%2Fupdating+Private+DNS+Zone&maxAge=24h&type=build-log shows 26% of the failed Azure jobs are related to "error creating/updating Private DNS Zone" in the past day. 3/5 of the failed Azure jobs are caused by this in QE’s CI today.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
No Azure outage was reported from https://status.azure.com/en-us/status. No private zone or DNS records quota exceeded was observed.
Description of problem:
Cluster ingress operator creates router deployments with affinity rules when running in a cluster with non-HA infrastructure plane (InfrastructureTopology=="SingleReplica") and "NodePortService" endpoint publishing strategy. With only one worker node available, rolling update of router-default stalls.
Version-Release number of selected component (if applicable):
All
How reproducible:
Create a single worker node cluster with "NodePortService" endpoint publishing strategy and try to restart the default router. Restart will not go through.
Steps to Reproduce:
1. Create a single worker node OCP cluster with HA control plane (ControlPlaneTopology=="HighlyAvailable"/"External") and one worker node (InfrastructureTopology=="SingleReplica") using "NodePortService" endpoint publishing strategy. The operator will create "ingress-default" deployment with "podAntiAffinity" block, even though the number of nodes where ingress pods can be scheduled is only one: ``` apiVersion: apps/v1 kind: Deployment metadata: ... name: router-default namespace: openshift-ingress ... spec: ... replicas: 1 ... strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 50% type: RollingUpdate template: ... spec: affinity: ... podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: ingresscontroller.operator.openshift.io/deployment-ingresscontroller operator: In values: - default - key: ingresscontroller.operator.openshift.io/hash operator: In values: - 559d6c97f4 topologyKey: kubernetes.io/hostname ... ``` 2. Restart the default router ``` oc rollout restart deployment router-default -n openshift-ingress ```
Actual results:
Deployment restart does not complete and hangs forever: ``` oc get po -n openshift-ingress NAME READY STATUS RESTARTS AGE router-default-58d88f8bf6-cxnjk 0/1 Pending 0 2s router-default-5bb8c8985b-kdg92 1/1 Running 0 2d23h ```
Expected results:
Deployment restart completes
Additional info:
Description of problem:
The alertmanager pod is stuck on OCP 4.11 with OVN in container Creating State From oc describe alertmanager pod: ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 16s (x459 over 17h) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_alertmanager-managed-ocs-alertmanager-0_openshift-storage_3a55ed54-4eaa-4f65-8a10-e5d21fad1ebc_0(88575547dc0b210307b89dd2bb8e379ece0962b607ac2707a1c2cf630b1aaa78): error adding pod openshift-storage_alertmanager-managed-ocs-alertmanager-0 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-storage/alertmanager-managed-ocs-alertmanager-0/3a55ed54-4eaa-4f65-8a10-e5d21fad1ebc:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-storage/alertmanager-managed-ocs-alertmanager-0 88575547dc0b210307b89dd2bb8e379ece0962b607ac2707a1c2cf630b1aaa78] [openshift
Version-Release number of selected component (if applicable):
OCP 4.11 with OVN
How reproducible:
100%
Steps to Reproduce:
1. Terminate the node on which alertmanager pod is running 2. pod will get stuck in container Creating state 3.
Actual results:
AlertManager pod is stuck in container Creating state
Expected results:
Alertmanager pod is ready
Additional info:
The workaround would be to terminate the alertmanager pod
This is a clone of issue OCPBUGS-5151. The following is the description of the original issue:
—
Description of problem:
Cx is not able to install new cluster OCP BM IPI. During the bootstrapping the provisioning interfaces from master node not getting ipv4 dhcp ip address from bootstrap dhcp server on OCP IPI BareMetal install Please refer to following BUG --> https://issues.redhat.com/browse/OCPBUGS-872 The problem was solved by applying rd.net.timeout.carrier=30 to the kernel parameters of compute nodes via cluster-baremetal operator. The fix also need to be apply to the control-plane. ref:// https://github.com/openshift/cluster-baremetal-operator/pull/286/files
Version-Release number of selected component (if applicable):
How reproducible:
Perform OCP 4.10.16 IPI BareMetal install.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Customer should be able to install the cluster without any issue.
Additional info:
This is a clone of issue OCPBUGS-1704. The following is the description of the original issue:
—
Description of problem:
According to OCP 4.11 doc (https://docs.openshift.com/container-platform/4.11/installing/installing_gcp/installing-gcp-account.html#installation-gcp-enabling-api-services_installing-gcp-account), the Service Usage API (serviceusage.googleapis.com) is an optional API service to be enabled. But, the installation cannot succeed if this API is disabled.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-25-071630
How reproducible:
Always, if the Service Usage API is disabled in the GCP project.
Steps to Reproduce:
1. Make sure the Service Usage API (serviceusage.googleapis.com) is disabled in the GCP project. 2. Try IPI installation in the GCP project.
Actual results:
The installation would fail finally, without any worker machines launched.
Expected results:
Installation should succeed, or the OCP doc should be updated.
Additional info:
Please see the attached must-gather logs (http://virt-openshift-05.lab.eng.nay.redhat.com/jiwei/jiwei-0926-03-cnxn5/) and the sanity check results. FYI if enabling the API, and without changing anything else, the installation could succeed.
This is a clone of issue OCPBUGS-5068. The following is the description of the original issue:
—
Description of problem:
virtual media provisioning fails when iLO Ironic driver is used
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1. attempt virtual media provisioning on a node configured with ilo-virtualmedia:// drivers 2. 3.
Actual results:
Provisioning fails with "An auth plugin is required to determine endpoint URL" error
Expected results:
Provisioning succeeds
Additional info:
Relevant log snippet: 3742 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector [None req-e58ac1f2-fac6-4d28-be9e-983fa900a19b - - - - - -] Unable to start managed inspection for node e4445d43-3458-4cee-9cbe-6da1de75 78cd: An auth plugin is required to determine endpoint URL: keystoneauth1.exceptions.auth_plugins.MissingAuthPlugin: An auth plugin is required to determine endpoint URL 3743 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector Traceback (most recent call last): 3744 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/inspector.py", line 210, in _start_managed_inspection 3745 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector task.driver.boot.prepare_ramdisk(task, ramdisk_params=params) 3746 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic_lib/metrics.py", line 59, in wrapped 3747 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector result = f(*args, **kwargs) 3748 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/ilo/boot.py", line 408, in prepare_ramdisk 3749 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector iso = image_utils.prepare_deploy_iso(task, ramdisk_params, 3750 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 624, in prepare_deploy_iso 3751 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector return prepare_iso_image(inject_files=inject_files) 3752 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 537, in _prepare_iso_image 3753 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector image_url = img_handler.publish_image( 3754 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic/drivers/modules/image_utils.py", line 193, in publish_image 3755 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector swift_api = swift.SwiftAPI() 3756 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector File "/usr/lib/python3.9/site-packages/ironic/common/swift.py", line 66, in __init__ 3757 2022-12-19T19:02:05.997747170Z 2022-12-19 19:02:05.995 1 ERROR ironic.drivers.modules.inspector endpoint = keystone.get_endpoint('swift', session=session)
This is a clone of issue OCPBUGS-6092. The following is the description of the original issue:
—
Description of problem:
While configuring 4.12.0 dualstack baremetal cluster ovs-configuration.service fails
Jan 19 22:01:05 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: Attempt 10 to bring up connection ovs-if-phys1 Jan 19 22:01:05 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + nmcli conn up ovs-if-phys1 Jan 19 22:01:05 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[26588]: Error: Connection activation failed: No suitable device found for this connection (device eno1np0 not available because profile i s not compatible with device (mismatching interface name)). Jan 19 22:01:05 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + s=4 Jan 19 22:01:05 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + sleep 5 Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + '[' 4 -eq 0 ']' Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + false Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + echo 'ERROR: Cannot bring up connection ovs-if-phys1 after 10 attempts' Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: ERROR: Cannot bring up connection ovs-if-phys1 after 10 attempts Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + return 4 Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + handle_exit Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + e=4 Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + '[' 4 -eq 0 ']' Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: + echo 'ERROR: configure-ovs exited with error: 4' Jan 19 22:01:10 openshift-worker-0.kni-qe-4.lab.eng.rdu2.redhat.com configure-ovs.sh[14588]: ERROR: configure-ovs exited with error: 4
Version-Release number of selected component (if applicable):
4.12.0
How reproducible:
So far 100%
Steps to Reproduce:
1. Deploy dualstack baremetal cluster with bonded interfaces(configured with MC and not NMState within install-config.yaml) 2. Run migration to second interface, part of machine config - contents: source: data:text/plain;charset=utf-8,bond0.117 filesystem: root mode: 420 path: /etc/ovnk/extra_bridge 3. Install operators: * kubevirt-hyperconverged * sriov-network-operator * cluster-logging * elasticsearch-operator 4. Start applying node-tunning profiles 5. During node reboots ovs-configuration service fails
Actual results:
ovs-configuration service fails on some nodes resulting in ovnkube-node-* pods failure oc get po -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-dvgx7 6/6 Running 8 16h ovnkube-master-vs7mp 6/6 Running 6 16h ovnkube-master-zrm4c 6/6 Running 6 16h ovnkube-node-2g8mb 4/5 CrashLoopBackOff 175 (3m48s ago) 16h ovnkube-node-bfbcc 4/5 CrashLoopBackOff 176 (64s ago) 16h ovnkube-node-cj6vf 5/5 Running 5 16h ovnkube-node-f92rm 5/5 Running 5 16h ovnkube-node-nmjpn 5/5 Running 5 16h ovnkube-node-pfv5z 4/5 CrashLoopBackOff 163 (4m53s ago) 15h ovnkube-node-z5vf9 5/5 Running 10 15h
Expected results:
ovs-configuration service succeeds on all nodes
Additional info:
This is a clone of issue OCPBUGS-10864. The following is the description of the original issue:
—
Description of problem:
APIServer service not selected correctly for PublicAndPrivate when external-dns isn't configured. Image: 4.14 Hypershift operator + OCP 4.14.0-0.nightly-2023-03-23-050449 jiezhao-mac:hypershift jiezhao$ oc get hostedcluster/jz-test -n clusters -ojsonpath='{.spec.platform.aws.endpointAccess}{"\n"}' PublicAndPrivate - lastTransitionTime: "2023-03-24T15:13:15Z" message: Cluster operators console, dns, image-registry, ingress, insights, kube-storage-version-migrator, monitoring, openshift-samples, service-ca are not available observedGeneration: 3 reason: ClusterOperatorsNotAvailable status: "False" type: ClusterVersionSucceeding services: - service: APIServer servicePublishingStrategy: type: LoadBalancer - service: OAuthServer servicePublishingStrategy: type: Route - service: Konnectivity servicePublishingStrategy: type: Route - service: Ignition servicePublishingStrategy: type: Route - service: OVNSbDb servicePublishingStrategy: type: Route jiezhao-mac:hypershift jiezhao$ oc get service -n clusters-jz-test | grep kube-apiserver kube-apiserver LoadBalancer 172.30.211.131 aa029c422933444139fb738257aedb86-9e9709e3fa1b594e.elb.us-east-2.amazonaws.com 6443:32562/TCP 34m kube-apiserver-private LoadBalancer 172.30.161.79 ab8434aa316e845c59690ca0035332f0-d818b9434f506178.elb.us-east-2.amazonaws.com 6443:32100/TCP 34m jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ cat hostedcluster.kubeconfig | grep server server: https://ab8434aa316e845c59690ca0035332f0-d818b9434f506178.elb.us-east-2.amazonaws.com:6443 jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ oc get node --kubeconfig=hostedcluster.kubeconfig E0324 11:17:44.003589 95300 memcache.go:238] couldn't get current server API group list: Get "https://ab8434aa316e845c59690ca0035332f0-d818b9434f506178.elb.us-east-2.amazonaws.com:6443/api?timeout=32s": dial tcp 10.0.129.24:6443: i/o timeout
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1.Create a PublicAndPrivate cluster without external-dns 2.access the guest cluster (it should fail) 3.
Actual results:
unable to access the guest cluster via 'oc get node --kubeconfig=<guest cluster kubeconfig>', some guest cluster co are not available
Expected results:
The cluster is up and running, the guest cluster can be accessed via 'oc get node --kubeconfig=<guest cluster kubeconfig>'
Additional info:
Description of problem:
When a pod runs to a completed state, we typically rely on the update event that will indicate to us that this pod is completed. At that point the pod IP is released and the port configuration is removed in OVN. The subsequent delete event for this pod will be ignored because it should have been cleaned up in the previous update. However, there can be cases where the update event is missed with pod completed. In this case we will only receive a delete with pod completed event, and ignore tearing down the pod. The end result is the pod is not cleaned up in OVN and the IP address remains allocated, reducing the amount of address range available to launch another pod. This can lead to exhausting all IP addresses available for pod allocation on a node.
Version-Release number of selected component (if applicable):
4.10.24
How reproducible:
Not sure how to reproduce this. I'm guessing some lag in kapi updates can cause the completed update event and the final delete event to be combined into a single event.
Steps to Reproduce:
1. 2. 3.
Actual results:
Port still exists in OVN, IP remains allocated for a deleted pod.
Expected results:
IP should be freed, port should be removed from OVN.
Additional info:
This is a clone of issue OCPBUGS-6222. The following is the description of the original issue:
—
Please review the following PR: https://github.com/openshift/alibaba-cloud-csi-driver/pull/20
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
In OpenShift 4.7.0 and 4.6.20, cluster-ingress-operator started using the OpenShift-specific unsupported.do-not-use.openshift.io/override-liveness-grace-period-seconds annotation for router pods as a short-term measure to configure the liveness probe's grace period in order to fix OCPBUGSM-20760 (BZ#1899941). This annotation is implemented by a carry patch in openshift/kubernetes.
Since then, upstream Kubernetes has added a terminationGracePeriodSeconds API field to configure the liveness probe using a formal API (upstream doc reference). Using this API field will allow for the carry patch to be removed from openshift/kubernetes.
Example:
spec: terminationGracePeriodSeconds: 3600 # pod-level containers: - name: test image: ... ports: - name: liveness-port containerPort: 8080 hostPort: 8080 livenessProbe: httpGet: path: /healthz port: liveness-port failureThreshold: 1 periodSeconds: 60 # Override pod-level terminationGracePeriodSeconds # terminationGracePeriodSeconds: 10
OpenShift 4.13.
Always.
1. Check the annotation and API field on a running cluster: oc -n openshift-ingress get pods -Lingresscontroller.operator.openshift.io/deployment-ingresscontroller=default -o 'custom-columns=NAME:.medadata.name,ANNOTATION:.metadata.annotations.unsupported\.do-not-use\.openshift\.io\/override-liveness-grace-period-seconds,SPEC:.spec.containers[0].livenessProbe.terminationGracePeriodSeconds'
The annotation is set, and the spec field is not:
% oc -n openshift-ingress get pods -Lingresscontroller.operator.openshift.io/deployment-ingresscontroller=default -o 'custom-columns=NAME:.metadata.name,ANNOTATION:.metadata.annotations.unsupported\.do-not-use\.openshift\.io\/override-liveness-grace-period-seconds,SPEC:.spec.containers[0].livenessProbe.terminationGracePeriodSeconds' NAME ANNOTATION SPEC router-default-677f956f8b-d5lqz 10 <none> router-default-677f956f8b-hntbb 10 <none>
The annotation is not set, and the spec field is:
% oc -n openshift-ingress get pods -Lingresscontroller.operator.openshift.io/deployment-ingresscontroller=default -o 'custom-columns=NAME:.metadata.name,ANNOTATION:.metadata.annotations.unsupported\.do-not-use\.openshift\.io\/override-liveness-grace-period-seconds,SPEC:.spec.containers[0].livenessProbe.terminationGracePeriodSeconds' NAME ANNOTATION SPEC router-default-677f956f8b-d5lqz <none> 10 router-default-677f956f8b-hntbb <none> 10
This is a clone of issue OCPBUGS-4190. The following is the description of the original issue:
—
Description of problem:
Two tests are perma failing in metal-ipi upgrade tests [sig-imageregistry] Image registry remains available using new connections expand_more 39m27s [sig-imageregistry] Image registry remains available using reused connections expand_more 39m27s
Version-Release number of selected component (if applicable):
4.12 / 4.13
How reproducible:
all ci runs
Steps to Reproduce:
1. 2. 3.
Actual results:
Nov 24 02:58:26.998: INFO: "[sig-imageregistry] Image registry remains available using reused connections": panic: runtime error: invalid memory address or nil pointer dereference
Expected results:
pass
Additional info:
Description of problem:
vSphere 4.12 CI jobs are failing with: admission webhook "validation.csi.vsphere.vmware.com" denied the request: AllowVolumeExpansion can not be set to true on the in-tree vSphere StorageClass https://search.ci.openshift.org/?search=can+not+be+set+to+true+on+the+in-tree+vSphere+StorageClass&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):
4.12 nigthlies
How reproducible:
consistently in CI
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This appears to have started failing in the past 36 hours.
This is a clone of issue OCPBUGS-3973. The following is the description of the original issue:
—
Description of problem:
Upgrade SNO cluster from 4.12 to 4.13, the csi-snapshot-controller is degraded with message (same with log from csi-snapshot-controller-operator): E1122 09:02:51.867727 1 base_controller.go:272] StaticResourceController reconciliation failed: ["csi_controller_deployment_pdb.yaml" (string): poddisruptionbudgets.policy "csi-snapshot-controller-pdb" is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:csi-snapshot-controller-operator" cannot delete resource "poddisruptionbudgets" in API group "policy" in the namespace "openshift-cluster-storage-operator", "webhook_deployment_pdb.yaml" (string): poddisruptionbudgets.policy "csi-snapshot-webhook-pdb" is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:csi-snapshot-controller-operator" cannot delete resource "poddisruptionbudgets" in API group "policy" in the namespace "openshift-cluster-storage-operator"]
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-11-19-191518 to 4.13.0-0.nightly-2022-11-19-182111
How reproducible:
1/1
Steps to Reproduce:
Upgrade SNO cluster from 4.12 to 4.13
Actual results:
csi-snapshot-controller is degraded
Expected results:
csi-snapshot-controller should be healthy
Additional info:
It also happened on from scratch cluster on 4.13: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.13-ocp-e2e-aws-ovn-arm64-single-node/1594946128904720384
Description of problem:
The statsPort is not correctly set for HostNetwork endpointPublishingStrategyWhen we change the httpPort from 80 to 85 and statsPort from 1936 to 1939 on the default router like here: # oc get IngressController default -n openshift-ingress-operator ... clientTLS: clientCA: name: "" clientCertificatePolicy: "" endpointPublishingStrategy: hostNetwork: httpPort: 85 httpsPort: 443 statsPort: 1939 type: HostNetwork ... status: ... endpointPublishingStrategy: hostNetwork: httpPort: 85 httpsPort: 443 protocol: TCP statsPort: 1939 We can see that the route pods get restarted:# oc get pod -n openshift-ingress NAME READY STATUS RESTARTS AGE router-default-5b96855754-2wnrp 1/1 Running 0 1m router-default-5b96855754-9c724 1/1 Running 0 2mThe pods are configured correctly:# oc get pod router-default-5b96855754-2wnrp -o yaml ... spec: containers: - env: - name: ROUTER_SERVICE_HTTPS_PORT value: "443" - name: ROUTER_SERVICE_HTTP_PORT value: "85" - name: STATS_PORT value: "1939" ... livenessProbe: failureThreshold: 3 httpGet: host: localhost path: /healthz port: 1939 scheme: HTTP ... ports: - containerPort: 85 hostPort: 85 name: http protocol: TCP - containerPort: 443 hostPort: 443 name: https protocol: TCP - containerPort: 1939 hostPort: 1939 name: metrics protocol: TCPBut the endpoint is incorrect:# oc get ep router-internal-default -o yaml ... apiVersion: v1 items: - apiVersion: v1 kind: Endpoints metadata: creationTimestamp: "2022-12-02T13:34:48Z" labels: ingresscontroller.operator.openshift.io/owning-ingresscontroller: default name: router-internal-default namespace: openshift-ingress resourceVersion: "23216275" uid: 50c00fc0-08e5-4a6a-a7eb-7501fa1a7ba6 subsets: - addresses: - ip: 10.74.211.203 nodeName: worker-0.rhodain01.lab.psi.pnq2.redhat.com targetRef: kind: Pod name: router-default-5b96855754-2wnrp namespace: openshift-ingress uid: eda945b9-9061-4361-b11a-9d895fee0003 - ip: 10.74.211.216 nodeName: worker-1.rhodain01.lab.psi.pnq2.redhat.com targetRef: kind: Pod name: router-default-5b96855754-9c724 namespace: openshift-ingress uid: 97a04c3e-ddea-43b7-ac70-673279057929 ports: - name: metrics port: 1936 protocol: TCP - name: https port: 443 protocol: TCP - name: http port: 85 protocol: TCP kind: List metadata: resourceVersion: "" selfLink: ""Notice that the https port is correctly set to 85, but the stats port is still set to 1936 and not to 1939. That is a problem as the metrics target endpoint is reported as down with an error message: Get "https://10.74.211.203:1936/metrics": dial tcp 10.74.211.203:1936: connect: connection refusedWhen the EP is corrected and the ports are changed to: ports: - name: metrics port: 1939 protocol: TCPthe metrics target endpoint is picked up correctly and the metrics are scribed works as expected
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
Set endpointPublishingStrategy and modify the nodeport for statPort: endpointPublishingStrategy: hostNetwork: httpPort: 85 httpsPort: 443 protocol: TCP statsPort: 1939
Actual results:
Stats are scribed from the standard port and not the one specified.
Expected results:
The endpoint object is pointing to the specified port.
Additional info:
The pipeline run nodes used to show a focus border when they were in focus but no longer do.
There is no indication of which node has the focus
There should be a focus border indicating the current focus node.
always
4.12
Previously:
Currently:
Description of problem:
The current version of openshift's corendns is based on Kubernetes 1.24 packages. OpenShift 4.12 is based on Kubernetes 1.25.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Always
Steps to Reproduce:
1. Check https://github.com/openshift/coredns/blob/release-4.12/go.mod
Actual results:
Kubernetes packages (k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go) are at version v0.24.0.
Expected results:
Kubernetes packages are at version v0.25.0 or later.
Additional info:
Using old Kubernetes API and client packages brings risk of API compatibility issues.
This is a clone of issue OCPBUGS-2551. The following is the description of the original issue:
—
Description of problem:
When normal user select "All namespaces" by using the radio button "Show operands in", The ""Error Loading" error will be shown
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-18-192348, 4.11
How reproducible:
Always
Steps to Reproduce:
1. Install operator "Red Hat Intergration-Camel K" on All namespace 2. Login console by using normal user 3. Navigate to "All instances" Tab for the opertor 4. Check the radio button "All namespaces" is being selected 5. Check the page
Actual results:
The Error Loading info will be shown on page
Expected results:
The error should not shown
Additional info:
This is a clone of issue OCPBUGS-7207. The following is the description of the original issue:
—
At some point in the mtu-migration development a configuration file was generated at /etc/cno/mtu-migration/config which was used as a flag to indicate to configure-ovs that a migration procedure was in progress. When that file was missing, it was assumed the migration procedure was over and configure-ovs did some cleaning on behalf of it.
But that changed and /etc/cno/mtu-migration/config is never set. That causes configure-ovs to remove mtu-migration information when the procedure is still in progress making it to use incorrect MTU values and either causing nodes to be tainted with "ovn.k8s.org/mtu-too-small" blocking the procedure itself or causing network disruption until the procedure is over.
However, this was not a problem for the CI job as it doesn't use the migration procedure as documented for the sake of saving limited time available to run CI jobs. The CI merges two steps of the procedure into one so that there is never a reboot while the procedure is in progress and hiding this issue.
This was probably not detected in QE as well for the same reason as CI.
Description of problem:
If a cluster has Control Plane Machines that are not indexed from 0, the CPMS currently will go degraded. For example, if you have already replaced the machines and they are indexed as `3`, `4`, `5`, the cluster will degrade. status: conditions: - lastTransitionTime: "2022-09-22T11:49:54Z" message: Missing 3 available replica(s) observedGeneration: 1 reason: UnavailableReplicas status: "False" type: Available - lastTransitionTime: "2022-09-22T13:25:40Z" message: Observed 2 index(es) in excess reason: ExcessIndexes status: "True" type: Degraded - lastTransitionTime: "2022-09-22T13:25:40Z" message: "" reason: OperatorDegraded status: "False" type: Progressing observedGeneration: 1 readyReplicas: 2 replicas: 2 unavailableReplicas: 3
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Create a cluster 2. Replace each control plane machine so that you end up with machines master-3, master-4 and master-5 3. Observe CPMS status
Actual results:
CPMS goes degraded because the current control plane machine names
Expected results:
CPMS should be happy with the current Control Plane Machine names
Additional info:
This is a clone of issue OCPBUGS-948. The following is the description of the original issue:
—
Description of problem:
OLM is setting the "openshift.io/scc" label to "anyuid" on several namespaces: https://github.com/openshift/operator-framework-olm/blob/d817e09c2565b825afd8bfc9bb546eeff28e47e7/manifests/0000_50_olm_00-namespace.yaml#L23 https://github.com/openshift/operator-framework-olm/blob/d817e09c2565b825afd8bfc9bb546eeff28e47e7/manifests/0000_50_olm_00-namespace.yaml#L8 this label has no effect and will lead to confusion. It should be set to emptystring for now (removing it entirely will have no effect on upgraded clusters because the CVO does not remove deleted labels, so the next best thing is to clear the value). For bonus points, OLM should remove the label entirely from the manifest and add migration logic to remove the existing label from these namespaces to handle upgraded clusters that already have it.
Version-Release number of selected component (if applicable):
Not sure how long this has been an issue, but fixing it in 4.12+ should be sufficient.
How reproducible:
always
Steps to Reproduce:
1. install cluster 2. examine namespace labels
Actual results:
label is present
Expected results:
ideally label should not be present, but in the short term setting it to emptystring is the quick fix and is better than nothing.
This is a clone of issue OCPBUGS-2144. The following is the description of the original issue:
—
Description of problem:
Azure IPI creates boot images using the image gallery API now, it will create two image definition resources for both hyperVGeneration V1 and V2. For arm64 cluster, the architecture in image definition hyperVGeneration V1 is x64, but it should be Arm64
Version-Release number of selected component (if applicable):
./openshift-install version ./openshift-install 4.12.0-0.nightly-arm64-2022-10-07-204251 built from commit 7b739cde1e0239c77fabf7622e15025d32fc272c release image registry.ci.openshift.org/ocp-arm64/release-arm64@sha256:d2569be4ba276d6474aea016536afbad1ce2e827b3c71ab47010617a537a8b11 release architecture arm64
How reproducible:
always
Steps to Reproduce:
1.Create arm cluster using latest arm64 nightly build 2.Check image definition created for hyperVGeneration V1
Actual results:
The architecture field is x64. ### $ az sig image-definition show --gallery-name ${gallery_name} --gallery-image-definition lwanazarm1008-rc8wh --resource-group ${rg} | jq -r ".architecture" x64 The image version under this image definition is for aarch64. ### $ az sig image-version show --gallery-name gallery_lwanazarm1008_rc8wh --gallery-image-definition lwanazarm1008-rc8wh --resource-group lwanazarm1008-rc8wh-rg --gallery-image-version 412.86.20220922 | jq -r ".storageProfile.osDiskImage.source" { "uri": "https://clustermuygq.blob.core.windows.net/vhd/rhcosmuygq.vhd"} $ az storage blob show --container-name vhd --name rhcosmuygq.vhd --account-name clustermuygq --account-key $account_key | jq -r ".metadata" { "Source_uri": "https://rhcos.blob.core.windows.net/imagebucket/rhcos-412.86.202209220538-0-azure.aarch64.vhd"}
Expected results:
Although no VMs with HypergenV1 can be provisioned, the architecture field should be Arm64 even for hyperGenerationV1 image definitions
Additional info:
1.The architecture in image definition hyperVGeneration V2 is Arm64 and installer will use V2 by default for arm64 vm_type, so installation didn't fail by default. But we still need to make architecture consistent in V1. 2.Need to set architecture field for both V1 and V2, now we only set architecture in V2 image definition resource. https://github.com/openshift/installer/blob/master/data/data/azure/vnet/main.tf#L100-L128
Description of problem:
some upgrade ci jobs from 4.11.z to 4.12 nightly build are failed, because system unit machine-config-daemon-update-rpmostree-via-container is failed
omg get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-6e18de1272fad7a5ca1529941e3ceaed False True True 3 0 0 1 3h53m master rendered-master-60f4ff5893c94f53acd9ebb7a6bf53d4 False True True 3 0 0 1 3h53m
check issued node
omg get node/ip-10-0-57-74.us-east-2.compute.internal -o yaml|yq -y '.metadata.annotations' cloud.network.openshift.io/egress-ipconfig: '[{"interface":"eni-0f6de21569b5b65c8","ifaddr":{"ipv4":"10.0.48.0/20"},"capacity":{"ipv4":14,"ipv6":15}}]' csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-01a34f6b5f2cd1e41"}' machine.openshift.io/machine: openshift-machine-api/ci-op-kb95kxx9-2a438-r6z94-master-2 machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-master-065664319cfbaee64277097d49a8a5a6 machineconfiguration.openshift.io/desiredConfig: rendered-master-60f4ff5893c94f53acd9ebb7a6bf53d4 machineconfiguration.openshift.io/desiredDrain: drain-rendered-master-60f4ff5893c94f53acd9ebb7a6bf53d4 machineconfiguration.openshift.io/lastAppliedDrain: drain-rendered-master-60f4ff5893c94f53acd9ebb7a6bf53d4 machineconfiguration.openshift.io/reason: 'error running systemd-run --unit machine-config-daemon-update-rpmostree-via-container --collect --wait -- podman run --authfile /var/lib/kubelet/config.json --privileged --pid=host --net=host --rm -v /:/run/host quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0daf5c4a35424410e88dde102022fc3581302bc8a98e09e2e4748502c59b3661 rpm-ostree ex deploy-from-self /run/host: Running as unit: machine-config-daemon-update-rpmostree-via-container.service Finished with result: exit-code Main processes terminated with: code=exited/status=125 Service runtime: 2min 52ms CPU time consumed: 144ms : exit status 125' machineconfiguration.openshift.io/state: Degraded volumes.kubernetes.io/controller-managed-attach-detach: 'true'
check mcd log on issued node
omg get pod -n openshift-machine-config-operator -o json | jq -r '.items[]|select(.spec.nodeName=="ip-10-0-57-74.us-east-2.compute.internal")|.metadata.name' | grep daemon machine-config-daemon-znbvf 2022-10-09T22:12:58.797891917Z I1009 22:12:58.797821 179598 update.go:1917] Updating OS to layered image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0daf5c4a35424410e88dde102022fc3581302bc8a98e09e2e4748502c59b3661 2022-10-09T22:12:58.797891917Z I1009 22:12:58.797846 179598 rpm-ostree.go:447] Running captured: rpm-ostree --version 2022-10-09T22:12:58.815829171Z I1009 22:12:58.815800 179598 update.go:2068] rpm-ostree is not new enough for layering; forcing an update via container 2022-10-09T22:12:58.817577513Z I1009 22:12:58.817555 179598 update.go:2053] Running: systemd-run --unit machine-config-daemon-update-rpmostree-via-container --collect --wait -- podman run --authfile /var/lib/kubelet/config.json --privileged --pid=host --net=host --rm -v /:/run/host quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0daf5c4a35424410e88dde102022fc3581302bc8a98e09e2e4748502c59b3661 rpm-ostree ex deploy-from-self /run/host ... 2022-10-09T22:15:00.831959313Z E1009 22:15:00.831949 179598 writer.go:200] Marking Degraded due to: error running systemd-run --unit machine-config-daemon-update-rpmostree-via-container --collect --wait -- podman run --authfile /var/lib/kubelet/config.json --privileged --pid=host --net=host --rm -v /:/run/host quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0daf5c4a35424410e88dde102022fc3581302bc8a98e09e2e4748502c59b3661 rpm-ostree ex deploy-from-self /run/host: Running as unit: machine-config-daemon-update-rpmostree-via-container.service 2022-10-09T22:15:00.831959313Z Finished with result: exit-code 2022-10-09T22:15:00.831959313Z Main processes terminated with: code=exited/status=125 2022-10-09T22:15:00.831959313Z Service runtime: 2min 52ms 2022-10-09T22:15:00.831959313Z CPU time consumed: 144ms 2022-10-09T22:15:00.831959313Z : exit status 125
Version-Release number of selected component (if applicable):
4.12
Steps to Reproduce:
upgrade cluster from 4.11.8 to 4.12.0-0.nightly-2022-10-05-053337
Actual results:
upgrade is failed due to node is degraded, rpm-ostree update via container is failed
Expected results:
upgrade can be completed successfully
Additional info:
must-gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.12-nightly-4.12-upgrade-from-stable-4.11-aws-ipi-proxy-p1/1579169944476585984/artifacts/aws-ipi-proxy-p1/gather-must-gather/artifacts/must-gather.tar
Other build logs of failed jobs
This is a clone of issue OCPBUGS-5164. The following is the description of the original issue:
—
Description of problem:
It looks like the ODC doesn't register KNATIVE_SERVING and KNATIVE_EVENTING flags. Those are based on KnativeServing and KnativeEventing CRs, but they are looking for v1alpha1 version of those: https://github.com/openshift/console/blob/f72519fdf2267ad91cc0aa51467113cc36423a49/frontend/packages/knative-plugin/console-extensions.json#L6-L8
This PR https://github.com/openshift-knative/serverless-operator/pull/1695 moved the CRs to v1beta1, and that breaks that ODC discovery.
Version-Release number of selected component (if applicable):
Openshift 4.8, Serverless Operator 1.27
Additional info:
https://coreos.slack.com/archives/CHGU4P8UU/p1671634903447019
This is a clone of issue OCPBUGS-8701. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-8232. The following is the description of the original issue:
—
Description of problem:
oc patch project command is failing to annotate the project
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1. Run the below patch command to update the annotation on existing project ~~~ oc patch project <PROJECT_NAME> --type merge --patch '{"metadata":{"annotations":{"openshift.io/display-name": "null","openshift.io/description": "This is a new project"}}}' ~~~
Actual results:
It produces the error output below: ~~~ The Project "<PROJECT_NAME>" is invalid: * metadata.namespace: Invalid value: "<PROJECT_NAME>": field is immutable * metadata.namespace: Forbidden: not allowed on this type ~~~
Expected results:
The `oc patch project` command should patch the project with specified annotation.
Additional info:
Tried to patch the project with OCP 4.11.26 version, and it worked as expected. ~~~ oc patch project <PROJECT_NAME> --type merge --patch '{"metadata":{"annotations":{"openshift.io/display-name": "null","openshift.io/description": "New project"}}}' project.project.openshift.io/<PROJECT_NAME> patched ~~~ The issue is with OCP 4.12, where it is not working.
This bug is a backport clone of [Bugzilla Bug 2100429](https://bugzilla.redhat.com/show_bug.cgi?id=2100429). The following is the description of the original bug:
—
Description of problem:
[apiserver-auth] default SCC restricted allow volumes don't have "ephemeral" caused deployment with Generic Ephemeral Volumes stuck at Pending
Version-Release number of selected component (if applicable):
Cluster version is 4.11.0-0.nightly-2022-06-22-190830
$ oc version
Client Version: 4.11.0-0.nightly-2022-05-11-054135
Kustomize Version: v4.5.4
Server Version: 4.11.0-0.nightly-2022-06-22-190830
Kubernetes Version: v1.24.0+284d62a
How reproducible:
Always
Steps to Reproduce:
1. Set up a AWS OCP cluster with 4.11 nightly
2. Create a deployment with Generic Ephemeral Volumes
3. Waiting for the deployment ready and check the volume could write and read data
Test data:
wangpenghao@MacBook-Pro ~ cat temp.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-dep
spec:
replicas: 1
selector:
matchLabels:
app: my-dep
template:
metadata:
labels:
app: my-dep
spec:
containers:
Actual results:
In Step3 : The deployment stucked at Pending caused by unable to validate against any security context constraint
Expected results:
In Step3 : The deployment should ready with the default scc restricted, the default scc restricted should allow
volumes:
Additional info:
Generic ephemeral volumes are the safer option of these two - it just creates/deletes PVCs on behalf of users. And most users can already create PVCs.
ephemeral type volume not in scc.volumes list definition
https://docs.openshift.com/container-platform/4.10/authentication/managing-security-context-constraints.html#authorization-cont[…]ing-internal-oauth
So currently if customers want to use ephemeral type volume have to use scc with:
volumes:
Discuss record: https://coreos.slack.com/archives/CB48XQ4KZ/p1655465586780419
Generic Ephemeral Volumes docs:
https://kubernetes.io/blog/2020/09/01/ephemeral-volumes-with-storage-capacity-tracking/#generic-ephemeral-volumes
Master Log:
Node Log (of failed PODs):
PV Dump:
PVC Dump:
StorageClass Dump (if StorageClass used by PV/PVC):
Description of problem:
- After upgrading to OCP 4.10.41, thanos-ruler-user-workload-1 in the openshift-user-workload-monitoring namespace is consistently being created and deleted. - We had to scale down the Prometheus operator multiple times so that the upgrade is considered as successful. - This fix is temporary. After some time it appears again and Prometheus operator needs to be scaled down and up again. - The issue is present on all clusters in this customer environment which are upgraded to 4.10.41.
Version-Release number of selected component (if applicable):
How reproducible:
N/A, I wasn't able to reproduce the issue.
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
Name: Routing
Description: Please change the "Routing" component to be a subcomponent "router" of the "Networking" component.
Component: change to "Networking".
Subcomponent: change to "router".
Existing fields (default assignee, default QA contact, default CC email list, etc.) should remain the same as they currently are.
Default Assignee: aos-network-edge-staff@bot.bugzilla.redhat.com
Default QA Contact: hongli@redhat.com
Default CC List: aos-network-edge-staff@bot.bugzilla.redhat.com
Additional Notes:
I filled in "Default CC email list" because the form validation would not permit me to omit it. However, it can be left empty in Bugzilla (it is currently empty).
If possible, we would like this change to be done prior to the Bugzilla-to-Jira migration to avoid the need to make the change after the migration.
This is a clone of issue OCPBUGS-6270. The following is the description of the original issue:
—
Similar to how, due to the install-config validation, the baremetal platform previously required a bunch of fields that are actually ignored (OCPBUGS-3278), we similarly require values for the following fields in the platform.vsphere section:
None of these values are actually used in the agent-based installer at present, and they should not be required.
Users can work around this by specifying dummy values in the platform config (note that the VIP values are required and must be genuine):
platform: vsphere: apiVIP: 192.168.111.1 ingressVIP: 192.168.111.2 vCenter: a username: b password: c datacenter: d defaultDatastore: e
This is a clone of issue OCPBUGS-10220. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-7559. The following is the description of the original issue:
—
Description of problem:
When attempting to add nodes to a long-lived 4.12.3 cluster, net new nodes are not able to join the cluster. They are provisioned in the cloud provider (AWS), but never actually join as a node.
Version-Release number of selected component (if applicable):
4.12.3
How reproducible:
Consistent
Steps to Reproduce:
1. On a long lived cluster, add a new machineset
Actual results:
Machines reach "Provisioned" but don't join the cluster
Expected results:
Machines join cluster as nodes
Additional info:
This is a clone of issue OCPBUGS-4401. The following is the description of the original issue:
—
Description of problem:
cluster-policy-controller has unnecessary permissions and is able to operate on all leases in KCM namespace. This also applies to namespace-security-allocation-controller that was moved some time ago and does not need lock mechanism.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-3172. The following is the description of the original issue:
—
Customer is trying to install the Logging operator, which appears to attempt to install a dynamic plugin. The operator installation fails in the console because permissions aren't available to "patch resource consoles".
We shouldn't block operator installation if permission issues prevent dynamic plugin installation.
This is an OSD cluster, presumably for a customer with "cluster-admin", although it may be a paired down permission set called "dedicated-admin".
See https://docs.google.com/document/d/1hYS-bm6aH7S6z7We76dn9XOFcpi9CGYcGoJys514YSY/edit for permissions investigation work on OSD
This is a clone of issue OCPBUGS-2479. The following is the description of the original issue:
—
Description of problem:
Right border radius is 0 for the pipeline visualization wrapper in dark mode but looks fine in light mode
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Steps to Reproduce:
1. Switch the theme to dark mode 2. Create a pipeline and navigate to the Pipeline details page
Actual results:
Right border radius is 0, see the screenshots
Expected results:
Right border radius should be same as left border radius.
Additional info:
Description of problem:
mapi_machinehealthcheck_short_circuit is not properly reconciling the state, when a MachineHealthCheck is failing because of unhealthy Machines but then is removed. When doing two MachineSet (called blue and green and only one has running Machines at a specific point in time) with MachineAutoscaler and MachineHealthCheck, the mapi_machinehealthcheck_short_circuit will continue to report 1 for MachineHealth that actually was removed because of a switch from blue to green. $ oc get machineset | egrep 'blue|green' housiocp4-wvqbx-worker-blue-us-east-2a 0 0 2d17h housiocp4-wvqbx-worker-green-us-east-2a 1 1 1 1 2d17h $ oc get machineautoscaler NAME REF KIND REF NAME MIN MAX AGE worker-green-us-east-1a MachineSet housiocp4-wvqbx-worker-green-us-east-2a 1 4 2d17h $ oc get machinehealthcheck NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY machine-api-termination-handler 100% 0 0 worker-green-us-east-1a 40% 1 1 { "name": "machine-health-check-unterminated-short-circuit", "file": "/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-machine-api-machine-api-operator-prometheus-rules-ccb650d9-6fc4-422b-90bb-70452f4aff8f.yaml", "rules": [ { "state": "firing", "name": "MachineHealthCheckUnterminatedShortCircuit", "query": "mapi_machinehealthcheck_short_circuit == 1", "duration": 1800, "labels": { "severity": "warning" }, "annotations": { "description": "The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check, you should check\nthe status of machines in the cluster.\n", "summary": "machine health check {{ $labels.name }} has been disabled by short circuit for more than 30 minutes" }, "alerts": [ { "labels": { "alertname": "MachineHealthCheckUnterminatedShortCircuit", "container": "kube-rbac-proxy-mhc-mtrc", "endpoint": "mhc-mtrc", "exported_namespace": "openshift-machine-api", "instance": "10.128.0.58:8444", "job": "machine-api-controllers", "name": "worker-blue-us-east-1a", "namespace": "openshift-machine-api", "pod": "machine-api-controllers-779dcb8769-8gcn6", "service": "machine-api-controllers", "severity": "warning" }, "annotations": { "description": "The number of unhealthy machines has exceeded the `maxUnhealthy` limit for the check, you should check\nthe status of machines in the cluster.\n", "summary": "machine health check worker-blue-us-east-1a has been disabled by short circuit for more than 30 minutes" }, "state": "firing", "activeAt": "2022-12-09T15:59:25.1287541Z", "value": "1e+00" } ], "health": "ok", "evaluationTime": 0.000648129, "lastEvaluation": "2022-12-12T09:35:55.140174009Z", "type": "alerting" } ], "interval": 30, "limit": 0, "evaluationTime": 0.000661589, "lastEvaluation": "2022-12-12T09:35:55.140165629Z" }, As we can see above, worker-blue-us-east-1a is no longer available and active but rather worker-green-us-east-1a. But worker-blue-us-east-1a was there before the switch to green has happen and was actuall reporting some unhealthy Machines. But since it's now gone, mapi_machinehealthcheck_short_circuit should properly reconcile as otherwise this is a false/positive alert.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.12.0-rc.3 (but is also seen on previous version)
How reproducible:
- Always
Steps to Reproduce:
1. Setup OpenShift Container Platform 4 on AWS for example 2. Create blue and green MachineSet with MachineAutoScaler and MachineHealthCheck 3. Have active Machines for blue only 4. Trigger unhealthy Machines in blue MachineSet 5. Switch to green MachineSet, by removing MachineHealthCheck, MachineAutoscaler and setting replicate of blue MachineSet to 0 6. Create green MachineHealthCheck, MachineAutoscaler and scale geen MachineSet to 1 7. Observe how mapi_machinehealthcheck_short_circuit continues to report unhealthy state for blue MachineHealthCheck which no longer exists.
Actual results:
mapi_machinehealthcheck_short_circuit reporting problematic MachineHealthCheck even though the faulty MachineHealthCheck does no longer exist.
Expected results:
mapi_machinehealthcheck_short_circuit to properly reconcile it's state and remove MachineHealthChecks that have been removed on OpenShift Container Platform level
Additional info:
It kind of looks like similar to the issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=2013528 respectively https://bugzilla.redhat.com/show_bug.cgi?id=2047702 (although https://bugzilla.redhat.com/show_bug.cgi?id=2047702 may not be super relevant)
Description of problem:
TO address: 'Static Pod is managed but errored" err="managed container xxx does not have Resource.Requests'
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-1604. The following is the description of the original issue:
—
Description of problem:
When viewing a resource that exists for multiple clusters, the data may be from the wrong cluster for a short time after switching clusters using the multicluster switcher.
Version-Release number of selected component (if applicable):
4.10.6
How reproducible:
Always
Steps to Reproduce:
1. Install RHACM 2.5 on OCP 4.10 and enable the FeatureGate to get multicluster switching 2. From the local-cluster perspective, view a resource that would exist on all clusters, like /k8s/cluster/config.openshift.io~v1~Infrastructure/cluster/yaml 3. Switch to a different cluster in the cluster switcher
Actual results:
Content for resource may start out correct, but then switch back to the local-cluster version before switching to the correct cluster several moments later.
Expected results:
Content should always be shown from the selected cluster.
Additional info:
Migrated from bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2075657
Description of problem:
For Hardware Backed Management Ports (e.g. Virtual functions), the Egress IP Health Check Feature will error out with: "unable to start health checking server: no mgmt ip"
Version-Release number of selected component (if applicable):
OVN-Kubernetes 4.12.0
How reproducible:
Always
Steps to Reproduce:
1. Load OVN-Kubernetes 4.12.0 in MLX BlueField 2 2. If in NIC mode: https://github.com/ovn-org/ovn-kubernetes/pull/3160 https://github.com/ovn-org/ovn-kubernetes/pull/3251 Patches are needed. 3. If in DPU mode then those above patches are optional. 4. Set OVNKUBE_NODE_MGMT_PORT_NETDEV environment variable to point to the Virtual Function.
Actual results:
Error in ovnkube-node: "unable to start health checking server: no mgmt ip". The ovnkube-node container will crash. Egress IP Health Check should be compatible with VFs as management port.
Expected results:
No Error.
Additional info:
A simple workaround is to not return an error: go-controller/pkg/node/node.go @@ -660,7 +660,8 @@ func (n *OvnNode) startEgressIPHealthCheckingServer(wg *sync.WaitGroup, mgmtPort return fmt.Errorf("failed start health checking server due to unsettled IPv6: %w", err) } } else { - return fmt.Errorf("unable to start health checking server: no mgmt ip") + klog.Infof("Unable to start Egress IP health checking server: no mgmt ip") + return nil }
This is a clone of issue OCPBUGS-6777. The following is the description of the original issue:
—
Description of problem:
"create manifests" without an existing "install-config.yaml" missing 4 YAML files in "<install dir>/openshift" which leads to "create cluster" failure
Version-Release number of selected component (if applicable):
$ ./openshift-install version ./openshift-install 4.13.0-0.nightly-2023-01-27-165107 built from commit fca41376abe654a9124f0450727579bb85591438 release image registry.ci.openshift.org/ocp/release@sha256:29b1bc2026e843d7a2d50844f6f31aa0d7eeb0df540c7d9339589ad889eee529 release architecture amd64
How reproducible:
Always
Steps to Reproduce:
1. "create manifests" 2. "create cluster"
Actual results:
1. After "create manifests", in "<install dir>/openshift", there're 4 YAML files missing, including "99_cloud-creds-secret.yaml", "99_kubeadmin-password-secret.yaml", "99_role-cloud-creds-secret-reader.yaml", and "openshift-install-manifests.yaml", comparing with "create manifests" with an existing "install-config.yaml". 2. The installation failed without any worker nodes due to error getting credentials secret "gcp-cloud-credentials" in namespace "openshift-machine-api".
Expected results:
1. "create manifests" without an existing "install-config.yaml" should generate the same set of YAML files as "create manifests" with an existing "install-config.yaml". 2. Then the subsequent "create cluster" should succeed.
Additional info:
The working scenario: "create manifests" with an existing "install-config.yaml" $ ./openshift-install version ./openshift-install 4.13.0-0.nightly-2023-01-27-165107 built from commit fca41376abe654a9124f0450727579bb85591438 release image registry.ci.openshift.org/ocp/release@sha256:29b1bc2026e843d7a2d50844f6f31aa0d7eeb0df540c7d9339589ad889eee529 release architecture amd64 $ $ mkdir test30 $ cp install-config.yaml test30 $ yq-3.3.0 r test30/install-config.yaml platform gcp: projectID: openshift-qe region: us-central1 $ yq-3.3.0 r test30/install-config.yaml metadata creationTimestamp: null name: jiwei-0130a $ ./openshift-install create manifests --dir test30 INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" INFO Consuming Install Config from target directory WARNING Discarding the Openshift Manifests that was provided in the target directory because its dependencies are dirty and it needs to be regenerated INFO Manifests created in: test30/manifests and test30/openshift $ $ tree test30 test30 ├── manifests │ ├── cloud-controller-uid-config.yml │ ├── cloud-provider-config.yaml │ ├── cluster-config.yaml │ ├── cluster-dns-02-config.yml │ ├── cluster-infrastructure-02-config.yml │ ├── cluster-ingress-02-config.yml │ ├── cluster-network-01-crd.yml │ ├── cluster-network-02-config.yml │ ├── cluster-proxy-01-config.yaml │ ├── cluster-scheduler-02-config.yml │ ├── cvo-overrides.yaml │ ├── kube-cloud-config.yaml │ ├── kube-system-configmap-root-ca.yaml │ ├── machine-config-server-tls-secret.yaml │ └── openshift-config-secret-pull-secret.yaml └── openshift ├── 99_cloud-creds-secret.yaml ├── 99_kubeadmin-password-secret.yaml ├── 99_openshift-cluster-api_master-machines-0.yaml ├── 99_openshift-cluster-api_master-machines-1.yaml ├── 99_openshift-cluster-api_master-machines-2.yaml ├── 99_openshift-cluster-api_master-user-data-secret.yaml ├── 99_openshift-cluster-api_worker-machineset-0.yaml ├── 99_openshift-cluster-api_worker-machineset-1.yaml ├── 99_openshift-cluster-api_worker-machineset-2.yaml ├── 99_openshift-cluster-api_worker-machineset-3.yaml ├── 99_openshift-cluster-api_worker-user-data-secret.yaml ├── 99_openshift-machine-api_master-control-plane-machine-set.yaml ├── 99_openshift-machineconfig_99-master-ssh.yaml ├── 99_openshift-machineconfig_99-worker-ssh.yaml ├── 99_role-cloud-creds-secret-reader.yaml └── openshift-install-manifests.yaml2 directories, 31 files $ The problem scenario: "create manifests" without an existing "install-config.yaml", and then "create cluster" $ ./openshift-install create manifests --dir test31 ? SSH Public Key /home/fedora/.ssh/openshift-qe.pub ? Platform gcp INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" ? Project ID OpenShift QE (openshift-qe) ? Region us-central1 ? Base Domain qe.gcp.devcluster.openshift.com ? Cluster Name jiwei-0130b ? Pull Secret [? for help] ******* INFO Manifests created in: test31/manifests and test31/openshift $ $ tree test31 test31 ├── manifests │ ├── cloud-controller-uid-config.yml │ ├── cloud-provider-config.yaml │ ├── cluster-config.yaml │ ├── cluster-dns-02-config.yml │ ├── cluster-infrastructure-02-config.yml │ ├── cluster-ingress-02-config.yml │ ├── cluster-network-01-crd.yml │ ├── cluster-network-02-config.yml │ ├── cluster-proxy-01-config.yaml │ ├── cluster-scheduler-02-config.yml │ ├── cvo-overrides.yaml │ ├── kube-cloud-config.yaml │ ├── kube-system-configmap-root-ca.yaml │ ├── machine-config-server-tls-secret.yaml │ └── openshift-config-secret-pull-secret.yaml └── openshift ├── 99_openshift-cluster-api_master-machines-0.yaml ├── 99_openshift-cluster-api_master-machines-1.yaml ├── 99_openshift-cluster-api_master-machines-2.yaml ├── 99_openshift-cluster-api_master-user-data-secret.yaml ├── 99_openshift-cluster-api_worker-machineset-0.yaml ├── 99_openshift-cluster-api_worker-machineset-1.yaml ├── 99_openshift-cluster-api_worker-machineset-2.yaml ├── 99_openshift-cluster-api_worker-machineset-3.yaml ├── 99_openshift-cluster-api_worker-user-data-secret.yaml ├── 99_openshift-machine-api_master-control-plane-machine-set.yaml ├── 99_openshift-machineconfig_99-master-ssh.yaml └── 99_openshift-machineconfig_99-worker-ssh.yaml2 directories, 27 files $ $ ./openshift-install create cluster --dir test31 INFO Consuming Common Manifests from target directory INFO Consuming Openshift Manifests from target directory INFO Consuming Master Machines from target directory INFO Consuming Worker Machines from target directory INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" INFO Creating infrastructure resources... INFO Waiting up to 20m0s (until 4:17PM) for the Kubernetes API at https://api.jiwei-0130b.qe.gcp.devcluster.openshift.com:6443... INFO API v1.25.2+7dab57f up INFO Waiting up to 30m0s (until 4:28PM) for bootstrapping to complete... INFO Destroying the bootstrap resources... INFO Waiting up to 40m0s (until 4:59PM) for the cluster at https://api.jiwei-0130b.qe.gcp.devcluster.openshift.com:6443 to initialize... ERROR Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::OAuthClientsController_SyncError::OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerRouteEndpointAccessibleController_SyncError::OAuthServerServiceEndpointAccessibleController_SyncError::OAuthServerServiceEndpointsEndpointAccessibleController_SyncError::WellKnownReadyController_SyncError: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server ERROR OAuthClientsControllerDegraded: no ingress for host oauth-openshift.apps.jiwei-0130b.qe.gcp.devcluster.openshift.com in route oauth-openshift in namespace openshift-authentication ERROR OAuthServerDeploymentDegraded: waiting for the oauth-openshift route to contain an admitted ingress: no admitted ingress for route oauth-openshift in namespace openshift-authentication ERROR OAuthServerDeploymentDegraded: ERROR OAuthServerRouteEndpointAccessibleControllerDegraded: route "openshift-authentication/oauth-openshift": status does not have a valid host address ERROR OAuthServerServiceEndpointAccessibleControllerDegraded: Get "https://172.30.99.43:443/healthz": dial tcp 172.30.99.43:443: connect: connection refused ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerDegraded: oauth service endpoints are not ready ERROR WellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this) ERROR Cluster operator authentication Available is False with OAuthServerDeployment_PreconditionNotFulfilled::OAuthServerRouteEndpointAccessibleController_ResourceNotFound::OAuthServerServiceEndpointAccessibleController_EndpointUnavailable::OAuthServerServiceEndpointsEndpointAccessibleController_ResourceNotFound::ReadyIngressNodes_NoReadyIngressNodes::WellKnown_NotReady: OAuthServerRouteEndpointAccessibleControllerAvailable: failed to retrieve route from cache: route.route.openshift.io "oauth-openshift" not found ERROR OAuthServerServiceEndpointAccessibleControllerAvailable: Get "https://172.30.99.43:443/healthz": dial tcp 172.30.99.43:443: connect: connection refused ERROR OAuthServerServiceEndpointsEndpointAccessibleControllerAvailable: endpoints "oauth-openshift" not found ERROR ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods). ERROR WellKnownAvailable: The well-known endpoint is not yet available: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this) INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected ERROR Cluster operator cloud-credential Degraded is True with CredentialsFailing: 7 of 7 credentials requests are failing to sync. INFO Cluster operator cloud-credential Progressing is True with Reconciling: 0 of 7 credentials requests provisioned, 7 reporting errors. ERROR Cluster operator cluster-autoscaler Degraded is True with MissingDependency: machine-api not ready ERROR Cluster operator console Degraded is True with DefaultRouteSync_FailedAdmitDefaultRoute::RouteHealth_RouteNotAdmitted::SyncLoopRefresh_FailedIngress: DefaultRouteSyncDegraded: no ingress for host console-openshift-console.apps.jiwei-0130b.qe.gcp.devcluster.openshift.com in route console in namespace openshift-console ERROR RouteHealthDegraded: console route is not admitted ERROR SyncLoopRefreshDegraded: no ingress for host console-openshift-console.apps.jiwei-0130b.qe.gcp.devcluster.openshift.com in route console in namespace openshift-console ERROR Cluster operator console Available is False with RouteHealth_RouteNotAdmitted: RouteHealthAvailable: console route is not admitted ERROR Cluster operator control-plane-machine-set Available is False with UnavailableReplicas: Missing 3 available replica(s) ERROR Cluster operator control-plane-machine-set Degraded is True with NoReadyMachines: No ready control plane machines found INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required ERROR Cluster operator image-registry Available is False with DeploymentNotFound: Available: The deployment does not exist ERROR NodeCADaemonAvailable: The daemon set node-ca has available replicas ERROR ImagePrunerAvailable: Pruner CronJob has been created INFO Cluster operator image-registry Progressing is True with Error: Progressing: Unable to apply resources: unable to sync storage configuration: unable to get cluster minted credentials "openshift-image-registry/installer-cloud-credentials": secret "installer-cloud-credentials" not found INFO NodeCADaemonProgressing: The daemon set node-ca is deployed ERROR Cluster operator image-registry Degraded is True with Unavailable: Degraded: The deployment does not exist ERROR Cluster operator ingress Available is False with IngressUnavailable: The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DNSReady=False (NoZones: The record isn't present in any zones.) INFO Cluster operator ingress Progressing is True with Reconciling: ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 0 of 2 updated replica(s) are available... INFO ). INFO Not all ingress controllers are available. ERROR Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1: Some pods are not scheduled: Pod "router-default-c68b5786c-prk7x" cannot be scheduled: 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling. Pod "router-default-c68b5786c-ssrv7" cannot be scheduled: 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling. Make sure you have sufficient worker nodes.), DNSReady=False (NoZones: The record isn't present in any zones.), CanaryChecksSucceeding=Unknown (CanaryRouteNotAdmitted: Canary route is not admitted by the default ingress controller) INFO Cluster operator ingress EvaluationConditionsDetected is False with AsExpected: INFO Cluster operator insights ClusterTransferAvailable is False with NoClusterTransfer: no available cluster transfer INFO Cluster operator insights Disabled is False with AsExpected: INFO Cluster operator insights SCAAvailable is True with Updated: SCA certs successfully updated in the etc-pki-entitlement secret ERROR Cluster operator kube-controller-manager Degraded is True with GarbageCollector_Error: GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp: lookup thanos-querier.openshift-monitoring.svc on 172.30.0.10:53: no such host INFO Cluster operator machine-api Progressing is True with SyncingResources: Progressing towards operator: 4.13.0-0.nightly-2023-01-27-165107 ERROR Cluster operator machine-api Degraded is True with SyncingFailed: Failed when progressing towards operator: 4.13.0-0.nightly-2023-01-27-165107 because minimum worker replica count (2) not yet met: current running replicas 0, waiting for [jiwei-0130b-25fcm-worker-a-j6t42 jiwei-0130b-25fcm-worker-b-dpw9b jiwei-0130b-25fcm-worker-c-9cdms] ERROR Cluster operator machine-api Available is False with Initializing: Operator is initializing ERROR Cluster operator monitoring Available is False with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas ERROR Cluster operator monitoring Degraded is True with UpdatingPrometheusOperatorFailed: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 2 unavailable replicas INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. INFO Cluster operator network ManagementStateDegraded is False with : INFO Cluster operator network Progressing is True with Deploying: Deployment "/openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready INFO Deployment "/openshift-cloud-network-config-controller/cloud-network-config-controller" is waiting for other operators to become ready INFO Cluster operator storage Progressing is True with GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying: GCPPDCSIDriverOperatorCRProgressing: GCPPDDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods ERROR Cluster operator storage Available is False with GCPPDCSIDriverOperatorCR_GCPPDDriverControllerServiceController_Deploying: GCPPDCSIDriverOperatorCRAvailable: GCPPDDriverControllerServiceControllerAvailable: Waiting for Deployment ERROR Cluster initialization failed because one or more operators are not functioning properly. ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below, ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation ERROR failed to initialize the cluster: Cluster operators authentication, console, control-plane-machine-set, image-registry, ingress, machine-api, monitoring, storage are not available $ export KUBECONFIG=test31/auth/kubeconfig $ ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 74m Unable to apply 4.13.0-0.nightly-2023-01-27-165107: some cluster operators are not available $ ./oc get nodes NAME STATUS ROLES AGE VERSION jiwei-0130b-25fcm-master-0.c.openshift-qe.internal Ready control-plane,master 69m v1.25.2+7dab57f jiwei-0130b-25fcm-master-1.c.openshift-qe.internal Ready control-plane,master 69m v1.25.2+7dab57f jiwei-0130b-25fcm-master-2.c.openshift-qe.internal Ready control-plane,master 69m v1.25.2+7dab57f $ ./oc get machines -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE jiwei-0130b-25fcm-master-0 73m jiwei-0130b-25fcm-master-1 73m jiwei-0130b-25fcm-master-2 73m jiwei-0130b-25fcm-worker-a-j6t42 65m jiwei-0130b-25fcm-worker-b-dpw9b 65m jiwei-0130b-25fcm-worker-c-9cdms 65m $ ./oc get controlplanemachinesets -n openshift-machine-api NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 Active 74m $ Please see the attached ".openshift_install.log", install-config.yaml snippet, and more "oc" commands outputs.
Probably for: 1h or some such; I don't think it needs to go off immediately. But in-cluster admins and folks monitoring submitted Insights should have a way to figure out that the cluster is trying and failing to submit Telemetry. The alert should not fire when Telemetry submission has been explicitly disabled.
There is an existing alert for PrometheusRemoteWriteBehind in a similar space, but as of today, the Temeletry submissions are happening via telemeter-client, due to concerns about the load of submitting via remote-write.
Description of problem:
failed even trying to "create install-config" in the epic's scenario
Version-Release number of selected component (if applicable):
$ ./openshift-install version ./openshift-install 4.12.0-0.nightly-2022-09-28-204419 built from commit 9eb0224926982cdd6cae53b872326292133e532d release image registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc release architecture amd64
How reproducible:
Always
Steps to Reproduce:
1. create vpc network, subnets, and a firewall-rule to allow ssh access to the bastion host 2. create the bastion host, with setting a valid service-account and scopes of "https://www.googleapis.com/auth/cloud-platform" 3. scp pull secret to the bastion host 4. ssh to the bastion host (subsequent steps would be on the bastion host, except told explicitly) 5. get "oc", e.g. curl https://mirror2.openshift.com/pub/openshift-v4/clients/ocp/4.9.9/openshift-client-linux-4.9.9.tar.gz -o openshift-client-linux-4.9.9.tar.gz; tar zxvf openshift-client-linux-4.9.9.tar.gz 6. obtain the installation program 7. try "create install-config" of platform "gcp"
Actual results:
[cloud-user@jiwei-0930-02-rhel8-mirror ~]$ ./openshift-install create install-config --dir work ? SSH Public Key /home/cloud-user/.ssh/id_rsa.pub ? Platform gcp INFO Credentials loaded from gcloud CLI defaults ? Project ID OpenShift QE Shared VPC (openshift-qe-shared-vpc) ? Region us-west1 ? Base Domain qe-shared-vpc.qe.gcp.devcluster.openshift.com ? Cluster Name jiwei-0930-03 ? Pull Secret [? for help] ****** FATAL failed to fetch Install Config: failed to generate asset "Install Config": credentialsMode: Forbidden: environmental authentication is only supported with Manual credentials mode [cloud-user@jiwei-0930-02-rhel8-mirror ~]$
Expected results:
"create install-config" should succeed.
Additional info:
Description of problem:
openshift-apiserver, openshift-oauth-apiserver and kube-apiserver pods cannot validate the certificate when trying to reach etcd reporting certificate validation errors: }. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for ::1, 127.0.0.1, ::1, fd69::2, not 2620:52:0:198::10" W1018 11:36:43.523673 15 logging.go:59] [core] [Channel #186 SubChannel #187] grpc: addrConn.createTransport failed to connect to { "Addr": "[2620:52:0:198::10]:2379", "ServerName": "2620:52:0:198::10", "Attributes": null, "BalancerAttributes": null, "Type": 0, "Metadata": null }. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for ::1, 127.0.0.1, ::1, fd69::2, not 2620:52:0:198::10"
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-18-041406
How reproducible:
100%
Steps to Reproduce:
1. Deploy SNO with single stack IPv6 via ZTP procedure
Actual results:
Deployment times out and some of the operators aren't deployed successfully. NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-2022-10-18-041406 False False True 124m APIServerDeploymentAvailable: no apiserver.openshift-oauth-apiserver pods available on any node.... baremetal 4.12.0-0.nightly-2022-10-18-041406 True False False 112m cloud-controller-manager 4.12.0-0.nightly-2022-10-18-041406 True False False 111m cloud-credential 4.12.0-0.nightly-2022-10-18-041406 True False False 115m cluster-autoscaler 4.12.0-0.nightly-2022-10-18-041406 True False False 111m config-operator 4.12.0-0.nightly-2022-10-18-041406 True False False 124m console control-plane-machine-set 4.12.0-0.nightly-2022-10-18-041406 True False False 111m csi-snapshot-controller 4.12.0-0.nightly-2022-10-18-041406 True False False 111m dns 4.12.0-0.nightly-2022-10-18-041406 True False False 111m etcd 4.12.0-0.nightly-2022-10-18-041406 True False True 121m ClusterMemberControllerDegraded: could not get list of unhealthy members: giving up getting a cached client after 3 tries image-registry 4.12.0-0.nightly-2022-10-18-041406 False True True 104m Available: The registry is removed... ingress 4.12.0-0.nightly-2022-10-18-041406 True True True 111m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 0/1 of replicas are available) insights 4.12.0-0.nightly-2022-10-18-041406 True False False 118s kube-apiserver 4.12.0-0.nightly-2022-10-18-041406 True False False 102m kube-controller-manager 4.12.0-0.nightly-2022-10-18-041406 True False True 107m GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp [fd02::3c5f]:9091: connect: connection refused kube-scheduler 4.12.0-0.nightly-2022-10-18-041406 True False False 107m kube-storage-version-migrator 4.12.0-0.nightly-2022-10-18-041406 True False False 117m machine-api 4.12.0-0.nightly-2022-10-18-041406 True False False 111m machine-approver 4.12.0-0.nightly-2022-10-18-041406 True False False 111m machine-config 4.12.0-0.nightly-2022-10-18-041406 True False False 115m marketplace 4.12.0-0.nightly-2022-10-18-041406 True False False 116m monitoring False True True 98m deleting Thanos Ruler Route failed: Timeout: request did not complete within requested timeout - context deadline exceeded, deleting UserWorkload federate Route failed: Timeout: request did not complete within requested timeout - context deadline exceeded, reconciling Alertmanager Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io alertmanager-main), reconciling Thanos Querier Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io thanos-querier), reconciling Prometheus API Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io prometheus-k8s), prometheuses.monitoring.coreos.com "k8s" not found network 4.12.0-0.nightly-2022-10-18-041406 True False False 124m node-tuning 4.12.0-0.nightly-2022-10-18-041406 True False False 111m openshift-apiserver 4.12.0-0.nightly-2022-10-18-041406 True False False 104m openshift-controller-manager 4.12.0-0.nightly-2022-10-18-041406 True False False 107m openshift-samples False True False 103m The error the server was unable to return a response in the time allotted, but may still be processing the request (get imagestreams.image.openshift.io) during openshift namespace cleanup has left the samples in an unknown state operator-lifecycle-manager 4.12.0-0.nightly-2022-10-18-041406 True False False 111m operator-lifecycle-manager-catalog 4.12.0-0.nightly-2022-10-18-041406 True False False 111m operator-lifecycle-manager-packageserver 4.12.0-0.nightly-2022-10-18-041406 True False False 106m service-ca 4.12.0-0.nightly-2022-10-18-041406 True False False 124m storage 4.12.0-0.nightly-2022-10-18-041406 True False False 111m
Expected results:
Deployment succeeds without issues.
Additional info:
I was unable to run must-gather so attaching the pods logs copied from the host file system.
Description of problem:
The reconciler removes the overlappingrangeipreservations.whereabouts.cni.cncf.io resources whether the pod is alive or not.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create pods and check the overlappingrangeipreservations.whereabouts.cni.cncf.io resources:
$ oc get overlappingrangeipreservations.whereabouts.cni.cncf.io -A NAMESPACE NAME AGE openshift-multus 2001-1b70-820d-4b04--13 4m53s openshift-multus 2001-1b70-820d-4b05--13 4m49s
2. Verify that when the ip-reconciler cronjob removes the overlappingrangeipreservations.whereabouts.cni.cncf.io resources when run:
$ oc get cronjob -n openshift-multus NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE ip-reconciler */15 * * * * False 0 14m 4d13h $ oc get overlappingrangeipreservations.whereabouts.cni.cncf.io -A No resources found $ oc get cronjob -n openshift-multus NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE ip-reconciler */15 * * * * False 0 5s 4d13h
Actual results:
The overlappingrangeipreservations.whereabouts.cni.cncf.io resources are removed for each created pod by the ip-reconciler cronjob. The "overlapping ranges" are not used.
Expected results:
The overlappingrangeipreservations.whereabouts.cni.cncf.io should not be removed regardless of if a pod has used an IP in the overlapping ranges.
Additional info:
This is a clone of issue OCPBUGS-5306. The following is the description of the original issue:
—
Description of problem:
One old machine stuck in Deleting and many co get degraded when doing master replacement on the cluster with OVN network
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2023-01-02-175114
How reproducible:
always after several times
Steps to Reproduce:
1.Install a cluster liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2023-01-02-175114 True False 30m Cluster version is 4.12.0-0.nightly-2023-01-02-175114 liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-2023-01-02-175114 True False False 33m baremetal 4.12.0-0.nightly-2023-01-02-175114 True False False 80m cloud-controller-manager 4.12.0-0.nightly-2023-01-02-175114 True False False 84m cloud-credential 4.12.0-0.nightly-2023-01-02-175114 True False False 80m cluster-api 4.12.0-0.nightly-2023-01-02-175114 True False False 81m cluster-autoscaler 4.12.0-0.nightly-2023-01-02-175114 True False False 80m config-operator 4.12.0-0.nightly-2023-01-02-175114 True False False 81m console 4.12.0-0.nightly-2023-01-02-175114 True False False 33m control-plane-machine-set 4.12.0-0.nightly-2023-01-02-175114 True False False 79m csi-snapshot-controller 4.12.0-0.nightly-2023-01-02-175114 True False False 81m dns 4.12.0-0.nightly-2023-01-02-175114 True False False 80m etcd 4.12.0-0.nightly-2023-01-02-175114 True False False 79m image-registry 4.12.0-0.nightly-2023-01-02-175114 True False False 74m ingress 4.12.0-0.nightly-2023-01-02-175114 True False False 74m insights 4.12.0-0.nightly-2023-01-02-175114 True False False 21m kube-apiserver 4.12.0-0.nightly-2023-01-02-175114 True False False 77m kube-controller-manager 4.12.0-0.nightly-2023-01-02-175114 True False False 77m kube-scheduler 4.12.0-0.nightly-2023-01-02-175114 True False False 77m kube-storage-version-migrator 4.12.0-0.nightly-2023-01-02-175114 True False False 81m machine-api 4.12.0-0.nightly-2023-01-02-175114 True False False 75m machine-approver 4.12.0-0.nightly-2023-01-02-175114 True False False 80m machine-config 4.12.0-0.nightly-2023-01-02-175114 True False False 74m marketplace 4.12.0-0.nightly-2023-01-02-175114 True False False 80m monitoring 4.12.0-0.nightly-2023-01-02-175114 True False False 72m network 4.12.0-0.nightly-2023-01-02-175114 True False False 83m node-tuning 4.12.0-0.nightly-2023-01-02-175114 True False False 80m openshift-apiserver 4.12.0-0.nightly-2023-01-02-175114 True False False 75m openshift-controller-manager 4.12.0-0.nightly-2023-01-02-175114 True False False 76m openshift-samples 4.12.0-0.nightly-2023-01-02-175114 True False False 22m operator-lifecycle-manager 4.12.0-0.nightly-2023-01-02-175114 True False False 81m operator-lifecycle-manager-catalog 4.12.0-0.nightly-2023-01-02-175114 True False False 81m operator-lifecycle-manager-packageserver 4.12.0-0.nightly-2023-01-02-175114 True False False 75m platform-operators-aggregated 4.12.0-0.nightly-2023-01-02-175114 True False False 74m service-ca 4.12.0-0.nightly-2023-01-02-175114 True False False 81m storage 4.12.0-0.nightly-2023-01-02-175114 True False False 74m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws4d2-fcks7-master-0 Running m6i.xlarge us-east-2 us-east-2a 85m huliu-aws4d2-fcks7-master-1 Running m6i.xlarge us-east-2 us-east-2b 85m huliu-aws4d2-fcks7-master-2 Running m6i.xlarge us-east-2 us-east-2a 85m huliu-aws4d2-fcks7-worker-us-east-2a-m279f Running m6i.xlarge us-east-2 us-east-2a 80m huliu-aws4d2-fcks7-worker-us-east-2a-qg9ps Running m6i.xlarge us-east-2 us-east-2a 80m huliu-aws4d2-fcks7-worker-us-east-2b-ps6tz Running m6i.xlarge us-east-2 us-east-2b 80m liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 3 Active 86m 2.Edit controlplanemachineset, change instanceType to another value to trigger RollingUpdate liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster controlplanemachineset.machine.openshift.io/cluster edited liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws4d2-fcks7-master-0 Running m6i.xlarge us-east-2 us-east-2a 86m huliu-aws4d2-fcks7-master-1 Running m6i.xlarge us-east-2 us-east-2b 86m huliu-aws4d2-fcks7-master-2 Running m6i.xlarge us-east-2 us-east-2a 86m huliu-aws4d2-fcks7-master-mbgz6-0 Provisioning m5.xlarge us-east-2 us-east-2a 5s huliu-aws4d2-fcks7-worker-us-east-2a-m279f Running m6i.xlarge us-east-2 us-east-2a 81m huliu-aws4d2-fcks7-worker-us-east-2a-qg9ps Running m6i.xlarge us-east-2 us-east-2a 81m huliu-aws4d2-fcks7-worker-us-east-2b-ps6tz Running m6i.xlarge us-east-2 us-east-2b 81m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws4d2-fcks7-master-0 Deleting m6i.xlarge us-east-2 us-east-2a 92m huliu-aws4d2-fcks7-master-1 Running m6i.xlarge us-east-2 us-east-2b 92m huliu-aws4d2-fcks7-master-2 Running m6i.xlarge us-east-2 us-east-2a 92m huliu-aws4d2-fcks7-master-mbgz6-0 Running m5.xlarge us-east-2 us-east-2a 5m36s huliu-aws4d2-fcks7-worker-us-east-2a-m279f Running m6i.xlarge us-east-2 us-east-2a 87m huliu-aws4d2-fcks7-worker-us-east-2a-qg9ps Running m6i.xlarge us-east-2 us-east-2a 87m huliu-aws4d2-fcks7-worker-us-east-2b-ps6tz Running m6i.xlarge us-east-2 us-east-2b 87m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws4d2-fcks7-master-1 Running m6i.xlarge us-east-2 us-east-2b 101m huliu-aws4d2-fcks7-master-2 Running m6i.xlarge us-east-2 us-east-2a 101m huliu-aws4d2-fcks7-master-mbgz6-0 Running m5.xlarge us-east-2 us-east-2a 15m huliu-aws4d2-fcks7-master-nbt9g-1 Provisioned m5.xlarge us-east-2 us-east-2b 3m1s huliu-aws4d2-fcks7-worker-us-east-2a-m279f Running m6i.xlarge us-east-2 us-east-2a 96m huliu-aws4d2-fcks7-worker-us-east-2a-qg9ps Running m6i.xlarge us-east-2 us-east-2a 96m huliu-aws4d2-fcks7-worker-us-east-2b-ps6tz Running m6i.xlarge us-east-2 us-east-2b 96m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws4d2-fcks7-master-1 Deleting m6i.xlarge us-east-2 us-east-2b 149m huliu-aws4d2-fcks7-master-2 Running m6i.xlarge us-east-2 us-east-2a 149m huliu-aws4d2-fcks7-master-mbgz6-0 Running m5.xlarge us-east-2 us-east-2a 62m huliu-aws4d2-fcks7-master-nbt9g-1 Running m5.xlarge us-east-2 us-east-2b 50m huliu-aws4d2-fcks7-worker-us-east-2a-m279f Running m6i.xlarge us-east-2 us-east-2a 144m huliu-aws4d2-fcks7-worker-us-east-2a-qg9ps Running m6i.xlarge us-east-2 us-east-2a 144m huliu-aws4d2-fcks7-worker-us-east-2b-ps6tz Running m6i.xlarge us-east-2 us-east-2b 144m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-aws4d2-fcks7-master-1 Deleting m6i.xlarge us-east-2 us-east-2b 4h12m huliu-aws4d2-fcks7-master-2 Running m6i.xlarge us-east-2 us-east-2a 4h12m huliu-aws4d2-fcks7-master-mbgz6-0 Running m5.xlarge us-east-2 us-east-2a 166m huliu-aws4d2-fcks7-master-nbt9g-1 Running m5.xlarge us-east-2 us-east-2b 153m huliu-aws4d2-fcks7-worker-us-east-2a-m279f Running m6i.xlarge us-east-2 us-east-2a 4h7m huliu-aws4d2-fcks7-worker-us-east-2a-qg9ps Running m6i.xlarge us-east-2 us-east-2a 4h7m huliu-aws4d2-fcks7-worker-us-east-2b-ps6tz Running m6i.xlarge us-east-2 us-east-2b 4h7m 3.master-1 stuck in Deleting, and many co get degraded, many pod cannot get Running liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-2023-01-02-175114 True True True 9s APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver (2 containers are waiting in pending apiserver-7b65bbc76b-mxl99 pod)... baremetal 4.12.0-0.nightly-2023-01-02-175114 True False False 4h8m cloud-controller-manager 4.12.0-0.nightly-2023-01-02-175114 True False False 4h11m cloud-credential 4.12.0-0.nightly-2023-01-02-175114 True False False 4h8m cluster-api 4.12.0-0.nightly-2023-01-02-175114 True False False 4h8m cluster-autoscaler 4.12.0-0.nightly-2023-01-02-175114 True False False 4h8m config-operator 4.12.0-0.nightly-2023-01-02-175114 True False False 4h9m console 4.12.0-0.nightly-2023-01-02-175114 False False False 150m RouteHealthAvailable: console route is not admitted control-plane-machine-set 4.12.0-0.nightly-2023-01-02-175114 True True False 4h7m Observed 1 replica(s) in need of update csi-snapshot-controller 4.12.0-0.nightly-2023-01-02-175114 True True False 4h9m CSISnapshotControllerProgressing: Waiting for Deployment to deploy pods... dns 4.12.0-0.nightly-2023-01-02-175114 True False False 4h8m etcd 4.12.0-0.nightly-2023-01-02-175114 True True True 4h7m GuardControllerDegraded: Missing operand on node ip-10-0-79-159.us-east-2.compute.internal... image-registry 4.12.0-0.nightly-2023-01-02-175114 True False False 4h2m ingress 4.12.0-0.nightly-2023-01-02-175114 True False False 4h2m insights 4.12.0-0.nightly-2023-01-02-175114 True False False 3h8m kube-apiserver 4.12.0-0.nightly-2023-01-02-175114 True True True 4h5m GuardControllerDegraded: Missing operand on node ip-10-0-79-159.us-east-2.compute.internal kube-controller-manager 4.12.0-0.nightly-2023-01-02-175114 True False True 4h5m GarbageCollectorDegraded: error querying alerts: Post "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query": dial tcp 172.30.19.115:9091: i/o timeout kube-scheduler 4.12.0-0.nightly-2023-01-02-175114 True False False 4h5m kube-storage-version-migrator 4.12.0-0.nightly-2023-01-02-175114 True False False 162m machine-api 4.12.0-0.nightly-2023-01-02-175114 True False False 4h3m machine-approver 4.12.0-0.nightly-2023-01-02-175114 True False False 4h8m machine-config 4.12.0-0.nightly-2023-01-02-175114 False False True 139m Cluster not available for [{operator 4.12.0-0.nightly-2023-01-02-175114}]: error during waitForDeploymentRollout: [timed out waiting for the condition, deployment machine-config-controller is not ready. status: (replicas: 1, updated: 1, ready: 0, unavailable: 1)] marketplace 4.12.0-0.nightly-2023-01-02-175114 True False False 4h8m monitoring 4.12.0-0.nightly-2023-01-02-175114 False True True 144m reconciling Prometheus Operator Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator: got 1 unavailable replicas network 4.12.0-0.nightly-2023-01-02-175114 True True False 4h11m DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" is not available (awaiting 1 nodes)... node-tuning 4.12.0-0.nightly-2023-01-02-175114 True False False 4h7m openshift-apiserver 4.12.0-0.nightly-2023-01-02-175114 False True False 151m APIServicesAvailable: "apps.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request... openshift-controller-manager 4.12.0-0.nightly-2023-01-02-175114 True False False 4h4m openshift-samples 4.12.0-0.nightly-2023-01-02-175114 True False False 3h10m operator-lifecycle-manager 4.12.0-0.nightly-2023-01-02-175114 True False False 4h9m operator-lifecycle-manager-catalog 4.12.0-0.nightly-2023-01-02-175114 True False False 4h9m operator-lifecycle-manager-packageserver 4.12.0-0.nightly-2023-01-02-175114 True False False 2m44s platform-operators-aggregated 4.12.0-0.nightly-2023-01-02-175114 True False False 4h2m service-ca 4.12.0-0.nightly-2023-01-02-175114 True False False 4h9m storage 4.12.0-0.nightly-2023-01-02-175114 True True False 4h2m AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods... liuhuali@Lius-MacBook-Pro huali-test % liuhuali@Lius-MacBook-Pro huali-test % oc get pod --all-namespaces|grep -v Running NAMESPACE NAME READY STATUS RESTARTS AGE openshift-apiserver apiserver-5cbdf985f9-85z4t 0/2 Init:0/1 0 155m openshift-authentication oauth-openshift-5c46d6658b-lkbjj 0/1 Pending 0 156m openshift-cloud-credential-operator pod-identity-webhook-77bf7c646d-4rtn8 0/1 ContainerCreating 0 156m openshift-cluster-api capa-controller-manager-d484bc464-lhqbk 0/1 ContainerCreating 0 156m openshift-cluster-csi-drivers aws-ebs-csi-driver-controller-5668745dcb-jc7fm 0/11 ContainerCreating 0 156m openshift-cluster-csi-drivers aws-ebs-csi-driver-operator-5d6b9fbd77-827vs 0/1 ContainerCreating 0 156m openshift-cluster-csi-drivers shared-resource-csi-driver-operator-866d897954-z77gz 0/1 ContainerCreating 0 156m openshift-cluster-csi-drivers shared-resource-csi-driver-webhook-d794748dc-kctkn 0/1 ContainerCreating 0 156m openshift-cluster-samples-operator cluster-samples-operator-754758b9d7-nbcc9 0/2 ContainerCreating 0 156m openshift-cluster-storage-operator csi-snapshot-controller-6d9c448fdd-wdb7n 0/1 ContainerCreating 0 156m openshift-cluster-storage-operator csi-snapshot-webhook-6966f555f8-cbdc7 0/1 ContainerCreating 0 156m openshift-console-operator console-operator-7d8567876b-nxgpj 0/2 ContainerCreating 0 156m openshift-console console-855f66f4f8-q869k 0/1 ContainerCreating 0 156m openshift-console downloads-7b645b6b98-7jqfw 0/1 ContainerCreating 0 156m openshift-controller-manager controller-manager-548c7f97fb-bl68p 0/1 Pending 0 156m openshift-etcd installer-13-ip-10-0-76-132.us-east-2.compute.internal 0/1 ContainerCreating 0 9m39s openshift-etcd installer-3-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 4h13m openshift-etcd installer-4-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 4h12m openshift-etcd installer-5-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 4h7m openshift-etcd installer-6-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 4h1m openshift-etcd installer-8-ip-10-0-48-21.us-east-2.compute.internal 0/1 Completed 0 168m openshift-etcd revision-pruner-10-ip-10-0-48-21.us-east-2.compute.internal 0/1 ContainerCreating 0 160m openshift-etcd revision-pruner-10-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 160m openshift-etcd revision-pruner-11-ip-10-0-48-21.us-east-2.compute.internal 0/1 ContainerCreating 0 159m openshift-etcd revision-pruner-11-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 159m openshift-etcd revision-pruner-11-ip-10-0-79-159.us-east-2.compute.internal 0/1 Completed 0 156m openshift-etcd revision-pruner-12-ip-10-0-48-21.us-east-2.compute.internal 0/1 ContainerCreating 0 156m openshift-etcd revision-pruner-12-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 156m openshift-etcd revision-pruner-12-ip-10-0-79-159.us-east-2.compute.internal 0/1 Completed 0 156m openshift-etcd revision-pruner-13-ip-10-0-48-21.us-east-2.compute.internal 0/1 ContainerCreating 0 155m openshift-etcd revision-pruner-13-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 155m openshift-etcd revision-pruner-13-ip-10-0-76-132.us-east-2.compute.internal 0/1 ContainerCreating 0 10m openshift-etcd revision-pruner-13-ip-10-0-79-159.us-east-2.compute.internal 0/1 Completed 0 155m openshift-etcd revision-pruner-6-ip-10-0-48-21.us-east-2.compute.internal 0/1 Completed 0 169m openshift-etcd revision-pruner-6-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 3h57m openshift-etcd revision-pruner-7-ip-10-0-48-21.us-east-2.compute.internal 0/1 Completed 0 168m openshift-etcd revision-pruner-7-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 168m openshift-etcd revision-pruner-8-ip-10-0-48-21.us-east-2.compute.internal 0/1 Completed 0 168m openshift-etcd revision-pruner-8-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 168m openshift-etcd revision-pruner-9-ip-10-0-48-21.us-east-2.compute.internal 0/1 Completed 0 166m openshift-etcd revision-pruner-9-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 166m openshift-kube-apiserver installer-6-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 4h4m openshift-kube-apiserver installer-7-ip-10-0-48-21.us-east-2.compute.internal 0/1 Completed 0 168m openshift-kube-apiserver installer-9-ip-10-0-76-132.us-east-2.compute.internal 0/1 ContainerCreating 0 9m52s openshift-kube-apiserver revision-pruner-6-ip-10-0-48-21.us-east-2.compute.internal 0/1 Completed 0 169m openshift-kube-apiserver revision-pruner-6-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 3h59m openshift-kube-apiserver revision-pruner-7-ip-10-0-48-21.us-east-2.compute.internal 0/1 Completed 0 168m openshift-kube-apiserver revision-pruner-7-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 168m openshift-kube-apiserver revision-pruner-8-ip-10-0-48-21.us-east-2.compute.internal 0/1 Completed 0 166m openshift-kube-apiserver revision-pruner-8-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 166m openshift-kube-apiserver revision-pruner-8-ip-10-0-79-159.us-east-2.compute.internal 0/1 Completed 0 156m openshift-kube-apiserver revision-pruner-9-ip-10-0-48-21.us-east-2.compute.internal 0/1 ContainerCreating 0 155m openshift-kube-apiserver revision-pruner-9-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 155m openshift-kube-apiserver revision-pruner-9-ip-10-0-76-132.us-east-2.compute.internal 0/1 ContainerCreating 0 9m54s openshift-kube-apiserver revision-pruner-9-ip-10-0-79-159.us-east-2.compute.internal 0/1 Completed 0 155m openshift-kube-controller-manager installer-6-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 4h11m openshift-kube-controller-manager installer-7-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 4h7m openshift-kube-controller-manager installer-8-ip-10-0-48-21.us-east-2.compute.internal 0/1 Completed 0 169m openshift-kube-controller-manager installer-8-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 4h4m openshift-kube-controller-manager installer-8-ip-10-0-79-159.us-east-2.compute.internal 0/1 Completed 0 156m openshift-kube-controller-manager revision-pruner-6-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 4h13m openshift-kube-controller-manager revision-pruner-7-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 4h10m openshift-kube-controller-manager revision-pruner-8-ip-10-0-48-21.us-east-2.compute.internal 0/1 Completed 0 169m openshift-kube-controller-manager revision-pruner-8-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 4h5m openshift-kube-controller-manager revision-pruner-8-ip-10-0-76-132.us-east-2.compute.internal 0/1 ContainerCreating 0 4m36s openshift-kube-controller-manager revision-pruner-8-ip-10-0-79-159.us-east-2.compute.internal 0/1 Completed 0 156m openshift-kube-scheduler installer-6-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 4h11m openshift-kube-scheduler installer-7-ip-10-0-48-21.us-east-2.compute.internal 0/1 Completed 0 169m openshift-kube-scheduler installer-7-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 4h10m openshift-kube-scheduler installer-7-ip-10-0-79-159.us-east-2.compute.internal 0/1 Completed 0 156m openshift-kube-scheduler revision-pruner-6-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 4h13m openshift-kube-scheduler revision-pruner-7-ip-10-0-48-21.us-east-2.compute.internal 0/1 Completed 0 169m openshift-kube-scheduler revision-pruner-7-ip-10-0-63-159.us-east-2.compute.internal 0/1 Completed 0 4h10m openshift-kube-scheduler revision-pruner-7-ip-10-0-76-132.us-east-2.compute.internal 0/1 ContainerCreating 0 4m36s openshift-kube-scheduler revision-pruner-7-ip-10-0-79-159.us-east-2.compute.internal 0/1 Completed 0 156m openshift-machine-config-operator machine-config-controller-55b4d497b6-p89lb 0/2 ContainerCreating 0 156m openshift-marketplace qe-app-registry-w8gnc 0/1 ContainerCreating 0 148m openshift-monitoring prometheus-operator-776bd79f6d-vz7q5 0/2 ContainerCreating 0 156m openshift-multus multus-admission-controller-5f88d77b65-nzmj5 0/2 ContainerCreating 0 156m openshift-oauth-apiserver apiserver-7b65bbc76b-mxl99 0/1 Init:0/1 0 154m openshift-operator-lifecycle-manager collect-profiles-27879975-fpvzk 0/1 Completed 0 3h21m openshift-operator-lifecycle-manager collect-profiles-27879990-86rk8 0/1 Completed 0 3h6m openshift-operator-lifecycle-manager collect-profiles-27880005-bscc4 0/1 Completed 0 171m openshift-operator-lifecycle-manager collect-profiles-27880170-s8cbj 0/1 ContainerCreating 0 4m37s openshift-operator-lifecycle-manager packageserver-6f8f8f9d54-4r96h 0/1 ContainerCreating 0 156m openshift-ovn-kubernetes ovnkube-master-lr9pk 3/6 CrashLoopBackOff 23 (46s ago) 156m openshift-route-controller-manager route-controller-manager-747bf8684f-5vhwx 0/1 Pending 0 156m liuhuali@Lius-MacBook-Pro huali-test %
Actual results:
RollingUpdate cannot complete successfully
Expected results:
RollingUpdate should complete successfully
Additional info:
Must gather - https://drive.google.com/file/d/1bvE1XUuZKLBGmq7OTXNVCNcFZkqbarab/view?usp=sharing must gather of another cluster hit the same issue (also this template ipi-on-aws/versioned-installer-customer_vpc-disconnected_private_cluster-techpreview-ci and with ovn network): https://drive.google.com/file/d/1CqAJlqk2wgnEuMo3lLaObk4Nbxi82y_A/view?usp=sharing must gather of another cluster hit the same issue (this template ipi-on-aws/versioned-installer-private_cluster-sts-usgov-ci and with ovn network): https://drive.google.com/file/d/1tnKbeqJ18SCAlJkS80Rji3qMu3nvN_O8/view?usp=sharing Seems this template ipi-on-aws/versioned-installer-customer_vpc-disconnected_private_cluster-techpreview-ci and with ovn network can often hit this issue.
This is a clone of issue OCPBUGS-3228. The following is the description of the original issue:
—
While starting a Pipelinerun using UI, and in the process of providing the values on "Start Pipeline" , the IBM Power Customer (Deepak Shetty from IBM) has tried creating credentials under "Advanced options" with "Image Registry Credentials" (Authenticaion type). When the IBM Customer verified the credentials from Secrets tab (in Workloads) , the secret was found in broken state. Screenshot of the broken secret is attached.
The issue has been observed on OCP4.8, OCP4.9 and OCP4.10.
This is a clone of issue OCPBUGS-7617. The following is the description of the original issue:
—
Description of problem:
Azure Disk volume is taking time to attach/detach
Version-Release number of selected component (if applicable):
Openshift ARO 4.10.30
How reproducible:
While performing scaledown and scaleup of statefulset pod takes time to attach and detach volume from nodes.
Reviewed must-gather and test output will share my findings in comments.
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Name: DNS
Description: Please change the "DNS" component to be a subcomponent "DNS" of the "Networking" component.
Component: change to "Networking".
Subcomponent: change to "DNS".
Existing fields (default assignee, default QA contact, default CC email list, etc.) should remain the same as they currently are.
Default Assignee: aos-network-edge-staff@bot.bugzilla.redhat.com
Default QA Contact: hongli@redhat.com
Default CC List: aos-network-edge-staff@bot.bugzilla.redhat.com
Additional Notes:
I filled in "Default CC email list" because the form validation would not permit me to omit it. However, it can be left empty in Bugzilla (it is currently empty).
If possible, we would like this change to be done prior to the Bugzilla-to-Jira migration to avoid the need to make the change after the migration.
This is a clone of issue OCPBUGS-3283. The following is the description of the original issue:
—
Description of problem:
We discovered that we are shipping unnecesary RBAC in https://coreos.slack.com/archives/CC3CZCQHM/p1667571136730989 .
This RBAC was only used 4.2 and 4.3 for
and we should remove it
Version-Release number of selected component (if applicable):{code:none}
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-5182. The following is the description of the original issue:
—
Description of problem:
Deploy IPI cluster on azure cloud, set region as westeurope, vm size as EC96iads_v5 or EC96ias_v5. Installation fails with below error: 12-15 11:47:03.429 level=error msg=Error: creating Linux Virtual Machine: (Name "jima-15a-m6fzd-bootstrap" / Resource Group "jima-15a-m6fzd-rg"): compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="BadRequest" Message="The VM size 'Standard_EC96iads_v5' is not supported for creation of VMs and Virtual Machine Scale Set with '<NULL>' security type." Similar as https://bugzilla.redhat.com/show_bug.cgi?id=2055247. From azure portal, we can see that the type of both vm size EC96iads_v5 and EC96ias_v5 are confidential compute. Might also need to do similar process for them as what did in bug 2055247.
Version-Release number of selected component (if applicable):
4.12 nightly build
How reproducible:
Always
Steps to Reproduce:
1. Prepare install-config.yaml file, set region as westeurope, vm size as EC96iads_v5 or EC96ias_v5 2. Deploy IPI azure cluster 3.
Actual results:
Install failed with error in description
Expected results:
Installer should be exited during validation and show expected error message.
Additional info:
Description of problem:
Installed and uninstalled some helm charts, and got now an issue with helm charts on all our releases. The issue is solved in 4.13.
The frontend tries to load /api/helm/releases?ns=christoph and the backend crashes with the error below.
Tl;dr:
It crashes here in the helm lib: https://github.com/openshift/console/blob/release-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/util.go#L66
And the missing out of bounds check is added on master: https://github.com/openshift/console/blob/master/vendor/helm.sh/helm/v3/pkg/storage/driver/util.go#L66
As part of the helm bump https://github.com/openshift/console/pull/12246
2023/02/15 13:09:09 http: panic serving [::1]:43264: runtime error: slice bounds out of range [:3] with capacity 0 goroutine 3291 [running]: net/http.(*conn).serve.func1() /usr/lib/golang/src/net/http/server.go:1850 +0xbf panic({0x2f8d700, 0xc0004dfaa0}) /usr/lib/golang/src/runtime/panic.go:890 +0x262 helm.sh/helm/v3/pkg/storage/driver.decodeRelease({0x0?, 0xc000776930?}) /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/util.go:66 +0x305 helm.sh/helm/v3/pkg/storage/driver.(*Secrets).List(0xc000b2ff80, 0xc0004bbe60) /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/secrets.go:95 +0x26f helm.sh/helm/v3/pkg/action.(*List).Run(0xc0005fb800) /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/action/list.go:161 +0xc5 github.com/openshift/console/pkg/helm/actions.ListReleases(0xc00037d680?) /home/christoph/git/openshift/console-4.12/pkg/helm/actions/list_releases.go:11 +0x6b github.com/openshift/console/pkg/helm/handlers.(*helmHandlers).HandleHelmList(0xc00014f000, 0xc000844960, {0x351ae00, 0xc00086d180}, 0x7fea2c6e5900?) /home/christoph/git/openshift/console-4.12/pkg/helm/handlers/handlers.go:154 +0xdb github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func7.1({0x351ae00?, 0xc00086d180?}, 0x7fea56daf108?) /home/christoph/git/openshift/console-4.12/pkg/server/server.go:286 +0x3c net/http.HandlerFunc.ServeHTTP(0xc0009b8170?, {0x351ae00?, 0xc00086d180?}, 0xc000c5b9f8?) /usr/lib/golang/src/net/http/server.go:2109 +0x2f net/http.(*ServeMux).ServeHTTP(0x2f32e80?, {0x351ae00, 0xc00086d180}, 0xc000248800) /usr/lib/golang/src/net/http/server.go:2487 +0x149 github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x351ae00, 0xc00086d180}, 0x7fea2c5c8248?) /home/christoph/git/openshift/console-4.12/pkg/server/middleware.go:116 +0x3af net/http.HandlerFunc.ServeHTTP(0xc0009ed667?, {0x351ae00?, 0xc00086d180?}, 0x109034e?) /usr/lib/golang/src/net/http/server.go:2109 +0x2f net/http.serverHandler.ServeHTTP({0xc001048120?}, {0x351ae00, 0xc00086d180}, 0xc000248800) /usr/lib/golang/src/net/http/server.go:2947 +0x30c net/http.(*conn).serve(0xc0007580a0, {0x351cca0, 0xc000145740}) /usr/lib/golang/src/net/http/server.go:1991 +0x607 created by net/http.(*Server).Serve /usr/lib/golang/src/net/http/server.go:3102 +0x4db 2023/02/15 13:09:09 http: panic serving [::1]:43256: runtime error: slice bounds out of range [:3] with capacity 0 goroutine 3290 [running]: net/http.(*conn).serve.func1() /usr/lib/golang/src/net/http/server.go:1850 +0xbf panic({0x2f8d700, 0xc000273440}) /usr/lib/golang/src/runtime/panic.go:890 +0x262 helm.sh/helm/v3/pkg/storage/driver.decodeRelease({0x0?, 0xc0004dc8a0?}) /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/util.go:66 +0x305 helm.sh/helm/v3/pkg/storage/driver.(*Secrets).List(0xc000de8e88, 0xc0011cb400) /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/secrets.go:95 +0x26f helm.sh/helm/v3/pkg/action.(*List).Run(0xc00068d800) /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/action/list.go:161 +0xc5 github.com/openshift/console/pkg/helm/actions.ListReleases(0xc00037d680?) /home/christoph/git/openshift/console-4.12/pkg/helm/actions/list_releases.go:11 +0x6b github.com/openshift/console/pkg/helm/handlers.(*helmHandlers).HandleHelmList(0xc00014f000, 0xc000844960, {0x351ae00, 0xc000b60b60}, 0x7fea2c47e700?) /home/christoph/git/openshift/console-4.12/pkg/helm/handlers/handlers.go:154 +0xdb github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func7.1({0x351ae00?, 0xc000b60b60?}, 0x7fea56daf5b8?) /home/christoph/git/openshift/console-4.12/pkg/server/server.go:286 +0x3c net/http.HandlerFunc.ServeHTTP(0xc0003d72b0?, {0x351ae00?, 0xc000b60b60?}, 0xc000bcd9f8?) /usr/lib/golang/src/net/http/server.go:2109 +0x2f net/http.(*ServeMux).ServeHTTP(0x2f32e80?, {0x351ae00, 0xc000b60b60}, 0xc000cabd00) /usr/lib/golang/src/net/http/server.go:2487 +0x149 github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x351ae00, 0xc000b60b60}, 0x7fea2c6d9838?) /home/christoph/git/openshift/console-4.12/pkg/server/middleware.go:116 +0x3af net/http.HandlerFunc.ServeHTTP(0xc000344f47?, {0x351ae00?, 0xc000b60b60?}, 0x109034e?) /usr/lib/golang/src/net/http/server.go:2109 +0x2f net/http.serverHandler.ServeHTTP({0xc001048180?}, {0x351ae00, 0xc000b60b60}, 0xc000cabd00) net/http.(*ServeMux).ServeHTTP(0x2f32e80?, {0x351ae00, 0xc000b60b60}, 0xc000cabd00) /usr/lib/golang/src/net/http/server.go:2487 +0x149 github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x351ae00, 0xc000b60b60}, 0x7fea2c6d9838?) /home/christoph/git/openshift/console-4.12/pkg/server/middleware.go:116 +0x3af net/http.HandlerFunc.ServeHTTP(0xc000344f47?, {0x351ae00?, 0xc000b60b60?}, 0x109034e?) /usr/lib/golang/src/net/http/server.go:2109 +0x2f net/http.serverHandler.ServeHTTP({0xc001048180?}, {0x351ae00, 0xc000b60b60}, 0xc000cabd00) /usr/lib/golang/src/net/http/server.go:2947 +0x30c net/http.(*conn).serve(0xc000758000, {0x351cca0, 0xc000145740}) /usr/lib/golang/src/net/http/server.go:1991 +0x607 created by net/http.(*Server).Serve /usr/lib/golang/src/net/http/server.go:3102 +0x4db 2023/02/15 13:09:09 http: panic serving [::1]:42956: runtime error: slice bounds out of range [:3] with capacity 0 goroutine 3261 [running]: net/http.(*conn).serve.func1() /usr/lib/golang/src/net/http/server.go:1850 +0xbf panic({0x2f8d700, 0xc000273740}) /usr/lib/golang/src/runtime/panic.go:890 +0x262 helm.sh/helm/v3/pkg/storage/driver.decodeRelease({0x0?, 0xc0005f6000?}) /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/util.go:66 +0x305 helm.sh/helm/v3/pkg/storage/driver.(*Secrets).List(0xc00094a570, 0xc0003d79e0) /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/secrets.go:95 +0x26f helm.sh/helm/v3/pkg/action.(*List).Run(0xc00068d800) /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/action/list.go:161 +0xc5 github.com/openshift/console/pkg/helm/actions.ListReleases(0xc00037d680?) /home/christoph/git/openshift/console-4.12/pkg/helm/actions/list_releases.go:11 +0x6b github.com/openshift/console/pkg/helm/handlers.(*helmHandlers).HandleHelmList(0xc00014f000, 0xc000844960, {0x351ae00, 0xc000b48a80}, 0x7fea2c403300?) /home/christoph/git/openshift/console-4.12/pkg/helm/handlers/handlers.go:154 +0xdb github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func7.1({0x351ae00?, 0xc000b48a80?}, 0x7fea56dafa68?) /home/christoph/git/openshift/console-4.12/pkg/server/server.go:286 +0x3c net/http.HandlerFunc.ServeHTTP(0xc0011cbb60?, {0x351ae00?, 0xc000b48a80?}, 0xc000ff59f8?) /usr/lib/golang/src/net/http/server.go:2109 +0x2f net/http.(*ServeMux).ServeHTTP(0x2f32e80?, {0x351ae00, 0xc000b48a80}, 0xc0002a3c00) /usr/lib/golang/src/net/http/server.go:2487 +0x149 github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x351ae00, 0xc000b48a80}, 0x7fea2c478e18?) /home/christoph/git/openshift/console-4.12/pkg/server/middleware.go:116 +0x3af net/http.HandlerFunc.ServeHTTP(0xc00084bfc7?, {0x351ae00?, 0xc000b48a80?}, 0x109034e?) /usr/lib/golang/src/net/http/server.go:2109 +0x2f net/http.serverHandler.ServeHTTP({0xc000c3f890?}, {0x351ae00, 0xc000b48a80}, 0xc0002a3c00) /usr/lib/golang/src/net/http/server.go:2947 +0x30c net/http.(*conn).serve(0xc0008a9f40, {0x351cca0, 0xc000145740}) /usr/lib/golang/src/net/http/server.go:1991 +0x607 created by net/http.(*Server).Serve /usr/lib/golang/src/net/http/server.go:3102 +0x4db 2023/02/15 13:09:09 http: panic serving [::1]:42954: runtime error: slice bounds out of range [:3] with capacity 0 goroutine 3247 [running]: net/http.(*conn).serve.func1() /usr/lib/golang/src/net/http/server.go:1850 +0xbf panic({0x2f8d700, 0xc000273a88}) /usr/lib/golang/src/runtime/panic.go:890 +0x262 helm.sh/helm/v3/pkg/storage/driver.decodeRelease({0x0?, 0xc0005f78f0?}) /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/util.go:66 +0x305 helm.sh/helm/v3/pkg/storage/driver.(*Secrets).List(0xc000de9560, 0xc0009b8c00) /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/secrets.go:95 +0x26f helm.sh/helm/v3/pkg/action.(*List).Run(0xc0005fb800) /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/action/list.go:161 +0xc5 github.com/openshift/console/pkg/helm/actions.ListReleases(0xc00037d680?) /home/christoph/git/openshift/console-4.12/pkg/helm/actions/list_releases.go:11 +0x6b github.com/openshift/console/pkg/helm/handlers.(*helmHandlers).HandleHelmList(0xc00014f000, 0xc000844960, {0x351ae00, 0xc000b60ee0}, 0x7fea2effb100?) /home/christoph/git/openshift/console-4.12/pkg/helm/handlers/handlers.go:154 +0xdb github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func7.1({0x351ae00?, 0xc000b60ee0?}, 0x7fea56daf5b8?) /home/christoph/git/openshift/console-4.12/pkg/server/server.go:286 +0x3c net/http.HandlerFunc.ServeHTTP(0xc0002a91d0?, {0x351ae00?, 0xc000b60ee0?}, 0xc000c319f8?) /usr/lib/golang/src/net/http/server.go:2109 +0x2f net/http.(*ServeMux).ServeHTTP(0x2f32e80?, {0x351ae00, 0xc000b60ee0}, 0xc000cab000) /usr/lib/golang/src/net/http/server.go:2487 +0x149 github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x351ae00, 0xc000b60ee0}, 0x7fea2eff84e8?) /home/christoph/git/openshift/console-4.12/pkg/server/middleware.go:116 +0x3af net/http.HandlerFunc.ServeHTTP(0xc000df4be7?, {0x351ae00?, 0xc000b60ee0?}, 0x109034e?) /usr/lib/golang/src/net/http/server.go:2109 +0x2f net/http.serverHandler.ServeHTTP({0xc000d2d320?}, {0x351ae00, 0xc000b60ee0}, 0xc000cab000) /usr/lib/golang/src/net/http/server.go:2947 +0x30c net/http.(*conn).serve(0xc0002688c0, {0x351cca0, 0xc000145740}) /usr/lib/golang/src/net/http/server.go:1991 +0x607 created by net/http.(*Server).Serve /usr/lib/golang/src/net/http/server.go:3102 +0x4db 2023/02/15 13:09:09 http: panic serving [::1]:55334: runtime error: slice bounds out of range [:3] with capacity 0 goroutine 3328 [running]: net/http.(*conn).serve.func1() /usr/lib/golang/src/net/http/server.go:1850 +0xbf panic({0x2f8d700, 0xc000273dd0}) /usr/lib/golang/src/runtime/panic.go:890 +0x262 helm.sh/helm/v3/pkg/storage/driver.decodeRelease({0x0?, 0xc000d0b020?}) /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/util.go:66 +0x305 helm.sh/helm/v3/pkg/storage/driver.(*Secrets).List(0xc000de98a8, 0xc0001cb670) /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/storage/driver/secrets.go:95 +0x26f helm.sh/helm/v3/pkg/action.(*List).Run(0xc000dad800) /home/christoph/git/openshift/console-4.12/vendor/helm.sh/helm/v3/pkg/action/list.go:161 +0xc5 github.com/openshift/console/pkg/helm/actions.ListReleases(0xc00037d680?) /home/christoph/git/openshift/console-4.12/pkg/helm/actions/list_releases.go:11 +0x6b github.com/openshift/console/pkg/helm/handlers.(*helmHandlers).HandleHelmList(0xc00014f000, 0xc000844960, {0x351ae00, 0xc000b610a0}, 0x7fea2effb100?) /home/christoph/git/openshift/console-4.12/pkg/helm/handlers/handlers.go:154 +0xdb github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func7.1({0x351ae00?, 0xc000b610a0?}, 0x7fea56daf5b8?) /home/christoph/git/openshift/console-4.12/pkg/server/server.go:286 +0x3c net/http.HandlerFunc.ServeHTTP(0xc000430260?, {0x351ae00?, 0xc000b610a0?}, 0xc000e469f8?) /usr/lib/golang/src/net/http/server.go:2109 +0x2f net/http.(*ServeMux).ServeHTTP(0x2f32e80?, {0x351ae00, 0xc000b610a0}, 0xc000537900) /usr/lib/golang/src/net/http/server.go:2487 +0x149 github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x351ae00, 0xc000b610a0}, 0x7fea2c6da648?) /home/christoph/git/openshift/console-4.12/pkg/server/middleware.go:116 +0x3af net/http.HandlerFunc.ServeHTTP(0xc000df53f7?, {0x351ae00?, 0xc000b610a0?}, 0x109034e?) /usr/lib/golang/src/net/http/server.go:2109 +0x2f net/http.serverHandler.ServeHTTP({0xc0005f7a10?}, {0x351ae00, 0xc000b610a0}, 0xc000537900) /usr/lib/golang/src/net/http/server.go:2947 +0x30c net/http.(*conn).serve(0xc000c203c0, {0x351cca0, 0xc000145740}) /usr/lib/golang/src/net/http/server.go:1991 +0x607 created by net/http.(*Server).Serve /usr/lib/golang/src/net/http/server.go:3102 +0x4db
Version-Release number of selected component (if applicable):
4.8-4.12 doesn't show a helm release list.
4.13 works fine
How reproducible:
Always with this Helm chart secret:
Steps to Reproduce:
Unable to reproduce this manually again.
But you can apply the Secret at the end to any namespace and test it with that on 4.8-4.12.
Actual results:
Crash
Expected results:
No crash
Additional info:
Secret to reproduce this issue:
kind: Secret apiVersion: v1 metadata: name: sh.helm.release.v1.dotnet.v1 labels: name: dotnet owner: helm status: deployed version: '1' data: release: >- SDRzSUFBQUFBQUFDLytTOWEzT2JUTkl3L0ZmMDZ2NzRPZ2tnS3h1NWFqOFlZaUVVaVVUSTRyVFoybUlHREVqRDRSRWdHZTJULy83VXpBQUNoR3pMY1pMcjNyMnFyb3JGWWVqcGMvZDB6L3k3SDFxQjA3L3AyMUVhT21uL3F1K0hEMUgvNXQvOUIzK2JwUCt5blJoRnVXUDNiL29jd3czZU1kdzc5dnFlRzl4Y2oyNVk3djNINFhBMFpKa2g5Lzh6N0EzRDlLLzZ5SHJOVzdhRG5KUThUMzRrY092SHFSK0YvWnUrRkNhcGhWQVBSa0dNSCtwZjlaUFVTck1FQTExKzU2b2ZScW1ETDMwUGpTamI5dDdMZC9jOUs0NTdmdElEbVk5c1AzVC92OTU5MU52NXpyNlhlZzY5MmtPUm0xejF0bGw0OHozOEhrYVFYT2dCK0lIaW8vZnUzVU9FVUxUSGQrVW9kWHFwWjZXOUhIL2lNL2w0NElScGIrOGoxTnM2Y2JSTmU5LzdkOXV0RkZpdTh5MUQ2SHUvWjRWMjczdS91c0piY1BQMTRlRjd2NWVGcVk5cXNQaEpOY24zdmE4aGRMcnZYZEhQKzNoQSttWGc5S3dzalFJcjlhR0ZVTjdiUmdnNWRpL0swdmY5SDFkOTZGbmJGTk0wY0ZMTHRsSUwvOTJtKzg3WkpoVGp6SHZtUFh0Q2g5dmV4RUZCajR6VlM2TUNManc1U29VSzVjaUhGbjRuNlYvMU4wNitqN1oyMHIvNVIzK0w1eHM0K0hMeDBYOWU5YTNZVjZzUDc3aitWZDhLd3lndEJyajVONFg5WDlrVzlXLzZYcHJHeWMySEQ2NmZlaGw0RDZQZ1F4UTdZZUw1RCtrN3owSEJPL0owOHFINForc2d4MHFjNUlNZDdVTVVXZmFIcldON1Z2cU9mdjhkbVdqWHRmZXBlK2ovK0hIVlJ4SGM5Ry9DREtHcmZ1b0VNYklJbC8yalFsOTE4WVA4OWY1dStUNTl4TGlrT080N2d5U1ZSQlJJd25CcGFvL0kwR1UwMjZERFVoc2ViSEdjQUlFZlBTeGlFd3pVWEJLR1h4VjE0Um82djVkRWRKREVLV3Rwanh0TEc0YlNrbCtCbk9jc1RSMUlFeVV5bDd4dmF5Z3hCVDRCbkgyWUNYeHVhOWNmQlRmZUdUbTlKb25UOVd5US9rMFNVV1p3ajZ3cHJsd3BVSGEyT0VTMk1Nd01qVVdTZjV0SkUzWWtDVXhxQnFNRWlLT0I0TVpmd1VCQitEdUd2bkFkYmNSQ243OHpkVDRCQTVTYTJwQ1JKbllNeEwwTEEzVVBCbE5HRXFaakdFNm5RQnVIcHNxelFOejdrampPVE9IV1gycXNaM0xxd3RZZWswVXdYbHZNS0RCOXliVzFJV05wZTljV1BWVE9sYzViM2dHZFQweGRRVE9mL3dZQ0diWG1ITU9jWHdPTzNRTlJaY3psdm9ReEp0OWY4Z05MZTB3a2NZb2tjY3phNGlnMWRDVTJ1SEVDSmhzWGtubXFHMGtjc2Jady9ZWFNTTTFNUW91Ly83My80NnFEdVAveUhCUTcyK1I5R3FNMmZSVmtCaWd6bDdlK0tZNFlFS2pNTEJoNlFGdjVrc0JpK3Y3TnlmbU5xWm1lclQweVRWNGd6N2t6WkhwZ29pS1lEME1nam54RDIyY2dHS2ZtYXNTWitqUzNORXdQZGlTRTZkOW1TeDZCWU9IT2RIWWt1SGhzeGpWRk5iQzBJWktFNlFZTWxNelVGeGtReDc2cFBSNGsvelo5MEprdmxxZ21ZRGs4V01Kb2JZbmozUDRjdWM0Z2NXY2JPVEwwS1ZQQ2dwL0YxeTF0dUFZVGRkT1lWeWdqSUtwcld4emw5OGZ4c3hwc3NlbmZaZ3ZPODJDNHlCWTZ2MWNETlljYzJnR2Y4TG9ISjdlWk5WQjlVNTltbU1Zd0g4WWdKL004V05vbysrcnpmM1B5czJOOGtpWmpsdkpuRXg4WWJpdzdzeUJsalVETk1ieW1QczhzN2RNT2FPUE0wR3hrQ3F6djNCZnpSbE04Rnc5eXEyekZxYmtkb0xXNVBNSUlCandDb1J4Wm1zbk1BclNiRGFZc0NKVVlhS3VQa2xqSTBXMkJmMjI0SWZBOFFRL0lxWW1weVF3WVRPZUdOa1ZnTWkvNTR4eE9pSXhTZlBBeENPVEV4bnhRcHpIbWthWGt6cDdHYlF4Q21URzA0ZHJzbVBzOUdZTXYrTFNZQzRFcituS2V2MUdLOFhsZmZsK250S0RQamxrd1diaGZudEU3WDVhM21ScU1FMXRURCtWNGhkeGdXbjZrY0ZaeVVjajJrREUwV1BKb0liZUV2OC9JTFRGU01Bb2ZmUGQ5YmdXb1V6ZHJodmJJWWw0eFFqVUc0aUl6dGFGbkJJK0k2b1RZZ3lMU2F2eFo2S0hoRG9wcUJqa3ZOc01GNFRON2ZmZkY0bEJtZm83Y0JSbEwrUXk0WVdCcDhBdlFWTWJRRk04Vzd6NEsvcTFMYUZmUW8xUFdLaDF5VGVZckNYeFM4QTE1WHhKNFFxL3VkeDg5STFBVG1CUGUrQ1NKd3hnRUNnTGh3cFhwbkE1UVZOZGYzY2dsZW5EQ3MvYm42SXNrM0xyU1JObVI2d0w1eHRiU2hwdFNKby8wb3ZwNkZoVHZDa1B5SEpFQkF0dXRLNGtFL28vUzVLd05valJkTmVkWjBZWGFqOGpVNytwOFVPSGUxcFc5clM4eWdtL2gxbGZFMGRyaTFKemFtNVhmNUs4VGVQZTJMa2NyVGwzRFFHV09jUE9ONnpVODFHVHh3bkZiT2tvUytBTzI5d2EzS3VuSU9EcVB4NTVZK29MU1FMVGppaDdDcld2cjAvanN0ME0xdDRqOFZyRG1wbVpRdkhmd05nelVvSzJ2VCtjajcwQ29JR2VpM0ZtNlZNQ040V3BjUC9zTmd4dGx0cWhlMjNkS0RQMldiaWx3RFFkS2J1Z0tNZ2ViTmg3dUMvd1UvQ2p2YkgyNk5sV1pnY0dZTVRWN1dLTkxBSU5SZXZ5TllVeGpFQ3crU25kVXA2d0dTbTVxNDFRVngySEZtMEpUL2pyNGkvSW0rYWJxQVZYeHpMelFYUVg4VCtrUGkvbzg5L1praWd5TlhSa2F6R3hkUnFzQTI0RHhnZks4ZW9EaXVMQTVkZmhyOTg3cGExM2VHNXJjZ2tWTklMYzYwcXJHdDNEQWU1amZ6dEdyQzE1dy9qdUZyeFM1ZUx5elBCUmlQL0R4M3RUazNOUVhEYmpnUkUzQVdFYkdZSXJxZlA2c25IV095Wi93RnNXam10bnI2TXRSVHhwZGRFWWdOTm80STgvYkV6NlJCSThCTFBLQXRqLzNia3owbHNCbldQOFIzL2l6K3pSY3dxMDdXNWJ6dlBXVnU5SHFmcU91ZEZaZUxkVHBTbFg1aDlWNCttMjVVVCtyZ2xTREM4c0NoZUU4Zm1RRyszSzJ6aTlnTVBvLzJOK1FKbnNYNnVyT0ZsZE4vZHF0R3dyZHBCNHFPYTFkSytXTWpERlJkcG8yVG9IUUJjY1VRVzdFd2tCR01PKzBQeWU0c1NmVDJPUnNCTVBTdnQzaWJ3eWh1UG9vM2NrN0VKaXh5Y2lSb1ExRHMvVit0KzJuWmordzRoZFlmbE5VOTBBY0RXZkJlQS9GUnh3dE1OamFyeVpUYk9WelcwU1k4Z2dFWTU5RUR4anFZTHkzVkJQQlVJNEJkLzFSbWhpUFFsQnFueEppMW9PM2NXcnFpbWVLWThhNEp4eDVvV2VIclhSaDBRK2xsYWFTMS9sdXd6UGZ1d0I3YjFnYWhGdHFrWUtqRjRJOVppQ2lOWTZRQUhlZHdmcDhENUg3SUREMTd6RlFiRWpDaGthRm02dzV6aEJ6Mzk3VXA0eUZ1U0xrYyt4TncxQ0pUWDEremlONUFVWHRLdVBTSXFtaDgzRVZKS3dqTWkyWWo3ajVJaTRkbUVZQUt3UXNzc1h4eHRBVmp6cEJ6em9yallDWk9IQUZtaHRDMGYxdWNuVDQyOHBpMGVYMGRCaGtOVE8wYVdKcTJMRldIN3VFRnd4VUJrNnc4MGRZMEpVMnB3WlE4amVsY3ZKQU1OelpJbVh6aXEzRTBoRWY3VTF0ZUxCRUZCQkhMUjh4TUVDaHlhbDVneTJFVzFmYjE0elhKR2txTEdGS0RMUzBqdjlXVjRERlBlbzBycU15U1ZBM1FQN3NObW85ZitzWFl2RlJDaTl5S3YzbXQvblJ5ZGhZVkxYSHpRcmpROERqeTN0VG0yTW5Kb1hpbzJlTHF3d09lR1Rrd3pYZ2NCQ0NNaHdRYUFmZUxvTVhxZTZFVEk3NDBkdktqbzc5c1ZDdWhkak1UNHh6cFpLd01xVXE2YWlVcTJCU0twMm4xTkNWdFhYWFVoTlA4K1dCSkNJR3lnNXVuZ2dZU2hVMFVSRFErUVE3YlNYUHQ0TWF5a09uTUx3MldQa3FISjBqZ3Y5RDlPVnA5WTB5U3lkQlYwV2pwbE53ZXIvaFBOYlUzSmRPQTZkZ1d1eWNKWVlSTVF2aTZJNWpnSFZQdmprRGYzekdRU0hPdEdkcHc3clJhemtJL01EVVdrNUFJYU5QbVkvQ29mdmExbGsxZXBERUhTenhPWmw2SUxCUk4vL3hPeGdxaDlNeDhQOU0wNUUrQnZBdFBVQytXWmVkQmY1KzZjalk0amczT1pWWmlhUGNGbG9PY1VVYmJFYVVuY0dOa3ZJOWJLNXNjYlFHM0w2VkZEaml2ZVg0VlNhcnk5bXB3WnFiZWhGNDZFM2ExUG1rd3prOE0zN0RERC9PdTRLaXpvQ3M0cmZFMGswRUF2VUFXVDRIM0JSMXdIenlUSU8zWCtiY1Z2QURFWEVtNXMyQmpNMjVieTVQK1h0K0w3MEc3NTRwWWg2UUQ5aTlNb0lPZmlGMlFJbmZhaTRkMyt4dzNPL3lyb0Q5YVgyalpyWi95cSttTnVSK0JsNzgvcGZsaWZ2MkdyNUJJRFJGWW9OUCthVzY5N093S3VGMEIxN01IODF2MmNFb3NUVVczV3NqRm9USzRhdjdOUDg4NldyVy9LUXVIVlRUcXg2YzhJbWx5WjR0b2gzdzJYMTluRk05aDIwZGdXOWg2RXAwR29CVitHNk9peHF1YjFZZjQySmVDODBkbUtpcHVXSjN0alprWU42bEoxOU90emJlMzRyZm5JbVNHNnVHYmJzMHdEN3lsdTR4TUJnMzdIVUhuTmZkaWJZbWY4Rm5mWWNMUWovL1ozbUsyTUxBMG16WjBHOVA3Y3VsOGNocitFaC9QVjBxbkw3UTUra081OGdLZHBKdUhTczRGNkwvcG5pb0hUOVMvMm5Wc1FoVUQvR2I0LzNWWXNyU1g1YklJdkZvYSt2OEFuQ1BzWEZNdUNhQWt6M3dPWEx0eVpSOVdWSmxHMldwYzBsQ0paenViRjFCYmMzY3hqZ01SaXlPc3A3RStKaU85UmZHTkFPcVNMcUVXVVl3Tk9NcW5mMEtXQ0gyaW8vTE14NE1iR1NQc1ZlK29PTk1sUDJaS0NYSFVrQ1d6ZlJwYU9yS2dpN1hYNzlBU3hSMEM1V2tTL3ZaNGhGM3Rxam1RRU5aa1VSNktwS3RqOG1ZK2pTMXRHR2hMWS9Xek5Kd1pDcXpNRkRIcG1nanRUSCtzT0xpRjM0bkJxR01qSUdhbXl0MVkzTHFxdkZkeE8rd04rRXNuMHBwbitJVFRPYVp4YW5EMnRMUjF0UXhUUHUwMHVaa3JKZU8wN0JvM0dVcDkrNXhUVkU5MkdLRnQ4K0xsVXc5a1lCNFQ3VUlndCtZdXN4VU9ObkkvSUlqbGkrZzFtejFxbms5Ly8yM243UEJqVDlUaTJzU1MxNWZYam01ZDkvTVpKSHZQczlQYStPM3pLT0IvL29TWE9QYlgzMyswK3k0OUVjMGVIY0VSUFVybHR0WjhCcjRPNzNBMHVNNHMveWVPTnVkRDUxbnNyWDFaZk9xRkdQeEYwdWExN0oyOWdUdE81WU9qN2d1NTZDUzVZdHEyYmZLdHEyZmg2ZTdYS1RiL3RTek9SZG1zZWg3SFY2Y1hKVkQvZk9xdjdOUTVwQnFQRkpQUWNyeW9qQjFIdFBQL3JZc2ozTkNDeURIN3QrazI4ekJQM2ZsSGVMbkxZbWZkMis1SjdXSE53TlNiWWl2SmJFRjhZMnFxcTkvMWM4U1I2RjFmUEx4aVFjTEpjNlBxMzZVcFhGR1NoczNmbWozYjJpWjVmRmJWLzA0Uzd5bEE3ZE9Tc0g1Z1M4aFZMOTAxZDg2RHhVNE1ObzY3eWhJV3llSnNpM0VVNmZQSmFtMVRiUDQyelphT3pEdDMvU3RPTVlnYnYzdTZzU3l0TkRaT1NpS25lMkhoUFBmMVQ3alBQWi9YQlZWckhnU3RlckpiMXY4UXVwVHZGZklKUk8vNmdkUkZxYmZyTlRyMy9Scnl5TGxvdGNIUFBIYUFQMy8rWi9lY2NDZUcvVThaK3ZnYjlmSTVJUzc4VFlLcXArUDZkWVNvakMxL05EWlZpandRejg5dllyOG5STTZTZkp0R3dFSEE1ekNlQm5CalVOb0UwZmJ0RUFRcWFyRXZ4dFZsT1RPVmZIY0orWVRROEJQSXhpaC9rMy9YdmpXditxbjF0WjEwbS9WSTVneHQ0NWwrNDN2NHBIRTRxc0ZlcXFCandCc0hZTG5wSC9EZGxDWitMZ05yRk9XcmtOUWdwd2lXcVZxQ1JpM0Q1aDRUamtPUEwxa08wbnFoNFRBd20zSEs2MHYrbUhpd0d6cjNObXVjKzlzZytMVmJ4SHlZZDYvNlN1TzdXOHhKNUpLMjJPaGF2VWs5czV0MXlHVExwVHhmUjVqbEFzb1MxSm5LMkhVN2lLVUJjNGM4MVNGQkhvdHFZVEdSUkd3VUNtOFgzZk9kdXZiVG5XYnlQaFJ0QXRBc0xUM2lTbElLUWpRY3dJU01wU0xScjV5TURnUEFlM08vK3JmK3RaRVllRG5hRGZqNGdQZ3JsUEl5Wkdwc2Q0c0dPVm1QdHJBWUJ6WUFyT1g4MUgxWHJXWTAxeG45TEdwYUFUVzVVTE5PbktkL2VuaUVsSHJTK21qSkV4M1JoQWpZN0RvWElUQ2JvMHhtTVp3UXR4ZEFyZFNWUHpCbkdjc2NWVUdrS1F5VlpyWUhsYXB0dmpKTFlMVFhlbGFTUDcrTkZIKzNEeTZGc1JPRnQ3T3pHMmMrSENnNUtTcTJOKzdVakFrMWJyNmN2L2srMTF6TGlvSGQ2Wi8yWnhuUG45WFZzTmlmSUVDWjdDc2ppa1NLak5mNm9acHdpVG44R0hqb0w2YnZ2V0ZSMUpwaEorVFFwbUIyTlRuMHJreEM5NVJFT1RrM05KNWtod2k3eUxGTXduc1kwYWFvSjI5NUFjR3FZNVdkbVZGODR3clRlM1p1WnhjYnlUMTJuTXRFaUMvZ29kTDkwQ2FUQkVReHd3TzFUSDlHaFhhUDhtdng4cktDM2hXbVBxQUcySGV5RHEvLzV4c0Z0K2NjVW9NT1J6R3J1aWM3a2FmVndJdjBHcWVWL0NhUG8xL0g2K3B5eVdWdFltbEs1S3RTVVgxdlJ6YjRpaC9nci9Pd2s4cUFYOFgvQnM3dG1sbG90Lzlic2VpZk1WZkhWVk52dzN2dkdlTExwR0QyV1k0VmdWK1g4RWdtakhtSlRDUVhEaVo3cXhBWGRzQ3Y3SDBLWFh6dzAwbXVkMHdQcHpWdDFPeVNHcnFIcU9JS0w5a25sbytQZGlUYVF3QzZNK3diUWpWQkFoVCt5eGU2ZnM0OUYvREFPMXBHb2JJMitjU0JrbFVZaGpRaW45bnlROHNYWWtzN2Jyc3VKaFkrb0x5SWRYakxPUldycUhQcVh4TnBqc3dXTGhtTU1xYkhSeVg4MnBIaGVKVGVxYmdHeEorRVIwQXVDbWwyU3YweDVONmNUTFByWEplb3BwUEIzUDNoY1VzZFJvMEZncWVwL2xGdHYvblpPSkNINkJyN3MrT2Y1N3VkZGhaeUtsVjUweWpDdlpsK0RyaENTTVk3WUNvZXNEL09Sd29vbHFtTXJIL0Z6K0JDOWZTNXk2V0gycFRHMVhBUXpEQXNqTkZsTTlVRDNLWVBsaXVwR2ZwKy9DMC8xYm90L3IzL2l6ZnFLZnpwMzdVc1NqbVVPaU02V2tsOWd2d3NZaWU0cmVMOVUrNW1IU1IzUWxHUHJVSnI3RTc4dDdVNU5nTUVPYXBnU1dxT2NYUjBjKzJPWlFBZ2ZmTkplMWFLUFVTNEliclV0ZmF3clY3cjQzd3V6RUl6QjBNV0pyaVhVY3VpYlVtODQremZMUUJuSHhvRmYydEFjZnNqRnFCMDR3V2Z3VmdNRTFuaDBVbSt5T3E5eWJ6c3NNSzI1NjA4UGZUNHdLY3h3QnQvMHQwSU8zKytMTzgzSkwvQ0F4Z0lkOUZ2RmwwU3hyQnlvVVQ5V0NKNnVZWk8xU0hGNEZRVFF2NzNpVUxpU1JNN3dBbmIwMjl1TCtjMm0ra2M1dmRMSy9Tc3p5bzQxa1NwcG10UFNZU1luNEs1eXVNUjVKU3BaMEExQmJ1WFZ1WGtTbndPeEE4RHVrQ01sMzgvWGJQdUZORzJSbGNpbUN4RUR6OUEzcWswZmtnL0ptNFhXM3dKdW5XZFdGQjQ1blB5MkF3eGZjejdMY0JqUlpEZlBYNXlKNG9lM2lJZGpOTzJSbURlV3VwVnQ2QjVhaFc0Q2NWaGJOWTV6QTdXYmptWmx4TnY0eEh0RkJYcitzTys2SHc4dzZ6Z1hyQWM1MXBSVUd5VHVCTVN6aGhQb3hza1UxZTRWOGpFQm9YK0k0OGtJSnhEb1B4OEdmeHJvUlRaR29FSDZSQURNY1BwdmE0ZVJuT1Q3dGFUWEcwaHZtSU1YUjVDL05SREdpOG41SWxreVhiS2tZWmJVek5qRUd3U3ZHM0xYMjZBd0dMUUxoSTdXQ2NXeHFKaTlPR3ZzOWZGVk1laXg0dmkxMk8rWXFmaTExRUdLaUk0SEhaS09KMHNTMEY0aUtUN3RnZERMQWRIUkpiVmkxYml4NWpUL2pEVi8vVHJxT0xsdHBJVHQ2QlFFWndvaFIvbTdHSjUwdkhxRHFOWjNxOUE0WnRGSTAvZ2RjTGMwRlZicWxiajd3MC91bjBQOHBsTFJ5ell6bFdOeVN2UlgyeVRXTTNBSEVVa1B5WExyWDdTVHB6VDQwZWsvd3BIVGpOOFhjd0R6LzkzRW0rS0FhaGdreE96VjhUNzkySGFvcGxqYzZMMzVta0dMaUVnOFM1NVZMZkszSVpkYjY0VFAvWGFQaG1lcWdocjIramo5YkUvOVI1cHZnN3NEU2JoUUVkWThheEhnaXdqOExXWmJPaGQyRCs2VFU1b3FMTVJsMlZPdVUzNVlDeFBOQ3hDTDlVNVQ0MDl6MllJb1B1WlBGV09EMlkrcFN6TktKWHNINGFnQUZwcEFsbmcrcmJPMm5BczBid0dFUE9JejU1dFNTdHo0OS9MMWtDTjN5Tm5oZEh1VDJaWDVTRE1mUnBidWdiL3lkMWVqVi9MSnVrTWVIdCtmWmxPT2JvemdqVVQ3bXI0ZlUxZHBPVVoveituQmIyejN3ZSsxVFlUOGpNblBlQXozOWJzTGRsU2Q0dmlkc3VXQWM0RTF0UGQ0QjdSSVoyL1F4OHorV2xGVlNXcjVra04yTlV1VXVibE1iSUVSaW9pVW5qN0FKUDZ1YWMzL2tpWGRWY3I2bzF2dndGY2pKRXBoWXU0SzVkS0k0MmREMFRIWTc0NEhjV0xUNS91N3dVS2F0NkxSKzhNTWR5b1R3YzdSYVJpZFg5ZUU1d1ltalg3ajBqTDBwOC9OcDRhWWpzaWIyRHBQd1Y3cWc4NHRpb0tHZlVGbWwxN1VVNWxsZXQyWjJScGFxYzkvSjMzMEtXTDgvSkVocEN6dHZaWktjcFRMUGpIQzZCLzBVODNaRHhSbm5zeitQcmNubC96cjVPUWFEUWtzaHE3VVpCTUdCalVQaHRSU2YrTDhYVEM4dCsvM2ZnVDhTMkJtRVpkWTFBalF6ZGpNRkFvbXRoSXNvZ3A2NXRIZk1namlHSHlCaVFPUjZldHl1WDV2RHFNcHNpNTZLOC95L0o1ejJIeTVtcGIzQ3NubUI3UzlZaGliMlJMb0Q1UmJhM3NlWjZVdEo3U2E3ejdmSjFGL2d0TFhqSlRuZmVEZ2FJY1piOHVsVUMvZHZ3MlB6dVg1N1hQdjhoUEQxWGJ2L1RPdTUxdFFBdnQ2KzA4VjNON0FuMml3cWYrVTdtMitYcE5DWW1NWEtBNXd2YXJRYitaWGgrR1U1ZThOeTUzUDJUN3o5QjUrQXh0Z08xM21COFlZNzU2TWUrbU04NzlZS1ptNXBLOHBxU2VBTFRSVGxRR2hjM2Q3a3p1RkZLOHA1bGM2ZlA3a2xOQkluTlB6R3p0YkZyTmVnZVpseXpzWEttZWNqUUhobExLSEw0Wisrem5vSDkyL0dvM1ZnWm1kbzRzVVgzVmZtM2RtUDVYeUpQclkwM1ZxUFpIc3NMamp0LzZmcHRFNi9odkVXNzY5UVNWUTlNbEtpSUw5Wm43MnRqc2thdW42WGw1VGtSc2taeUdXMDhHRTQ5Wi9tV01rUWEvQytoeGRiV3BnZ0dRMFRqTWxUR2Z6dGJIQitzd1J6SHp5UnZNNk1icDZRdG5PN0szVU5ubXByWkFjb0JOeVI1OXBsdWVqQkF0SlpScTh2Z0svS2xnWnJaR3pNSEhQT1hXQXVqR3dpOExaNHg3NXNCQ3JHZlBkUDVuU25VMTJHa2MvZHgzSjhhK3UxT3FxM3ZtRXZXQStJK3RUaDFpT2tBSmlwK3g3UDA2V0dtb1d5bTNhWGxlRUFiNzJmYStOQ2tFWXRBYU1Zd0dHVUEwMVZnT1VPZnhpVCsxT2V2b045VHplb1hyWlU4VlNmOG5BNHJXK1dLdUVhOUpyRnVNRTRzUGNhK0I4bVhQTG5IV0k1cC9vaWV5MkhDeStiM016bUsxOVFkUDRlbk83S09TT0pwOVVEcUVxaFAxTEpydzJZdXRhZ3ZiZVVzNmpoR3BzREh3T2U5eG41endsdlZpOUdOSnFwTnNmNTR2ZGpqcnVSM1dtTlA0U3R3MllrN1cvejBWdldIcjhsei81WmFtQmZKVjdGekt3aVZ3MHR5MTAvQmM2NG01b21hQ3c1d2p5elFWQmtNU016d3gyMU9lL09UWDdHR1pJdWpuTlFDREtvTk4zYXZxRmNwY1hmNDQ3N1E1TGh4eUp2WFVneElxNnRuY3F2ZGNYT1IxL2cxSFJ2QS9YRWZzZ09tdCtjM3NrWUp4SkZuVHVZN3VuWXpJcHZVTmZ5UThGVTgyTFdwengrUjRZV21iR0RPVTNpV2pRM2xEclEraGRhaDRQblBhSzhHWjJrS3RwTWV6SGtQeDhSd3NDQTVDL3hNZlZORG1QK2MzNG45UDROVDkvWmt2ck81VVc1eGp6dER3N3pON3pCTnBBSHhNUmxUSzJMbUoveSswdzNuR2h2MXRQaC9XcDRhY1lZbUwxMHZlOXJIczBYUHN3WCtZSWtqRm9nTFVzOXE0amtHNDBRU3gyc1lqQTR3NS9lR1BpVXQ1SVkyM0pBOHVLZ1dyZlM4WkdxUHFTVFNFeWRncDA5d2daMmw5ZXpmN0VETllZQTJmNndQY21BaUdFNWpmSy93UmZLeVQ2SFk3aVdUN2xBS3hmSGFuc3lidGNIMHZ2dUYxS28yVDBHdzlMa0xSRFd3QmQ0SDRqaXo4azJCMC9aWTlIbk0wMlc5aVNtVWEvaTErcDV6Y24rNmlWaUQvOHI3K0YreUpjQlYvOEZIeldOd2xMdmJ6L083OTRGOTNPVkJ5bSt6KzQyNmt1NDhCRVRHTFE3MCtMSllOdG1nV0hJdFdtaUtFb1JnNFpidG9vUkU0cDJyNWNPdmlxcllYNW9wS21hR1JWSDRGRXRpTXlTU3hGRW0zWkdVeUQxTmlWeC9FZno1V2hyenVhbFBFZFRWR0hJSWkrR1hSYUFtWUFDTDlvdVQreVhycDRhK29lSE1aRVBLZTRvMktOcjJ4STBQNXJMNFJzNlRBMitrc3RUM05qYkJvQ3JaejB5dEtLY1Q1ZHpVU09yWmE1ZmkwakNCdEpYdXlWamlPSlBHODN4WmF6ZVNSQkxDa3pETDFEMEdtMWxESXdmemhKWXVNNlFGYmF3ZXl0YUI4cEFmaWxONUJ6U1c0TnJRNTY1QjJOWkVFSWROcmZLbFNxMU9WQXgvV1hiOVVRaDQxeENuSHVUY0w0Q2IxNTR2UzV6NURTMU5sOUlhVEM3UU55a2RpNjFLdUdkTDl2Z3NLYVZSODI5TDVWNVJwNXFpVGg5VWRUb25CeFVWQnozTWRQVkE1OHVpYjB0RlhUSHE4bjR6bHBYbGJUclRpbEp2bjkwYnVuekE2dGo4ekd4V2QrUDdGV3QvVzIyYTN6TTExck8wL1doNnA4cUxGWnFUZWQxR1h6UnR4RXFpN0lHQ3hyYm94VEEvbHAwYjRjYUY0dmhRdE9yZ01RLzBlckdoZW1PamV6WjFsaXloNVV3dnJvbTNGTThpbFJHclBCaEt2RTBrY1pRWDlGK1RNTHFXcnNCcXBtd2xNcFk4ZHd6RFVXTGVSMThNa1hjZGJaeUMyNWp5Q3RrOWl2SlJkYmlGejUvQ2N4dTdobmo3aGFvWklrOURTWndPcFFudndZRk1RM3FCU2UyeVhFSlF0TVhxVVZWVStVSFpvTG1pM1dhQ0c2MmxmTzRXSmZybFp2MFVsMFVyQkFoVXJLSVlrSmNsTlNzOEQ5SnVVT09kMlBScFc1VE5qYkg1d004WHoxQityRnBoTUE1L253elVXdzkrVmdZTzFuKzRERlJ0UUNGKzN5djVZWFE2ZjhXbCsvMDlNUWJkQ1c3VVBPeEZkYWt1NVNOcVYxQUdCNG9IeEVkM0p2RFl0dERXT202YzBlWDJNcVdIK1lId0JVMmhUS3FTWnhJeWYzV3hMUEJEUTJNVG9XaTc3end3cHpwd3BObGUwbm1nVENsenVoeTFaWTdhcGdLR2ZTeVkydVBPenNsaFo1NDBVVWphbHl5bmlhcG5jSzVMWlhCVnRyd1FXVHFXTFZWMG9vZDlZZHVIaUM1SmJWMW1pTzNWaG1FcEU0WllPMHhHdzdxbk91UUl5ME5GbDBxdkRSUVBoZ29MeDN3T3VCZ1pBNFhTWURKRlpxRG1kVmNjRE95c2JadG5HTGZQSFQ2aTNicEFWdzgyTGIwai9FRGEyN1F4USthUFhaL2kwRHo3ZUVBWUN6bDFRM1ZXWjBrNjFrOWZIZ1NldXJWTC9wTjAxd3JmSm50WEVkWEEwTlhFRnZEOThjWVVFYm1IOWNxeUdlcTZEN2Z4Snl5VHN5V1Q0bmZ4djdYLzNRZmh0dnJkY2IvMVAvOUpDUGV1MFhBRk1YL3YzZGR1dDlHWUQzRVppOWJsd2k4NzNYYUQveVNOaVJ0YjhFNTVvRjdwcTR3YU9oTHJYWTFwN29XcEw3MWYxaTVVekR1RnhiZGd6cWFHTmlUN2RWb1RKUVhDeGpXQzhjSFVhQmxqSVFmVVJpNExlb2w4N1VBeG0rQ1g3QWRhTy9Ud09ad2FCRjcxZ0czNGc4Q3ZpQnhSSDdmMDgwcmJ0b09CR21DZmtoeG1TUHk0OUxTVmoyV2lOMWpTcXkzWEtzZDJLcTVzb3lxK3A4L1RxbFdsV05xcm50V2JyVmNsbm1lNjRwa0QrZUU4L3JIK1A4RjlGWjZRdmRweG1LWXRTaG9jRDlJcGRwYzBCQ3FQN1pMY3NxVVU4Mm9SM2pSYWU1b3A4b0pQdFFXUlpXT2k1SFloUXl0T1pVcmZpZnBkV04vS3lPajJOY28yRCtLYndFRGxMT3p2YzhRbnd2cVNxVUkybjUrYjJwaWoraFFkc08wdnhxQjdxMnEzWkI1bitLb1dLai91SEJ3TGlFT0VkVC9sMGVzMnZsZ1lJRElmaFVPTU5DWmJneFRiMEZEOWEycERncGNpUGdscjJ6UjhhaXp4YzRpeEpxcGZ5R051YW9UL1UxTlVPV3gvb0tqbXM4RTh0N0NmUUhhbVU5WmdNRVV6VGIybU1sT1hyZU1obDBXYmpDZE5aSThjNW5teTFISHRMb0tWQ3dzN0RISitkd3pqQ3h3ZTQ1K1RiWFVLZTFUUnA1am56dVpPbDV1a3lSN3JlN2QrQUJybHMycExFdmNvVTdUbG5zaHljNXl6dEh2QS9sMGROL2Z6Ykw2a09wVHdXV0duelM0ZGZCWS92QStEY1dad2JpYmRZWDlmTG16Nkp4ZFU2WWJLeG5meGJubFlhaU9XZnRJbSs2WHRPWCtZRk1IYitMZ2xDcDFENlFOUXZNQlF4V090S0EySjMwanRoVi95eFBGNVd1WjNHNW5MWVRqeitUVWM0SHRSWFBodFhtdjdrY3BlbERrQjRuTnlqM1VZcU15VHVaVjZHY3NqOWF1T0IxUXFoMW83MjhFZ0tibFRtaCszZGZsei9ObzRUQzg0MmhvNFVQMloxdHFlcGJaTndNbGNpcll6N0FOajNqRjFMNFlENVVCOUVoNzdNRzA5Yy9ZUVBNb0g1N1pub0g1RXJmUnJZK295aG5OVDRzT3N4VzU0cnVWWk1GL1g4Mnl1dlpSdVJYNXBVaDNCZFN4RmU3bmVLV01hSEZNZnQvTVNXYzhXNWVoWUR0MzhSckVWMEpBYzNOek9PNVU3TkUvbUo3Uzg2R0JBaW9kelM4ZkhIRkRIUVljTk82RFJHSzJTNitEMCtjclBwSHljMWVUaS9EWWx4emlMNUFVS1hkVE03UkZhN2V0MjlpSStvd2NYcW9WL2RySmxTRk1mdkFndkdKYWFUMFYweEczYXBsTXozbFRjd29mdTNPYmx1ZTA0c0gvY3RSVU1HRWpZcjJodDNYQWJuY2xoeEJpR3JuRWVoTE5MU0wrd05ZWHlFc3hITmQrUExlakpIZzluSmY1NHk2NmNPU3kxc0MwczVOeGFEclE0VDRqU3lGVlB0bmM0Y3hEdFoyWWtDWFlMdDdETmQ0MThHUFVKdXJkRnFIVGtMeXlick5OQUwvejFCbjdaWXd4azZ5UVhsMWErZUNPWHFwVHRRS2QrZG1oU1hxNHRoUnVpbXRrQk9aRFZBMkhzWFRMWHMwdTdBOFdEWElyampvemFUNWJzMWovVFdqNEhWeDR1L1VSNTFMSlp0dFhoejFWTEhKU0c5YVhYZXB3Z3U5S0VISVQ5MFNqdG5mREhXNTFMR3RYUEtpcnN0a3pqL3BlM294TmJTdTNuVWFjMXpwRXJXODg1WWx2cUphaHVDNDhIV3h4blRuQlh5ZDViTFkzVzh0amwxaFJsL0o3V1lUaFp0ajZaVDdQbHUvQkp5STdiWkhuM2VLaDdJOCtNeDFod2p5d2NLQWh1MElMeXpKdVZNWlF3SFdaYXEzMnZndWZUR2s1VUg0Z0kyeUNzTit2dHh0WGZNeDNQaFp3ZDFwbzNiWHMrNGZWYTY3a3hxWjRwNlpnVjhTK1N4bW01WHBDUk5ZSXhFODM4VWZPYXNIbEwyZmpKZHVyU2ZwenNsUDk4M21ESkwrbXp6V1hyMmpJcGxqMEdoaXJsbjdzeFdSdUFPYjhHSXFMbFVpVGZLNU40Qy9QVlBkYlRpT3BwNnVPUDE1amVNRC9uRDU3SVlYbWFRTzBrRCtYbzR4UXR1TVdhSS9FeFZyUVYrNWpuS0gxTWdXZGdNQTdNQUsyUHZoYmhXYmZVNDJkd1IwOW9LTnV3eFNkOXpVNGNxbmVQOXpOTnpZekJkQWduWGJqOGhiWXlVQmxhWW9IbFowVG5wTTkzWlZ1ZEtiRGx0WllQMG8ySFpvdm1zTXZmSTNTZzVRTWtCMHZhU1Z1dG5UUW0xbVZFNlVCT0c2RTZmTUNYNTZ4VzEyY0MwYmtJQkhMdTZEeGpDSHNzdHg0Y3lJdzFtZTVzelk2TVorQitXY3NrT3VlL0hrOUdXZGI1cU5IeW5wdFZqS2x1eUx6R1UyU0tLSy95QVhlamZkRSs4RkVTcGp3UUgzZDJzUzJacGN0MDNjTGZ1eEk2dmlmNXo4eUxVNGVGUDVpRGdVbExLOFFVT2N1T2NzYlNYeW55TFJaZHh5dlhwekxwRHN2b2NHY0wvTm9TMWJWVnRiU1hzU1hLTU4xTURqR3paK0E2T1VHRXhtaUxzc3lvNUJOeWVvZkFlN2F1UkdBd2plM0p4bTJmNkdIVVdxaEtHL3VjdkRiSEtIS2FaVjRWei9zTnZ2SGNxUzBuZERuRytMVzJMdjd6WXNtMzIrc0NzbFZ1amVmTE5ucHlTTGpBZFBzdEo2MVZhb2NQMi9WTTZldVJVelB1VFczbGFvUHF0QTZ5cnFjdjNXeld1dmJsRi92NXY3RTlxc3UzYkoyRDJZSExqck0zZjcwZjhkbzR0SWtibUovRlJXRUg1SFAzVTBNanZQaHdyc1hwbEM5cDNOVDJvMDF0eUwyMS8vd0xXNGRPZUtScXg5RzY1MWJlK3pYeFlxUGxaZys0UlhOTHVqUDNxN2FiWEs2dmdhZUc5cGpNTkd3MzJHS05ndDBiR3NwaHpzYkFaejJDQ3p3ZXgzcFFZTDNtVm0zU0UxdmxkZmpsTHp3LzhxeXYycEY3cnBYeE44c3VxeHpSWSt5UXZDcktKUEhPWFJTNHVOZkcreWZ5Ymk4OFM3WFcva0g5d3puZ3F2VUpIRk9scEp4ZjZNdzNkN2NoWUppRVVXUDd1R1BzREhldmdyT3hsWWxjeXhXMjlHWnp5NU9PTFFhZXNFSHRzMWRQNCtlaVRIOVZuaE43ZUJPNWVMNUUvZ1JYMWIyek1LcC9ERFpMRzhiMlhUMnVsMC9zRDNsR2FKZDJ2elc4SkM1UEFEZmV3SHkwQjV4Q2MxWDY0dG44VE5lWnRLZDVwOWI1dDIrY1EzbGhlWGtKZTFrZVR1dHFWaVBPMUtjNTlsY0wvNzM2RGM4Y3hYTDBzZXZiRi9JZXBnV3QxVm00aG1ZeXBpNlYyMmNWVzVnWDk1Zlh6YlgzWWsyMGEwMjh0YjdaT2RiRGJmVDMvbzkvL3JqcTAvT3VHa2VUdGM5UStuT25qLzA0ZTdMWWY5NUJZdityVHdDNy9NQ3YxeC9VZGVIcFhQV3p0VTdPMHdxczBIL0FQMjc2Nzk2OSt4NytUMjlKampLNzZWSGUrTkI5Rk9QMzBJcDkxZGttZmhUZTlIYnM5eER6NzAxdlNaLzVIZ1pPYXRsV2F0MThEM3M5VEtGeVFQd2JXY0JCQ2JuVjYza09DdDRuM2dmb1dkdTAvbFN2WjhYeCswMEduRzNvcEU3eTNvOCt0RWZxZXNZUGs5UUs0YlBQQlZab3VZNzlEdVEzdlltRGd1TnpsZmppeDdaWm1QcjFyeWF4QXduc2FSNDdONzBLMGZoUzRpQUhwdEgyNXEwblFOaTlHUFZkZ1ZETVQvUUt2WC9Ud3p4ZFhTbVkvNlozTDN3ckx4NzVzWHo0T2FvZkpicUQ4RlljSngrTzFQOWNQZnNmelFDOW5oV0dVVW9Fc0p3RWtiRG1lK25XZDExbm05ejAvdSs3RXYvL0tQL285ZjU5L0xQWCs5NS8yRWJCOS81TjR5cStqakg3dlgvenZXVVp2dmV2Mms5aTFKQW5DN05FcGZ4N3YvN2NqNnZXVjMwSDJWaDVreGN4Wjc4dlNmK2UvSUxWUVkzL1lQNzVuc3l5UHVLUDhzOS8xdVNpMUl3M1BiWkRKZ0lyaGQ2c3pnQXZvL05MS1YzQ1gzNnV6b2Y0UDlUODlKUDg5M0xZWHM2SGwvRGlTL214MTZ1UWovODdFcTAyelZKcjdCMVE1d0ZDMG5Lc2ttZHE5K3VLcHoxVVhRR1YvMVhmL25haWthb2h1elFUSVUzZEEyaDlzM0lHYms2R0l4OXF3OUkwNjYyWENndC9PcFNWZWplOUR5LzdRdjNNeTJxazR0RExtK2NWSy9FMnFZL1VvVm5KM1NiZGozcVd4emNGOHVwL2g2V2xibkl4alRTcXNFM1IwZEtNeGIzNkJHcDhUVTlxTFljaUZsejBDOGhkLzhnUzJkYW5KTC9Zank1SDJEb1A1ZmRMdjUwQWtHNnQxSEh6QmdpVVN3cFJKbjh2bTQvMWV0aEExQmoycWFtM0psOTgrSGlIQkNFM3ZRcjU1VjBuM0hVb2pPLzl6MS92NWJ2N2Z5M3ZiNVg3MWJkL2ZWTytUdStFKzZabElTYzg0NGV0T0taM0t2dFhlaTJGdjBUNFZ2Q3MwSFdlbHhLaW5oSXl3UTRwNmJDNlJ5bXA0ZWEvUTBwUUZHMnltRVlNeFdSUUJDMTAwOE1oeHZPNEprRk1CNWJwOVROWVZ2RE4veEovdjFROHJWaW5MWENsdjE1S2VNM25MbTFJV3FHakZzemQ5SEFzVi9pVFQ0V0RONzB5R3Z3ZTlxLzZPMG9vRW9qV3N4RFEyL3BKR3NWZS84Zi9Dd0FBLy8raHFZVU1wYWNBQUE9PQ== type: helm.sh/release.v1
Decoded json:
{ "name": "dotnet", "info": { "first_deployed": "2023-02-14T23:49:12.655951052+01:00", "last_deployed": "2023-02-14T23:49:12.655951052+01:00", "deleted": "", "description": "Install complete", "status": "deployed", "notes": "\nYour .NET app is building! To view the build logs, run:\n\noc logs bc/dotnet --follow\n\nNote that your Deployment will report \"ErrImagePull\" and \"ImagePullBackOff\" until the build is complete. Once the build is complete, your image will be automatically rolled out." }, "chart": { "metadata": { "name": "dotnet", "version": "0.0.1", "description": "A Helm chart to build and deploy .NET applications", "keywords": [ "runtimes", "dotnet" ], "apiVersion": "v2", "annotations": { "chart_url": "https://github.com/openshift-helm-charts/charts/releases/download/redhat-dotnet-0.0.1/redhat-dotnet-0.0.1.tgz" } }, "lock": null, "templates": [ /* removed */ ], "values": { "build": { "contextDir": null, "enabled": true, "env": null, "imageStreamTag": { "name": "dotnet:3.1", "namespace": "openshift", "useReleaseNamespace": false }, "output": { "kind": "ImageStreamTag", "pushSecret": null }, "pullSecret": null, "ref": "dotnetcore-3.1", "resources": null, "startupProject": "app", "uri": "https://github.com/redhat-developer/s2i-dotnetcore-ex" }, "deploy": { "applicationProperties": { "enabled": false, "mountPath": "/deployments/config/", "properties": "## Properties go here" }, "env": null, "envFrom": null, "extraContainers": null, "initContainers": null, "livenessProbe": { "tcpSocket": { "port": "http" } }, "ports": [ { "name": "http", "port": 8080, "protocol": "TCP", "targetPort": 8080 } ], "readinessProbe": { "httpGet": { "path": "/", "port": "http" } }, "replicas": 1, "resources": null, "route": { "enabled": true, "targetPort": "http", "tls": { "caCertificate": null, "certificate": null, "destinationCACertificate": null, "enabled": true, "insecureEdgeTerminationPolicy": "Redirect", "key": null, "termination": "edge" } }, "serviceType": "ClusterIP", "volumeMounts": null, "volumes": null }, "global": { "nameOverride": null }, "image": { "name": null, "tag": "latest" } }, "schema": "removed", "files": [ { "name": "README.md", "data": "removed" } ] }, "config": { "build": { "enabled": true, "imageStreamTag": { "name": "dotnet:3.1", "namespace": "openshift", "useReleaseNamespace": false }, "output": { "kind": "ImageStreamTag" }, "ref": "dotnetcore-3.1", "startupProject": "app", "uri": "https://github.com/redhat-developer/s2i-dotnetcore-ex" }, "deploy": { "applicationProperties": { "enabled": false, "mountPath": "/deployments/config/", "properties": "## Properties go here" }, "livenessProbe": { "tcpSocket": { "port": "http" } }, "ports": [ { "name": "http", "port": 8080, "protocol": "TCP", "targetPort": 8080 } ], "readinessProbe": { "httpGet": { "path": "/", "port": "http" } }, "replicas": 1, "route": { "enabled": true, "targetPort": "http", "tls": { "enabled": true, "insecureEdgeTerminationPolicy": "Redirect", "termination": "edge" } }, "serviceType": "ClusterIP" }, "image": { "tag": "latest" } }, "manifest": "---\n# Source: dotnet/templates/service.yaml\napiVersion: v1\nkind: Service\nmetadata:\n name: dotnet\n labels:\n helm.sh/chart: dotnet\n app.kubernetes.io/name: dotnet\n app.kubernetes.io/instance: dotnet\n app.kubernetes.io/managed-by: Helm\n app.openshift.io/runtime: dotnet\nspec:\n type: ClusterIP\n selector:\n app.kubernetes.io/name: dotnet\n app.kubernetes.io/instance: dotnet\n ports:\n - name: http\n port: 8080\n protocol: TCP\n targetPort: 8080\n---\n# Source: dotnet/templates/deployment.yaml\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: dotnet\n labels:\n helm.sh/chart: dotnet\n app.kubernetes.io/name: dotnet\n app.kubernetes.io/instance: dotnet\n app.kubernetes.io/managed-by: Helm\n app.openshift.io/runtime: dotnet\n annotations:\n image.openshift.io/triggers: |-\n [\n {\n \"from\":{\n \"kind\":\"ImageStreamTag\",\n \"name\":\"dotnet:latest\"\n },\n \"fieldPath\":\"spec.template.spec.containers[0].image\"\n }\n ]\nspec:\n replicas: 1\n selector:\n matchLabels:\n app.kubernetes.io/name: dotnet\n app.kubernetes.io/instance: dotnet\n template:\n metadata:\n labels:\n helm.sh/chart: dotnet\n app.kubernetes.io/name: dotnet\n app.kubernetes.io/instance: dotnet\n app.kubernetes.io/managed-by: Helm\n app.openshift.io/runtime: dotnet\n spec:\n containers:\n - name: web\n image: dotnet:latest\n ports:\n - name: http\n containerPort: 8080\n protocol: TCP\n livenessProbe:\n tcpSocket:\n port: http\n readinessProbe:\n httpGet:\n path: /\n port: http\n volumeMounts:\n volumes:\n---\n# Source: dotnet/templates/buildconfig.yaml\napiVersion: build.openshift.io/v1\nkind: BuildConfig\nmetadata:\n name: dotnet\n labels:\n helm.sh/chart: dotnet\n app.kubernetes.io/name: dotnet\n app.kubernetes.io/instance: dotnet\n app.kubernetes.io/managed-by: Helm\n app.openshift.io/runtime: dotnet\nspec:\n output:\n to:\n kind: ImageStreamTag\n name: dotnet:latest\n source:\n type: Git\n git:\n uri: https://github.com/redhat-developer/s2i-dotnetcore-ex\n ref: dotnetcore-3.1\n strategy:\n type: Source\n sourceStrategy:\n from:\n kind: ImageStreamTag\n name: dotnet:3.1\n namespace: openshift\n env:\n - name: \"DOTNET_STARTUP_PROJECT\"\n value: \"app\"\n triggers:\n - type: ConfigChange\n---\n# Source: dotnet/templates/imagestream.yaml\napiVersion: image.openshift.io/v1\nkind: ImageStream\nmetadata:\n name: dotnet\n labels:\n helm.sh/chart: dotnet\n app.kubernetes.io/name: dotnet\n app.kubernetes.io/instance: dotnet\n app.kubernetes.io/managed-by: Helm\n app.openshift.io/runtime: dotnet\nspec:\n lookupPolicy:\n local: true\n---\n# Source: dotnet/templates/route.yaml\napiVersion: route.openshift.io/v1\nkind: Route\nmetadata:\n name: dotnet\n labels:\n helm.sh/chart: dotnet\n app.kubernetes.io/name: dotnet\n app.kubernetes.io/instance: dotnet\n app.kubernetes.io/managed-by: Helm\n app.openshift.io/runtime: dotnet\nspec:\n to:\n kind: Service\n name: dotnet\n port:\n targetPort: http\n tls:\n termination: edge\n insecureEdgeTerminationPolicy: Redirect\n", "version": 1 }
Description of problem:
service machine-config-daemon-update-rpmostree-via-container is failed to deploy commit
sh-4.4# journalctl -u machine-config-daemon-update-rpmostree-via-container.service | tail Oct 12 11:45:56 master-00.wduan-1012e-upg.qe.devcluster.openshift.com peaceful_elbakyan[2022141]: Checking out tree 845113b...done Oct 12 11:45:56 master-00.wduan-1012e-upg.qe.devcluster.openshift.com podman[2019123]: Checking out tree 845113b...done Oct 12 11:45:57 master-00.wduan-1012e-upg.qe.devcluster.openshift.com peaceful_elbakyan[2022141]: error: No enabled repositories Oct 12 11:45:57 master-00.wduan-1012e-upg.qe.devcluster.openshift.com podman[2019123]: error: No enabled repositories Oct 12 11:45:57 master-00.wduan-1012e-upg.qe.devcluster.openshift.com peaceful_elbakyan[2022141]: error: Failed to deploy commit: ExitStatus(unix_wait_status(256)) Oct 12 11:45:57 master-00.wduan-1012e-upg.qe.devcluster.openshift.com podman[2019123]: error: Failed to deploy commit: ExitStatus(unix_wait_status(256)) Oct 12 11:45:57 master-00.wduan-1012e-upg.qe.devcluster.openshift.com podman[2022949]: time="2022-10-12T11:45:57Z" level=warning msg="lstat /sys/fs/cgroup/devices/machine.slice/libpod-ea744a45645d9c8d7a79182a78525a0b9f65b13e2e997f55bf80f626dcc0e945.scope: no such file or directory" Oct 12 11:45:57 master-00.wduan-1012e-upg.qe.devcluster.openshift.com systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Main process exited, code=exited, status=1/FAILURE Oct 12 11:45:57 master-00.wduan-1012e-upg.qe.devcluster.openshift.com systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Failed with result 'exit-code'. Oct 12 11:45:57 master-00.wduan-1012e-upg.qe.devcluster.openshift.com systemd[1]: machine-config-daemon-update-rpmostree-via-container.service: Consumed 1min 9.080s CPU time
full service log is attached
Version-Release number of selected component (if applicable):
4.12
Steps to Reproduce:
1. setup SNO cluster upi-on-baremetal with 4.11.8 2. upgrade it to 4.12.0-0.nightly-2022-10-05-053337
Actual results:
service machine-config-daemon-update-rpmostree-via-container is failed to deploy comment due to no enabled repositories issue
Expected results:
service machine-config-daemon-update-rpmostree-via-container can deploy new commit successfully
Additional info:
no proxy configured sh-4.4# cat /etc/mco/proxy.env # Proxy environment variables will be populated in this file. Properly # url encoded passwords with special characters will use '%<HEX><HEX>'. # Systemd requires that any % used in a password be represented as # %% in a unit file since % is a prefix for macros; this restriction does not # apply for environment files. Templates that need the proxy set should use # 'EnvironmentFile=/etc/mco/proxy.env'.
Description of problem:
The default dns-default pod is missing the "target.workload.openshift.io/management:" annotation. As a result when the workload partitioning feature is enabled on SNO, this pod resources will not get mutated and pinned to the reserved cpuset. This is a regresion from 4.10. Pod spec from 4.10.17 Annotations: ... resources.workload.openshift.io/dns: {"cpushares": 51} resources.workload.openshift.io/kube-rbac-proxy: {"cpushares": 10} target.workload.openshift.io/management {"effect":"PreferredDuringScheduling"}
Version-Release number of selected component (if applicable):
4.11.0
How reproducible:
100%
Steps to Reproduce:
1. Install a SNO and check the annotation 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-5129. The following is the description of the original issue:
—
Description of problem:
I attempted to install a BM SNO with the agent based installer. In the install_config, I disabled all supported capabilities except marketplace. Install_config snippet: capabilities: baselineCapabilitySet: None additionalEnabledCapabilities: - marketplace The system installed fine but the capabilities config was not passed down to the cluster. clusterversion: status: availableUpdates: null capabilities: enabledCapabilities: - CSISnapshot - Console - Insights - Storage - baremetal - marketplace - openshift-samples knownCapabilities: - CSISnapshot - Console - Insights - Storage - baremetal - marketplace - openshift-samples oc -n kube-system get configmap cluster-config-v1 -o yaml apiVersion: v1 data: install-config: | additionalTrustBundlePolicy: Proxyonly apiVersion: v1 baseDomain: ptp.lab.eng.bos.redhat.com bootstrapInPlace: installationDisk: /dev/disk/by-id/wwn-0x62cea7f04d10350026c6f2ec315557a0 compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: {} replicas: 0 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: {} replicas: 1 metadata: creationTimestamp: null name: cnfde8 networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineNetwork: - cidr: 10.16.231.0/24 networkType: OVNKubernetes serviceNetwork: - 172.30.0.0/16 platform: none: {} publish: External pullSecret: ""
Version-Release number of selected component (if applicable):
4.12.0-rc.5
How reproducible:
100%
Steps to Reproduce:
1. Install SNO with agent based installer as described above 2. 3.
Actual results:
Capabilities installed
Expected results:
Capabilities not installed
Additional info:
Because the agent ISO is ephemeral, it is probably safe to allow a user to log in to it with a password. If the network configuration is broken, a user may have no other way to debug it other than to log in through the console, which is currently not possible.
The best password to set would be the kubeadmin password used for the OpenShift GUI, since we'll have generated that already.
We must take care to test that this does not result in the installed nodes on disk allowing login with a password.
Description of problem:
NodePort port not accessible
Version-Release number of selected component (if applicable):
OCP 4.8.20
How reproducible:
$oc -n ui-nprd get services -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
docker-registry ClusterIP 10.201.219.240 <none> 5000/TCP 24d app=registry
docker-registry-lb LoadBalancer 10.201.252.253 internal-xxxxxx.xx-xxxx-1.elb.amazonaws.com 5000:30779/TCP 3d22h app=registry
docker-registry-np NodePort 10.201.216.26 <none> 5000:32428/TCP 3d16h app=registry
$oc debug node/ip-xxx.ca-central-1.compute.internal
Starting pod/ip-xxx.ca-central-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.81.23.96
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# nc -vz 10.81.23.96 32428
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection timed out.
In a new-created namespaces the same deployment works:
[RHEL7:> oc project
Using project "test-c1" on server "https://api.xx.xx.xxxx.xx.xx:6443".
[RHEL7:- ~/tmp]> oc port-forward service/docker-registry-np 5000:5000
Forwarding from 127.0.0.1:5000 -> 5000
[1]+ Stopped oc4 port-forward service/docker-registry-np 5000:5000
[RHEL7: ~/tmp]> bg %1
[1]+ oc4 port-forward service/docker-registry-np 5000:5000 &
[RHEL7: ~/tmp]> nc -v localhost 5000
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 127.0.0.1:5000.
Handling connection for 5000
[RHEL7: ~/tmp]> kill %1
[RHEL7: ~/tmp]>
[1]+ Terminated oc4 port-forward service/docker-registry-np 5000:5000
[RHEL7: ~/tmp]> oc get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
docker-registry-np NodePort 10.201.224.174 <none> 5000:31793/TCP 68s
[RHEL7: ~/tmp]> oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
registry-75b7c7fd94-rx29j 1/1 Running 0 7m5s 10.201.1.29 ip-xxx.ca-central-1.compute.internal <none> <none>
[RHEL7: ~/tmp]> oc debug node/ip-xxx.ca-central-1.compute.internal
Starting pod/ip-xxxca-central-1computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.81.23.87
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# nc -v 10.81.23.87 31793
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.81.23.87:31793.
Actual results:
Expected results:
Additional info:
Searching recent 4.12 CI, there are a number of failures in the clusteroperator/machine-config should not change condition/Available test case:
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=clusteroperator%2Fmachine-config+should+not+change+condition%2FAvailable&maxAge=48h&type=junit' | grep '4[.]12.*failures match' | sort
periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade (all) - 129 runs, 53% failed, 6% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-techpreview-serial (all) - 6 runs, 50% failed, 67% of failures match = 33% impact
periodic-ci-openshift-release-master-ci-4.12-e2e-azure-ovn-upgrade (all) - 60 runs, 50% failed, 3% of failures match = 2% impact
periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade (all) - 129 runs, 56% failed, 8% of failures match = 5% impact
periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade (all) - 129 runs, 69% failed, 12% of failures match = 9% impact
periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-ovn-rt-upgrade (all) - 8 runs, 38% failed, 67% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-ovn-upgrade (all) - 60 runs, 57% failed, 6% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade (all) - 12 runs, 42% failed, 20% of failures match = 8% impact
periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn-upgrade (all) - 60 runs, 40% failed, 4% of failures match = 2% impact
periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-serial-virtualmedia (all) - 6 runs, 100% failed, 17% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-sdn-upgrade (all) - 6 runs, 67% failed, 25% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.12-e2e-metal-ipi-serial-ovn-dualstack (all) - 6 runs, 67% failed, 25% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.12-e2e-vsphere-ovn-techpreview-serial (all) - 9 runs, 56% failed, 20% of failures match = 11% impact
periodic-ci-openshift-release-master-nightly-4.12-upgrade-from-stable-4.11-e2e-metal-ipi-upgrade (all) - 6 runs, 100% failed, 17% of failures match = 17% impact
periodic-ci-openshift-release-master-nightly-4.12-upgrade-from-stable-4.11-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 6 runs, 83% failed, 20% of failures match = 17% impact
periodic-ci-openshift-release-master-okd-4.12-e2e-vsphere (all) - 25 runs, 100% failed, 4% of failures match = 4% impact
release-openshift-ocp-installer-e2e-gcp-serial-4.12 (all) - 6 runs, 83% failed, 20% of failures match = 17% impact
Doesn't seem like reason is getting set?
$ curl -s 'https://search.ci.openshift.org/search?name=periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade&search=clusteroperator%2Fmachine-config+should+not+change+condition%2FAvailable&maxAge=48h&type=junit&context=15' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | grep 'clusteroperator/machine-config condition/Available status/False reason' Aug 31 01:13:56.724 - 698s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.12.0-0.ci-2022-08-30-194744}] Aug 31 09:09:15.460 - 1078s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.12.0-0.ci-2022-08-30-194744}] Sep 01 03:31:24.808 - 1131s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.12.0-0.ci-2022-08-31-111359}] Sep 01 07:15:58.029 - 1085s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.12.0-0.ci-2022-08-31-111359}]
Example runs in the job I've randomly selected to drill into:
$ curl -s 'https://search.ci.openshift.org/search?name=periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade&search=clusteroperator%2Fmachine-config+should+not+change+condition%2FAvailable&maxAge=48h&type=junit' | jq -r 'keys[]' https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade/1564757706458271744 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade/1564879945233076224 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade/1565158084484009984 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade/1565212566194491392
Drilling into that last run, the Available=False was the whole pool-update phase:
And details from the origin's monitor:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-ovn-upgrade/1565212566194491392/artifacts/e2e-aws-ovn-upgrade/openshift-e2e-test/build-log.txt | grep clusteroperator/machine-config Sep 01 07:15:57.629 E clusteroperator/machine-config condition/Degraded status/True reason/RenderConfigFailed changed: Failed to resync 4.12.0-0.ci-2022-08-31-111359 because: refusing to read osImageURL version "4.12.0-0.ci-2022-09-01-053740", operator version "4.12.0-0.ci-2022-08-31-111359" Sep 01 07:15:57.629 - 49s E clusteroperator/machine-config condition/Degraded status/True reason/Failed to resync 4.12.0-0.ci-2022-08-31-111359 because: refusing to read osImageURL version "4.12.0-0.ci-2022-09-01-053740", operator version "4.12.0-0.ci-2022-08-31-111359" Sep 01 07:15:58.029 E clusteroperator/machine-config condition/Available status/False changed: Cluster not available for [{operator 4.12.0-0.ci-2022-08-31-111359}] Sep 01 07:15:58.029 - 1085s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [{operator 4.12.0-0.ci-2022-08-31-111359}] Sep 01 07:16:47.000 I /machine-config reason/OperatorVersionChanged clusteroperator/machine-config-operator started a version change from [{operator 4.12.0-0.ci-2022-08-31-111359}] to [{operator 4.12.0-0.ci-2022-09-01-053740}] Sep 01 07:16:47.377 W clusteroperator/machine-config condition/Progressing status/True changed: Working towards 4.12.0-0.ci-2022-09-01-053740 Sep 01 07:16:47.377 - 1037s W clusteroperator/machine-config condition/Progressing status/True reason/Working towards 4.12.0-0.ci-2022-09-01-053740 Sep 01 07:16:47.405 W clusteroperator/machine-config condition/Degraded status/False changed: Sep 01 07:18:02.614 W clusteroperator/machine-config condition/Upgradeable status/False reason/PoolUpdating changed: One or more machine config pools are updating, please see `oc get mcp` for further details Sep 01 07:34:03.000 I /machine-config reason/OperatorVersionChanged clusteroperator/machine-config-operator version changed from [{operator 4.12.0-0.ci-2022-08-31-111359}] to [{operator 4.12.0-0.ci-2022-09-01-053740}] Sep 01 07:34:03.699 W clusteroperator/machine-config condition/Available status/True changed: Cluster has deployed [{operator 4.12.0-0.ci-2022-08-31-111359}] Sep 01 07:34:03.715 W clusteroperator/machine-config condition/Upgradeable status/True changed: Sep 01 07:34:04.065 I clusteroperator/machine-config versions: operator 4.12.0-0.ci-2022-08-31-111359 -> 4.12.0-0.ci-2022-09-01-053740 Sep 01 07:34:04.663 W clusteroperator/machine-config condition/Progressing status/False changed: Cluster version is 4.12.0-0.ci-2022-09-01-053740 [bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available [bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Degraded
No idea if whatever was happening there is the same thing that was happening in other runs, and I haven't checked 4.11 and earlier either. The test-case is non-fatal, so it doesn't break CI, but it can cause noise like ClusterOperatorDown if it continues for 10 or more minutes. Whic PromeCIeus says actually fired in this run, although apparently the origin monitors didn't notice to complain:
So parallel asks (and I'm happy to shard into separate bugs, if that's helpful):
Description of problem:
ovnkube-trace fails on hypershift deployments:
https://bugzilla.redhat.com/show_bug.cgi?id=2066891#c8
getDatabaseURIs looks for pods with container ovnkube-master, and those don't exist in hypershift.
https://github.com/ovn-org/ovn-kubernetes/blob/6b8acf05cb6043ebdc42d9d36e700390baabea4a/go-controller/cmd/ovnkube-trace/ovnkube-trace.go#L540
~~~
// Returns nbAddress, sbAddress, protocol == "ssl", nil
func getDatabaseURIs(coreclient *corev1client.CoreV1Client, restconfig *rest.Config, ovnNamespace string) (string, string, bool, error) {
containerName := "ovnkube-master"
var err error
found := false
var podName string
listOptions := metav1.ListOptions{}
pods, err := coreclient.Pods(ovnNamespace).List(context.TODO(), listOptions)
if err != nil
for _, pod := range pods.Items {
for _, container := range pod.Spec.Containers {
if container.Name == containerName
}
}
if !found
~~~
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-3123. The following is the description of the original issue:
—
Description of problem:
Support for tech preview API extensions was introduced in https://github.com/openshift/installer/pull/6336 and https://github.com/openshift/api/pull/1274 . In the case of https://github.com/openshift/api/pull/1278 , config/v1/0000_10_config-operator_01_infrastructure-TechPreviewNoUpgrade.crd.yaml was introduced which seems to result in both 0000_10_config-operator_01_infrastructure-TechPreviewNoUpgrade.crd.yaml and 0000_10_config-operator_01_infrastructure-Default.crd.yaml being rendered by the bootstrap. As a result, both CRDs are created during bootstrap. However, one of them(in this case the tech preview CRD) fails to be created. We may need to modify the render command to be aware of feature gates when rendering manifests during bootstrap. Also, I'm open hearing other views on how this might work.
Version-Release number of selected component (if applicable):
https://github.com/openshift/cluster-config-operator/pull/269 built and running on 4.12-ec5
How reproducible:
consistently
Steps to Reproduce:
1. bump the version of OpenShift API to one including a tech preview version of the infrastructure CRD 2. install openshift with the infrastructure manifest modified to incorporate tech preview fields 3. those fields will not be populated upon installation Also, checking the logs from bootkube will show both being installed, but one of them fails.
Actual results:
Expected results:
Additional info:
Excerpts from bootkube log Nov 02 20:40:01 localhost.localdomain bootkube.sh[4216]: Writing asset: /assets/config-bootstrap/manifests/0000_10_config-operator_01_infrastructure-TechPreviewNoUpgrade.crd.yaml Nov 02 20:40:01 localhost.localdomain bootkube.sh[4216]: Writing asset: /assets/config-bootstrap/manifests/0000_10_config-operator_01_infrastructure-Default.crd.yaml Nov 02 20:41:23 localhost.localdomain bootkube.sh[5710]: Created "0000_10_config-operator_01_infrastructure-Default.crd.yaml" customresourcedefinitions.v1.apiextensions.k8s.io/infrastructures.config.openshift.io -n Nov 02 20:41:23 localhost.localdomain bootkube.sh[5710]: Skipped "0000_10_config-operator_01_infrastructure-TechPreviewNoUpgrade.crd.yaml" customresourcedefinitions.v1.apiextensions.k8s.io/infrastructures.config.openshift.io -n as it already exists
Description of problem:
When creating a pod with an additional network that contains a `spec.config.ipam.exclude` range, any address within the excluded range is still iterated while searching for a suitable IP candidate. As a result, pod creation times out when large exclude ranges are used.
Version-Release number of selected component (if applicable):
How reproducible:
with big exclude ranges, 100%
Steps to Reproduce:
1. create network-attachment-definition with a large range: $ cat <<EOF| oc apply -f - apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: nad-w-excludes spec: config: |- { "cniVersion": "0.3.1", "name": "macvlan-net", "type": "macvlan", "master": "ens3", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "fd43:01f1:3daa:0baa::/64", "exclude": [ "fd43:01f1:3daa:0baa::/100" ], "log_file": "/tmp/whereabouts.log", "log_level" : "debug" } } EOF 2. create a pod with the network attached: $ cat <<EOF|oc apply -f - apiVersion: v1 kind: Pod metadata: name: pod-with-exclude-range annotations: k8s.v1.cni.cncf.io/networks: nad-w-excludes spec: containers: - name: pod-1 image: openshift/hello-openshift EOF 3. check pod status, event log and whereabouts logs after a while: $ oc get pods NAME READY STATUS RESTARTS AGE pod-with-exclude-range 0/1 ContainerCreating 0 2m23s $ oc get events <...> 6m39s Normal Scheduled pod/pod-with-exclude-range Successfully assigned default/pod-with-exclude-range to <worker-node> 6m37s Normal AddedInterface pod/pod-with-exclude-range Add eth0 [10.129.2.49/23] from openshift-sdn 2m39s Warning FailedCreatePodSandBox pod/pod-with-exclude-range Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded $ oc debug node/<worker-node> - tail /host/tmp/whereabouts.log Starting pod/<worker-node>-debug ... To use host binaries, run `chroot /host` 2022-10-27T14:14:50Z [debug] Finished leader election 2022-10-27T14:14:50Z [debug] IPManagement: {fd43:1f1:3daa:baa::1 ffffffffffffffff0000000000000000} , <nil> 2022-10-27T14:14:59Z [debug] Used defaults from parsed flat file config @ /etc/kubernetes/cni/net.d/whereabouts.d/whereabouts.conf 2022-10-27T14:14:59Z [debug] ADD - IPAM configuration successfully read: {Name:macvlan-net Type:whereabouts Routes:[] Datastore:kubernetes Addresses:[] OmitRanges:[fd43:01f1:3daa:0baa::/80] DNS: {Nameservers:[] Domain: Search:[] Options:[]} Range:fd43:1f1:3daa:baa::/64 RangeStart:fd43:1f1:3daa:baa:: RangeEnd:<nil> GatewayStr: EtcdHost: EtcdUsername: EtcdPassword:********* EtcdKeyFile: EtcdCertFile: EtcdCACertFile: LeaderLeaseDuration:1500 LeaderRenewDeadline:1000 LeaderRetryPeriod:500 LogFile:/tmp/whereabouts.log LogLevel:debug OverlappingRanges:true SleepForRace:0 Gateway:<nil> Kubernetes: {KubeConfigPath:/etc/kubernetes/cni/net.d/whereabouts.d/whereabouts.kubeconfig K8sAPIRoot:} ConfigurationPath:PodName:pod-with-exclude-range PodNamespace:default} 2022-10-27T14:14:59Z [debug] Beginning IPAM for ContainerID: f4ffd0e07d6c1a2b6ffb0fa29910c795258792bb1a1710ff66f6b48fab37af82 2022-10-27T14:14:59Z [debug] Started leader election 2022-10-27T14:14:59Z [debug] OnStartedLeading() called 2022-10-27T14:14:59Z [debug] Elected as leader, do processing 2022-10-27T14:14:59Z [debug] IPManagement - mode: 0 / containerID:f4ffd0e07d6c1a2b6ffb0fa29910c795258792bb1a1710ff66f6b48fab37af82 / podRef: default/pod-with-exclude-range 2022-10-27T14:14:59Z [debug] IterateForAssignment input >> ip: fd43:1f1:3daa:baa:: | ipnet: {fd43:1f1:3daa:baa:: ffffffffffffffff0000000000000000} | first IP: fd43:1f1:3daa:baa::1 | last IP: fd43:1f1:3daa:baa:ffff:ffff:ffff:ffff
Actual results:
Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Expected results:
additional network gets attached to the pod
Additional info:
When installing OCP cluster with worker nodes VM type specified as high performance, some of the configuration settings of said VMs do not match the configuration settings a high performance VM should have.
Specific configurations that do not match are described in subtasks.
Default configuration settings of high performance VMs:
https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html-single/virtual_machine_management_guide/index?extIdCarryOver=true&sc_cid=701f2000001Css5AAC#Configuring_High_Performance_Virtual_Machines_Templates_and_Pools
When installing OCP cluster with worker nodes VM type specified as high performance, manual and automatic migration is enabled in the said VMs.
However, high performance worker VMs are created with default values of the engine, so only manual migration should be enabled.
Default configuration settings of high performance VMs:
https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html-single/virtual_machine_management_guide/index?extIdCarryOver=true&sc_cid=701f2000001Css5AAC#Configuring_High_Performance_Virtual_Machines_Templates_and_Pools
How reproducible: 100%
How to reproduce:
1. Create install-config.yaml with a vmType field and set it to high performance, i.e.:
apiVersion: v1 baseDomain: basedomain.com compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: ovirt: affinityGroupsNames: [] vmType: high_performance replicas: 2 ...
2. Run installation
./openshift-install create cluster --dir=resources --log-level=debug
3. Check worker VM's configuration in the RHV webconsole.
Expected:
Only manual migration (under Host) should be enabled.
Actual:
Manual and automatic migration is enabled.
In https://github.com/openshift/installer/pull/6237 we are setting the version to v1alpha1, since we are not committing to not making further changes.
Before shipping in an official release we must update to at least v1beta1, or preferably v1.
This is a clone of issue OCPBUGS-2500. The following is the description of the original issue:
—
Description of problem:
When the Ux switches to the Dev console the topology is always blank in a Project that has a large number of components.
Version-Release number of selected component (if applicable):
How reproducible:
Always occurs
Steps to Reproduce:
1.Create a project with at least 12 components (Apps, Operators, knative Brokers) 2. Go to the Administrator Viewpoint 3. Switch to Developer Viewpoint/Topology 4. No components displayed 5. Click on 'fit to screen' 6. All components appear
Actual results:
Topology renders with all controls but no components visible (see screenshot 1)
Expected results:
All components should be visible
Additional info:
This is a clone of issue OCPBUGS-1695. The following is the description of the original issue:
—
Update initial FCOS used in OKD to 36.20220906.3.2
Description of problem:
On MicroShift, the Route API is served by kube-apiserver as a CRD. Reusing the same defaulting implementation as vanilla OpenShift through a patch to kube- apiserver is expected to resolve OCPBUGS-4189 but have no detectable effect on OCP.
Additional info:
This patch will be inert on OCP, but is implemented in openshift/kubernetes because MicroShift ingests kube-apiserver through its build-time dependency on openshift/kubernetes.
This is a clone of issue OCPBUGS-2598. The following is the description of the original issue:
—
Description of problem:
Liveness probe of ipsec pods fail with large clusters. Currently the command that is executed in the ipsec container is ovs-appctl -t ovs-monitor-ipsec ipsec/status && ipsec status The problem is with command "ipsec/status". In clusters with high node count this command will return a list with all the node daemons of the cluster. This means that as the node count raises the completion time of the command raises too.
This makes the main command
ovs-appctl -t ovs-monitor-ipsec
To hang until the subcommand is finished.
As the liveness and readiness probe values are hardcoded in the manifest of the ipsec container herehttps//github.com/openshift/cluster-network-operator/blob/9c1181e34316d34db49d573698d2779b008bcc20/bindata/network/ovn-kubernetes/common/ipsec.yaml] the liveness timeout of the container probe of 60 seconds start to be insufficient as the node count list is growing. This resulted in a cluster with 170 + nodes to have 15+ ipsec pods in a crashloopbackoff state.
Version-Release number of selected component (if applicable):
Openshift Container Platform 4.10 but i think the same will be visible to other versions too.
How reproducible:
I was not able to reproduce due to an extreamely high amount of resources are needed and i think that there is no point as we have spotted the issue.
Steps to Reproduce:
1. Install an Openshift cluster with IPSEC enabled 2. Scale to 170+ nodes or more 3. Notice that the ipsec pods will start getting in a Crashloopbackoff state with failed Liveness/Readiness probes.
Actual results:
Ip Sec pods are stuck in a Crashloopbackoff state
Expected results:
Ip Sec pods to work normally
Additional info:
We have provided a workaround where CVO and CNO operators are scaled to 0 replicas in order for us to be able to increase the liveness probe limit to a value of 600 that recovered the cluster. As a next step the customer will try to reduce the node count and restore the default liveness timeout value along with bringing the operators back to see if the cluster will stabilize.
This is a clone of issue OCPBUGS-3114. The following is the description of the original issue:
—
Description of problem:
When running a Hosted Cluster on Hypershift the cluster-networking-operator never progressed to Available despite all the components being up and running
Version-Release number of selected component (if applicable):
quay.io/openshift-release-dev/ocp-release:4.11.11-x86_64 for the hosted clusters hypershift operator is quay.io/hypershift/hypershift-operator:4.11 4.11.9 management cluster
How reproducible:
Happened once
Steps to Reproduce:
1. 2. 3.
Actual results:
oc get co network reports False availability
Expected results:
oc get co network reports True availability
Additional info:
Create a script that gathers debug information from a host running the agent ISO and exports it in a standard format so that we can ask customers to provide it for debugging when something has gone wrong (and also use it in CI).
For now, it is fine to require the user to ssh into the host to run the script. The script should be already in place inside the agent ISO.
The output should probably be a compressed tar file. That file could be saved locally, or potentially piped to stdout so that a user only has to run a command like: ssh node0 -c agent-gather >node0.tgz
Things we need to collect:
I'd disabled Telemetry for the bulk of the CI fleet in OTA-740. But that lead to many
failures for:
[sig-instrumentation] Prometheus when installed on the cluster should report telemetry if a cloud.openshift.com token is present [Late] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
We should extend the checks for Telemetry enablement to include telemeterClient.enabled in the monitoring-specific ConfigMap, as well as the previously-checked pull-secret token.
This is a clone of issue OCPBUGS-5733. The following is the description of the original issue:
—
Description of problem:
Description of parameters are not shown in pipelinerun description page
Version-Release number of selected component (if applicable):
Openshift Pipelines 1.9.0 OCP 4.12
How reproducible:
Always
Steps to Reproduce:
1. Create pipeline with parameters and add description to the params 2. Start the pipeline and navigate to created pipelinerun 3. Select
Parameters
tab and check the description of the params
Actual results:
Description feild of the params are empty
Expected results:
Description of the params should be present
Additional info:
Description of problem:
Image registry pods panic while deploying OCP in ap-south-2 AWS region
Version-Release number of selected component (if applicable):
4.11.2
How reproducible:
Deploy OCP in AWS ap-south-2 region
Steps to Reproduce:
Deploy OCP in AWS ap-south-2 region
Actual results:
panic: Invalid region provided: ap-south-2
Expected results:
Image registry pods should come up with no errors
Additional info:
This is a clone of issue OCPBUGS-6018. The following is the description of the original issue:
—
This is a public clone of OCPBUGS-3821
The MCO can sometimes render a rendered-config in the middle of an upgrade with old MCs, e.g.:
This will cause the render controller to create a new rendered MC that uses the OLD kubeletconfig-MC, which at best is a double reboot for 1 node, and at worst block the update and break maxUnavailable nodes per pool.
Description of problem: As discovered in https://issues.redhat.com/browse/OCPBUGS-2795, gophercloud fails to list swift containers when the endpoint speaks HTTP2. This means that CIRO will provision a 100GB cinder volume even though swift is available to the tenant.
We're for example seeing this behavior in our CI on vexxhost.
The gophercloud commit that fixed this issue is https://github.com/gophercloud/gophercloud/commit/b7d5b2cdd7ffc13e79d924f61571b0e5f74ec91c, specifically the `|| ct == ""` part on line 75 of openstack/objectstorage/v1/containers/results.go. This commit made it in gophercloud v0.18.0.
CIRO still depends on gophercloud v0.17.0. We should bump gophercloud to fix the bug.
Version-Release number of selected component (if applicable):
All versions. Fix should go to 4.8 - 4.12.
How reproducible:
Always, when swift speaks HTTP2.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Discovered in the must gather kubelet_service.log from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-gcp-sdn-upgrade/1586093220087992320
It appears the guard pod names are too long, and being truncated down to where they will collide with those from the other masters.
From kubelet logs in this run:
❯ grep openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste kubelet_service.log Oct 28 23:58:55.693391 ci-op-3hj6pnwf-4f6ab-lv57z-master-1 kubenswrapper[1657]: E1028 23:58:55.693346 1657 kubelet_pods.go:413] "Hostname for pod was too long, truncated it" podName="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-master-1" hostnameMaxLen=63 truncatedHostname="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste" Oct 28 23:59:03.735726 ci-op-3hj6pnwf-4f6ab-lv57z-master-0 kubenswrapper[1670]: E1028 23:59:03.735671 1670 kubelet_pods.go:413] "Hostname for pod was too long, truncated it" podName="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-master-0" hostnameMaxLen=63 truncatedHostname="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste" Oct 28 23:59:11.168082 ci-op-3hj6pnwf-4f6ab-lv57z-master-2 kubenswrapper[1667]: E1028 23:59:11.168041 1667 kubelet_pods.go:413] "Hostname for pod was too long, truncated it" podName="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-master-2" hostnameMaxLen=63 truncatedHostname="openshift-kube-scheduler-guard-ci-op-3hj6pnwf-4f6ab-lv57z-maste"
This also looks to be happening for openshift-kube-scheduler-guard, kube-controller-manager-guard, possibly others.
Looks like they should be truncated further to make room for random suffixes in https://github.com/openshift/library-go/blame/bd9b0e19121022561dcd1d9823407cd58b2265d0/pkg/operator/staticpod/controller/guard/guard_controller.go#L97-L98
Unsure of the implications here, it looks a little scary.
The path used by --rotated-pod-logs to gather the rotated pod logs from /var/log/pods node folder via /api/v1/nodes/${NODE}/proxy/logs/${LOG_PATH} is only valid for regular pods but not for static pods.
The main problem is that, while normal pods have their rotated logs at this /var/log/pods/${POD_NAME}_${POD_UID_IN_API}/${CONTAINER_NAME}, static pods have them at /var/log/pods/${POD_NAME}_${CONFIG_HASH}/${CONTAINER_NAME} because the UID cannot be known at the time that the static pod is born (because static pods are created by kubelet before registering them in the kube-apiserver, and UID is assigned by the kube-apiserver).
The visible results of that are:
4.10
Always if there are static pods.
1. oc adm inspect --rotated-pod-logs ns/openshift-etcd (or any other project with static pods).
error: errors occurred while gathering data: one or more errors occurred while gathering pod-specific data for namespace: openshift-etcd [one or more errors occurred while gathering container data for pod etcd-master-0.example.net: the server could not find the requested resource, one or more errors occurred while gathering container data for pod etcd-master-1.example.net: the server could not find the requested resource, one or more errors occurred while gathering container data for pod etcd-master-2.example.net: the server could not find the requested resource]
No errors like the ones above and rotated pod logs to be gathered, if present.
Despite being marked as experimental, this --rotated-pod-logs is used in must-gather, so this issue can be easily reproduced by just running a default must-gather. I focused on bare oc adm inspect reproducers for simplicity.
Description of problem:
The ovn-kubernetes ovnkube-master containers are continuously crashlooping since we updated to 4.11.0-0.okd-2022-10-15-073651.
Log Excerpt:
] [] [] [{kubectl-client-side-apply Update networking.k8s.io/v1 2022-09-12 12:25:06 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{".":{},"f:kubectl.kubernetes.io/last-applied-configuration":{}}},"f:spec":{"f:ingress":{},"f:policyTypes":{}}} }]},Spec:NetworkPolicySpec{PodSelector:{map[] []},Ingress:[]NetworkPolicyIngressRule{NetworkPolicyIngressRule{Ports:[]NetworkPolicyPort{},From:[]NetworkPolicyPeer{NetworkPolicyPeer{PodSelector:&v1.LabelSelector{MatchLabels:map[string]string{access: true,},MatchExpressions:[]LabelSelectorRequirement{},},NamespaceSelector:nil,IPBlock:nil,},},},},Egress:[]NetworkPolicyEgressRule{},PolicyTypes:[Ingress],},} &NetworkPolicy{ObjectMeta:{allow-from-openshift-ingress compsci-gradcentral a405f843-c250-40d7-8dd4-a759f764f091 217304038 1 2022-09-22 14:36:38 +0000 UTC <nil> <nil> map[] map[] [] [] [{openshift-apiserver Update networking.k8s.io/v1 2022-09-22 14:36:38 +0000 UTC FieldsV1 {"f:spec":{"f:ingress":{},"f:policyTypes":{}}} }]},Spec:NetworkPolicySpec{PodSelector:{map[] []},Ingress:[]NetworkPolicyIngressRule{NetworkPolicyIngressRule{Ports:[]NetworkPolicyPort{},From:[]NetworkPolicyPeer{NetworkPolicyPeer{PodSelector:nil,NamespaceSelector:&v1.LabelSelector{MatchLabels:map[string]string{policy-group.network.openshift.io/ingress: ,},MatchExpressions:[]LabelSelectorRequirement{},},IPBlock:nil,},},},},Egress:[]NetworkPolicyEgressRule{},PolicyTypes:[Ingress],},}]: cannot clean up egress default deny ACL name: error in transact with ops [{Op:mutate Table:Port_Group Row:map[] Rows:[] Columns:[] Mutations:[{Column:acls Mutator:delete Value:{GoSet:[{GoUUID:60cb946a-46e9-4623-9ba4-3cb35f018ed6}]}}] Timeout:<nil> Where:[where column _uuid == {ccdd01bf-3009-42fb-9672-e1df38190cd7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Port_Group Row:map[] Rows:[] Columns:[] Mutations:[{Column:acls Mutator:delete Value:{GoSet:[{GoUUID:60cb946a-46e9-4623-9ba4-3cb35f018ed6}]}}] Timeout:<nil> Where:[where column _uuid == {10bbf229-8c1b-4c62-b36e-4ba0097722db}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:ACL Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {7b55ba0c-150f-4a63-9601-cfde25f29408}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:ACL Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {60cb946a-46e9-4623-9ba4-3cb35f018ed6}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete ACL row 7b55ba0c-150f-4a63-9601-cfde25f29408 because of 1 remaining reference(s) UUID:{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete ACL row 7b55ba0c-150f-4a63-9601-cfde25f29408 because of 1 remaining reference(s)
Additional info:
https://github.com/okd-project/okd/issues/1372 Issue persisted through update to 4.11.0-0.okd-2022-10-28-153352 must-gather: https://nbc9-snips.cloud.duke.edu/snips/must-gather.local.2859117512952590880.zip
This is a clone of issue OCPBUGS-5184. The following is the description of the original issue:
—
Description of problem:
Fail to deploy IPI azure cluster, where set region as westus3, vm type as NV8as_v4. Master node is running from azure portal, but could not ssh login. From serials log, get below error:
[ 3009.547219] amdgpu d1ef:00:00.0: amdgpu: failed to write reg:de0 [ 3011.982399] mlx5_core 6637:00:02.0 enP26167s1: TX timeout detected [ 3011.987010] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 0, SQ: 0x170, CQ: 0x84d, SQ Cons: 0x823 SQ Prod: 0x840, usecs since last trans: 2418884000 [ 3011.996946] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 1, SQ: 0x175, CQ: 0x852, SQ Cons: 0x248c SQ Prod: 0x24a7, usecs since last trans: 2148366000 [ 3012.006980] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 2, SQ: 0x17a, CQ: 0x857, SQ Cons: 0x44a1 SQ Prod: 0x44c0, usecs since last trans: 2055000000 [ 3012.016936] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 3, SQ: 0x17f, CQ: 0x85c, SQ Cons: 0x405f SQ Prod: 0x4081, usecs since last trans: 1913890000 [ 3012.026954] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 4, SQ: 0x184, CQ: 0x861, SQ Cons: 0x39f2 SQ Prod: 0x3a11, usecs since last trans: 2020978000 [ 3012.037208] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 5, SQ: 0x189, CQ: 0x866, SQ Cons: 0x1784 SQ Prod: 0x17a6, usecs since last trans: 2185513000 [ 3012.047178] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 6, SQ: 0x18e, CQ: 0x86b, SQ Cons: 0x4c96 SQ Prod: 0x4cb3, usecs since last trans: 2124353000 [ 3012.056893] mlx5_core 6637:00:02.0 enP26167s1: TX timeout on queue: 7, SQ: 0x193, CQ: 0x870, SQ Cons: 0x3bec SQ Prod: 0x3c0f, usecs since last trans: 1855857000 [ 3021.535888] amdgpu d1ef:00:00.0: amdgpu: failed to write reg:e15 [ 3021.545955] BUG: unable to handle kernel paging request at ffffb57b90159000 [ 3021.550864] PGD 100145067 P4D 100145067 PUD 100146067 PMD 0
From azure doc https://learn.microsoft.com/en-us/azure/virtual-machines/nvv4-series , looks like nvv4 series only supports Window VM.
Version-Release number of selected component (if applicable):
4.12 nightly build
How reproducible:
Always
Steps to Reproduce:
1. prepare install-config.yaml, set region as westus3, vm type as NV8as_v4 2. install cluster 3.
Actual results:
installation failed
Expected results:
If nvv4 series is not supported for Linux VM, installer might validate and show the message that such size is not supported.
Additional info:
In order to support 4.12 there needs to be an entry for OS_IMAGES in images.env.template.
Note that the actual url isn't important, just that there is an entry for 4.12.
Description of problem:
Pod and PDB list page just report "Not found" when no resources found
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-15-094115
How reproducible:
Always
Steps to Reproduce:
1. normal user has a new empty project 2. normal user visit PDB list page via Workloads -> PodDisruptionBudgets 3.
Actual results:
2. it just reports 'Not found'
Expected results:
2. for other workloads, it will report "No <resource> found", for example No HorizontalPodAutoscalers found No StatefulSets found No Deployments found so for Pods and PodDisruptionBudgets list page, when no resource can be found, it's better that we also reports "No pods found" and "No PodDisruptionBudgets found"
Additional info:
libovsdb builds transaction log messages for every transaction and then throws them away if the log level is not 4 or above. This wastes a bunch of CPU at scale and increases pod ready latency.
Description of problem:
Data race seen in unit tests: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1448/pull-ci-openshift-ovn-kubernetes-release-4.11-unit/1604898712423763968/artifacts/test/build-log.txt
This is a clone of issue OCPBUGS-6647. The following is the description of the original issue:
—
Description of problem:
Resource type drop-down menu item 'Last used' is in English
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Steps to Reproduce:
1. Navigate to kube:admin -> User Preferences -> Applications 2. Click on Resource type dorp-down
Actual results:
Content is in English
Expected results:
Content should be in target language
Additional info:
Screenshot reference provided
This is a clone of issue OCPBUGS-10622. The following is the description of the original issue:
—
Description of problem:
Unit test failing === RUN TestNewAppRunAll/app_generation_using_context_dir newapp_test.go:907: app generation using context dir: Error mismatch! Expected <nil>, got supplied context directory '2.0/test/rack-test-app' does not exist in 'https://github.com/openshift/sti-ruby' --- FAIL: TestNewAppRunAll/app_generation_using_context_dir (0.61s)
Version-Release number of selected component (if applicable):
How reproducible:
100
Steps to Reproduce:
see for example https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/1376/pull-ci-openshift-oc-master-images/1638172620648091648
Actual results:
unit tests fail
Expected results:
TestNewAppRunAll unit test should pass
Additional info:
The dependency on openshift/api needs to be bumped in openshift/kubernetes in order to pull in the fix from https://issues.redhat.com/browse/OCPBUGS-3635.
This is a clone of issue OCPBUGS-2260. The following is the description of the original issue:
—
TRT-594 investigates failed CI upgrade runs due to alert KubePodNotReady firing. The case was a pod getting skipped over for scheduling over two successive master node update / restarts. The case was determined valid so the ask is to be able to have the monitoring aware that master nodes are restarting and scheduling may be delayed. Presuming we don't want to change the existing tolerance for the non master node restart cases could we suppress it during those restarts and fall back to a second alert with increased tolerances only during those restarts, if we have metrics indicating we are restarting. Or similar if there are better ways to handle.
The scenario is:
Description of problem:
The TestReloadInterval E2E test has completely wrong validations in which the min value should be 1s, not 5s. But there is a race condition which allow these tests to sometimes pass due to the last test condition. Therefore, failures in CI are actually correct, and successes are wrong based on the E2E conditions.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
50%
Steps to Reproduce:
1.Run TestReloadInterval E2E test (make test-e2e TEST=TestReloadInterval)
Actual results:
Sometimes fails on 5us test case: reloadinterval_test.go:106: router deployment not updated with RELOAD_INTERVAL=5s: timed out waiting for the condition
Expected results:
Should pass E2E
Additional info:
A nil-pointer dereference occurred in the TestRouterCompressionOperation test in the e2e-gcp-operator CI job for the openshift/cluster-ingress-operator repository.
4.12.
Observed once. However, we run e2e-gcp-operator infrequently.
1. Run the e2e-gcp-operator CI job on a cluster-ingress-operator PR.
panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x14cabef] goroutine 8048 [running]: testing.tRunner.func1.2({0x1624920, 0x265b870}) /usr/lib/golang/src/testing/testing.go:1389 +0x24e testing.tRunner.func1() /usr/lib/golang/src/testing/testing.go:1392 +0x39f panic({0x1624920, 0x265b870}) /usr/lib/golang/src/runtime/panic.go:838 +0x207 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x40e43e5698?}) /go/src/github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xd8 panic({0x1624920, 0x265b870}) /usr/lib/golang/src/runtime/panic.go:838 +0x207 github.com/openshift/cluster-ingress-operator/test/e2e.getHttpHeaders(0xc0002b9380?, 0xc0000e4540, 0x1) /go/src/github.com/openshift/cluster-ingress-operator/test/e2e/router_compression_test.go:257 +0x2ef github.com/openshift/cluster-ingress-operator/test/e2e.testContentEncoding.func1() /go/src/github.com/openshift/cluster-ingress-operator/test/e2e/router_compression_test.go:220 +0x57 k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1({0x18, 0xc00003f000}) /go/src/github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:222 +0x1b k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext({0x1b25d40?, 0xc000138000?}, 0xc000befe08?) /go/src/github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:235 +0x57 k8s.io/apimachinery/pkg/util/wait.poll({0x1b25d40, 0xc000138000}, 0x48?, 0xc4fa25?, 0x30?) /go/src/github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:582 +0x38 k8s.io/apimachinery/pkg/util/wait.PollImmediateWithContext({0x1b25d40, 0xc000138000}, 0xc000b1da00?, 0xc000befe98?, 0x414207?) /go/src/github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:528 +0x4a k8s.io/apimachinery/pkg/util/wait.PollImmediate(0xc00088cea0?, 0x3b9aca00?, 0xc000138000?) /go/src/github.com/openshift/cluster-ingress-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:514 +0x50 github.com/openshift/cluster-ingress-operator/test/e2e.testContentEncoding(0xc00088cea0, 0xc000a8a270, 0xc0000e4540, 0x1, {0x17fe569, 0x4}) /go/src/github.com/openshift/cluster-ingress-operator/test/e2e/router_compression_test.go:219 +0xfc github.com/openshift/cluster-ingress-operator/test/e2e.TestRouterCompressionOperation(0xc00088cea0) /go/src/github.com/openshift/cluster-ingress-operator/test/e2e/router_compression_test.go:208 +0x454 testing.tRunner(0xc00088cea0, 0x191cdd0) /usr/lib/golang/src/testing/testing.go:1439 +0x102 created by testing.(*T).Run /usr/lib/golang/src/testing/testing.go:1486 +0x35f
The test should pass.
The faulty logic was introduced in https://github.com/openshift/cluster-ingress-operator/pull/679/commits/211b9c15b1fd6217dee863790c20f34c26c138aa.
The test was subsequently marked as a parallel test in https://github.com/openshift/cluster-ingress-operator/pull/756/commits/a22322b25569059c61e1973f37f0a4b49e9407bc.
The job history shows that the e2e-gcp-operator job has only run once since June: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-operator. I see failures in May, but none of those failures shows the panic.
Description of problem:
Get the below error when upgrading to OCP 4.12 from 4.9->4.10->4.11.
MacBook-Pro:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-08-24-091058 True True 4h Unable to apply 4.12.0-0.nightly-2022-08-24-053339: the workload openshift-operator-lifecycle-manager/package-server-manager cannot roll out - lastTransitionTime: "2022-08-25T04:47:36Z" lastUpdateTime: "2022-08-25T04:47:36Z" message: 'pods "package-server-manager-85b6dc4d89-sdzcc" is forbidden: violates PodSecurity "restricted:v1.24": seccompProfile (pod or container "package-server-manager" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")' reason: FailedCreate status: "True" type: ReplicaFailure
Version-Release number of selected component (if applicable):
MacBook-Pro:~ jianzhang$ oc exec catalog-operator-c5c655d5c-b9lcn -- olm --version
OLM version: 0.19.0
git commit: 8a984d41acc67c0bc9bfe807fadeef23f83abd44
How reproducible:
always
Steps to Reproduce:
1. Install OCP 4.11.0-0.nightly-2022-08-24-091058
2. Upgrade it to 4.12.0-0.nightly-2022-08-24-053339
Actual results:
The cluster upgrading is blocked. Get the above errors as described.
Expected results:
Upgraded to 4.12 from old OCP versions 4.5, 4.9 successfully.
Additional info:
MacBook-Pro:~ jianzhang$ oc get deployment package-server-manager -o yaml apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "5" include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2022-08-25T00:14:08Z" generation: 5 labels: app: package-server-manager name: package-server-manager namespace: openshift-operator-lifecycle-manager ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 3fd29082-0e76-4b09-988e-78cb5fc7c8b5 resourceVersion: "169028" uid: c8f7cbe2-4f82-40ce-9468-817ffefa903f spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: package-server-manager strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: annotations: target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}' creationTimestamp: null labels: app: package-server-manager spec: containers: - args: - --name - $(PACKAGESERVER_NAME) - --namespace - $(PACKAGESERVER_NAMESPACE) command: - /bin/psm - start env: - name: PACKAGESERVER_NAME value: packageserver - name: PACKAGESERVER_IMAGE value: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d49e1e27114f4b719bc8f3c222b2c5934d3b8028c79ec8e2bd288f6e9b5b3d5c - name: PACKAGESERVER_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: RELEASE_VERSION value: 4.12.0-0.nightly-2022-08-24-053339 image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d49e1e27114f4b719bc8f3c222b2c5934d3b8028c79ec8e2bd288f6e9b5b3d5c imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 3 httpGet: path: /healthz port: 8080 scheme: HTTP initialDelaySeconds: 30 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 name: package-server-manager readinessProbe: failureThreshold: 3 httpGet: path: /healthz port: 8080 scheme: HTTP initialDelaySeconds: 30 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 resources: requests: cpu: 10m memory: 50Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError dnsPolicy: ClusterFirst nodeSelector: kubernetes.io/os: linux node-role.kubernetes.io/master: "" priorityClassName: system-cluster-critical restartPolicy: Always schedulerName: default-scheduler securityContext: runAsNonRoot: true serviceAccount: olm-operator-serviceaccount serviceAccountName: olm-operator-serviceaccount terminationGracePeriodSeconds: 30 tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 120 - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 120 status: availableReplicas: 1 conditions: - lastTransitionTime: "2022-08-25T03:14:20Z" lastUpdateTime: "2022-08-25T03:14:20Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available - lastTransitionTime: "2022-08-25T04:47:36Z" lastUpdateTime: "2022-08-25T04:47:36Z" message: 'pods "package-server-manager-85b6dc4d89-sdzcc" is forbidden: violates PodSecurity "restricted:v1.24": seccompProfile (pod or container "package-server-manager" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")' reason: FailedCreate status: "True" type: ReplicaFailure - lastTransitionTime: "2022-08-25T04:57:37Z" lastUpdateTime: "2022-08-25T04:57:37Z" message: ReplicaSet "package-server-manager-85b6dc4d89" has timed out progressing. reason: ProgressDeadlineExceeded status: "False" type: Progressing observedGeneration: 5 readyReplicas: 1 replicas: 1 unavailableReplicas: 1
This is a clone of issue OCPBUGS-6503. The following is the description of the original issue:
—
Description of problem:
While looking into OCPBUGS-5505 I discovered that some 4.10->4.11 upgrade job runs perform an Admin Ack check, while some do not. 4.11 has a ack-4.11-kube-1.25-api-removals-in-4.12 gate, so these upgrade jobs sometimes test that Upgradeable goes false after the ugprade, and sometimes they do not. This is only determined by the polling race condition: the check is executed once per 10 minutes, and we cancel the polling after upgrade is completed. This means that in some cases we are lucky and manage to run one check before the cancel, and sometimes we are not and only check while still on the base version.
Example job that checked admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640
$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired' Jan 6 21:16:40.153: INFO: Waiting for Upgradeable to be AdminAckRequired ...
Example job that did not check admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480
$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired'
Version-Release number of selected component (if applicable):
4.11+ openshift-tests
How reproducible:
nondeterministic, wild guess is ~30% of upgrade jobs
Steps to Reproduce:
1. Inspect the E2E test log of an upgrade jobs and compare the time of the update ("Completed upgrade") with the time of the last check ( "Skipping admin ack", "Gate .* not applicable to current version", "Admin Ack verified') done by the admin ack test
Actual results:
Jan 23 00:47:43.842: INFO: Admin Ack verified Jan 23 00:57:43.836: INFO: Admin Ack verified Jan 23 01:07:43.839: INFO: Admin Ack verified Jan 23 01:17:33.474: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-z09ll8fw/release@sha256:322cf67dc00dd6fa4fdd25c3530e4e75800f6306bd86c4ad1418c92770d58ab8
No check done after the upgrade
Expected results:
Jan 23 00:57:37.894: INFO: Admin Ack verified Jan 23 01:07:37.894: INFO: Admin Ack verified Jan 23 01:16:43.618: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-z8h5x1c5/release@sha256:9c4c732a0b4c2ae887c73b35685e52146518e5d2b06726465d99e6a83ccfee8d Jan 23 01:17:57.937: INFO: Admin Ack verified
One or more checks done after upgrade
In order to start 4.12 development, we need to merge the agent-installer branch. We need to create a PR and engage the Installer team on getting it approved
Description of problem:
When all projects are selected, workloads list page and details page shows inconsistent HorizontalPodAutoscaler actions
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-07-25-010250
How reproducible:
Always
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
Description of problem: Knative tests were disabled due to https://issues.redhat.com/browse/OCPBUGS-190 to unblock the queue and should be enabled back again
https://coreos.slack.com/archives/C6A3NV5J9/p1660659719046909
https://github.com/openshift/console/pull/11956#discussion_r948075848
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Description of problem:
Container networking pods cannot access the host network pods on another node which caused some operators DEGRADED $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.0-0.nightly-2022-10-23-204408 False True True 63m OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.jhou.arm.eng.rdu2.redhat.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)... baremetal 4.12.0-0.nightly-2022-10-23-204408 True False False 62m cloud-controller-manager 4.12.0-0.nightly-2022-10-23-204408 True False False 68m cloud-credential 4.12.0-0.nightly-2022-10-23-204408 True False False 78m cluster-autoscaler 4.12.0-0.nightly-2022-10-23-204408 True False False 62m config-operator 4.12.0-0.nightly-2022-10-23-204408 True False False 63m console 4.12.0-0.nightly-2022-10-23-204408 False False False 30m RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.jhou.arm.eng.rdu2.redhat.com): Get "https://console-openshift-console.apps.jhou.arm.eng.rdu2.redhat.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers) control-plane-machine-set 4.12.0-0.nightly-2022-10-23-204408 True False False 62m csi-snapshot-controller 4.12.0-0.nightly-2022-10-23-204408 True False False 62m dns 4.12.0-0.nightly-2022-10-23-204408 True False False 62m etcd 4.12.0-0.nightly-2022-10-23-204408 False True True 13m EtcdMembersAvailable: 1 of 2 members are available, openshift-qe-048.arm.eng.rdu2.redhat.com is unhealthy image-registry 4.12.0-0.nightly-2022-10-23-204408 True False False 39m ingress 4.12.0-0.nightly-2022-10-23-204408 True False True 47m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing) insights 4.12.0-0.nightly-2022-10-23-204408 True False False 56m kube-apiserver 4.12.0-0.nightly-2022-10-23-204408 True False False 50m kube-controller-manager 4.12.0-0.nightly-2022-10-23-204408 True False True 60m GarbageCollectorDegraded: error querying alerts: client_error: client error: 403 kube-scheduler 4.12.0-0.nightly-2022-10-23-204408 True False False 54m kube-storage-version-migrator 4.12.0-0.nightly-2022-10-23-204408 True False False 63m machine-api 4.12.0-0.nightly-2022-10-23-204408 True False False 51m machine-approver 4.12.0-0.nightly-2022-10-23-204408 True False False 62m machine-config 4.12.0-0.nightly-2022-10-23-204408 True False False 29m marketplace 4.12.0-0.nightly-2022-10-23-204408 True False False 62m monitoring 4.12.0-0.nightly-2022-10-23-204408 True False False 38m network 4.12.0-0.nightly-2022-10-23-204408 True False False 62m node-tuning 4.12.0-0.nightly-2022-10-23-204408 True False False 62m openshift-apiserver 4.12.0-0.nightly-2022-10-23-204408 True False False 30m openshift-controller-manager 4.12.0-0.nightly-2022-10-23-204408 True False False 56m openshift-samples 4.12.0-0.nightly-2022-10-23-204408 True False False 43m operator-lifecycle-manager 4.12.0-0.nightly-2022-10-23-204408 True False False 62m operator-lifecycle-manager-catalog 4.12.0-0.nightly-2022-10-23-204408 True False False 62m operator-lifecycle-manager-packageserver 4.12.0-0.nightly-2022-10-23-204408 True False False 43m service-ca 4.12.0-0.nightly-2022-10-23-204408 True False False 63m storage 4.12.0-0.nightly-2022-10-23-204408 True False False 63m $ oc get pod -n openshift-ingress -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES router-default-58f6498646-gf6ns 1/1 Running 1 (79m ago) 93m 2620:52:0:1eb:3673:5aff:fe9e:5abc openshift-qe-049.arm.eng.rdu2.redhat.com <none> <none> router-default-58f6498646-qjtbk 1/1 Running 1 (79m ago) 93m 2620:52:0:1eb:3673:5aff:fe9e:593c openshift-qe-052.arm.eng.rdu2.redhat.com <none> <none> $ oc get pod -n openshift-network-diagnostics -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES network-check-source-5f967d78bc-cfwz4 1/1 Running 0 103m fd01:0:0:3::9 openshift-qe-052.arm.eng.rdu2.redhat.com <none> <none> network-check-target-52krv 1/1 Running 0 91m fd01:0:0:4::3 openshift-qe-049.arm.eng.rdu2.redhat.com <none> <none> network-check-target-56q9q 1/1 Running 0 91m fd01:0:0:3::5 openshift-qe-052.arm.eng.rdu2.redhat.com <none> <none> network-check-target-ggqsf 1/1 Running 0 103m fd01:0:0:2::4 openshift-qe-048.arm.eng.rdu2.redhat.com <none> <none> network-check-target-xfrq4 1/1 Running 0 103m fd01:0:0:1::3 openshift-qe-047.arm.eng.rdu2.redhat.com <none> <none> network-check-target-zrglr 1/1 Running 0 73m fd01:0:0:6::4 openshift-qe-051.arm.eng.rdu2.redhat.com <none> <none> network-check-target-zwb4t 1/1 Running 0 91m fd01:0:0:5::5 openshift-qe-053.arm.eng.rdu2.redhat.com <none> <none> ####Failed from containers pod on openshift-qe-053.arm.eng.rdu2.redhat.com to access ingress pods $ oc rsh -n openshift-network-diagnostics network-check-target-zwb4t sh-4.4$ curl https://[2620:52:0:1eb:3673:5aff:fe9e:5abc]:443 -k -I ^C sh-4.4$ curl https://[2620:52:0:1eb:3673:5aff:fe9e:593c]:443 -k -I ^C
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-23-204408
How reproducible:
always
Steps to Reproduce:
1. Deploy ipv6 disconnect single cluster 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The current version of openshift/cluster-ingress-operator vendors Kubernetes 1.25 packages. OpenShift 4.13 is based on Kubernetes 1.26.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1. Check https://github.com/openshift/cluster-ingress-operator/blob/release-4.13/go.mod
Actual results:
Kubernetes packages (k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go) are at version v0.25
Expected results:
Kubernetes packages are at version v0.26.0 or later.
Additional info:
Using old Kubernetes API and client packages brings risk of API compatibility issues. Also, Gateway-API is dependent on v0.26, so we are required to bump in order to support our Enhanced Dev Preview activities.
Description of problem:
If a customer creates a machine with a networks section like this networks: - filter: {} noAllowedAddressPairs: false subnets: - filter: {} uuid: primary-subnet-uuid - filter: {} noAllowedAddressPairs: true subnets: - filter: {} uuid: other-subnet-uuid primarySubnet: primary-subnet-uuid Then all the ports are created without the allowed address pairs. Doing some research in the source code, I have found that: - For each entry on the networks: section, networks are filtered as per its filter: section[1] - Then, if the subnets: section of the network entry is not empty, for each of the network IDs found above[2], 2 things are done that are relevant for this situatoin: - The net ID is saved on a netsWithoutAllowedAddressPairs[3]. That map is later checked while creating any port[4]. - For each subnet entry that matches the network ID, a port is created[5]. So, the problematic behavior happens due to the following: - Both entries in the networks array have empty filters. This means that both entries selected all the neutron networks. - This configuration results in one port per subnet as expected because, in the later traversal of the subnets array of each entry[5], it is filtering by subnet and creating a single port as expected. - However, the entry with "noAllowedAddressPairs: true" is selecting all the neutron networks, so it adds all of them to the netsWithoutAllowedAddressPairs map[3], regardless of the subnets filtering. - As all the networks are in noAllowedAddressPairs: true array, all the ports created for the VM have their allowed address pairs removed[4]. Why do we consider this behavior undesired? I understand that, if we create a port for a network that has no allowed pairs, we create all the other ports in the same networks without the pairs. However, it is surprising that a port in a network is removed the allowed address pairs due to a setting in an entry that yielded no port on that network. In other words, one would expect that the same subnet filtering that happens on each network entry in what regards yielding ports for the VM would also work for the noAllowedPairs parameter.
Version-Release number of selected component (if applicable):
4.10.30
How reproducible:
Always
Steps to Reproduce:
1. Create a machineset like in the description 2. 3.
Actual results:
All ports have no address pairs
Expected results:
Only the port on the secondary subnet has no address pairs.
Additional info:
A simple workaround would be to just fill the filter so that a single network is selected for each network entry. References: [1] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L576 [2] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L580 [3] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L581-L583 [4] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L658-L660 [5] - https://github.com/openshift/cluster-api-provider-openstack/blob/f6b51710d4f395ded401347589447f5f41dd5c4c/pkg/cloud/openstack/clients/machineservice.go#L610-L625
Description of problem:
Controlplanmachineset couldn't be created after deleting a machineset
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-25-210451
How reproducible:
Always
Steps to Reproduce:
1. Delete a machineset $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE mihuangawstest-kjbx9-worker-us-east-2a 1 1 1 1 110m mihuangawstest-kjbx9-worker-us-east-2b 1 1 1 1 110m mihuangawstest-kjbx9-worker-us-east-2c 1 1 1 1 110m $ oc delete machineset mihuangawstest-kjbx9-worker-us-east-2c machineset.machine.openshift.io "mihuangawstest-kjbx9-worker-us-east-2c" deleted $ oc get machineset NAME DESIRED CURRENT READY AVAILABLE AGE mihuangawstest-kjbx9-worker-us-east-2a 1 1 1 1 120m mihuangawstest-kjbx9-worker-us-east-2b 1 1 1 1 120m $ oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 3 Active 110m 2. Delete controlplanmachineset $ oc delete controlplanemachineset cluster controlplanemachineset.machine.openshift.io "cluster" deleted 3. Check if a new controlplanmachineset will be created $ oc get controlplanemachineset No resources found in openshift-machine-api namespace.
Actual results:
No new controlplanmachineset was created. E1026 05:34:55.422062 1 controller.go:326] "msg"="Reconciler error" "error"="error reconciling control plane machine set: unable to create control plane machine set: unable to create control plane machine set: admission webhook \"controlplanemachineset.machine.openshift.io\" denied the request: spec.template.machines_v1beta1_machine_openshift_io.failureDomains: Forbidden: control plane machines are using unspecified failure domain(s) [AWSFailureDomain{AvailabilityZone:us-east-2c, Subnet:{Type:Filters, Value:&[{Name:tag:Name Values:[mihuangawstest-kjbx9-private-us-east-2c]}]}}]" "controller"="controlplanemachinesetgenerator" "reconcileID"="02af2bd5-fa3e-4bb5-8693-c3f5368f4da7"
Expected results:
When delete controlplanmachineset, a new controlplanmachineset will be created.
Additional info:
And if on a 5 masters 3 workers cluster, don't delete machineset, delete controlplanemachineset directly, controlplanemachineset can be created and report "Observed 2 index(es) in excess" $ oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 5 5 5 Inactive 5s - lastTransitionTime: "2022-10-26T07:59:14Z" message: Observed 2 index(es) in excess reason: ExcessIndexes status: "True" type: Degraded
The linux kernel was updated:
https://lkml.org/lkml/2020/3/20/1030
to include steal
accounting
This would greatly assist in troubleshooting vSphere performance issues
caused by over-provisioned ESXi hosts.
This is a clone of issue OCPBUGS-4941. The following is the description of the original issue:
—
Description of problem: This is a follow-up to OCPBUGS-3933.
The installer fails to destroy the cluster when the OpenStack object storage omits 'content-type' from responses, and a container is empty.
Version-Release number of selected component (if applicable):
4.8.z
How reproducible:
Likely not happening in customer environments where Swift is exposed directly. We're seeing the issue in our CI where we're using a non-RHOSP managed cloud.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-996. The following is the description of the original issue:
—
Description of problem:
when using the OnDelete update method of the ControlPlaneMachineSet, it should not be possible to have multiple machines in the Running phase in the same machine index. eg, if machine-1 is in Running phase, we should not have a machine-replacement-1 also in the Running phase.
Version-Release number of selected component (if applicable):
4.12 / main branch
How reproducible:
unsure, this is currently not tested in the code and is difficult to produce
Steps to Reproduce:
1. setup a cluster with CPMS in OnDelete update mode 2. rename one of the master machines to have the same index as another, or manually create a machine to match. this step might be difficult to reproduce. 3. observe logs from CPMS operator
Actual results:
no errors are emitted about the extra machine, although perhaps others are. operator does not degrade.
Expected results:
an error should be produced and the operator should go degraded
Additional info:
this bug is slightly predictive, we have not observed this condition but have detected a gap in the code that might make it possible.
Tracker bug for bootimage bump in 4.12. This bug should block bugs which need a bootimage bump to fix.
Description of problem:
Have 6 runs of techpreview jobs where the jobs fails due to the MCO:
{Operator degraded (RequiredPoolsFailed): Unable to apply 4.12.0-0.ci.test-2022-09-21-183414-ci-op-qd6plyhc-latest: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3)] Operator degraded (RequiredPoolsFailed): Unable to apply 4.12.0-0.ci.test-2022-09-21-183414-ci-op-qd6plyhc-latest: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3)]}
looking at the MCD logs the master seems to go degraded in bootstrap due to the rendered config not being found?
I0921 18:49:47.091804 8171 daemon.go:444] Node ci-op-qd6plyhc-6dd9a-bfmjd-master-1 is part of the control plane I0921 18:49:49.213556 8171 node.go:24] No machineconfiguration.openshift.io/currentConfig annotation on node ci-op-qd6plyhc-6dd9a-bfmjd-master-1: map[csi.volume.kubernetes.io/nodeid: {"pd.csi.storage.gke.io":"projects/openshift-gce-devel-ci-2/zones/us-central1-b/instances/ci-op-qd6plyhc-6dd9a-bfmjd-master-1"} volumes.kubernetes.io/controller-managed-attach-detach:true], in cluster bootstrap, loading initial node annotation from /etc/machine-config-daemon/node-annotations.json I0921 18:49:49.215186 8171 node.go:45] Setting initial node config: rendered-master-2dde32327e4e5d15092fccbac1dcec49 I0921 18:49:49.253706 8171 daemon.go:1184] In bootstrap mode E0921 18:49:49.254046 8171 writer.go:200] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-2dde32327e4e5d15092fccbac1dcec49" not found I0921 18:49:51.232610 8171 daemon.go:499] Transitioned from state: Done -> Degraded I0921 18:49:51.249618 8171 daemon.go:1184] In bootstrap mode E0921 18:49:51.249906 8171 writer.go:200] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-2dde32327e4e5d15092fccbac1dcec49" not found
However looking at controller a rendered-config was generated correctly but it's not the missing config from above:
I0921 18:54:06.736984 1 render_controller.go:506] Generated machineconfig rendered-master-acc8491aafab8ef511a40b76372325ee from 6 configs: [{MachineConfig 00-master machineconfiguration.openshift.io/v1 } {MachineConfig 01-master-container-runtime machineconfiguration.openshift.io/v1 } {MachineConfig 01-master-kubelet machineconfiguration.openshift.io/v1 } {MachineConfig 98-master-generated-kubelet machineconfiguration.openshift.io/v1 } {MachineConfig 99-master-generated-registries machineconfiguration.openshift.io/v1 } {MachineConfig 99-master-ssh machineconfiguration.openshift.io/v1 }] I0921 18:54:06.737226 1 event.go:285] Event(v1.ObjectReference{Kind:"MachineConfigPool", Namespace:"", Name:"master", UID:"b2084ca6-4b33-46bf-b83b-9e98010ff085", APIVersion:"machineconfiguration.openshift.io/v1", ResourceVersion:"5648", FieldPath:""}): type: 'Normal' reason: 'RenderedConfigGenerated' rendered-master-acc8491aafab8ef511a40b76372325ee successfully generated (release version: 4.12.0-0.ci.test-2022-09-21-183220-ci-op-9ksj7d7g-latest, controller version: a627415c240b4c7dd2f9e90f659690d9c0f623f3) I0921 18:54:06.742053 1 render_controller.go:532] Pool master: now targeting: rendered-master-acc8491aafab8ef511a40b76372325ee
So far I see this in the following techpreview jobs:
GCP techpreview
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-kubernetes-1360-ci-4.12-e2e-gcp-sdn-techpreview/1572638837954318336
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-kubernetes-1360-ci-4.12-e2e-gcp-sdn-techpreview-serial/1572638838793179136
Vsphere techpreview
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-kubernetes-1360-nightly-4.12-e2e-vsphere-ovn-techpreview/1572638854794448896
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-kubernetes-1360-nightly-4.12-e2e-vsphere-ovn-techpreview-serial/1572638855574589440
AWS Techpreview:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-kubernetes-1360-ci-4.12-e2e-aws-sdn-techpreview/1572638828672323584
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-kubernetes-1360-ci-4.12-e2e-aws-sdn-techpreview-serial/1572638829217583104
The above jobs affect the k8s 1.25 bump and are blocking the job.
There are also other occurances not in our PR:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/31965/rehearse-31965-pull-ci-openshift-openshift-controller-manager-master-openshift-e2e-aws-builds-techpreview/1572581504297472000
Also see a quick search:
https://search.ci.openshift.org/?search=timed+out+waiting+for+the+condition%2C+error+pool+master+is+not+ready&maxAge=48h&context=1&type=bug%2Bissue%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Did something change that would affect tech preview jobs?
Also note, this seems like a new failure. I have some of these jobs passing in the last ~ 8 days.
Description of problem:
OnDelete update strategy cannot work when master machines are not index as 0, 1, 2
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-10-015203
How reproducible:
always
Steps to Reproduce:
1.Change the master machines name to 3, 4, 5 liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-awso-cnr4j-master-3 Running m6i.xlarge us-east-2 us-east-2a 44m huliu-awso-cnr4j-master-4 Running m6i.xlarge us-east-2 us-east-2b 33m huliu-awso-cnr4j-master-5 Running m6i.xlarge us-east-2 us-east-2c 17m huliu-awso-cnr4j-worker-us-east-2a-7p5c9 Running m6i.xlarge us-east-2 us-east-2a 173m huliu-awso-cnr4j-worker-us-east-2b-fmk56 Running m6i.xlarge us-east-2 us-east-2b 173m huliu-awso-cnr4j-worker-us-east-2c-w6n78 Running m6i.xlarge us-east-2 us-east-2c 173m 2.Create cpms, update strategy is OnDelete liuhuali@Lius-MacBook-Pro huali-test % oc create -f cpms3.yaml controlplanemachineset.machine.openshift.io/cluster created liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 3 Active 60s liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-awso-cnr4j-master-3 Running m6i.xlarge us-east-2 us-east-2a 45m huliu-awso-cnr4j-master-4 Running m6i.xlarge us-east-2 us-east-2b 34m huliu-awso-cnr4j-master-5 Running m6i.xlarge us-east-2 us-east-2c 19m huliu-awso-cnr4j-worker-us-east-2a-7p5c9 Running m6i.xlarge us-east-2 us-east-2a 174m huliu-awso-cnr4j-worker-us-east-2b-fmk56 Running m6i.xlarge us-east-2 us-east-2b 174m huliu-awso-cnr4j-worker-us-east-2c-w6n78 Running m6i.xlarge us-east-2 us-east-2c 174m liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE control-plane-machine-set 4.12.0-0.nightly-2022-10-10-015203 True False False 165m 3.Edit CPMS, change instanceType to another value, here changed to m5.2xlarge liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster controlplanemachineset.machine.openshift.io/cluster edited liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-awso-cnr4j-master-3 Running m6i.xlarge us-east-2 us-east-2a 46m huliu-awso-cnr4j-master-4 Running m6i.xlarge us-east-2 us-east-2b 35m huliu-awso-cnr4j-master-5 Running m6i.xlarge us-east-2 us-east-2c 19m huliu-awso-cnr4j-worker-us-east-2a-7p5c9 Running m6i.xlarge us-east-2 us-east-2a 175m huliu-awso-cnr4j-worker-us-east-2b-fmk56 Running m6i.xlarge us-east-2 us-east-2b 175m huliu-awso-cnr4j-worker-us-east-2c-w6n78 Running m6i.xlarge us-east-2 us-east-2c 175m liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 Active 114s liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE control-plane-machine-set 4.12.0-0.nightly-2022-10-10-015203 True True False 167m Observed 3 replica(s) in need of update 4.Delete a master machine liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-awso-cnr4j-master-3 machine.machine.openshift.io "huliu-awso-cnr4j-master-3" deleted ^C liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-awso-cnr4j-master-3 Deleting m6i.xlarge us-east-2 us-east-2a 49m huliu-awso-cnr4j-master-4 Running m6i.xlarge us-east-2 us-east-2b 38m huliu-awso-cnr4j-master-5 Running m6i.xlarge us-east-2 us-east-2c 22m huliu-awso-cnr4j-worker-us-east-2a-7p5c9 Running m6i.xlarge us-east-2 us-east-2a 177m huliu-awso-cnr4j-worker-us-east-2b-fmk56 Running m6i.xlarge us-east-2 us-east-2b 177m huliu-awso-cnr4j-worker-us-east-2c-w6n78 Running m6i.xlarge us-east-2 us-east-2c 177m liuhuali@Lius-MacBook-Pro huali-test % oc logs control-plane-machine-set-operator-75d75dccbd-r4ftb … I1013 11:41:52.932759 1 status.go:111] "msg"="Observed Machine Configuration" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "observedGeneration"=2 "readyReplicas"=3 "reconcileID"="f8ebe234-5e1d-469c-a2d0-808ecb785ad1" "replicas"=3 "unavailableReplicas"=0 "updatedReplicas"=0 E1013 11:41:52.932951 1 updates.go:441] "msg"="Error creating machine" "error"="error creating new Machine for index 0: could not get provider config for index 0: cannot inject failure domain in the provider config: failure domain is nil" "controller"="controlplanemachineset" "index"=3 "name"="huliu-awso-cnr4j-master-3" "namespace"="openshift-machine-api" "reconcileID"="f8ebe234-5e1d-469c-a2d0-808ecb785ad1" "updateStrategy"="OnDelete" I1013 11:41:52.933317 1 controller.go:178] "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="f8ebe234-5e1d-469c-a2d0-808ecb785ad1" E1013 11:41:52.933353 1 controller.go:326] "msg"="Reconciler error" "error"="error reconciling control plane machine set: error reconciling machines: error reconciling machine updates: error creating new Machine for index 0: could not get provider config for index 0: cannot inject failure domain in the provider config: failure domain is nil" "controller"="controlplanemachineset" "reconcileID"="f8ebe234-5e1d-469c-a2d0-808ecb785ad1" I1013 11:42:33.894074 1 controller.go:128] "msg"="Reconciling control plane machine set" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="be3b87fa-ffff-47f9-a4c4-a13cc077a897" E1013 11:42:33.894609 1 provider.go:242] "msg"="Unknown Index" "error"="could not find failure domain for index: unknown index 3" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="be3b87fa-ffff-47f9-a4c4-a13cc077a897"
Actual results:
OnDelete update strategy cannot work when master machines are not index as 0, 1, 2
Expected results:
OnDelete update strategy should work when master machines are not index as 0, 1, 2
Additional info:
RollingUpdate update strategy work right when master machines are not index as 0, 1, 2
Found when running resource watcher, these keep updating with no real changes, just moving conditions around. Likely needs bugs for all three.
This is a clone of issue OCPBUGS-4181. The following is the description of the original issue:
—
Description of problem:
After configuring a webhook receiver in alertmanager to send alerts to an external tool, a customer noticed that when receiving alerts they have as source "https:///<console-url>" (notice the 3 /).
Version-Release number of selected component (if applicable):
OCP 4.10
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
https:///<console-url>
Expected results:
https://<console-url>
Additional info:
After investigating I discovered that the problem might be in the CMO code:
→ oc get Alertmanager main -o yaml | grep externalUrl externalUrl: https:/console-openshift-console.apps.jakumar-2022-11-27-224014.devcluster.openshift.com/monitoring → oc get Prometheus k8s -o yaml | grep externalUrl externalUrl: https:/console-openshift-console.apps.jakumar-2022-11-27-224014.devcluster.openshift.com/monitoring
This bug is a backport clone of [Bugzilla Bug 2073220](https://bugzilla.redhat.com/show_bug.cgi?id=2073220). The following is the description of the original bug:
—
Description of problem:
Version-Release number of selected component (if applicable): 4.*
How reproducible: always
Steps to Reproduce:
1. Set audit profile to WriteRequestBodies
2. Wait for api server rollout to complete
3. tail -f /var/log/kube-apiserver/audit.log | grep routes/status
Actual results:
Write events to routes/status are recorded at the RequestResponse level, which often includes keys and certificates.
Expected results:
Events involving routes should always be recorded at the Metadata level, per the documentation at https://docs.openshift.com/container-platform/4.10/security/audit-log-policy-config.html#about-audit-log-profiles_audit-log-policy-config
Additional info:
This is a clone of issue OCPBUGS-5165. The following is the description of the original issue:
—
Currently, the Dev Sandbox clusters sends the clusterType "OSD" instead of "DEVSANDBOX" because the configuration annotations of the console config are automatically overridden by some SyncSets.
Open Dev Sandbox and browser console and inspect window.SERVER_FLAGS.telemetry
Please review the following PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/510
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-4022. The following is the description of the original issue:
—
Description of problem:
Unnecessary react warning:
Warning: Each child in a list should have a unique "key" prop. Check the render method of `NavSection`. See https://reactjs.org/link/warning-keys for more information. NavItemHref@http://localhost:9012/static/main-785e94355aeacc12c321.js:5141:88 NavSection@http://localhost:9012/static/main-785e94355aeacc12c321.js:5294:20 PluginNavItem@http://localhost:9012/static/main-785e94355aeacc12c321.js:5582:23 div PerspectiveNav@http://localhost:9012/static/main-785e94355aeacc12c321.js:5398:134
Version-Release number of selected component (if applicable):
4.11 was fine
4.12 and 4.13 (master) shows this warning
How reproducible:
Always
Steps to Reproduce:
1. Open browser log
2. Open web console
Actual results:
React warning
Expected results:
Obviously no react warning
Description of problem:
vSphere privilege checking failing when providing user-defined folder and/or resource pool
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-08-30-054458
How reproducible:
consistently
Steps to Reproduce:
1. Provide pre-existing folder and/or resource pool to the install-config 2. Perform an installation with an account with read only privileges on the datacenter and cluster 3. The installer will fail with missing privileges for the cluster and datacenter. When a pre-existing folder and resource pool are defined, the account can hold read only privileges on the datacenter and cluster .
Actual results:
Installer reports missing privileges
Expected results:
Installer should succeed
Additional info:
Description of problem:
Added a script to collect PodNetworkConnectivityChecks to able to view the overall status of the pod network connectivity. Current must-gather collects the contents of `openshift-network-diagnostics` but does not collect the PodNetworkConnectivityCheck.
Version-Release number of selected component (if applicable):
4.12, 4.11, 4.10
Description of problem:
After the enabling the FIPS in S390x , the ingress controller is repeatedly going into the degraded state. However the observation here is the ingress controller is in running state after a few failure, but it keep recreating the pod and the operator status showing as degraded.
Version-Release number of selected component (if applicable):
OCP Version: 4.11.0-rc.2
How reproducible:
Enable FIPS: True in image-config file
Steps to Reproduce:
1. Enable FIPS: True in image-config file before the installation.
2.
3. oc get co
Actual results:
oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.11.0-rc.2 True False False 7h29m
baremetal 4.11.0-rc.2 True False False 4d12h
cloud-controller-manager 4.11.0-rc.2 True False False 4d12h
cloud-credential 4.11.0-rc.2 True False False 4d12h
cluster-autoscaler 4.11.0-rc.2 True False False 4d12h
config-operator 4.11.0-rc.2 True False False 4d12h
console 4.11.0-rc.2 True False False 4d11h
csi-snapshot-controller 4.11.0-rc.2 True False False 4d12h
dns 4.11.0-rc.2 True False False 4d12h
etcd 4.11.0-rc.2 True False False 4d11h
image-registry 4.11.0-rc.2 True False False 4d11h
ingress 4.11.0-rc.2 True False True 4d11h The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-84689cdc5f-r87hs" is not yet scheduled: SchedulerError: binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "router-default-84689cdc5f-r87hs": pod router-default-84689cdc5f-r87hs is already assigned to node "worker-0.ocp-m1317001.lnxero1.boe" Pod "router-default-84689cdc5f-8z2fh" is not yet scheduled: SchedulerError: binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "router-default-84689cdc5f-8z2fh": pod router-default-84689cdc5f-8z2fh is already assigned to node "worker-0.ocp-m1317001.lnxero1.boe" Pod "router-default-84689cdc5f-s7z96" is not yet scheduled: SchedulerError: binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "router-default-84689cdc5f-s7z96": pod router-default-84689cdc5f-s7z96 is already assigned to node "worker-0.ocp-m1317001.lnxero1.boe" Pod "router-default-84689cdc5f-hslhn" is not yet scheduled: SchedulerError: binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "router-default-84689cdc5f-hslhn": pod router-default-84689cdc5f-hslhn is already assigned to node "worker-0.ocp-m1317001.lnxero1.boe" Pod "router-default-84689cdc5f-nf9vt" is not yet scheduled: SchedulerError: binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "router-default-84689cdc5f-nf9vt": pod router-default-84689cdc5f-nf9vt is already assigned to node "worker-0.ocp-m1317001.lnxero1.boe" Pod "router-default-84689cdc5f-mslzf" is not yet scheduled: SchedulerError: binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "router-default-84689cdc5f-mslzf": pod router-default-84689cdc5f-mslzf is already assigned to node "worker-0.ocp-m1317001.lnxero1.boe" Pod "router-default-84689cdc5f-mc8th" is not yet scheduled: SchedulerError: binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "router-default-84689cdc5f-mc8th": pod router-default-84689cdc5f-mc8th is already assigned to node "worker-0.ocp-m1317001.lnxero1.boe")
insights 4.11.0-rc.2 True False False 4d12h
kube-apiserver 4.11.0-rc.2 True False False 4d11h
kube-controller-manager 4.11.0-rc.2 True False False 4d12h
kube-scheduler 4.11.0-rc.2 True False False 4d12h
kube-storage-version-migrator 4.11.0-rc.2 True False False 4d11h
machine-api 4.11.0-rc.2 True False False 4d12h
machine-approver 4.11.0-rc.2 True False False 4d12h
machine-config 4.11.0-rc.2 True False False 4d12h
marketplace 4.11.0-rc.2 True False False 4d12h
monitoring 4.11.0-rc.2 True False False 4d11h
network 4.11.0-rc.2 True False False 4d12h
node-tuning 4.11.0-rc.2 True False False 4d11h
openshift-apiserver 4.11.0-rc.2 True False False 4d11h
openshift-controller-manager 4.11.0-rc.2 True False False 4d12h
openshift-samples 4.11.0-rc.2 True False False 4d11h
operator-lifecycle-manager 4.11.0-rc.2 True False False 4d12h
operator-lifecycle-manager-catalog 4.11.0-rc.2 True False False 4d12h
operator-lifecycle-manager-packageserver 4.11.0-rc.2 True False False 4d11h
service-ca 4.11.0-rc.2 True False False 4d12h
storage 4.11.0-rc.2 True False False 4d12h
Expected results:
oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.11.0-rc.2 True False False 9d
baremetal 4.11.0-rc.2 True False False 13d
cloud-controller-manager 4.11.0-rc.2 True False False 13d
cloud-credential 4.11.0-rc.2 True False False 13d
cluster-autoscaler 4.11.0-rc.2 True False False 13d
config-operator 4.11.0-rc.2 True False False 13d
console 4.11.0-rc.2 True False False 13d
csi-snapshot-controller 4.11.0-rc.2 True False False 13d
dns 4.11.0-rc.2 True False False 13d
etcd 4.11.0-rc.2 True False False 13d
image-registry 4.11.0-rc.2 True False False 13d
ingress 4.11.0-rc.2 True False False 13d
insights 4.11.0-rc.2 True False False 13d
kube-apiserver 4.11.0-rc.2 True False False 13d
kube-controller-manager 4.11.0-rc.2 True False False 13d
kube-scheduler 4.11.0-rc.2 True False False 13d
kube-storage-version-migrator 4.11.0-rc.2 True False False 13d
machine-api 4.11.0-rc.2 True False False 13d
machine-approver 4.11.0-rc.2 True False False 13d
machine-config 4.11.0-rc.2 True False False 13d
marketplace 4.11.0-rc.2 True False False 13d
monitoring 4.11.0-rc.2 True False False 13d
network 4.11.0-rc.2 True False False 13d
node-tuning 4.11.0-rc.2 True False False 13d
openshift-apiserver 4.11.0-rc.2 True False False 13d
openshift-controller-manager 4.11.0-rc.2 True False False 13d
openshift-samples 4.11.0-rc.2 True False False 13d
operator-lifecycle-manager 4.11.0-rc.2 True False False 13d
operator-lifecycle-manager-catalog 4.11.0-rc.2 True False False 13d
operator-lifecycle-manager-packageserver 4.11.0-rc.2 True False False 13d
service-ca 4.11.0-rc.2 True False False 13d
storage 4.11.0-rc.2 True False False 13d
Additional info:
Attached the Running ingress controller logs.
The failed ingress controller pod is repeatedly creating in openshift-ingress namespaces.
looks like two ingress controller pod is in running state, but the other failed pods were not cleaned up. So manually delete the failed pods fixed the issue.
451
router-default-84689cdc5f-9j44t 1/1 Running 4 (4d12h ago) 4d12h
router-default-84689cdc5f-qn4gh 1/1 Running 3 (4d12h ago) 4d12h
449
This is a clone of issue OCPBUGS-4297. The following is the description of the original issue:
—
Description of problem:
OnDelete update strategy create two replace machines when deleting a master machine
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2022-11-29-035943
How reproducible:
Not sure, I met twice on this template cluster https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/master/functionality-testing/aos-4_13/ipi-on-vsphere/versioned-installer-vmc7_techpreview
Steps to Reproduce:
1.Launch a 4.13 cluster on Vsphere with techpreview enabled, we use automated template: https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/master/functionality-testing/aos-4_13/ipi-on-vsphere/versioned-installer-vmc7_techpreview liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-2022-11-29-035943 True False 56m Cluster version is 4.13.0-0.nightly-2022-11-29-035943 2.Replace master machines one by one with index 3,4,5 liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs13d-rcr7z-master-3 Running 57m huliu-vs13d-rcr7z-master-4 Running 35m huliu-vs13d-rcr7z-master-5 Running 12m huliu-vs13d-rcr7z-worker-ngw2j Running 7h12m huliu-vs13d-rcr7z-worker-p2xd7 Running 7h12m liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.13.0-0.nightly-2022-11-29-035943 True False False 29m baremetal 4.13.0-0.nightly-2022-11-29-035943 True False False 7h33m cloud-controller-manager 4.13.0-0.nightly-2022-11-29-035943 True False False 7h35m cloud-credential 4.13.0-0.nightly-2022-11-29-035943 True False False 7h37m cluster-api 4.13.0-0.nightly-2022-11-29-035943 True False False 7h33m cluster-autoscaler 4.13.0-0.nightly-2022-11-29-035943 True False False 7h32m config-operator 4.13.0-0.nightly-2022-11-29-035943 True False False 7h34m console 4.13.0-0.nightly-2022-11-29-035943 True False False 28m control-plane-machine-set 4.13.0-0.nightly-2022-11-29-035943 True False False 5h12m csi-snapshot-controller 4.13.0-0.nightly-2022-11-29-035943 True False False 7h33m dns 4.13.0-0.nightly-2022-11-29-035943 True False False 7h32m etcd 4.13.0-0.nightly-2022-11-29-035943 True False False 7h31m image-registry 4.13.0-0.nightly-2022-11-29-035943 True False False 74m ingress 4.13.0-0.nightly-2022-11-29-035943 True False False 7h21m insights 4.13.0-0.nightly-2022-11-29-035943 True False False 7h26m kube-apiserver 4.13.0-0.nightly-2022-11-29-035943 True False False 7h22m kube-controller-manager 4.13.0-0.nightly-2022-11-29-035943 True False False 7h31m kube-scheduler 4.13.0-0.nightly-2022-11-29-035943 True False False 7h30m kube-storage-version-migrator 4.13.0-0.nightly-2022-11-29-035943 True False False 74m machine-api 4.13.0-0.nightly-2022-11-29-035943 True False False 7h23m machine-approver 4.13.0-0.nightly-2022-11-29-035943 True False False 7h33m machine-config 4.13.0-0.nightly-2022-11-29-035943 True False False 27m marketplace 4.13.0-0.nightly-2022-11-29-035943 True False False 7h32m monitoring 4.13.0-0.nightly-2022-11-29-035943 True False False 7h19m network 4.13.0-0.nightly-2022-11-29-035943 True False False 7h34m node-tuning 4.13.0-0.nightly-2022-11-29-035943 True False False 7h32m openshift-apiserver 4.13.0-0.nightly-2022-11-29-035943 True False False 30m openshift-controller-manager 4.13.0-0.nightly-2022-11-29-035943 True False False 7h26m openshift-samples 4.13.0-0.nightly-2022-11-29-035943 True False False 7h25m operator-lifecycle-manager 4.13.0-0.nightly-2022-11-29-035943 True False False 7h33m operator-lifecycle-manager-catalog 4.13.0-0.nightly-2022-11-29-035943 True False False 7h33m operator-lifecycle-manager-packageserver 4.13.0-0.nightly-2022-11-29-035943 True False False 7h26m platform-operators-aggregated 4.13.0-0.nightly-2022-11-29-035943 True False False 20m service-ca 4.13.0-0.nightly-2022-11-29-035943 True False False 7h34m storage 4.13.0-0.nightly-2022-11-29-035943 True False False 5h16m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs13d-rcr7z-master-3 Running 77m huliu-vs13d-rcr7z-master-4 Running 55m huliu-vs13d-rcr7z-master-5 Running 32m huliu-vs13d-rcr7z-worker-ngw2j Running 7h32m huliu-vs13d-rcr7z-worker-p2xd7 Running 7h32m 3.Create CPMS, yaml as below: apiVersion: machine.openshift.io/v1 kind: ControlPlaneMachineSet metadata: name: cluster namespace: openshift-machine-api spec: replicas: 3 state: Active strategy: type: OnDelete selector: matchLabels: machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master template: machineType: machines_v1beta1_machine_openshift_io machines_v1beta1_machine_openshift_io: metadata: labels: machine.openshift.io/cluster-api-machine-role: master machine.openshift.io/cluster-api-machine-type: master machine.openshift.io/cluster-api-cluster: huliu-vs13d-rcr7z spec: providerSpec: value: apiVersion: machine.openshift.io/v1beta1 credentialsSecret: name: vsphere-cloud-credentials diskGiB: 120 kind: VSphereMachineProviderSpec memoryMiB: 16384 metadata: creationTimestamp: null network: devices: - networkName: qe-segment numCPUs: 4 numCoresPerSocket: 4 snapshot: "" template: huliu-vs13d-rcr7z-rhcos userDataSecret: name: master-user-data workspace: datacenter: SDDC-Datacenter datastore: WorkloadDatastore folder: /SDDC-Datacenter/vm/huliu-vs13d-rcr7z resourcePool: /SDDC-Datacenter/host/Cluster-1/Resources server: vcenter.sddc-44-236-21-251.vmwarevmc.com liuhuali@Lius-MacBook-Pro huali-test % oc create -f controlplanemachineset_vsphere.yaml controlplanemachineset.machine.openshift.io/cluster created liuhuali@Lius-MacBook-Pro huali-test % oc get controlplanemachineset NAME DESIRED CURRENT READY UPDATED UNAVAILABLE STATE AGE cluster 3 3 3 3 Active 9s 4.Edit CPMS, change numCPUs to 8 to trigger update liuhuali@Lius-MacBook-Pro huali-test % oc edit controlplanemachineset cluster controlplanemachineset.machine.openshift.io/cluster edited liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.13.0-0.nightly-2022-11-29-035943 True False False 31m baremetal 4.13.0-0.nightly-2022-11-29-035943 True False False 7h34m cloud-controller-manager 4.13.0-0.nightly-2022-11-29-035943 True False False 7h37m cloud-credential 4.13.0-0.nightly-2022-11-29-035943 True False False 7h38m cluster-api 4.13.0-0.nightly-2022-11-29-035943 True False False 7h34m cluster-autoscaler 4.13.0-0.nightly-2022-11-29-035943 True False False 7h34m config-operator 4.13.0-0.nightly-2022-11-29-035943 True False False 7h35m console 4.13.0-0.nightly-2022-11-29-035943 True False False 29m control-plane-machine-set 4.13.0-0.nightly-2022-11-29-035943 True True False 5h14m Observed 3 replica(s) in need of update csi-snapshot-controller 4.13.0-0.nightly-2022-11-29-035943 True False False 7h35m dns 4.13.0-0.nightly-2022-11-29-035943 True False False 7h34m etcd 4.13.0-0.nightly-2022-11-29-035943 True False False 7h33m image-registry 4.13.0-0.nightly-2022-11-29-035943 True False False 75m ingress 4.13.0-0.nightly-2022-11-29-035943 True False False 7h23m insights 4.13.0-0.nightly-2022-11-29-035943 True False False 7h27m kube-apiserver 4.13.0-0.nightly-2022-11-29-035943 True False False 7h24m kube-controller-manager 4.13.0-0.nightly-2022-11-29-035943 True False False 7h32m kube-scheduler 4.13.0-0.nightly-2022-11-29-035943 True False False 7h32m kube-storage-version-migrator 4.13.0-0.nightly-2022-11-29-035943 True False False 75m machine-api 4.13.0-0.nightly-2022-11-29-035943 True False False 7h24m machine-approver 4.13.0-0.nightly-2022-11-29-035943 True False False 7h35m machine-config 4.13.0-0.nightly-2022-11-29-035943 True False False 28m marketplace 4.13.0-0.nightly-2022-11-29-035943 True False False 7h34m monitoring 4.13.0-0.nightly-2022-11-29-035943 True False False 7h21m network 4.13.0-0.nightly-2022-11-29-035943 True False False 7h35m node-tuning 4.13.0-0.nightly-2022-11-29-035943 True False False 7h34m openshift-apiserver 4.13.0-0.nightly-2022-11-29-035943 True False False 31m openshift-controller-manager 4.13.0-0.nightly-2022-11-29-035943 True False False 7h27m openshift-samples 4.13.0-0.nightly-2022-11-29-035943 True False False 7h26m operator-lifecycle-manager 4.13.0-0.nightly-2022-11-29-035943 True False False 7h35m operator-lifecycle-manager-catalog 4.13.0-0.nightly-2022-11-29-035943 True False False 7h35m operator-lifecycle-manager-packageserver 4.13.0-0.nightly-2022-11-29-035943 True False False 7h27m platform-operators-aggregated 4.13.0-0.nightly-2022-11-29-035943 True False False 21m service-ca 4.13.0-0.nightly-2022-11-29-035943 True False False 7h35m storage 4.13.0-0.nightly-2022-11-29-035943 True False False 5h18m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs13d-rcr7z-master-3 Running 79m huliu-vs13d-rcr7z-master-4 Running 57m huliu-vs13d-rcr7z-master-5 Running 33m huliu-vs13d-rcr7z-worker-ngw2j Running 7h34m huliu-vs13d-rcr7z-worker-p2xd7 Running 7h34m 5.Delete master machine one by one, found it create two master machines when delete huliu-vs13d-rcr7z-master-4 liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-vs13d-rcr7z-master-5 machine.machine.openshift.io "huliu-vs13d-rcr7z-master-5" deleted ^C liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs13d-rcr7z-master-3 Running 79m huliu-vs13d-rcr7z-master-4 Running 57m huliu-vs13d-rcr7z-master-5 Deleting 33m huliu-vs13d-rcr7z-master-6b9x7-5 Provisioning 5s huliu-vs13d-rcr7z-worker-ngw2j Running 7h34m huliu-vs13d-rcr7z-worker-p2xd7 Running 7h34m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs13d-rcr7z-master-3 Running 91m huliu-vs13d-rcr7z-master-4 Running 69m huliu-vs13d-rcr7z-master-6b9x7-5 Running 12m huliu-vs13d-rcr7z-worker-ngw2j Running 7h46m huliu-vs13d-rcr7z-worker-p2xd7 Running 7h46m liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.13.0-0.nightly-2022-11-29-035943 True False False 53m baremetal 4.13.0-0.nightly-2022-11-29-035943 True False False 7h56m cloud-controller-manager 4.13.0-0.nightly-2022-11-29-035943 True False False 7h59m cloud-credential 4.13.0-0.nightly-2022-11-29-035943 True False False 8h cluster-api 4.13.0-0.nightly-2022-11-29-035943 True False False 7h56m cluster-autoscaler 4.13.0-0.nightly-2022-11-29-035943 True False False 7h56m config-operator 4.13.0-0.nightly-2022-11-29-035943 True False False 7h57m console 4.13.0-0.nightly-2022-11-29-035943 True False False 18m control-plane-machine-set 4.13.0-0.nightly-2022-11-29-035943 True True False 18m Observed 2 replica(s) in need of update csi-snapshot-controller 4.13.0-0.nightly-2022-11-29-035943 True False False 7h57m dns 4.13.0-0.nightly-2022-11-29-035943 True False False 7h56m etcd 4.13.0-0.nightly-2022-11-29-035943 True False False 7h55m image-registry 4.13.0-0.nightly-2022-11-29-035943 True False False 97m ingress 4.13.0-0.nightly-2022-11-29-035943 True False False 7h45m insights 4.13.0-0.nightly-2022-11-29-035943 True False False 7h49m kube-apiserver 4.13.0-0.nightly-2022-11-29-035943 True False False 7h46m kube-controller-manager 4.13.0-0.nightly-2022-11-29-035943 True False False 7h54m kube-scheduler 4.13.0-0.nightly-2022-11-29-035943 True False False 7h54m kube-storage-version-migrator 4.13.0-0.nightly-2022-11-29-035943 True False False 97m machine-api 4.13.0-0.nightly-2022-11-29-035943 True False False 7h46m machine-approver 4.13.0-0.nightly-2022-11-29-035943 True False False 7h56m machine-config 4.13.0-0.nightly-2022-11-29-035943 True False False 50m marketplace 4.13.0-0.nightly-2022-11-29-035943 True False False 7h56m monitoring 4.13.0-0.nightly-2022-11-29-035943 True False False 7h42m network 4.13.0-0.nightly-2022-11-29-035943 True False False 7h57m node-tuning 4.13.0-0.nightly-2022-11-29-035943 True False False 7h56m openshift-apiserver 4.13.0-0.nightly-2022-11-29-035943 True False False 53m openshift-controller-manager 4.13.0-0.nightly-2022-11-29-035943 True False False 7h49m openshift-samples 4.13.0-0.nightly-2022-11-29-035943 True False False 7h48m operator-lifecycle-manager 4.13.0-0.nightly-2022-11-29-035943 True False False 7h57m operator-lifecycle-manager-catalog 4.13.0-0.nightly-2022-11-29-035943 True False False 7h57m operator-lifecycle-manager-packageserver 4.13.0-0.nightly-2022-11-29-035943 True False False 7h49m platform-operators-aggregated 4.13.0-0.nightly-2022-11-29-035943 True False False 10m service-ca 4.13.0-0.nightly-2022-11-29-035943 True False False 7h57m storage 4.13.0-0.nightly-2022-11-29-035943 True False False 5h40m liuhuali@Lius-MacBook-Pro huali-test % oc delete machine huliu-vs13d-rcr7z-master-4 machine.machine.openshift.io "huliu-vs13d-rcr7z-master-4" deleted ^C liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs13d-rcr7z-master-3 Running 101m huliu-vs13d-rcr7z-master-4 Deleting 79m huliu-vs13d-rcr7z-master-6b9x7-5 Running 22m huliu-vs13d-rcr7z-master-8h9p9-4 Provisioning 6s huliu-vs13d-rcr7z-master-df78v-4 Provisioning 6s huliu-vs13d-rcr7z-worker-ngw2j Running 7h56m huliu-vs13d-rcr7z-worker-p2xd7 Running 7h56m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-vs13d-rcr7z-master-3 Running 115m huliu-vs13d-rcr7z-master-6b9x7-5 Running 36m huliu-vs13d-rcr7z-master-8h9p9-4 Running 14m huliu-vs13d-rcr7z-master-df78v-4 Running 14m huliu-vs13d-rcr7z-worker-ngw2j Running 8h huliu-vs13d-rcr7z-worker-p2xd7 Running 8h
Actual results:
When deleting a mater machine, two replace machines created
Expected results:
When deleting a mater machine, only one replace machine created
Additional info:
Must-gather https://drive.google.com/file/d/1VVxGPW0WNDc3CxiJIg90dAQckEWhYy8i/view?usp=sharing
Description of problem:
`create a project` link is enabled for users who do not have permission to create a project. This issue surfaces itself in the developer sandbox.
Version-Release number of selected component (if applicable):
4.11.5
How reproducible:
Steps to Reproduce:
1. log into dev sandbox, or a cluster where the user does not have permission to create a project 2. go directly to URL /topology/all-namespaces
Actual results:
`create a project` link is enabled. Upon clicking the link and submitting the form, the project fails to create; as expected.
Expected results:
`create a project` link should only be available to users with the correct permissions.
Additional info:
The project list pages are not directly available to the user in the UI through the project selector. The user must go directly to the URL. It's possible to encounter this situation when a user logs in with multiple accounts and returns to a previous url.
Description of problem:
documentationBaseURL still points to 4.10
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-08-31-101631
How reproducible:
Always
Steps to Reproduce:
1.Check documentationBaseURL on 4.12 cluster: # oc get configmap console-config -n openshift-console -o yaml | grep documentationBaseURL documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.11/ 2. 3.
Actual results:
1.documentationBaseURL is still pointing to 4.11
Expected results:
1.documentationBaseURL should point to 4.12
Additional info:
Description of problem:
The default catalogSources in the openshift-4.12 payload are using the 4.12 image tag
Version-Release number of selected component (if applicable):
4.12.0
How reproducible:
Always
Steps to Reproduce:
1. Install a 4.12 OpenShift cluster 2. Inspect the default catalogSource image tags.
Actual results:
The default catalogSources reference the 4.11 image tags.
Expected results:
The default catalogSources reference the 4.12 image tags.
Additional info:
Description of problem:
When services are deleted, the services controller cache should also remove the service from its top level cache to avoid growing forever. While this is not an issue in 4.13 once the lb_cache rework merges [1], the 4.12 and older branches have this problem because that rework is meant for 4.13 only. [1]: https://github.com/ovn-org/ovn-kubernetes/pull/3387 This is the location where alreadyApplied is not deleting the removal: https://github.com/openshift/ovn-kubernetes/blob/cf9fb51510e1870961bf3a0f064b73536757a4f8/go-controller/pkg/ovn/controller/services/services_controller.go#L269 It should do the similar changes depicted here (currently merged upstream): https://github.com/ovn-org/ovn-kubernetes/blob/cd78ae1af4657d38bdc41003a8737aa958d62b9d/go-controller/pkg/ovn/controller/services/services_controller.go#L322-L324
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. create service -- use unique name 2. remove service 3. notice how alreadyApplied grows and never gets smaller 4. repeat
Actual results:
^^
Expected results:
alreadyApplied should not grow forever
Additional info:
Description of problem:
According to https://issues.redhat.com/browse/OCPBUGS-705, thanks Junyun share the test env/result for install part, and we need the fix in vsphere-problem-detector, currently it reports the following missing when using the pre-existing folder and/or resource pool with ReadOnly permission: 1. vcenter cluster set ReadOnly permission: I0902 10:07:50.324782 1 vsphere_check.go:244] CheckComputeClusterPermissions:jima-permission-q84s8-worker-86gd4 failed: missing privileges for compute cluster workloads: Resource.AssignVMToPool, VApp.AssignResourcePool, VApp.Import, VirtualMachine.Config.AddNewDisk 2. datacenter set ReadOnly permission: I0902 08:09:19.462001 1 vsphere_check.go:225] CheckAccountPermissions failed: missing privileges for datacenter OCP-DC: Resource.AssignVMToPool, VApp.Import, VirtualMachine.Config.AddExistingDisk, VirtualMachine.Config.AddNewDisk, VirtualMachine.Config.AddRemoveDevice, VirtualMachine.Config.AdvancedConfig, VirtualMachine.Config.Annotation, VirtualMachine.Config.CPUCount, VirtualMachine.Config.DiskExtend, VirtualMachine.Config.DiskLease, VirtualMachine.Config.EditDevice, VirtualMachine.Config.Memory, VirtualMachine.Config.RemoveDisk, VirtualMachine.Config.Rename, VirtualMachine.Config.ResetGuestInfo, VirtualMachine.Config.Resource, VirtualMachine.Config.Settings, VirtualMachine.Config.UpgradeVirtualHardware, VirtualMachine.Interact.GuestControl, VirtualMachine.Interact.PowerOff, VirtualMachine.Interact.PowerOn, VirtualMachine.Interact.Reset, VirtualMachine.Inventory.Create, VirtualMachine.Inventory.CreateFromExisting, VirtualMachine.Inventory.Delete, VirtualMachine.Provisioning.Clone, VirtualMachine.Provisioning.DeployTemplate, VirtualMachine.Provisioning.MarkAsTemplate, Folder.Create, Folder.Delete
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-02-194931
How reproducible:
Always
Steps to Reproduce:
See Description of problem
Actual results:
The vsphere-problem-detector operator reports privilege missing when using pre-existing folder and/or resource pool with ReadOnly permission
Expected results:
The vsphere-problem-detector operator should not reports privilege missing in that case.
Additional info:
This is a clone of issue OCPBUGS-10221. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-5469. The following is the description of the original issue:
—
Description of problem:
When changing channels it's possible that multiple new conditional update risks will need to be evaluated. For instance, a cluster running 4.10.34 in a 4.10 channel today only has to evaluate `OpenStackNodeCreationFails` but when the channel is changed to a 4.11 channel multiple new risks require evaluation and the evaluation of new risks is throttled at one every 10 minutes. This means if there are three new risks it may take up to 30 minutes after the channel has changed for the full set of conditional updates to be computed. This leads to a perception that no update paths are recommended because most will not wait 30 minutes, they expect immediate feedback.
Version-Release number of selected component (if applicable):
4.10.z, 4.11.z, 4.12, 4.13
How reproducible:
100%
Steps to Reproduce:
1. Install 4.10.34 2. Switch from stable-4.10 to stable-4.11 3.
Actual results:
Observe no recommended updates for 10-20 minutes because all available paths to 4.11 have a risk associated with them
Expected results:
Risks are computed in a timely manner for an interactive UX, lets say < 10s
Additional info:
This was intentional in the design, we didn't want risks to continuously re-evaluate or overwhelm the monitoring stack, however we didn't anticipate that we'd have long standing pile of risks and realize how confusing the user experience would be. We intend to work around this in the deployed fleet by converting older risks from `type: promql` to `type: Always` avoiding the evaluation period but preserving the notification. While this may lead customers to believe they're exposed to a risk they may not be, as long as the set of outstanding risks to the latest version is limited to no more than one it's likely no one will notice. All 4.10 and 4.11 clusters currently have a clear path toward relatively recent 4.10.z or 4.11.z with no more than one risk to be evaluated.
Please review the following PR: https://github.com/openshift/cloud-provider-alibaba-cloud/pull/30
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Bootstrap fail in SNO installation
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Test this in libvirt env. Agent-config and install-config in attached. 2. Use attached agent-config and install-config to create image 3. Install SNO: virt-install --connect qemu:///system -n control-0 -r 33000 --vcpus 8 --cdrom ./agent.iso --disk pool=installer,size=120 --boot uefi,hd,cdrom --os-variant=rhel8.5 --network network=default,mac=52:54:00:aa:aa:aa --wait=-1 --check mac_in_use=off 4. There is following error in bootkube.service log: -- Logs begin at Fri 2022-09-30 08:58:21 UTC, end at Fri 2022-09-30 09:19:40 UTC. -- Sep 30 09:00:51 test.metalkube.org systemd[1]: Starting Bootkube - bootstrap in place post reboot... Sep 30 09:00:51 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Running bootkube bootstrap-in-place post reboot Sep 30 09:00:52 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Waiting for api ... Sep 30 09:00:57 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Waiting for api ... Sep 30 09:01:02 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Waiting for api ... Sep 30 09:01:07 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Waiting for api ... Sep 30 09:01:12 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Waiting for api ... Sep 30 09:01:17 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Approving csrs ... Sep 30 09:01:17 test.metalkube.org bootstrap-in-place-post-reboot.sh[3045]: error: error executing jsonpath "{.items[0].status.conditions[?(@.type==\"Ready\")].status}": Error executing template: array index out of bounds: index 0, length 0. Printing more information for debugging the template: Sep 30 09:01:17 test.metalkube.org bootstrap-in-place-post-reboot.sh[3045]: template was: Sep 30 09:01:17 test.metalkube.org bootstrap-in-place-post-reboot.sh[3045]: {.items[0].status.conditions[?(@.type=="Ready")].status} Sep 30 09:01:17 test.metalkube.org bootstrap-in-place-post-reboot.sh[3045]: object given to jsonpath engine was: Sep 30 09:01:17 test.metalkube.org bootstrap-in-place-post-reboot.sh[3045]: map[string]interface {}{"apiVersion":"v1", "items":[]interface {}{}, "kind":"List", "metadata":map[string]interface {}{"resourceVersion":""}} Sep 30 09:01:17 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Approving csrs ... Sep 30 09:01:51 test.metalkube.org bootstrap-in-place-post-reboot.sh[3142]: error: error executing jsonpath "{.items[0].status.conditions[?(@.type==\"Ready\")].status}": Error executing template: array index out of bounds: index 0, length 0. Printing more information for debugging the template: Sep 30 09:01:51 test.metalkube.org bootstrap-in-place-post-reboot.sh[3142]: template was: Sep 30 09:01:51 test.metalkube.org bootstrap-in-place-post-reboot.sh[3142]: {.items[0].status.conditions[?(@.type=="Ready")].status} Sep 30 09:01:51 test.metalkube.org bootstrap-in-place-post-reboot.sh[3142]: object given to jsonpath engine was: Sep 30 09:01:51 test.metalkube.org bootstrap-in-place-post-reboot.sh[3142]: map[string]interface {}{"apiVersion":"v1", "items":[]interface {}{}, "kind":"List", "metadata":map[string]interface {}{"resourceVersion":""}} Sep 30 09:01:51 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Approving csrs ... Sep 30 09:02:21 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Approving csrs ... Sep 30 09:02:52 test.metalkube.org bootstrap-in-place-post-reboot.sh[2409]: Approving csrs ...
Actual results:
Expected results:
Additional info:
Description of problem:
NPE on topology for the ns which just got deleted, see screenshot below
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Login as regular user 2. Create a ns and delete the ns 3. visit the deleted ns in topology
Actual results:
console breaks dur to NPE
Expected results:
console shouldn't break
Additional info:
This is a clone of issue OCPBUGS-10239. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-8082. The following is the description of the original issue:
—
Description of problem:
Currently during the gathering some of the ServiceAccounts were lost. This tasks fixes that problem.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-3214. The following is the description of the original issue:
—
Description of problem:
The installer has logic that avoids adding the router CAs to the kubeconfig if the console is not available. It's not clear why it does this, but it means that the router CAs don't get added when the console is deliberately disabled (it is now an optional capability in 4.12).
Version-Release number of selected component (if applicable):
Seen in 4.12+4.13
How reproducible:
Always, when starting a cluster w/o the Console capability
Steps to Reproduce:
1. Edit the install-config to set: capabilities: baselineCapabilitySet: None 2. install the cluster 3. check the CAs in the kubeconfig, the wildcard route CA will be missing (compare it w/ a normal cluster)
Actual results:
router CAs missing
Expected results:
router CAs should be present
Additional info:
This needs to be backported to 4.12.
Description of problem:
when egress firewall is applied in a namespace which name is longer than 43 symbols, acl names gets cropped and all acls for the same egress firewall object are considered equivalent. It is a known problem that we faced for network policies too.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Changelog between 3.5.5 and 3.5.4:
https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-3.5.md#v355-tbd
Changelog between 3.5.3 and 3.5.4:
https://github.com/etcd-io/etcd/blob/main/CHANGELOG/CHANGELOG-3.5.md#v354-2022-04-24
This is a clone of issue OCPBUGS-3524. The following is the description of the original issue:
—
Description of problem:
Install fully private cluster on Azure against 4.12.0-0.nightly-2022-11-10-033725, sa for coreOS image have public access.
$ az storage account list -g jima-azure-11a-f58lp-rg --query "[].[name,allowBlobPublicAccess]" -o tsv
clusterptkpx True
imageregistryjimaazrsgcc False
same profile on 4.11.0-0.nightly-2022-11-10-202051, sa for coreos image are not publicly accessible.
$ az storage account list -g jima-azure-11c-kf9hw-rg --query "[].[name,allowBlobPublicAccess]" -o tsv
clusterr8wv9 False
imageregistryjimaaz9btdx False
Checked that terraform-provider-azurerm version is different between 4.11 and 4.12.
4.11: v2.98.0
4.12: v3.19.1
In terraform-provider-azurerm v2.98.0, it use property allow_blob_public_access to manage sa public access, the default value is false.
In terraform-provider-azurerm v3.19.1, property allow_blob_public_access is renamed to allow_nested_items_to_be_public , the default value is true.
https://github.com/hashicorp/terraform-provider-azurerm/blob/main/CHANGELOG.md#300-march-24-2022
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-11-10-033725
How reproducible:
always on 4.12
Steps to Reproduce:
1. Install fully private cluster on azure against 4.12 payload 2. 3.
Actual results:
sa for coreos image is publicly accessible
Expected results:
sa for coreos image should not be publicly accessible
Additional info:
only happened on 4.12
While backporting to 4.12 the node healthz server (#1570), a number of functions related to checking stale ovs ports (checkForStaleOVSInternalPorts, checkForStaleOVSRepresentorInterfaces, checkForStaleOVSInterfaces) were moved to pkg/node/openflow_manager.go and their related tests were left in pkg/node/healthcheck_test.go. In 4.13, we have everything under pkg/network-controller-manager. To keep consistency, let's move these to pkg/node/node.go and pkg/node/node_test.go
An RW mutex was introduced to the project auth cache with https://github.com/openshift/openshift-apiserver/pull/267, taking exclusive access during cache syncs. On clusters with extremely high object counts for namespaces and RBAC, syncs appear to be extremely slow (on the order of several minutes). The project LIST handler acquires the same mutex in shared mode as part of its critical path.
Copied from an upstream issue: https://github.com/operator-framework/operator-lifecycle-manager/issues/2830
What did you do?
When attempting to reinstall an operator that uses conversion webhooks by
The resulting InstallPlan enters a failed state with message similar to
error validating existing CRs against new CRD's schema for "devworkspaces.workspace.devfile.io": error listing resources in GroupVersionResource schema.GroupVersionResource{Group:"workspace.devfile.io", Version:"v1alpha1", Resource:"devworkspaces"}: conversion webhook for workspace.devfile.io/v1alpha2, Kind=DevWorkspace failed: Post "https://devworkspace-controller-manager-service.test-namespace.svc:443/convert?timeout=30s": service "devworkspace-controller-manager-service" not found
When the original CSVs are deleted, the operator's main deployment and service are removed, but CRDs are left in-cluster. However, since the service/CA bundle/deployment that serve the conversion webhook are removed, conversion webhooks are broken at that point. Eventually this impacts garbage collection on the cluster as well.
This can be reproduced by installing the DevWorkspace Operator from the Red Hat catalog. (I can provide yamls/upstream images that reproduce as well, if that's helpful). It may be necessary to create a DevWorkspace in the cluster before deletion, e.g. by oc apply -f https://raw.githubusercontent.com/devfile/devworkspace-operator/main/samples/plain.yaml
What did you expect to see?
Operator is able to be reinstalled without removing CRDs and all instances.
What did you see instead? Under which circumstances?
It's necessary to completely remove the operator including CRDs. For our operator (DevWorkspace), this also makes uninstall especially complicated as finalizers are used (so CRDs cannot be deleted if the controller is removed, and the controller cannot be restored by reinstalling)
Environment
operator-lifecycle-manager version: 4.10.24
Kubernetes version information: Kubernetes Version: v1.23.5+012e945 (OpenShift 4.10.24)
Kubernetes cluster kind: OpenShift
Description of problem:
One multus case always fail in QE e2e testing. Using same net-attach-def and pod configure files, testing passed in 4.11 but failed in 4.12 and 4.13
Version-Release number of selected component (if applicable):
4.12 and 4.13
How reproducible:
All the times
Steps to Reproduce:
[weliang@weliang networking]$ oc create -f https://raw.githubusercontent.com/weliang1/verification-tests/master/testdata/networking/multus-cni/NetworkAttachmentDefinitions/runtimeconfig-def-ipandmac.yaml networkattachmentdefinition.k8s.cni.cncf.io/runtimeconfig-def created [weliang@weliang networking]$ oc get net-attach-def -o yaml apiVersion: v1 items: - apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: creationTimestamp: "2023-01-03T16:33:03Z" generation: 1 name: runtimeconfig-def namespace: test resourceVersion: "64139" uid: bb26c08f-adbf-477e-97ab-2aa7461e50c4 spec: config: '{ "cniVersion": "0.3.1", "name": "runtimeconfig-def", "plugins": [{ "type": "macvlan", "capabilities": { "ips": true }, "mode": "bridge", "ipam": { "type": "static" } }, { "type": "tuning", "capabilities": { "mac": true } }] }' kind: List metadata: resourceVersion: "" [weliang@weliang networking]$ oc create -f https://raw.githubusercontent.com/weliang1/verification-tests/master/testdata/networking/multus-cni/Pods/runtimeconfig-pod-ipandmac.yaml pod/runtimeconfig-pod created [weliang@weliang networking]$ oc get pod NAME READY STATUS RESTARTS AGE runtimeconfig-pod 0/1 ContainerCreating 0 6s [weliang@weliang networking]$ oc describe pod runtimeconfig-pod Name: runtimeconfig-pod Namespace: test Priority: 0 Node: weliang-01031-bvxtz-worker-a-qlwz7.c.openshift-qe.internal/10.0.128.4 Start Time: Tue, 03 Jan 2023 11:33:45 -0500 Labels: <none> Annotations: k8s.v1.cni.cncf.io/networks: [ { "name": "runtimeconfig-def", "ips": [ "192.168.22.2/24" ], "mac": "CA:FE:C0:FF:EE:00" } ] openshift.io/scc: anyuid Status: Pending IP: IPs: <none> Containers: runtimeconfig-pod: Container ID: Image: quay.io/openshifttest/hello-sdn@sha256:c89445416459e7adea9a5a416b3365ed3d74f2491beb904d61dc8d1eb89a72a4 Image ID: Port: <none> Host Port: <none> State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-k5zqd (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kube-api-access-k5zqd: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 26s default-scheduler Successfully assigned test/runtimeconfig-pod to weliang-01031-bvxtz-worker-a-qlwz7.c.openshift-qe.internal Normal AddedInterface 24s multus Add eth0 [10.128.2.115/23] from openshift-sdn Warning FailedCreatePodSandBox 23s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_runtimeconfig-pod_test_7d5f3e7a-846d-4cfb-ac78-fd08b27102ae_0(cff792dbd07e8936d04aad31964bd7b626c19a90eb9d92a67736323a1a2303c4): error adding pod test_runtimeconfig-pod to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [test/runtimeconfig-pod/7d5f3e7a-846d-4cfb-ac78-fd08b27102ae:runtimeconfig-def]: error adding container to network "runtimeconfig-def": Interface name contains an invalid character / Normal AddedInterface 7s multus Add eth0 [10.128.2.116/23] from openshift-sdn Warning FailedCreatePodSandBox 7s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_runtimeconfig-pod_test_7d5f3e7a-846d-4cfb-ac78-fd08b27102ae_0(d2456338fa65847d5dc744dea64972912c10b2a32d3450910b0b81cdc9159ca4): error adding pod test_runtimeconfig-pod to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [test/runtimeconfig-pod/7d5f3e7a-846d-4cfb-ac78-fd08b27102ae:runtimeconfig-def]: error adding container to network "runtimeconfig-def": Interface name contains an invalid character / [weliang@weliang networking]$
Actual results:
Pod is not running
Expected results:
Pod should be in running state
Additional info:
Description of problem:
Large OpenShift Container Platform 4.10.24 - Cluster is failing to update router-certs secret in openshift-config-managed namespace as the given secret is too big. 2022-09-01T06:24:15.157333294Z 2022-09-01T06:24:15.157Z ERROR operator.init.controller.certificate_publisher_controller controller/controller.go:266 Reconciler error {"name": "foo-bar", "namespace": "openshift-ingress-operator", "error": "failed to ensure global secret: failed to update published router certificates secret: Secret \"router-certs\" is invalid: data: Too long: must have at most 1048576 bytes"} The OpenShift Container Platform 4 - Cluster has 180 IngressController configured with endpointPublishingStrategy set to private. Now the default certificate needs to be replaced but is not properly replicated to openshift-authentication namespace and potentially other location because of the problem mentioned (since the required secret can not be updated)
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.10.24
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4.10 2. Create 180 IngressController with specific certificates 3. Check openshift-ingress-operator logs to see how it fails to update/create the necessary secret in openshift-config-managed
Actual results:
2022-09-01T06:24:15.157333294Z 2022-09-01T06:24:15.157Z ERROR operator.init.controller.certificate_publisher_controller controller/controller.go:266 Reconciler error {"name": "foo-bar", "namespace": "openshift-ingress-operator", "error": "failed to ensure global secret: failed to update published router certificates secret: Secret \"router-certs\" is invalid: data: Too long: must have at most 1048576 bytes"}
Expected results:
No matter how many IngressController is created, secret management taken care by Operators need to work, even if data exceed 1 MB size limitation. In that case an approach needs to exist to split data into multiple secrets or handle it otherwise.
Additional info:
This is a clone of issue OCPBUGS-2895. The following is the description of the original issue:
—
Description of problem:
Current validation will not accept Resource Groups or DiskEncryptionSets which have upper-case letters.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Attempt to create a cluster/machineset using a DiskEncryptionSet with an RG or Name with upper-case letters
Steps to Reproduce:
1. Create cluster with DiskEncryptionSet with upper-case letters in DES name or in Resource Group name
Actual results:
See error message: encountered error: [controlPlane.platform.azure.defaultMachinePlatform.osDisk.diskEncryptionSet.resourceGroup: Invalid value: \"v4-e2e-V62447568-eastus\": invalid resource group format, compute[0].platform.azure.defaultMachinePlatform.osDisk.diskEncryptionSet.resourceGroup: Invalid value: \"v4-e2e-V62447568-eastus\": invalid resource group format]
Expected results:
Create a cluster/machineset using the existing and valid DiskEncryptionSet
Additional info:
I have submitted a PR for this already, but it needs to be reviewed and backported to 4.11: https://github.com/openshift/installer/pull/6513
Similar to how we generate the kubeconfig at the same time as the ISO, we should also generate the admin password.
This will require changes to the installer to allow assisted-service to pass at least the hash of the password in to the installer process that generates the bootstrap ignition, similar in concept to the changes made to pass the kubeconfig.
This is a clone of issue OCPBUGS-5018. The following is the description of the original issue:
—
Description of problem:
When upgrading from 4.11 to 4.12 an IPI AWS cluster which included Machineset and BYOH Windows nodes, the upgrade hanged while trying to upgrade the machine-api component: $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-12-16-190443 True True 117m Working towards 4.12.0-rc.5: 214 of 827 done (25% complete), waiting on machine-api $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.11.0-0.nightly-2022-12-16-190443 True False False 4h47m baremetal 4.11.0-0.nightly-2022-12-16-190443 True False False 4h59m cloud-controller-manager 4.12.0-rc.5 True False False 5h3m cloud-credential 4.11.0-0.nightly-2022-12-16-190443 True False False 5h4m cluster-autoscaler 4.11.0-0.nightly-2022-12-16-190443 True False False 4h59m config-operator 4.12.0-rc.5 True False False 5h1m console 4.11.0-0.nightly-2022-12-16-190443 True False False 4h43m csi-snapshot-controller 4.11.0-0.nightly-2022-12-16-190443 True False False 5h dns 4.11.0-0.nightly-2022-12-16-190443 True False False 4h59m etcd 4.12.0-rc.5 True False False 4h58m image-registry 4.11.0-0.nightly-2022-12-16-190443 True False False 4h54m ingress 4.11.0-0.nightly-2022-12-16-190443 True False False 4h55m insights 4.11.0-0.nightly-2022-12-16-190443 True False False 4h53m kube-apiserver 4.12.0-rc.5 True False False 4h50m kube-controller-manager 4.12.0-rc.5 True False False 4h57m kube-scheduler 4.12.0-rc.5 True False False 4h57m kube-storage-version-migrator 4.11.0-0.nightly-2022-12-16-190443 True False False 5h machine-api 4.11.0-0.nightly-2022-12-16-190443 True True False 4h56m Progressing towards operator: 4.12.0-rc.5 machine-approver 4.11.0-0.nightly-2022-12-16-190443 True False False 5h machine-config 4.11.0-0.nightly-2022-12-16-190443 True False False 4h59m marketplace 4.11.0-0.nightly-2022-12-16-190443 True False False 4h59m monitoring 4.11.0-0.nightly-2022-12-16-190443 True False False 4h53m network 4.11.0-0.nightly-2022-12-16-190443 True False False 5h3m node-tuning 4.11.0-0.nightly-2022-12-16-190443 True False False 4h59m openshift-apiserver 4.11.0-0.nightly-2022-12-16-190443 True False False 4h53m openshift-controller-manager 4.11.0-0.nightly-2022-12-16-190443 True False False 4h56m openshift-samples 4.11.0-0.nightly-2022-12-16-190443 True False False 4h55m operator-lifecycle-manager 4.11.0-0.nightly-2022-12-16-190443 True False False 5h operator-lifecycle-manager-catalog 4.11.0-0.nightly-2022-12-16-190443 True False False 5h operator-lifecycle-manager-packageserver 4.11.0-0.nightly-2022-12-16-190443 True False False 4h55m service-ca 4.11.0-0.nightly-2022-12-16-190443 True False False 5h storage 4.11.0-0.nightly-2022-12-16-190443 True False False 5h When digging a little deeper into the exact component hanging, we observed that it was the machine-api-termination-handler that was running in the Machine Windows workers, the one that was in ImagePullBackOff state: $ oc get pods -n openshift-machine-api NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-6ff66b6655-kpgp9 2/2 Running 0 5h5m cluster-baremetal-operator-6dbcd6f76b-d9dwd 2/2 Running 0 5h5m machine-api-controllers-cdb8d979b-79xlh 7/7 Running 0 94m machine-api-operator-86bf4f6d79-g2vwm 2/2 Running 0 97m machine-api-termination-handler-fcfq2 0/1 ImagePullBackOff 0 94m machine-api-termination-handler-gj4pf 1/1 Running 0 4h57m machine-api-termination-handler-krwdg 0/1 ImagePullBackOff 0 94m machine-api-termination-handler-l95x2 1/1 Running 0 4h54m machine-api-termination-handler-p6sw6 1/1 Running 0 4h57m $ oc describe pods machine-api-termination-handler-fcfq2 -n openshift-machine-api Name: machine-api-termination-handler-fcfq2 Namespace: openshift-machine-api Priority: 2000001000 Priority Class Name: system-node-critical ..................................................................... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 94m default-scheduler Successfully assigned openshift-machine-api/machine-api-termination-handler-fcfq2 to ip-10-0-145-114.us-east-2.compute.internal Warning FailedCreatePodSandBox 94m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "7b80f84cc547310f5370a7dde7c651ca661dd40ebd0730296329d1cbe8981b37": plugin type="win-ov erlay" name="OVNKubernetesHybridOverlayNetwork" failed (add): error while adding HostComputeEndpoint: failed to create the new HostComputeEndpoint: hcnCreateEndpoint failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already exists. ","ErrorCode":2147947410} Warning FailedCreatePodSandBox 94m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "6b3e020a419dde8359a31b56129c65821011e232467d712f9f5081f32fe380c9": plugin type="win-ov erlay" name="OVNKubernetesHybridOverlayNetwork" failed (add): error while adding HostComputeEndpoint: failed to create the new HostComputeEndpoint: hcnCreateEndpoint failed in Win32: The object already exists. (0x1392) {"Success":false,"Error":"The object already exists. ","ErrorCode":2147947410} Normal Pulling 93m (x4 over 94m) kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9aa96cb22047b62f785b87bf81ec1762703c1489079dd33008085b5585adc258" Warning Failed 93m (x4 over 94m) kubelet Error: ErrImagePull Normal BackOff 4m39s (x393 over 94m) kubelet Back-off pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9aa96cb22047b62f785b87bf81ec1762703c1489079dd33008085b5585adc258" $ oc get pods -n openshift-machine-api -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-autoscaler-operator-6ff66b6655-kpgp9 2/2 Running 0 5h8m 10.130.0.10 ip-10-0-180-35.us-east-2.compute.internal <none> <none> cluster-baremetal-operator-6dbcd6f76b-d9dwd 2/2 Running 0 5h8m 10.130.0.8 ip-10-0-180-35.us-east-2.compute.internal <none> <none> machine-api-controllers-cdb8d979b-79xlh 7/7 Running 0 97m 10.128.0.144 ip-10-0-138-246.us-east-2.compute.internal <none> <none> machine-api-operator-86bf4f6d79-g2vwm 2/2 Running 0 100m 10.128.0.143 ip-10-0-138-246.us-east-2.compute.internal <none> <none> machine-api-termination-handler-fcfq2 0/1 ImagePullBackOff 0 97m 10.129.0.7 ip-10-0-145-114.us-east-2.compute.internal <none> <none> machine-api-termination-handler-gj4pf 1/1 Running 0 5h 10.0.223.37 ip-10-0-223-37.us-east-2.compute.internal <none> <none> machine-api-termination-handler-krwdg 0/1 ImagePullBackOff 0 97m 10.128.0.4 ip-10-0-143-111.us-east-2.compute.internal <none> <none> machine-api-termination-handler-l95x2 1/1 Running 0 4h57m 10.0.172.211 ip-10-0-172-211.us-east-2.compute.internal <none> <none> machine-api-termination-handler-p6sw6 1/1 Running 0 5h 10.0.146.227 ip-10-0-146-227.us-east-2.compute.internal <none> <none> [jfrancoa@localhost byoh-auto]$ oc get nodes -o wide | grep ip-10-0-143-111.us-east-2.compute.internal ip-10-0-143-111.us-east-2.compute.internal Ready worker 4h24m v1.24.0-2566+5157800f2a3bc3 10.0.143.111 <none> Windows Server 2019 Datacenter 10.0.17763.3770 containerd://1.18 [jfrancoa@localhost byoh-auto]$ oc get nodes -o wide | grep ip-10-0-145-114.us-east-2.compute.internal ip-10-0-145-114.us-east-2.compute.internal Ready worker 4h18m v1.24.0-2566+5157800f2a3bc3 10.0.145.114 <none> Windows Server 2019 Datacenter 10.0.17763.3770 containerd://1.18 [jfrancoa@localhost byoh-auto]$ oc get machine.machine.openshift.io -n openshift-machine-api -o wide | grep ip-10-0-145-114.us-east-2.compute.internal jfrancoa-1912-aws-rvkrp-windows-worker-us-east-2a-v57sh Running m5a.large us-east-2 us-east-2a 4h37m ip-10-0-145-114.us-east-2.compute.internal aws:///us-east-2a/i-0b69d52c625c46a6a running [jfrancoa@localhost byoh-auto]$ oc get machine.machine.openshift.io -n openshift-machine-api -o wide | grep ip-10-0-143-111.us-east-2.compute.internal jfrancoa-1912-aws-rvkrp-windows-worker-us-east-2a-j6gkc Running m5a.large us-east-2 us-east-2a 4h37m ip-10-0-143-111.us-east-2.compute.internal aws:///us-east-2a/i-05e422c0051707d16 running This is blocking the whole upgrade process, as the upgrade is not able to move further from this component.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-12-16-190443 True True 141m Working towards 4.12.0-rc.5: 214 of 827 done (25% complete), waiting on machine-api $ oc version Client Version: 4.11.0-0.ci-2022-06-09-065118 Kustomize Version: v4.5.4 Server Version: 4.11.0-0.nightly-2022-12-16-190443 Kubernetes Version: v1.25.4+77bec7a
How reproducible:
Always
Steps to Reproduce:
1. Deploy a 4.11 IPI AWS cluster with Windows workers using a MachineSet 2. Perform the upgrade to 4.12 3. Wait for the upgrade to hang on the machine-api component
Actual results:
The upgrade hangs when upgrading the machine-api component.
Expected results:
The upgrade suceeds
Additional info:
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled
CI is flaky because tests pull the "openshift/origin-node" image from Docker Hub and get rate-limited:
E0803 20:44:32.429877 2066 kuberuntime_image.go:53] "Failed to pull image" err="rpc error: code = Unknown desc = reading manifest latest in docker.io/openshift/origin-node: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit" image="openshift/origin-node:latest"
This particular failure comes from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/929/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/16871891662673059841687189166267305984. I don't know how to search for this failure using search.ci. I discovered the rate-limiting through Loki: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%7B%22datasource%22:%22PCEB727DF2F34084E%22,%22queries%22:%5B%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fpull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator%2F1687189166267305984%5C%22%7D%20%7C%20unpack%20%7C~%20%5C%22pull%20rate%20limit%5C%22%22,%22refId%22:%22A%22,%22editorMode%22:%22code%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%221691086303449%22,%22to%22:%221691122303451%22%7D%7D.
This happened on 4.14 CI job.
I have observed this once so far, but it is quite obscure.
1. Post a PR and have bad luck.
2. Check Loki using the following query:
{...} {invoker="openshift-internal-ci/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/*"} | unpack | systemd_unit="kubelet.service" |~ "pull rate limit"
CI pulls from Docker Hub and fails.
CI passes, or fails on some other test failure. CI should never pull from Docker Hub.
We have been using the "openshift/origin-node" image in multiple tests for years. I have no idea why it is suddenly pulling from Docker Hub, or how we failed to notice that it was pulling from Docker Hub if that's what it was doing all along.
Description of problem:
If not installed capability operator build and deploymentconfig, when use `oc new-app registry.redhat.io/<namespace>/<image>:<tag>` , the created deployment emptied spec.containers[0].image. The deploy will fail to start pod.
Version-Release number of selected component (if applicable):
oc version Client Version: 4.14.0-0.nightly-2023-08-22-221456 Kustomize Version: v5.0.1 Server Version: 4.14.0-0.nightly-2023-09-02-132842 Kubernetes Version: v1.27.4+2c83a9f
How reproducible:
Always
Steps to Reproduce:
1. Installed cluster without build/deploymentconfig function Set "baselineCapabilitySet: None" in install-config 2.Create a deploy using 'new-app' cmd oc new-app registry.redhat.io/ubi8/httpd-24:latest 3.
Actual results:
2. $oc new-app registry.redhat.io/ubi8/httpd-24:latest --> Found container image c412709 (11 days old) from registry.redhat.io for "registry.redhat.io/ubi8/httpd-24:latest" Apache httpd 2.4 ---------------- Apache httpd 2.4 available as container, is a powerful, efficient, and extensible web server. Apache supports a variety of features, many implemented as compiled modules which extend the core functionality. These can range from server-side programming language support to authentication schemes. Virtual hosting allows one Apache installation to serve many different Web sites. Tags: builder, httpd, httpd-24 * An image stream tag will be created as "httpd-24:latest" that will track this image--> Creating resources ... imagestream.image.openshift.io "httpd-24" created deployment.apps "httpd-24" created service "httpd-24" created --> Success Application is not exposed. You can expose services to the outside world by executing one or more of the commands below: 'oc expose service/httpd-24' Run 'oc status' to view your app 3. oc get deploy -o yaml apiVersion: v1 items: - apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "1" image.openshift.io/triggers: '[{"from":{"kind":"ImageStreamTag","name":"httpd-24:latest"},"fieldPath":"spec.template.spec.containers[?(@.name==\"httpd-24\")].image"}]' openshift.io/generated-by: OpenShiftNewApp creationTimestamp: "2023-09-04T07:44:01Z" generation: 1 labels: app: httpd-24 app.kubernetes.io/component: httpd-24 app.kubernetes.io/instance: httpd-24 name: httpd-24 namespace: wxg resourceVersion: "115441" uid: 909d0c4e-180c-4f88-8fb5-93c927839903 spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: deployment: httpd-24 strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: annotations: openshift.io/generated-by: OpenShiftNewApp creationTimestamp: null labels: deployment: httpd-24 spec: containers: - image: ' ' imagePullPolicy: IfNotPresent name: httpd-24 ports: - containerPort: 8080 protocol: TCP - containerPort: 8443 protocol: TCP resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 status: conditions: - lastTransitionTime: "2023-09-04T07:44:01Z" lastUpdateTime: "2023-09-04T07:44:01Z" message: Created new replica set "httpd-24-7f6b55cc85" reason: NewReplicaSetCreated status: "True" type: Progressing - lastTransitionTime: "2023-09-04T07:44:01Z" lastUpdateTime: "2023-09-04T07:44:01Z" message: Deployment does not have minimum availability. reason: MinimumReplicasUnavailable status: "False" type: Available - lastTransitionTime: "2023-09-04T07:44:01Z" lastUpdateTime: "2023-09-04T07:44:01Z" message: 'Pod "httpd-24-7f6b55cc85-pvvgt" is invalid: spec.containers[0].image: Invalid value: " ": must not have leading or trailing whitespace' reason: FailedCreate status: "True" type: ReplicaFailure observedGeneration: 1 unavailableReplicas: 1 kind: List metadata:
Expected results:
Should set spec.containers[0].image to registry.redhat.io/ubi8/httpd-24:latest
Additional info:
Description of problem:
The test case https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-50926 was created for NE-577 epic. When we increase the 'spec.tuningOptions.maxConnections' to 200000, the default ingress controller stuck in progressing.
Version-Release number of selected component (if applicable):
How reproducible:
https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-50926
Steps to Reproduce:
1.Edit the defualt controller with max value 2000000oc -n openshift-ingress-operator edit ingresscontroller defaulttuningOptions: maxConnections: 2000000 2.melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller default -o yaml | grep -A1 tuningOptions tuningOptions: maxConnections: 2000000 3. melvinjoseph@mjoseph-mac openshift-tests-private % oc get co/ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress 4.15.0-0.nightly-2023-10-16-231617 True True False 3h42m ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination......
Actual results:
The default ingress controller stuck in progressing
Expected results:
The ingress controller should work as normal
Additional info:
melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress get po NAME READY STATUS RESTARTS AGE router-default-7cf67f448-gb7mr 0/1 Running 0 38s router-default-7cf67f448-qmvks 0/1 Running 0 38s router-default-7dcd556587-kvk8d 0/1 Terminating 0 3h53m router-default-7dcd556587-vppk4 1/1 Running 0 3h53m melvinjoseph@mjoseph-mac openshift-tests-private % melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress get po NAME READY STATUS RESTARTS AGE router-default-7cf67f448-gb7mr 0/1 Running 0 111s router-default-7cf67f448-qmvks 0/1 Running 0 111s router-default-7dcd556587-vppk4 1/1 Running 0 3h55m melvinjoseph@mjoseph-mac openshift-tests-private % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.0-0.nightly-2023-10-16-231617 True False False 3h28m baremetal 4.15.0-0.nightly-2023-10-16-231617 True False False 3h55m cloud-controller-manager 4.15.0-0.nightly-2023-10-16-231617 True False False 3h58m cloud-credential 4.15.0-0.nightly-2023-10-16-231617 True False False 3h59m cluster-autoscaler 4.15.0-0.nightly-2023-10-16-231617 True False False 3h55m config-operator 4.15.0-0.nightly-2023-10-16-231617 True False False 3h56m console 4.15.0-0.nightly-2023-10-16-231617 True False False 3h34m control-plane-machine-set 4.15.0-0.nightly-2023-10-16-231617 True False False 3h43m csi-snapshot-controller 4.15.0-0.nightly-2023-10-16-231617 True False False 3h39m dns 4.15.0-0.nightly-2023-10-16-231617 True False False 3h54m etcd 4.15.0-0.nightly-2023-10-16-231617 True False False 3h47m image-registry 4.15.0-0.nightly-2023-10-16-231617 True False False 176m ingress 4.15.0-0.nightly-2023-10-16-231617 True True False 3h39m ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination...... insights 4.15.0-0.nightly-2023-10-16-231617 True False False 3h49m kube-apiserver 4.15.0-0.nightly-2023-10-16-231617 True False False 3h45m kube-controller-manager 4.15.0-0.nightly-2023-10-16-231617 True False False 3h46m kube-scheduler 4.15.0-0.nightly-2023-10-16-231617 True False False 3h46m kube-storage-version-migrator 4.15.0-0.nightly-2023-10-16-231617 True False False 3h56m machine-api 4.15.0-0.nightly-2023-10-16-231617 True False False 3h45m machine-approver 4.15.0-0.nightly-2023-10-16-231617 True False False 3h55m machine-config 4.15.0-0.nightly-2023-10-16-231617 True False False 3h53m marketplace 4.15.0-0.nightly-2023-10-16-231617 True False False 3h55m monitoring 4.15.0-0.nightly-2023-10-16-231617 True False False 3h35m network 4.15.0-0.nightly-2023-10-16-231617 True False False 3h57m node-tuning 4.15.0-0.nightly-2023-10-16-231617 True False False 3h39m openshift-apiserver 4.15.0-0.nightly-2023-10-16-231617 True False False 3h43m openshift-controller-manager 4.15.0-0.nightly-2023-10-16-231617 True False False 3h39m openshift-samples 4.15.0-0.nightly-2023-10-16-231617 True False False 3h39m operator-lifecycle-manager 4.15.0-0.nightly-2023-10-16-231617 True False False 3h54m operator-lifecycle-manager-catalog 4.15.0-0.nightly-2023-10-16-231617 True False False 3h54m operator-lifecycle-manager-packageserver 4.15.0-0.nightly-2023-10-16-231617 True False False 3h43m service-ca 4.15.0-0.nightly-2023-10-16-231617 True False False 3h56m storage 4.15.0-0.nightly-2023-10-16-231617 True False False 3h36m melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get po NAME READY STATUS RESTARTS AGE ingress-operator-c6fd989fd-jsrzv 2/2 Running 4 (3h45m ago) 3h58m melvinjoseph@mjoseph-mac openshift-tests-private % melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator logs ingress-operator-c6fd989fd-jsrzv -c ingress-operator --tail=20 2023-10-17T11:34:54.327Z INFO operator.ingress_controller handler/enqueue_mapped.go:81 queueing ingress {"name": "default", "related": ""} 2023-10-17T11:34:54.348Z INFO operator.ingress_controller handler/enqueue_mapped.go:81 queueing ingress {"name": "default", "related": ""} 2023-10-17T11:34:54.348Z INFO operator.ingress_controller handler/enqueue_mapped.go:81 queueing ingress {"name": "default", "related": ""} 2023-10-17T11:34:54.394Z INFO operator.ingressclass_controller controller/controller.go:118 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.394Z INFO operator.route_metrics_controller controller/controller.go:118 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.394Z INFO operator.status_controller controller/controller.go:118 Reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.397Z INFO operator.ingress_controller controller/controller.go:118 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.429Z INFO operator.status_controller controller/controller.go:118 Reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.446Z INFO operator.certificate_controller controller/controller.go:118 Reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.553Z INFO operator.ingressclass_controller controller/controller.go:118 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.553Z INFO operator.route_metrics_controller controller/controller.go:118 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.553Z INFO operator.status_controller controller/controller.go:118 Reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.557Z ERROR operator.ingress_controller controller/controller.go:118 got retryable error; requeueing {"after": "59m59.9999758s", "error": "IngressController may become degraded soon: DeploymentReplicasAllAvailable=False"} 2023-10-17T11:34:54.558Z INFO operator.ingress_controller controller/controller.go:118 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.583Z INFO operator.status_controller controller/controller.go:118 Reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.657Z ERROR operator.ingress_controller controller/controller.go:118 got retryable error; requeueing {"after": "59m59.345629987s", "error": "IngressController may become degraded soon: DeploymentReplicasAllAvailable=False"} 2023-10-17T11:34:54.794Z INFO operator.certificate_controller controller/controller.go:118 Reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:36:11.151Z INFO operator.ingress_controller handler/enqueue_mapped.go:81 queueing ingress {"name": "default", "related": ""} 2023-10-17T11:36:11.151Z INFO operator.ingress_controller controller/controller.go:118 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:36:11.248Z ERROR operator.ingress_controller controller/controller.go:118 got retryable error; requeueing {"after": "58m42.755479533s", "error": "IngressController may become degraded soon: DeploymentReplicasAllAvailable=False"} melvinjoseph@mjoseph-mac openshift-tests-private % melvinjoseph@mjoseph-mac openshift-tests-private % oc get po -n openshift-ingress NAME READY STATUS RESTARTS AGE router-default-7cf67f448-gb7mr 0/1 Running 1 (71s ago) 3m57s router-default-7cf67f448-qmvks 0/1 Running 1 (70s ago) 3m57s router-default-7dcd556587-vppk4 1/1 Running 0 3h57m melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress logs router-default-7cf67f448-gb7mr --tail=20 I1017 11:39:22.623928 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:23.623924 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:24.623373 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:25.627359 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:26.623337 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:27.623603 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:28.623866 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:29.623183 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:30.623475 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:31.623949 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure melvinjoseph@mjoseph-mac openshift-tests-private % melvinjoseph@mjoseph-mac openshift-tests-private % melvinjoseph@mjoseph-mac openshift-tests-private % melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress logs router-default-7cf67f448-qmvks --tail=20 I1017 11:39:34.553475 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:35.551412 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:36.551421 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure E1017 11:39:37.052068 1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory I1017 11:39:37.551648 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:38.551632 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:39.551410 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:40.552620 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:41.552050 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:42.551076 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:42.564293 1 template.go:828] router "msg"="Shutdown requested, waiting 45s for new connections to cease" melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller NAME AGE default 3h59m melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller default -o yaml apiVersion: operator.openshift.io/v1 <-----snip----> status: availableReplicas: 1 conditions: - lastTransitionTime: "2023-10-17T07:41:42Z" reason: Valid status: "True" type: Admitted - lastTransitionTime: "2023-10-17T07:57:01Z" message: The deployment has Available status condition set to True reason: DeploymentAvailable status: "True" type: DeploymentAvailable - lastTransitionTime: "2023-10-17T07:57:01Z" message: Minimum replicas requirement is met reason: DeploymentMinimumReplicasMet status: "True" type: DeploymentReplicasMinAvailable - lastTransitionTime: "2023-10-17T11:34:54Z" message: 1/2 of replicas are available reason: DeploymentReplicasNotAvailable status: "False" type: DeploymentReplicasAllAvailable - lastTransitionTime: "2023-10-17T11:34:54Z" message: | Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination... reason: DeploymentRollingOut status: "True" type: DeploymentRollingOut - lastTransitionTime: "2023-10-17T07:41:43Z" message: The endpoint publishing strategy supports a managed load balancer reason: WantedByEndpointPublishingStrategy status: "True" type: LoadBalancerManaged - lastTransitionTime: "2023-10-17T07:57:24Z" message: The LoadBalancer service is provisioned reason: LoadBalancerProvisioned status: "True" type: LoadBalancerReady - lastTransitionTime: "2023-10-17T07:41:43Z" message: LoadBalancer is not progressing reason: LoadBalancerNotProgressing status: "False" type: LoadBalancerProgressing - lastTransitionTime: "2023-10-17T07:41:43Z" message: DNS management is supported and zones are specified in the cluster DNS config. reason: Normal status: "True" type: DNSManaged - lastTransitionTime: "2023-10-17T07:57:26Z" message: The record is provisioned in all reported zones. reason: NoFailedZones status: "True" type: DNSReady - lastTransitionTime: "2023-10-17T07:57:26Z" status: "True" type: Available - lastTransitionTime: "2023-10-17T11:34:54Z" message: |- One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination... ) reason: IngressControllerProgressing status: "True" type: Progressing - lastTransitionTime: "2023-10-17T07:57:28Z" status: "False" type: Degraded - lastTransitionTime: "2023-10-17T07:41:43Z" <-----snip---->
Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/223
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/531
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-api/pull/190
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/102
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/337
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
maxUnavailable defaults to 50% for anything under 4: https://github.com/openshift/cluster-ingress-operator/blob/master/pkg/operator/controller/ingress/poddisruptionbudget.go#L71 Based on PDB rounding logic, it always rounds to the next while integer, so 1.5 becomes 2. spec: maxUnavailable: 50% selector: matchLabels: ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default currentHealthy: 3 desiredHealthy: 1 disruptionsAllowed: 2 Where as with 4 router pods, we only allow 1 of 4 to be disrupted at a time.
Version-Release number of selected component (if applicable):
4.x
How reproducible:
Always
Steps to Reproduce:
1. Set 3 replicas 2. Look at the disruptionsAllowed on the PDB
Actual results:
You can take down 2 of 3 routers at once, leaving no HA.
Expected results:
With 3+ routers, we should always ensure 2 are up with the PDB.
Additional info:
Reduce the maxUnavailable to 25% for >= 3 pods instead of 4
Please review the following PR: https://github.com/openshift/cluster-api-provider-libvirt/pull/274
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
allow eviction of unhealthy (not ready) pods even if there are no disruptions allowed on a PodDisruptionBudget. This can help to drain/maintain a node and recover without a manual intervention when multiple instances of nodes or pods are misbehaving.
to prevent possible issues similar to https://issues.redhat.com//browse/OCPBUGS-23796
Please review the following PR: https://github.com/openshift/cluster-api-provider-vsphere/pull/17
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/192
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
In https://issues.redhat.com/browse/OCPBUGS-24195 Lukasz is working on a solution to a problem both the auth and apiserver operators have where a large number of identical kube events can be emitted. The kube apiserver was granted an exception here, but the linked bug was never fixed.
These OpenShiftAPICheckFailed events are reportedly originating during bootstrap, and if bootstrap takes too long many can be emitted, which can trip a test that watches for this sort of thing.
Ideally the problem should be fixed and it sounds like Lukasz is on the path to one which we hope could be used for the apiserver operator as well. (start a controller monitoring the aggregated API only after the bootstrap is complete)
Fix here would hopefully be to leverage what comes out of OCPBUGS-24195, apply it for the apiserver operator, and then remove the exception linked above in origin.
Description of problem:
Tests like lint and vet used to be ran within a container engine by default if an engine was detected, both locally and in CI.Up until now no container engine was detected in CI, so tests would run natively there.Now that the base image we use in CI has now started shipping `podman`, a container engine is detected by default and tests are run within podman by default. But creating nested containers doesn't work in CI at the moment and thus results in a test failure.As such we are switching the default behaviour for tests (both locally and in CI), where now by default no container engine is used to run tests, even if one is detected, but instead tests are run natively unless otherwise specifi
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
All the apiservers:
Expose both `apiserver_request_slo_duration_seconds` and `apiserver_request_sli_duration_seconds`. The SLI metric was introduced in Kubernetes 1.26 as a replacement of `apiserver_request_slo_duration_seconds` which was deprecated in Kubernetes 1.27. This change is only a renaming so both metrics expose the same data. To avoid storing duplicated data in Prometheus, we need to drop the `apiserver_request_slo_duration_seconds` in favor of `apiserver_request_sli_duration_seconds`.
Description of problem:
when checking the bug https://issues.redhat.com/browse/OCPBUGS-15976, found that the default ingresscontroller DNSReady is True even dns records failed to be published to public zone, the co/ingress doesn't report any error.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-05-191022
How reproducible:
100%
Steps to Reproduce:
1. install Azure cluster configured for manual mode with Azure Workload Identity 2. check dnsrecords of default-wildcard $ oc -n openshift-ingress-operator get dnsrecords default-wildcard -oyaml <---snip---> - conditions: - lastTransitionTime: "2023-07-10T04:23:55Z" message: 'The DNS provider failed to ensure the record: failed to update dns ...... reason: ProviderError status: "False" type: Published dnsZone: id: /subscriptions/xxxxx/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/qe.azure.devcluster.openshift.com 3. Check ingresscontroller status $ oc -n openshift-ingress-operator get ingresscontroller default -oyaml <---snip---> - lastTransitionTime: "2023-07-10T04:23:55Z" message: The record is provisioned in all reported zones. reason: NoFailedZones status: "True" type: DNSReady 4. Check co/ingress status $ oc get co/ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress 4.14.0-0.nightly-2023-07-05-191022 True False False 127m
Actual results:
1. DNSReady is True and message shows: The record is provisioned in all reported zones. 2. co/ingress doesn't report any error
Expected results:
DNSReady should be False since failed to publish to public zone
Additional info:
Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/94
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Various jobs are failing in e2e-gcp-operator due to the LoadBalancer-Type Service not going "ready", which means it most likely not getting an IP address. Tests so far affected are: - TestUnmanagedDNSToManagedDNSInternalIngressController - TestScopeChange - TestInternalLoadBalancerGlobalAccessGCP - TestInternalLoadBalancer - TestAllowedSourceRanges For example, in TestInternalLoadBalancer, the Load Balancer never comes back ready: operator_test.go:1454: Expected conditions: map[Admitted:True Available:True DNSManaged:True DNSReady:True LoadBalancerManaged:True LoadBalancerReady:True] Current conditions: map[Admitted:True Available:False DNSManaged:True DNSReady:False Degraded:True DeploymentAvailable:True DeploymentReplicasAllAvailable:True DeploymentReplicasMinAvailable:True DeploymentRollingOut:False EvaluationConditionsDetected:False LoadBalancerManaged:True LoadBalancerProgressing:False LoadBalancerReady:False Progressing:False Upgradeable:True] Where DNSReady:False and LoadBalancerReady:False.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
10% of the time
Steps to Reproduce:
1. Run e2e-gcp-operator many times until you see one of these failures
Actual results:
Test Failure
Expected results:
Not failure
Additional info:
Search.CI Links:
TestScopeChange
TestInternalLoadBalancerGlobalAccessGCP & TestInternalLoadBalancer
This does not seem related to https://issues.redhat.com/browse/OCPBUGS-6013. The DNS E2E tests actually pass this same condition check.
CI is flaky because the TestClientTLS test fails.
I have seen these failures in 4.13 and 4.14 CI jobs.
Presently, search.ci reports the following stats for the past 14 days:
Found in 16.07% of runs (20.93% of failures) across 56 total runs and 13 jobs (76.79% failed) in 185ms
1. Post a PR and have bad luck.
2. Check https://search.ci.openshift.org/?search=FAIL%3A+TestAll%2Fparallel%2FTestClientTLS&maxAge=336h&context=1&type=all&name=cluster-ingress-operator&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job.
The test fails:
=== RUN TestAll/parallel/TestClientTLS === PAUSE TestAll/parallel/TestClientTLS === CONT TestAll/parallel/TestClientTLS === CONT TestAll/parallel/TestClientTLS stdout: Healthcheck requested 200 stderr: * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/ * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache * Trying 172.30.53.236... * TCP_NODELAY set % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [10 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Request CERT (13): { [82 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [1763 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [264 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [36 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Certificate (11): } [8 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [36 bytes data] * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com * start date: Mar 22 18:55:46 2023 GMT * expire date: Mar 21 18:55:47 2025 GMT * issuer: CN=ingress-operator@1679509964 * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. } [5 bytes data] * TLSv1.3 (OUT), TLS app data, [no content] (0): } [1 bytes data] > GET / HTTP/1.1 > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com > User-Agent: curl/7.61.1 > Accept: */* > { [5 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [313 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [313 bytes data] * TLSv1.3 (IN), TLS app data, [no content] (0): { [1 bytes data] < HTTP/1.1 200 OK < x-request-port: 8080 < date: Wed, 22 Mar 2023 18:56:24 GMT < content-length: 22 < content-type: text/plain; charset=utf-8 < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=683c60a6110214134bed475edc895cb9; path=/; HttpOnly; Secure; SameSite=None < cache-control: private < { [22 bytes data] * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact stdout: Healthcheck requested 200 stderr: * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/ * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache * Trying 172.30.53.236... * TCP_NODELAY set % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [10 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Request CERT (13): { [82 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [1763 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [264 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [36 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Certificate (11): } [799 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, CERT verify (15): } [264 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [36 bytes data] * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com * start date: Mar 22 18:55:46 2023 GMT * expire date: Mar 21 18:55:47 2025 GMT * issuer: CN=ingress-operator@1679509964 * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. } [5 bytes data] * TLSv1.3 (OUT), TLS app data, [no content] (0): } [1 bytes data] > GET / HTTP/1.1 > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com > User-Agent: curl/7.61.1 > Accept: */* > { [5 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [1097 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [1097 bytes data] * TLSv1.3 (IN), TLS app data, [no content] (0): { [1 bytes data] < HTTP/1.1 200 OK < x-request-port: 8080 < date: Wed, 22 Mar 2023 18:56:24 GMT < content-length: 22 < content-type: text/plain; charset=utf-8 < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=eb40064e54af58007f579a6c82f2bcd7; path=/; HttpOnly; Secure; SameSite=None < cache-control: private < { [22 bytes data] * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact stdout: Healthcheck requested 200 stderr: * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/ * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache * Trying 172.30.53.236... * TCP_NODELAY set % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [10 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Request CERT (13): { [82 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [1763 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [264 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [36 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Certificate (11): } [802 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, CERT verify (15): } [264 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [36 bytes data] * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com * start date: Mar 22 18:55:46 2023 GMT * expire date: Mar 21 18:55:47 2025 GMT * issuer: CN=ingress-operator@1679509964 * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. } [5 bytes data] * TLSv1.3 (OUT), TLS app data, [no content] (0): } [1 bytes data] > GET / HTTP/1.1 > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com > User-Agent: curl/7.61.1 > Accept: */* > { [5 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [1097 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [1097 bytes data] * TLSv1.3 (IN), TLS app data, [no content] (0): { [1 bytes data] < HTTP/1.1 200 OK < x-request-port: 8080 < date: Wed, 22 Mar 2023 18:56:25 GMT < content-length: 22 < content-type: text/plain; charset=utf-8 < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=104beed63d6a19782a5559400bd972b6; path=/; HttpOnly; Secure; SameSite=None < cache-control: private < { [22 bytes data] * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact stdout: 000 stderr: * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/ * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache * Trying 172.30.53.236... * TCP_NODELAY set % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [10 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Request CERT (13): { [82 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [1763 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [264 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [36 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Certificate (11): } [799 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, CERT verify (15): } [264 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [36 bytes data] * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com * start date: Mar 22 18:55:46 2023 GMT * expire date: Mar 21 18:55:47 2025 GMT * issuer: CN=ingress-operator@1679509964 * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. } [5 bytes data] * TLSv1.3 (OUT), TLS app data, [no content] (0): } [1 bytes data] > GET / HTTP/1.1 > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com > User-Agent: curl/7.61.1 > Accept: */* > { [5 bytes data] * TLSv1.3 (IN), TLS alert, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS alert, unknown CA (560): { [2 bytes data] * OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0 * Closing connection 0 curl: (56) OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0 === CONT TestAll/parallel/TestClientTLS stdout: 000 stderr: * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/ * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache * Trying 172.30.53.236... * TCP_NODELAY set % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [10 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Request CERT (13): { [82 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [1763 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [264 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [36 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Certificate (11): } [8 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [36 bytes data] * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com * start date: Mar 22 18:55:46 2023 GMT * expire date: Mar 21 18:55:47 2025 GMT * issuer: CN=ingress-operator@1679509964 * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. } [5 bytes data] * TLSv1.3 (OUT), TLS app data, [no content] (0): } [1 bytes data] > GET / HTTP/1.1 > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com > User-Agent: curl/7.61.1 > Accept: */* > { [5 bytes data] * TLSv1.3 (IN), TLS alert, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS alert, unknown (628): { [2 bytes data] * OpenSSL SSL_read: error:1409445C:SSL routines:ssl3_read_bytes:tlsv13 alert certificate required, errno 0 * Closing connection 0 curl: (56) OpenSSL SSL_read: error:1409445C:SSL routines:ssl3_read_bytes:tlsv13 alert certificate required, errno 0 === CONT TestAll/parallel/TestClientTLS stdout: Healthcheck requested 200 stderr: * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/ * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache * Trying 172.30.53.236... * TCP_NODELAY set % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [10 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Request CERT (13): { [82 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [1763 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [264 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [36 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Certificate (11): } [799 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, CERT verify (15): } [264 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [36 bytes data] * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com * start date: Mar 22 18:55:46 2023 GMT * expire date: Mar 21 18:55:47 2025 GMT * issuer: CN=ingress-operator@1679509964 * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. } [5 bytes data] * TLSv1.3 (OUT), TLS app data, [no content] (0): } [1 bytes data] > GET / HTTP/1.1 > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com > User-Agent: curl/7.61.1 > Accept: */* > { [5 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [1097 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [1097 bytes data] * TLSv1.3 (IN), TLS app data, [no content] (0): { [1 bytes data] < HTTP/1.1 200 OK < x-request-port: 8080 < date: Wed, 22 Mar 2023 18:57:00 GMT < content-length: 22 < content-type: text/plain; charset=utf-8 < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=683c60a6110214134bed475edc895cb9; path=/; HttpOnly; Secure; SameSite=None < cache-control: private < { [22 bytes data] * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact === CONT TestAll/parallel/TestClientTLS stdout: Healthcheck requested 200 stderr: * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/ * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache * Trying 172.30.53.236... * TCP_NODELAY set % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [10 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Request CERT (13): { [82 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [1763 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [264 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [36 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Certificate (11): } [802 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, CERT verify (15): } [264 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [36 bytes data] * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com * start date: Mar 22 18:55:46 2023 GMT * expire date: Mar 21 18:55:47 2025 GMT * issuer: CN=ingress-operator@1679509964 * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. } [5 bytes data] * TLSv1.3 (OUT), TLS app data, [no content] (0): } [1 bytes data] > GET / HTTP/1.1 > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com > User-Agent: curl/7.61.1 > Accept: */* > { [5 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [1097 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [1097 bytes data] * TLSv1.3 (IN), TLS app data, [no content] (0): { [1 bytes data] < HTTP/1.1 200 OK < x-request-port: 8080 < date: Wed, 22 Mar 2023 18:57:00 GMT < content-length: 22 < content-type: text/plain; charset=utf-8 < set-cookie: c6e529a6ab19a530fd4f1cceb91c08a9=eb40064e54af58007f579a6c82f2bcd7; path=/; HttpOnly; Secure; SameSite=None < cache-control: private < { [22 bytes data] * Connection #0 to host canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com left intact === CONT TestAll/parallel/TestClientTLS stdout: 000 stderr: * Added canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com:443:172.30.53.236 to DNS cache * Rebuilt URL to: https://canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com/ * Hostname canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com was found in DNS cache * Trying 172.30.53.236... * TCP_NODELAY set % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [10 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Request CERT (13): { [82 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [1763 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [264 bytes data] * TLSv1.3 (IN), TLS handshake, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [36 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Certificate (11): } [799 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, CERT verify (15): } [264 bytes data] * TLSv1.3 (OUT), TLS handshake, [no content] (0): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [36 bytes data] * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 * ALPN, server did not agree to a protocol * Server certificate: * subject: CN=*.client-tls.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com * start date: Mar 22 18:55:46 2023 GMT * expire date: Mar 21 18:55:47 2025 GMT * issuer: CN=ingress-operator@1679509964 * SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway. } [5 bytes data] * TLSv1.3 (OUT), TLS app data, [no content] (0): } [1 bytes data] > GET / HTTP/1.1 > Host: canary-openshift-ingress-canary.apps.ci-op-21xplx9n-43abb.origin-ci-int-aws.dev.rhcloud.com > User-Agent: curl/7.61.1 > Accept: */* > { [5 bytes data] * TLSv1.3 (IN), TLS alert, [no content] (0): { [1 bytes data] * TLSv1.3 (IN), TLS alert, unknown CA (560): { [2 bytes data] * OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0 * Closing connection 0 curl: (56) OpenSSL SSL_read: error:14094418:SSL routines:ssl3_read_bytes:tlsv1 alert unknown ca, errno 0 === CONT TestAll/parallel/TestClientTLS --- FAIL: TestAll (1538.53s) --- FAIL: TestAll/parallel (0.00s) --- FAIL: TestAll/parallel/TestClientTLS (123.10s)
CI passes, or it fails on a different test.
I saw that TestClientTLS failed on the test case with no client certificate and ClientCertificatePolicy set to "Required". My best guess is that the test is racy and is hitting a terminating router pod. The test uses waitForDeploymentComplete to wait until all new pods are available, but perhaps waitForDeploymentComplete should also wait until all old pods are terminated.
Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/112
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
It would help making debugging easier if we included the namespace in the message for these alerts: https://github.com/openshift/cluster-ingress-operator/blob/master/manifests/0000_90_ingress-operator_03_prometheusrules.yaml#L69
Version-Release number of selected component (if applicable):
4.12.x
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
No namespace in the alert message
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-alibaba-cloud/pull/44
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Update the owners file in openshift-state-metric repository, add new team mates in, move old team mates out.
Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1020
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Create a private Shared VPC cluster on AWS, Ingress operator degraded due to the following error: 2023-06-14T09:55:50.240Z INFO operator.dns_controller controller/controller.go:118 reconciling {"request": {"name":"default-wildcard","namespace":"openshift-ingress-operator"}} 2023-06-14T09:55:50.363Z ERROR operator.dns_controller dns/controller.go:354 failed to publish DNS record to zone {"record": {"dnsName":"*.apps.ci-op-2x6lics3-849ce.qe.devcluster.openshift.com.","targets":["internal-ac656ce4d29f64da289152053f50c908-1642793317.us-east-1.elb.amazonaws.com"],"recordType":"CNAME","recordTTL":30,"dnsManagementPolicy":"Managed"}, "dnszone": {"id":"Z0698684SM2RRJSYHP43"}, "error": "failed to get hosted zone for load balancer target \"internal-ac656ce4d29f64da289152053f50c908-1642793317.us-east-1.elb.amazonaws.com\": couldn't find hosted zone ID of ELB internal-ac656ce4d29f64da289152053f50c908-1642793317.us-east-1.elb.amazonaws.com"} ingress operator: ingress False True True 37m The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DNSReady=False (FailedZones: The record failed to provision in some zones: [{Z0698684SM2RRJSYHP43 map[]}])
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-13-223353
How reproducible:
always
Steps to Reproduce:
1. Create a private Shared VPC cluster on AWS using STS
Actual results:
ingress operator degraded
Expected results:
cluster is healthy
Additional info:
public cluster no such issue.
The flowcontrol manifests in the following operators (kas, oas, etcd, openshift controller manager, auth, and network) should use v1.
We saw this in an OKD job:
https://github.com/openshift/machine-config-operator/pull/3358#issuecomment-1267532305
It's simple to reproduce, from say a current RHCOS 4.12 doing:
[root@cosa-devsh ~]# podman run --privileged --pid=host --net=host --rm -v /:/run/host quay.io/fedora/fedora-coreos:testing-devel "rpm-ostree" "ex" "deploy-from-self" "/run/host" NOTICE: Experimental commands are subject to change. error: Writing content object: Setting xattrs: fsetxattr(security.selinux): Invalid argument
I've tried doing `--security-opt label=type:unconfined_t` which gives the same error (of course), but using `install_t` I get:
[root@cosa-devsh ~]# podman run --privileged --security-opt label=type:install_t --pid=host --net=host --rm -v /:/run/host quay.io/fedora/fedora-coreos:testing-devel "rpm-ostree" "ex" "deploy-from-self" "/run/host" exec /usr/bin/rpm-ostree: permission denied [root@cosa-devsh ~]#
I'm really tempted to just `setenforce 0` for the first OS update...
Please review the following PR: https://github.com/openshift/cloud-provider-alibaba-cloud/pull/39
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
When the ingress operator creates or updates a router deployment that specifies spec.template.spec.hostNetwork: true, the operator does not set spec.template.spec.containers[*].ports[*].hostPort. As a result, the API sets each port's hostPort field to the port's containerPort field value. The operator detects this as an external update and attempts to revert it. The operator should not update the deployment in response to API defaulting.
I observed this in CI for OCP 4.14 and was able to reproduce the issue on OCP 4.11.37. The problematic code was added in https://github.com/openshift/cluster-ingress-operator/pull/694/commits/af653f9fa7368cf124e11b7ea4666bc40e601165 in OCP 4.11 to implement NE-674.
Easily.
1. Create an IngressController that specifies the "HostNetwork" endpoint publishing strategy type:
oc create -f - <<EOF apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: example-hostnetwork namespace: openshift-ingress-operator spec: domain: example.xyz endpointPublishingStrategy: type: HostNetwork EOF
2. Check the ingress operator's logs:
oc -n openshift-ingress-operator logs -c ingress-operator deployments/ingress-operator
The ingress operator logs "updated router deployment" multiple times for the "example-hostnetwork" IngressController, such as the following:
2023-06-15T02:11:47.229Z INFO operator.ingress_controller ingress/deployment.go:131 updated router deployment {"namespace": "openshift-ingress", "name": "router-example-hostnetwork", "diff": " &v1.Deployment{\n \tTypeMeta: {},\n \tObjectMeta: {Name: \"router-example-hostnetwork\", Namespace: \"openshift-ingress\", UID: \"d7c51022-460e-4962-8521-e00255f649c3\", ResourceVersion: \"3356177\", ...},\n \tSpec: v1.DeploymentSpec{\n \t\tReplicas: &2,\n \t\tSelector: &{MatchLabels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"example-hostnetwork\"}},\n \t\tTemplate: v1.PodTemplateSpec{\n \t\t\tObjectMeta: {Labels: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"example-hostnetwork\", \"ingresscontroller.operator.openshift.io/hash\": \"b7c697fd\"}, Annotations: {\"target.workload.openshift.io/management\": `{\"effect\": \"PreferredDuringScheduling\"}`, \"unsupported.do-not-use.openshift.io/override-liveness-grace-period-seconds\": \"10\"}},\n \t\t\tSpec: v1.PodSpec{\n \t\t\t\tVolumes: []v1.Volume{\n \t\t\t\t\t{Name: \"default-certificate\", VolumeSource: {Secret: &{SecretName: \"router-certs-example-hostnetwork\", DefaultMode: &420}}},\n \t\t\t\t\t{\n \t\t\t\t\t\tName: \"service-ca-bundle\",\n \t\t\t\t\t\tVolumeSource: v1.VolumeSource{\n \t\t\t\t\t\t\t... // 16 identical fields\n \t\t\t\t\t\t\tFC: nil,\n \t\t\t\t\t\t\tAzureFile: nil,\n \t\t\t\t\t\t\tConfigMap: &v1.ConfigMapVolumeSource{\n \t\t\t\t\t\t\t\tLocalObjectReference: {Name: \"service-ca-bundle\"},\n \t\t\t\t\t\t\t\tItems: {{Key: \"service-ca.crt\", Path: \"service-ca.crt\"}},\n- \t\t\t\t\t\t\t\tDefaultMode: &420,\n+ \t\t\t\t\t\t\t\tDefaultMode: nil,\n \t\t\t\t\t\t\t\tOptional: &false,\n \t\t\t\t\t\t\t},\n \t\t\t\t\t\t\tVsphereVolume: nil,\n \t\t\t\t\t\t\tQuobyte: nil,\n \t\t\t\t\t\t\t... // 8 identical fields\n \t\t\t\t\t\t},\n \t\t\t\t\t},\n \t\t\t\t\t{\n \t\t\t\t\t\tName: \"stats-auth\",\n \t\t\t\t\t\tVolumeSource: v1.VolumeSource{\n \t\t\t\t\t\t\t... // 3 identical fields\n \t\t\t\t\t\t\tAWSElasticBlockStore: nil,\n \t\t\t\t\t\t\tGitRepo: nil,\n \t\t\t\t\t\t\tSecret: &v1.SecretVolumeSource{\n \t\t\t\t\t\t\t\tSecretName: \"router-stats-example-hostnetwork\",\n \t\t\t\t\t\t\t\tItems: nil,\n- \t\t\t\t\t\t\t\tDefaultMode: &420,\n+ \t\t\t\t\t\t\t\tDefaultMode: nil,\n \t\t\t\t\t\t\t\tOptional: nil,\n \t\t\t\t\t\t\t},\n \t\t\t\t\t\t\tNFS: nil,\n \t\t\t\t\t\t\tISCSI: nil,\n \t\t\t\t\t\t\t... // 21 identical fields\n \t\t\t\t\t\t},\n \t\t\t\t\t},\n \t\t\t\t\t{\n \t\t\t\t\t\tName: \"metrics-certs\",\n \t\t\t\t\t\tVolumeSource: v1.VolumeSource{\n \t\t\t\t\t\t\t... // 3 identical fields\n \t\t\t\t\t\t\tAWSElasticBlockStore: nil,\n \t\t\t\t\t\t\tGitRepo: nil,\n \t\t\t\t\t\t\tSecret: &v1.SecretVolumeSource{\n \t\t\t\t\t\t\t\tSecretName: \"router-metrics-certs-example-hostnetwork\",\n \t\t\t\t\t\t\t\tItems: nil,\n- \t\t\t\t\t\t\t\tDefaultMode: &420,\n+ \t\t\t\t\t\t\t\tDefaultMode: nil,\n \t\t\t\t\t\t\t\tOptional: nil,\n \t\t\t\t\t\t\t},\n \t\t\t\t\t\t\tNFS: nil,\n \t\t\t\t\t\t\tISCSI: nil,\n \t\t\t\t\t\t\t... // 21 identical fields\n \t\t\t\t\t\t},\n \t\t\t\t\t},\n \t\t\t\t},\n \t\t\t\tInitContainers: nil,\n \t\t\t\tContainers: []v1.Container{\n \t\t\t\t\t{\n \t\t\t\t\t\t... // 3 identical fields\n \t\t\t\t\t\tArgs: nil,\n \t\t\t\t\t\tWorkingDir: \"\",\n \t\t\t\t\t\tPorts: []v1.ContainerPort{\n \t\t\t\t\t\t\t{\n \t\t\t\t\t\t\t\tName: \"http\",\n- \t\t\t\t\t\t\t\tHostPort: 80,\n+ \t\t\t\t\t\t\t\tHostPort: 0,\n \t\t\t\t\t\t\t\tContainerPort: 80,\n \t\t\t\t\t\t\t\tProtocol: \"TCP\",\n \t\t\t\t\t\t\t\tHostIP: \"\",\n \t\t\t\t\t\t\t},\n \t\t\t\t\t\t\t{\n \t\t\t\t\t\t\t\tName: \"https\",\n- \t\t\t\t\t\t\t\tHostPort: 443,\n+ \t\t\t\t\t\t\t\tHostPort: 0,\n \t\t\t\t\t\t\t\tContainerPort: 443,\n \t\t\t\t\t\t\t\tProtocol: \"TCP\",\n \t\t\t\t\t\t\t\tHostIP: \"\",\n \t\t\t\t\t\t\t},\n \t\t\t\t\t\t\t{\n \t\t\t\t\t\t\t\tName: \"metrics\",\n- \t\t\t\t\t\t\t\tHostPort: 1936,\n+ \t\t\t\t\t\t\t\tHostPort: 0,\n \t\t\t\t\t\t\t\tContainerPort: 1936,\n \t\t\t\t\t\t\t\tProtocol: \"TCP\",\n \t\t\t\t\t\t\t\tHostIP: \"\",\n \t\t\t\t\t\t\t},\n \t\t\t\t\t\t},\n \t\t\t\t\t\tEnvFrom: nil,\n \t\t\t\t\t\tEnv: {{Name: \"DEFAULT_CERTIFICATE_DIR\", Value: \"/etc/pki/tls/private\"}, {Name: \"DEFAULT_DESTINATION_CA_PATH\", Value: \"/var/run/configmaps/service-ca/service-ca.crt\"}, {Name: \"RELOAD_INTERVAL\", Value: \"5s\"}, {Name: \"ROUTER_ALLOW_WILDCARD_ROUTES\", Value: \"false\"}, ...},\n \t\t\t\t\t\tResources: {Requests: {s\"cpu\": {i: {...}, s: \"100m\", Format: \"DecimalSI\"}, s\"memory\": {i: {...}, Format: \"BinarySI\"}}},\n \t\t\t\t\t\tVolumeMounts: {{Name: \"default-certificate\", ReadOnly: true, MountPath: \"/etc/pki/tls/private\"}, {Name: \"service-ca-bundle\", ReadOnly: true, MountPath: \"/var/run/configmaps/service-ca\"}, {Name: \"stats-auth\", ReadOnly: true, MountPath: \"/var/lib/haproxy/conf/metrics-auth\"}, {Name: \"metrics-certs\", ReadOnly: true, MountPath: \"/etc/pki/tls/metrics-certs\"}},\n \t\t\t\t\t\tVolumeDevices: nil,\n \t\t\t\t\t\tLivenessProbe: &v1.Probe{\n \t\t\t\t\t\t\tProbeHandler: v1.ProbeHandler{\n \t\t\t\t\t\t\t\tExec: nil,\n \t\t\t\t\t\t\t\tHTTPGet: &v1.HTTPGetAction{\n \t\t\t\t\t\t\t\t\tPath: \"/healthz\",\n \t\t\t\t\t\t\t\t\tPort: {IntVal: 1936},\n \t\t\t\t\t\t\t\t\tHost: \"localhost\",\n- \t\t\t\t\t\t\t\t\tScheme: \"HTTP\",\n+ \t\t\t\t\t\t\t\t\tScheme: \"\",\n \t\t\t\t\t\t\t\t\tHTTPHeaders: nil,\n \t\t\t\t\t\t\t\t},\n \t\t\t\t\t\t\t\tTCPSocket: nil,\n \t\t\t\t\t\t\t\tGRPC: nil,\n \t\t\t\t\t\t\t},\n \t\t\t\t\t\t\tInitialDelaySeconds: 0,\n \t\t\t\t\t\t\tTimeoutSeconds: 1,\n- \t\t\t\t\t\t\tPeriodSeconds: 10,\n+ \t\t\t\t\t\t\tPeriodSeconds: 0,\n- \t\t\t\t\t\t\tSuccessThreshold: 1,\n+ \t\t\t\t\t\t\tSuccessThreshold: 0,\n- \t\t\t\t\t\t\tFailureThreshold: 3,\n+ \t\t\t\t\t\t\tFailureThreshold: 0,\n \t\t\t\t\t\t\tTerminationGracePeriodSeconds: nil,\n \t\t\t\t\t\t},\n \t\t\t\t\t\tReadinessProbe: &v1.Probe{\n \t\t\t\t\t\t\tProbeHandler: v1.ProbeHandler{\n \t\t\t\t\t\t\t\tExec: nil,\n \t\t\t\t\t\t\t\tHTTPGet: &v1.HTTPGetAction{\n \t\t\t\t\t\t\t\t\tPath: \"/healthz/ready\",\n \t\t\t\t\t\t\t\t\tPort: {IntVal: 1936},\n \t\t\t\t\t\t\t\t\tHost: \"localhost\",\n- \t\t\t\t\t\t\t\t\tScheme: \"HTTP\",\n+ \t\t\t\t\t\t\t\t\tScheme: \"\",\n \t\t\t\t\t\t\t\t\tHTTPHeaders: nil,\n \t\t\t\t\t\t\t\t},\n \t\t\t\t\t\t\t\tTCPSocket: nil,\n \t\t\t\t\t\t\t\tGRPC: nil,\n \t\t\t\t\t\t\t},\n \t\t\t\t\t\t\tInitialDelaySeconds: 0,\n \t\t\t\t\t\t\tTimeoutSeconds: 1,\n- \t\t\t\t\t\t\tPeriodSeconds: 10,\n+ \t\t\t\t\t\t\tPeriodSeconds: 0,\n- \t\t\t\t\t\t\tSuccessThreshold: 1,\n+ \t\t\t\t\t\t\tSuccessThreshold: 0,\n- \t\t\t\t\t\t\tFailureThreshold: 3,\n+ \t\t\t\t\t\t\tFailureThreshold: 0,\n \t\t\t\t\t\t\tTerminationGracePeriodSeconds: nil,\n \t\t\t\t\t\t},\n \t\t\t\t\t\tStartupProbe: &v1.Probe{\n \t\t\t\t\t\t\tProbeHandler: v1.ProbeHandler{\n \t\t\t\t\t\t\t\tExec: nil,\n \t\t\t\t\t\t\t\tHTTPGet: &v1.HTTPGetAction{\n \t\t\t\t\t\t\t\t\tPath: \"/healthz/ready\",\n \t\t\t\t\t\t\t\t\tPort: {IntVal: 1936},\n \t\t\t\t\t\t\t\t\tHost: \"localhost\",\n- \t\t\t\t\t\t\t\t\tScheme: \"HTTP\",\n+ \t\t\t\t\t\t\t\t\tScheme: \"\",\n \t\t\t\t\t\t\t\t\tHTTPHeaders: nil,\n \t\t\t\t\t\t\t\t},\n \t\t\t\t\t\t\t\tTCPSocket: nil,\n \t\t\t\t\t\t\t\tGRPC: nil,\n \t\t\t\t\t\t\t},\n \t\t\t\t\t\t\tInitialDelaySeconds: 0,\n \t\t\t\t\t\t\tTimeoutSeconds: 1,\n \t\t\t\t\t\t\tPeriodSeconds: 1,\n- \t\t\t\t\t\t\tSuccessThreshold: 1,\n+ \t\t\t\t\t\t\tSuccessThreshold: 0,\n \t\t\t\t\t\t\tFailureThreshold: 120,\n \t\t\t\t\t\t\tTerminationGracePeriodSeconds: nil,\n \t\t\t\t\t\t},\n \t\t\t\t\t\tLifecycle: nil,\n \t\t\t\t\t\tTerminationMessagePath: \"/dev/termination-log\",\n \t\t\t\t\t\t... // 6 identical fields\n \t\t\t\t\t},\n \t\t\t\t},\n \t\t\t\tEphemeralContainers: nil,\n \t\t\t\tRestartPolicy: \"Always\",\n \t\t\t\t... // 31 identical fields\n \t\t\t},\n \t\t},\n \t\tStrategy: {Type: \"RollingUpdate\", RollingUpdate: &{MaxUnavailable: &{Type: 1, StrVal: \"25%\"}, MaxSurge: &{}}},\n \t\tMinReadySeconds: 30,\n \t\t... // 3 identical fields\n \t},\n \tStatus: {ObservedGeneration: 1, Replicas: 2, UpdatedReplicas: 2, UnavailableReplicas: 2, ...},\n }\n"}
Note the following in the diff:
Ports: []v1.ContainerPort{ { Name: \"http\", - HostPort: 80, + HostPort: 0, ContainerPort: 80, Protocol: \"TCP\", HostIP: \"\", }, { Name: \"https\", - HostPort: 443, + HostPort: 0, ContainerPort: 443, Protocol: \"TCP\", HostIP: \"\", }, { Name: \"metrics\", - HostPort: 1936, + HostPort: 0, ContainerPort: 1936, Protocol: \"TCP\", HostIP: \"\", }, },
The operator should ignore updates by the API that only set default values. The operator should not perform these unnecessary updates to the router deployment.
Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/203
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The image quay.io/centos7/httpd-24-centos7 used in TestMTLSWithCRLs and TestCRLUpdate is no longer being rebuilt, and has had its 'latest' tag removed. Containers using this image fail to start, and cause the tests to fail.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
Run 'TEST="(TestMTLSWithCRLs|TestCRLUpdate)" make test-e2e' from the cluster-ingress-operator repo
Actual results:
Both tests and all their subtests fail
Expected results:
Tests pass
Additional info:
Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/977
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/548
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Install private cluster by using azure workload identity, and failed due to no worker machines being provisioned. install-config: ---------------------- platform: azure: region: eastus networkResourceGroupName: jima971b-12015319-rg virtualNetwork: jima971b-vnet controlPlaneSubnet: jima971b-master-subnet computeSubnet: jima971b-worker-subnet resourceGroupName: jima971b-rg publish: Internal credentialsMode: Manual Detailed check on cluster and found machine-api/ingress/image-registry operators reported permissions issues and have no access to customer vnet. $ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE jima971b-qqjb7-master-0 Running Standard_D8s_v3 eastus 2 5h14m jima971b-qqjb7-master-1 Running Standard_D8s_v3 eastus 3 5h14m jima971b-qqjb7-master-2 Running Standard_D8s_v3 eastus 1 5h15m jima971b-qqjb7-worker-eastus1-mtc47 Failed 4h52m jima971b-qqjb7-worker-eastus2-ph8bk Failed 4h52m jima971b-qqjb7-worker-eastus3-hpmvj Failed 4h52m Errors on worker machine: -------------------- errorMessage: 'failed to reconcile machine "jima971b-qqjb7-worker-eastus1-mtc47": network.SubnetsClient#Get: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="AuthorizationFailed" Message="The client ''705eb743-7c91-4a16-a7cf-97164edc0341'' with object id ''705eb743-7c91-4a16-a7cf-97164edc0341'' does not have authorization to perform action ''Microsoft.Network/virtualNetworks/subnets/read'' over scope ''/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima971b-12015319-rg/providers/Microsoft.Network/virtualNetworks/jima971b-vnet/subnets/jima971b-worker-subnet'' or the scope is invalid. If access was recently granted, please refresh your credentials."' errorReason: InvalidConfiguration After manually creating customer role with missed permissions for machine-api/ingress/cloud-controller-manager/image-registry, and assigning it to machine-api/ingress/cloud-controller-manager/image-registry user-assigned identity on scope of customer vnet, cluster was recovered and became running. Permissions for machine-api/cloud-controller-manager/ingress on customer vnet: "Microsoft.Network/virtualNetworks/subnets/read", "Microsoft.Network/virtualNetworks/subnets/join/action" Permissions for image-registry on customer vnet: "Microsoft.Network/virtualNetworks/subnets/read", "Microsoft.Network/virtualNetworks/subnets/join/action" "Microsoft.Network/virtualNetworks/join/action"
Version-Release number of selected component (if applicable):
4.15 nightly build
How reproducible:
always on recent 4.15 payload
Steps to Reproduce:
1. prepare install-config with private cluster configuration + credentialsMode: Manual 2. using ccoctl tool to create workload identity 3. install cluster
Actual results:
Installation failed due to permission issues
Expected results:
ccoctl also needs to assign customer role to machine-api/ccm/image-registry user-assigned identity on scope of customer vnet if it is configured in install-config
Additional info:
Issue is only detected on 4.15, it works on 4.14.
Description of problem:
Running through instructions for a smoke test on 4.14, the DNS record is incorrectly created for the Gateway. It is missing a trailing dot in the dnsName.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1.Run through the steps in https://github.com/openshift/network-edge-tools/blob/2fd044d110eb737c94c8b86ea878a130cae0d03e/docs/blogs/EnhancedDevPreviewGatewayAPI/GettingStarted.md until the step "oc get dnsrecord -n openshift-ingress" 2. Check the status of the DNS record: "oc get dnsrecord xxx -n openshift-ingress -ojson | jq .status.zones[].conditions"
Actual results:
The status shows error conditions with a message like 'The DNS provider failed to ensure the record: googleapi: Error 400: Invalid value for ''entity.change.additions[*.gwapi.apps.ci-ln-3vxsgxb-72292.origin-ci-int-gce.dev.rhcloud.com][A].name'': ''*.gwapi.apps.ci-ln-3vxsgxb-72292.origin-ci-int-gce.dev.rhcloud.com'', invalid'
Expected results:
The status of the DNS record should show a successful publishing of the record.
Additional info:
Backport to 4.13.z
Since HyperShift / Hosted Control Plane have adopted include.release.openshift.io/ibm-cloud-managed, to tailor the resources of clusters running in the ROKS IBM environment, the include.release.openshift.io/hypershift addition will allow Hypershift to express different profile choices than ROKS
Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/218
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
CRL list is capped at 1MB due to configmap max size. If multiple public CRLs are needed for ingress controller the CRL pem file will be over 1MB.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Create CRL configmap with the following distribution points: Issuer: C=US, O=DigiCert Inc, CN=DigiCert Global G2 TLS RSA SHA256 2020 CA1 Subject: SOME SIGNED CERT X509v3 CRL Distribution Points: Full Name: URI:http://crl3.digicert.com/DigiCertGlobalG2TLSRSASHA2562020CA1-2.cr # curl -o DigiCertGlobalG2TLSRSASHA2562020CA1-2.crl http://crl3.digicert.com/DigiCertGlobalG2TLSRSASHA2562020CA1-2.crl # openssl crl -in DigiCertGlobalG2TLSRSASHA2562020CA1-2.crl -inform DER -out DigiCertGlobalG2TLSRSASHA2562020CA1-2.pem # du -bsh DigiCertGlobalG2TLSRSASHA2562020CA1-2.pem 604K DigiCertGlobalG2TLSRSASHA2562020CA1-2.pem I still need to find more intermediate CRLS to grow this.
Actual results:
2023-01-25T13:45:01.443Z ERROR operator.init controller/controller.go:273 Reconciler error {"controller": "crl", "object": {"name":"custom","namespace":"openshift-ingress-operator"}, "namespace": "openshift-ingress-operator", "name": "custom", "reconcileID": "d49d9b96-d509-4562-b3d9-d4fc315226c0", "error": "failed to ensure client CA CRL configmap for ingresscontroller openshift-ingress-operator/custom: failed to update configmap: ConfigMap \"router-client-ca-crl-custom\" is invalid: []: Too long: must have at most 1048576 bytes"}
Expected results:
First be able to create a configmap where data only accounted to the 1MB max (see additional info below for more details), second some way to compress or allow a large CRL list that would be larger than 1MB
Additional info:
Only using this CRL and it being only 600K still causes issue and it could be due to the `last-applied-configuration` annotation on the configmap. This is added since we do an apply operation (update) on the configmap. I am not sure if this is counting towards the 1MB max. https://github.com/openshift/cluster-ingress-operator/blob/release-4.10/pkg/operator/controller/crl/crl_configmap.go#L295 Not sure if we could just replace the configmap.
Description of problem:
Based on the discussion in https://issues.redhat.com/browse/OCPBUGS-24044
and the discussion in this slack [https://redhat-internal.slack.com/archives/CBWMXQJKD/p1700510945375019|thread] we need to update our CI and some of the work done for mutable scope in NE-621.
Specifically, we need to
Version-Release number of selected component (if applicable):
4.15
How reproducible:
100%
Steps to Reproduce:
1. Run CI TestUnmanagedDNSToManagedDNSInternalIngressController 2. Observe failure in unmanaged-migrated-internal
Actual results:
CI tests fail.
Expected results:
CI tests shouldn't fail.
Additional info:
This is a change from past behavior, as reported in https://issues.redhat.com/browse/OCPBUGS-24044. Further discussion revealed that the new behavior is currently expected but could be restored in the future. Notes to SRE and release notes are needed for this change to behavior.
When the ingress operator's clientca-configmap controller reconciles an IngressController, this controller attempts to add a finalizer to the IngressController if that finalizer is absent. This controller erroneously attempts to add the missing finalizer even if the IngressController is marked for deletion, which results in an error. This error causes the controller to retry the deletion and log the error multiple times.
I observed this in CI for OCP 4.14 and was able to reproduce it on 4.11.37, and it probably affects earlier versions as well. The problematic code was added in https://github.com/openshift/cluster-ingress-operator/pull/450/commits/0f36470250c3089769867ebd72e25c413a29cda2 in OCP 4.9 to implement NE-323.
Easily.
1. Create a configmap in the "openshift-config" namespace (to reproduce this issue, it is not necessary that the configmap have a valid TLS certificate and key):
oc -n openshift-config create configmap client-ca-cert
2. Create an IngressController that specifies spec.clientTLS.clientCA.name to point to the configmap from the previous step:
oc create -f - <<EOF apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: test-client-ca-configmap namespace: openshift-ingress-operator spec: domain: example.xyz endpointPublishingStrategy: type: Private clientTLS: clientCA: name: client-ca-cert clientCertificatePolicy: Required EOF
3. Delete the IngressController:
oc -n openshift-ingress-operator delete ingresscontrollers/test-client-ca-configmap
4. Check the ingress operator's logs:
oc -n openshift-ingress-operator logs -c ingress-operator deployments/ingress-operator
The ingress operator logs several attempts to add the finalizer to the IngressController after it has been marked for deletion:
2023-06-15T02:17:12.419Z ERROR operator.init controller/controller.go:273 Reconciler error {"controller": "clientca_configmap_controller", "object": {"name":"test-client-ca-configmap","namespace":"openshift-ingress-operator"}, "namespace": "openshift-ingress-operator", "name": "test-client-ca-configmap", "reconcileID": "2274f55e-e5bd-4fdb-973e-821a44cf2ebf", "error": "failed to add client-ca-configmap finalizer: IngressController.operator.openshift.io \"test-client-ca-configmap\" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{\"ingresscontroller.operator.openshift.io/finalizer-clientca-configmap\"}"}
The deletion does succeed, errors notwithstanding.
The ingress operator should succeed in deleting the IngressController without attempting to re-add the finalizer to the IngressController after it has been marked for deletion.
Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/199
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
CI is flaky because the TestRouterCompressionOperation test fails.
I have seen these failures on 4.14 CI jobs.
Presently, search.ci reports the following stats for the past 14 days:
Found in 7.71% of runs (16.58% of failures) across 402 total runs and 24 jobs (46.52% failed)
GCP is most impacted:
pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-operator (all) - 44 runs, 86% failed, 37% of failures match = 32% impact
Azure and AWS are also impacted:
pull-ci-openshift-cluster-ingress-operator-master-e2e-azure-operator (all) - 36 runs, 64% failed, 43% of failures match = 28% impact
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator (all) - 38 runs, 79% failed, 23% of failures match = 18% impact
1. Post a PR and have bad luck.
2. Check https://search.ci.openshift.org/?search=compression+error%3A+expected&maxAge=336h&context=1&type=build-log&name=cluster-ingress-operator&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job.
The test fails:
TestAll/serial/TestRouterCompressionOperation === RUN TestAll/serial/TestRouterCompressionOperation router_compression_test.go:209: compression error: expected "gzip", got "" for canary route
CI passes, or it fails on a different test.
Description of problem:
Error message seen during testing: 2023-03-23T22:33:02.507Z ERROR operator.dns_controller dns/controller.go:348 failed to publish DNS record to zone {"record": {"dnsName":"*.example.com","targets":["34.67.189.132"],"recordType":"A","recordTTL":30,"dnsManagementPolicy":"Managed"}, "dnszone": {"id":"ci-ln-95xvtb2-72292-9jj4w-private-zone"}, "error": "googleapi: Error 400: Invalid value for 'entity.change.additions[*.example.com][A].name': '*.example.com', invalid"}
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Steps to Reproduce:
1. Setup 4.13 gcp cluster, install OSSM using http://pastebin.test.redhat.com/1092754 2. Run gateway api e2e against cluster (or create gateway with listener hostname *.example.com) 3. Check ingress operator logs
Actual results:
DNS record not published, and continous error in log
Expected results:
Should publish DNS record to zone without errors
Additional info:
Miciah: The controller should check ManageDNSForDomain when calling EnsureDNSRecord.
Description of problem:
Bump Kubernetes to 0.27.1 and bump dependencies
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Core CAPI CRDs not deployed on unsupported platforms even when explicitly needed by other operators. An example of this is on VSphere clusters. CAPI is not yet supported on VSphere clusters, but the CAPI IPAM CRDs, are needed by other operators than the usual consumer, cluster-capi-operator and the CAPI controllers.
Version-Release number of selected component (if applicable):
How reproducible:
Launch a techpreview cluster for an unsupported platform (e.g. vsphere/azure). Check that the Core CAPI CRDs are not present.
Steps to Reproduce:
$ oc get crds | grep cluster.x-k8s.io
Actual results:
Core CAPI CRDs are not present (only the metal ones)
Expected results:
Core CAPI CRDs should be present
Additional info:
Description of problem:
Some of AWS CI jobs failed with error: Error: error creating EC2 NAT Gateway: InvalidElasticIpID.NotFound: The elasticIp ID 'eipalloc-094ec9d0482d5b9f2' does not exist Errors from: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn/1566090383895564288 level=info msg=Creating infrastructure resources...level=errorlevel=error msg=Error: error creating EC2 NAT Gateway: InvalidElasticIpID.NotFound: The elasticIp ID 'eipalloc-094ec9d0482d5b9f2' does not existlevel=error msg= status code: 400, request id: 5223ac0c-77cb-4f29-adc8-4192e9fc3ef8level=errorlevel=error msg= with module.vpc.aws_nat_gateway.nat_gw[0],level=error msg= on vpc/vpc-public.tf line 85, in resource "aws_nat_gateway" "nat_gw":level=error msg= 85: resource "aws_nat_gateway" "nat_gw" {level=errorlevel=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "cluster" stage: failed to create cluster: failed to apply Terraform: exit status 1level=errorlevel=error msg=Error: error creating EC2 NAT Gateway: InvalidElasticIpID.NotFound: The elasticIp ID 'eipalloc-094ec9d0482d5b9f2' does not existlevel=error msg= status code: 400, request id: 5223ac0c-77cb-4f29-adc8-4192e9fc3ef8level=errorlevel=error msg= with module.vpc.aws_nat_gateway.nat_gw[0],level=error msg= on vpc/vpc-public.tf line 85, in resource "aws_nat_gateway" "nat_gw":level=error msg= 85: resource "aws_nat_gateway" "nat_gw" {level=errorlevel=error
Version-Release number of selected component (if applicable):
4.11 - 4.12
How reproducible:
Occasionally happened, searching logs in CI pipeline via https://search.ci.openshift.org/?search=InvalidElasticIpID.NotFound&maxAge=48h&context=1&type=build-log&name=.*4%5C.12.*aws.*&excludeName=.*upgrade.*&maxMatches=5&maxBytes=20971520&groupBy=job
Steps to Reproduce:
1. All kinds of AWS IPI installations can encounter this error.
Actual results:
Error: error creating EC2 NAT Gateway: InvalidElasticIpID.NotFound: The elasticIp ID 'eipalloc-094ec9d0482d5b9f2' does not exist
Expected results:
Cluster install succeeds.
Additional info:
4.11 jobs with this error: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.11-ocp-e2e-aws-arm64/1565061451272425472
Description of problem:
Kubernetes and other associated dependencies need to be updated to protect against potential vulnerabilities.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
IHAC with OCP 4.9 who has configured the IngressControllers with a long httpLogFormat, and the routers are printing every time it reloads
I0927 13:29:45.495077 1 router.go:612] template "msg"="router reloaded" "output"="[WARNING] 269/132945 (9167) : config : truncating capture length to 63 bytes for frontend 'public'.\n[WARNING] 269/132945 (9167) : config : truncating capture length to 63 bytes for frontend 'fe_sni'.\n[WARNING] 269/132945 (9167) : config : truncating capture length to 63 bytes for frontend 'fe_no_sni'.\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
This is the Ingress Contoller configuration:
logging: access: destination: syslog: address: 10.X.X.X port: 10514 type: Syslog httpCaptureCookies: - matchType: Exact maxLength: 128 name: ITXSESSIONID httpCaptureHeaders: request: - maxLength: 128 name: Host - maxLength: 128 name: itxrequestid httpLogFormat: actconn="%ac",backend_name="%b",backend_queue="%bq",backend_source_ip="%bi",backend_source_port="%bp",beconn="%bc",bytes_read="%B",bytes_uploaded="%U",captrd_req_cookie="%CC",captrd_req_headers="%hr",captrd_res_cookie="%CS",captrd_res_headers="%hs",client_ip="%ci",client_port="%cp",cluster="ieec1ocp1",datacenter="ieec1",environment="pro",fe_name_transport="%ft",feconn="%fc",frontend_name="%f",hostname="%H",http_version="%HV",log_type="http",method="%HM",query_string="%HQ",req_date="%tr",request="%HP",res_time="%TR",retries="%rc",server_ip="%si",server_name="%s",server_port="%sp",srv_queue="%sq",srv_conn="%sc",srv_queue="%sq",status_code="%ST",Ta="%Ta",Tc="%Tc",tenant="bk",term_state="%tsc",tot_wait_q="%Tw",Tr="%Tr" logEmptyRequests: Ignore
Any way to avoid this truncate warning?
How reproducible:
For every reload of haproxy config
Steps to Reproduce:
You can reproduce easily with the following configuration in the default ingress controller:
logging:
access:
destination:
type: Container
httpCaptureCookies:
2022-10-18T14:13:53.068164+00:00 xxxx xxxxxx haproxy[38]: 10.39.192.203:40698 [18/Oct/2022:14:13:52.488] fe_sni~ be_secure:openshift-console:console/pod:console-5976495467-zxgxr:console:https:10.128.1.116:8443 0/0/0/10/580 200 1130598 _abck=B7EA642C9E828FA8210F329F80B7B2D80YAAQnVozuFVfkOaDAQAADk - --VN 78/37/33/33/0 0/0 "GET /api/kubernetes/openapi/v2 HTTP/1.1"
Since HyperShift / Hosted Control Plane have adopted include.release.openshift.io/ibm-cloud-managed, to tailor the resources of clusters running in the ROKS IBM environment, the include.release.openshift.io/hypershift addition will allow Hypershift to express different profile choices than ROKS
Description of problem:
OCM-o does not support obtaining verbosity through OpenShiftControllerManager.operatorLogLevel object
Version-Release number of selected component (if applicable):
How reproducible:
modify the OpenShiftControllerManager.operatorLogLevel, and the OCM-o operator will not display the correspond logs
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The MAPI metric mapi_current_pending_csr fires even when there are no pending MAPI CSRs. However, there are non-MAPI CSRs present. It may not be appropriately scoping this metric to only it's CSRs.
Version-Release number of selected component (if applicable):
Observed in 4.11.25
How reproducible:
Consistent
Steps to Reproduce:
1. Install a component that uses CSRs (like ACM) but leave the CSRs in a pending state 2. Observe metric firing 3.
Actual results:
Metric is firing
Expected results:
Metric only fires if there are MAPI specific CSRs pending
Additional info:
This impacts SRE alerting
When baselineCapabilitySet is set to None, still see an SA with name `deployer-controller` in the cluster.
steps to Reproduce:
=================
1. Install 4.15 cluster with baselineCapabilitySet to None
2. Run command `oc get sa -A | grep deployer`
Actual Results:
================
[knarra@knarra openshift-tests-private]$ oc get sa -A | grep deployer
openshift-infra deployer-controller 0 63m
Expected Results:
==================
No SA related to deployer should be returned
Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/898
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The current version of openshift/cluster-ingress-operator vendors Kubernetes 1.26 packages. OpenShift 4.13 is based on Kubernetes 1.27.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Check https://github.com/openshift/cluster-ingress-operator/blob/release-4.14/go.mod
Actual results:
Kubernetes packages (k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go) are at version v0.26
Expected results:
Kubernetes packages are at version v0.27.0 or later.
Additional info:
Using old Kubernetes API and client packages brings risk of API compatibility issues. controller-runtime will need to be bumped to 1.15 as well
Description of problem:
a cluster installed with baselineCapabilitySet: None have build available while the build capability is disabled ❯ oc get -o json clusterversion version | jq '.spec.capabilities' { "baselineCapabilitySet": "None" } ❯ oc get -o json clusterversion version | jq '.status.capabilities.enabledCapabilities' null ❯ oc get build -A NAME AGE cluster 5h23m
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-04-143709
How reproducible:
100%
Steps to Reproduce:
1.install a cluster with baselineCapabilitySet: None
Actual results:
❯ oc get build -A NAME AGE cluster 5h23m
Expected results:
❯ oc get -A build error: the server doesn't have a resource type "build"
slack thread with more info: https://redhat-internal.slack.com/archives/CF8SMALS1/p1696527133380269
As a developer I want to remove the NoUpgrade annotation from the CAPI IPAM CRDs so that I can promote them to General Availability
The SPLAT team is planning to have the CAPI IPAM CRDs promoted to GA because they need them in a component they are promoting to GA.
Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1006
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/321
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
CAPI E2Es failing to start in some CAPI provider's release branches.
Failing with the following error: `go: errors parsing go.mod:94/tmp/tmp.ssf1LXKrim/go.mod:5: unknown directive: toolchain` https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-api/199/pull-ci-openshift-cluster-api-master-e2e-aws-capi-techpreview/1765512397532958720#1:build-log.txt%3A91-95
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is because the script launching the e2e is launching it from the `main` branch of the cluster-capi-operator (which has some backward incompabible go toolchain changes), rather than the correctly matching release branch.
Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/201
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
`[sig-arch] events should not repeat pathologically` has started failing on aws-ovn-serial jobs.
event happened 25 times, something is wrong: ns/openshift-ovn-kubernetes service/ovn-kubernetes-master - reason/FailedToUpdateEndpointSlices Error updating Endpoint Slices for Service openshift-ovn-kubernetes/ovn-kubernetes-master: node "ip-10-0-176-16.us-east-2.compute.internal" not found event happened 25 times, something is wrong: ns/openshift-ovn-kubernetes service/ovnkube-db - reason/FailedToUpdateEndpointSlices Error updating Endpoint Slices for Service openshift-ovn-kubernetes/ovnkube-db: node "ip-10-0-176-16.us-east-2.compute.internal" not found}
While investigating TRT-413, we discovered that many service monitors are configured to use bearer token authentication. Per this document https://github.com/deads2k/openshift-enhancements/blob/master/enhancements/monitoring/client-cert-scraping.md, we should try to use client certification authentication for metrics scraping. This is to make sure metrics collection still works even apiserver is not available.
Currently, the following repos have been identified to be fixed:
Additionally, it is discovered that kube-rabc-proxy is not coded properly to automatically update client ca certificate. That issue is addressed with https://issues.redhat.com/browse/TRT-464. Until the fix lands to openshift, some of the above changes (repositories that uses kube-rbac-proxy) will not be effective.
For the repositories that are not using kube-rbac-proxy (e.g. storage operator), the above change can be merged and verified.
How to verify
Description of problem:
The following changes are required for openshift/route-controller-manager#22 refactoring.
add POD_NAME to route-controller-manager deployment
introduce route-controller-defaultconfig and customize lease name openshift-route-controllers to override the default one supplied by library-go
add RBAC for infrastructures which is used by library-go for configuring leader election
Description of problem:
migrator pod in `openshift-kube-storage-version-migrator` project stuck in Pending state
Version-Release number of selected component (if applicable):
4.12
How reproducible:
100%
Steps to Reproduce:
1. Add a default cluster-wide node selector with a label that do not match with any node label: $ oc edit scheduler cluster apiVersion: config.openshift.io/v1 kind: Scheduler metadata: name: cluster ... spec: defaultNodeSelector: node-role.kubernetes.io/role=app mastersSchedulable: false 2. Delete the migrator pod running in the `openshift-kube-storage-version-migrator` $ oc delete pod migrator-6b78665974-zqd47 -n openshift-kube-storage-version-migrator 3. Check if the migrator pod comes up in running state or not. $ oc get pods -n openshift-kube-storage-version-migrator NAME READY STATUS RESTARTS AGE migrator-6b78665974-j4jwp 0/1 Pending 0 2m41s
Actual results:
The pod goes into the pending state because it tries to get scheduled on the node having label `node-role.kubernetes.io/role=app`.
Expected results:
The pod should come up in running state, it should not get affected by the cluster-wide node-selector.
Additional info:
Setting the annotation `openshift.io/node-selector=` into the `openshift-kube-storage-version-migrator` project and then deleting the pending migrator pod helps in bringing the pod up.
The expectation with this bug is that the project `openshift-kube-storage-version-migrator` should have the annotation `openshift.io/node-selector=`, so that the pod running on this project will not get affected by the wrong cluster-wide node-selector configuration.
Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/304
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Since HyperShift / Hosted Control Plane have adopted include.release.openshift.io/ibm-cloud-managed, to tailor the resources of clusters running in the ROKS IBM environment, the include.release.openshift.io/hypershift addition will allow Hypershift to express different profile choices than ROKS
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Some minor strings in the Quick Start weren't translated, for example, the error message "No Quick Start found" when all Quick Start was disabled and the string "View Prerequisites" in the quick start content drawer.
Version-Release number of selected component (if applicable):
4.11-4.12, maybe earlier
How reproducible:
Always
Steps to Reproduce:
1. Switch the language and open a quick start in the drawer.
2. Or disable all quick starts and check the quick start page.
Actual results:
1. String "View Prerequisites" is not translated
2. Error page is empty
Expected results:
1. String "View Prerequisites" should be translated (at least in pseudo translation)
2. Error page should show a info that no Quick Starts are found.
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-alibaba-cloud/pull/40
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In case the [appsDomain|https://docs.openshift.com/container-platform/4.13/networking/ingress-operator.html#nw-ingress-configuring-application-domain_configuring-ingress] is specified and a cluster-admin is deleting accidentally all routes on a cluster, the route canary in the namespace openshift-ingress-canary is created with the domain specified in the .spec.appsDomain instead of .spec.domain of the definition in Ingress.config.openshift.io. Additionally the docs are a bit confusing. On one page (https://docs.openshift.com/container-platform/4.13/networking/ingress-operator.html#nw-ingress-configuring-application-domain_configuring-ingress) it's defined as {code:none} As a cluster administrator, you can specify an alternative to the default cluster domain for user-created routes by configuring the appsDomain field. The appsDomain field is an optional domain for OpenShift Container Platform to use instead of the default, which is specified in the domain field. If you specify an alternative domain, it overrides the default cluster domain for the purpose of determining the default host for a new route. For example, you can use the DNS domain for your company as the default domain for routes and ingresses for applications running on your cluster.
In the API spec (https://docs.openshift.com/container-platform/4.11/rest_api/config_apis/ingress-config-openshift-io-v1.html#spec) the correct behaviour is explained
appsDomain is an optional domain to use instead of the one specified in the domain field when a Route is created without specifying an explicit host. If appsDomain is nonempty, this value is used to generate default host values for Route. Unlike domain, appsDomain may be modified after installation. This assumes a new ingresscontroller has been setup with a wildcard certificate.
It would be nice if the wording could be adjusted as `you can specify an alternative to the default cluster domain for user-created routes by configuring` does not fits good as more or less all new created routes (operator created and so on) getting created with the appsDomain.
Version-Release number of selected component (if applicable):{code:none}
OpenShift 4.12.22
How reproducible:
see steps below
Steps to Reproduce:
1. Install OpenShift 2. define .spec.appsDomain in Ingress.config.openshift.io 3. oc delete route canary -n openshift-ingress-canary 4. wait some seconds to get the route recreated and check cluster-operator
Actual results:
Ingress Operator degraded and route recreated with wrong domain (.spec.appsDomain)
Expected results:
Ingress Operator not degraded and route recreated with the correct domain (.spec.domain)
Additional info:
Please see screenshot
Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1033
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Adam Kaplan will be assuming the role of "Staff Engineer" for core OpenShift for the team, taking over the role from Ben Parees. To expedite reviews and other OCP processes, Adam needs to be added back as an approver in the following repositories:
Adam may also need to be added to the OWNERS files in the following repos:
Description of problem:
During the testing of NE1264 epic, i configured both syslog and container destination type of logging on the same default ingress controller. In the ingress controller spec we can see, it is taking both the destination type, but it is not reflect in ROUTER_LOG_MAX_LENGTH env or the haproxy.config file melvinjoseph@mjoseph-mac Downloads % oc -n openshift-ingress-operator get ingresscontroller/default -oyaml apiVersion: operator.openshift.io/v1 kind: IngressController <-----snip---> spec: clientTLS: clientCA: name: "" clientCertificatePolicy: "" httpCompression: {} httpEmptyRequestsPolicy: Respond httpErrorCodePages: name: "" logging: access: destination: container: maxLength: 1024 syslog: address: 1.2.3.4 maxLength: 1024 port: 514 type: Container logEmptyRequests: Log replicas: 2 tuningOptions: reloadInterval: 0s unsupportedConfigOverrides: null melvinjoseph@mjoseph-mac Downloads % oc -n openshift-ingress exec router-default-6c86ff75d9-g24q5 -- env | grep ROUTER_LOG_MAX_LENGTH Defaulted container "router" out of: router, logs ROUTER_LOG_MAX_LENGTH=1024 melvinjoseph@mjoseph-mac Downloads % oc -n openshift-ingress exec router-default-6c86ff75d9-l9rjv -- cat haproxy.config | grep 1024 Defaulted container "router" out of: router, logs log /var/lib/rsyslog/rsyslog.sock len 1024 local1 info when we patch changes to log length, it is not reflect as expected for one destination. melvinjoseph@mjoseph-mac Downloads % oc -n openshift-ingress-operator patch ingresscontroller/default -p '{"spec":{"logging":{"access":{"destination":{"container":{"maxLength":480}}}}}}' --type=merge ingresscontroller.operator.openshift.io/default patched melvinjoseph@mjoseph-mac Downloads % oc -n openshift-ingress exec router-default-6476d6c69d-tlhqd -- env | grep ROUTER_LOG_MAX_LENGTH Defaulted container "router" out of: router, logs ROUTER_LOG_MAX_LENGTH=480 melvinjoseph@mjoseph-mac Downloads % oc -n openshift-ingress-operator patch ingresscontroller/default -p '{"spec":{"logging":{"access":{"destination":{"syslog":{"maxLength":4096}}}}}}' --type=merge ingresscontroller.operator.openshift.io/default patched melvinjoseph@mjoseph-mac Downloads % oc -n openshift-ingress-operator get ingresscontroller/default -oyaml apiVersion: operator.openshift.io/v1 kind: IngressController <----snip----> spec: clientTLS: clientCA: name: "" clientCertificatePolicy: "" httpCompression: {} httpEmptyRequestsPolicy: Respond httpErrorCodePages: name: "" logging: access: destination: container: maxLength: 480 syslog: address: 1.2.3.4 maxLength: 4096 port: 514 type: Container logEmptyRequests: Log replicas: 2 tuningOptions: reloadInterval: 0s unsupportedConfigOverrides: null melvinjoseph@mjoseph-mac Downloads % oc -n openshift-ingress exec router-default-59cf55666d-shq98 -- env | grep ROUTER_LOG_MAX_LENGTH Defaulted container "router" out of: router, logs ROUTER_LOG_MAX_LENGTH=480 In another round of testing i can see only the syslog destination type is reflecting on env and not the container destination type. I am also not sure whether it is a valid situation where we can use both type of destination type on default ingress controller.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Steps to Reproduce:
1. Edit the default ingress controller and add both destination type configs 2. 3.
Actual results:
Either one type value is only reflecting in the haproxy.config file
Expected results:
Both type should we reflected
Additional info:
Since HyperShift / Hosted Control Plane have adopted include.release.openshift.io/ibm-cloud-managed, to tailor the resources of clusters running in the ROKS IBM environment, the include.release.openshift.io/hypershift addition will allow Hypershift to express different profile choices than ROKS
Please review the following PR: https://github.com/openshift/cluster-api-provider-libvirt/pull/273
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing
doesn't give much detail or suggest next-steps. Expanding it to include at least a more detailed error message would make it easier for the admin to figure out how to resolve the issue.
Version-Release number of selected component (if applicable):
It's in the dev branch, and probably dates back to whenever the canary system was added.
How reproducible:
100%
Steps to Reproduce:
1. Break ingress. FIXME: Maybe by deleting the cloud load balancer, or dropping a firewall in the way, or something.
2. See the canary pods start failing.
3. Ingress operator sets CanaryChecksRepetitiveFailures with a message.
Actual results:
Canary route checks for the default ingress controller are failing
Expected results:
Canary route checks for the default ingress controller are failing: ${ERROR_MESSAGE}. ${POSSIBLY_ALSO_MORE_TROUBLESHOOTING_IDEAS?}
Additional info:
Plumbing the error message through might be as straightforward as passing probeRouteEndpoint's err through to setCanaryFailingStatusCondition for formatting. Or maybe it's more complicated than that?
Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/202
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
CI is flaky because the TestHostNetworkPort test fails:
=== NAME TestAll/serial/TestHostNetworkPortBinding operator_test.go:1034: Expected conditions: map[Admitted:True Available:True DNSManaged:False DeploymentReplicasAllAvailable:True LoadBalancerManaged:False] Current conditions: map[Admitted:True Available:True DNSManaged:False Degraded:False DeploymentAvailable:True DeploymentReplicasAllAvailable:False DeploymentReplicasMinAvailable:True DeploymentRollingOut:True EvaluationConditionsDetected:False LoadBalancerManaged:False LoadBalancerProgressing:False Progressing:True Upgradeable:True] operator_test.go:1034: Ingress Controller openshift-ingress-operator/samehost status: { "availableReplicas": 0, "selector": "ingresscontroller.operator.openshift.io/deployment-ingresscontroller=samehost", "domain": "samehost.ci-op-xlwngvym-43abb.origin-ci-int-aws.dev.rhcloud.com", "endpointPublishingStrategy": { "type": "HostNetwork", "hostNetwork": { "protocol": "TCP", "httpPort": 9080, "httpsPort": 9443, "statsPort": 9936 } }, "conditions": [ { "type": "Admitted", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "Valid" }, { "type": "DeploymentAvailable", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "DeploymentAvailable", "message": "The deployment has Available status condition set to True" }, { "type": "DeploymentReplicasMinAvailable", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "DeploymentMinimumReplicasMet", "message": "Minimum replicas requirement is met" }, { "type": "DeploymentReplicasAllAvailable", "status": "False", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "DeploymentReplicasNotAvailable", "message": "0/1 of replicas are available" }, { "type": "DeploymentRollingOut", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "DeploymentRollingOut", "message": "Waiting for router deployment rollout to finish: 0 of 1 updated replica(s) are available...\n" }, { "type": "LoadBalancerManaged", "status": "False", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "EndpointPublishingStrategyExcludesManagedLoadBalancer", "message": "The configured endpoint publishing strategy does not include a managed load balancer" }, { "type": "LoadBalancerProgressing", "status": "False", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "LoadBalancerNotProgressing", "message": "LoadBalancer is not progressing" }, { "type": "DNSManaged", "status": "False", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "UnsupportedEndpointPublishingStrategy", "message": "The endpoint publishing strategy doesn't support DNS management." }, { "type": "Available", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z" }, { "type": "Progressing", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "IngressControllerProgressing", "message": "One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 0 of 1 updated replica(s) are available...\n)" }, { "type": "Degraded", "status": "False", "lastTransitionTime": "2024-02-26T17:25:39Z" }, { "type": "Upgradeable", "status": "True", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "Upgradeable", "message": "IngressController is upgradeable." }, { "type": "EvaluationConditionsDetected", "status": "False", "lastTransitionTime": "2024-02-26T17:25:39Z", "reason": "NoEvaluationCondition", "message": "No evaluation condition is detected." } ], "tlsProfile": { "ciphers": [ "ECDHE-ECDSA-AES128-GCM-SHA256", "ECDHE-RSA-AES128-GCM-SHA256", "ECDHE-ECDSA-AES256-GCM-SHA384", "ECDHE-RSA-AES256-GCM-SHA384", "ECDHE-ECDSA-CHACHA20-POLY1305", "ECDHE-RSA-CHACHA20-POLY1305", "DHE-RSA-AES128-GCM-SHA256", "DHE-RSA-AES256-GCM-SHA384", "TLS_AES_128_GCM_SHA256", "TLS_AES_256_GCM_SHA384", "TLS_CHACHA20_POLY1305_SHA256" ], "minTLSVersion": "VersionTLS12" }, "observedGeneration": 1 } operator_test.go:1036: failed to observe expected conditions for the second ingresscontroller: timed out waiting for the condition operator_test.go:1059: deleted ingresscontroller samehost operator_test.go:1059: deleted ingresscontroller hostnetworkportbinding
This particular failure comes from https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1017/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/1762147882179235840. Search.ci shows another failure: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/48873/rehearse-48873-pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-gatewayapi/1762576595890999296. The test has failed sporadically in the past, beyond what search.ci is able to search.
TestHostNetworkPort is marked as a serial test in TestAll and marked with t.Parallel() in the test itself. Not sure if this is what is causing a new failure seen in this test, but something is incorrect.
The test failures have been observed recently on 4.16 as well as on 4.12 (https://github.com/openshift/cluster-ingress-operator/pull/828#issuecomment-1292888086) and 4.11 (https://github.com/openshift/cluster-ingress-operator/pull/914#issuecomment-1526808286). The logic error was introduced in 4.11 (https://github.com/openshift/cluster-ingress-operator/pull/756/commits/a22322b25569059c61e1973f37f0a4b49e9407bc).
The logic error is self-evident. The test failure is very rare. The failure has been observed sporadically over the past couple years. Presently, search.ci shows two failures, with the following impact, for the past 14 days:
rehearse-48873-pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-gatewayapi (all) - 3 runs, 33% failed, 100% of failures match = 33% impact
pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator (all) - 16 runs, 25% failed, 25% of failures match = 6% impact
N/A.
The TestHostNetworkPort test fails. The test is marked as both serial and parallel.
Test should be marked as either serial or parallel, and it should pass consistently.
When TestAll was introduced, TestHostNetworkPortBinding was initially marked parallel in https://github.com/openshift/cluster-ingress-operator/pull/756/commits/a22322b25569059c61e1973f37f0a4b49e9407bc. After some discussion, it was moved to the serial list in https://github.com/openshift/cluster-ingress-operator/pull/756/commits/a449e497e35fafeecbee9ea656e0631393182f70, but the commit to remove t.Parallel() evidently got inadvertently dropped.
Description of problem:
mtls connection is not working when using an intermetiate CA appart from the root CA, both with CRL defined.
The Intermediate CA Cert had a published CDP which directed to a CRL issued by the root CA.
The config map in the openshift-ingress namespace contains the CRL as issued by the root CA. The CRL issued by the Intermediate CA is not present since that CDP is in the user cert and so not in the bundle.
When attempting to connect using a user certificate issued by the Intermediate CA it fails with an error of unknown CA.
When attempting to connect using a user certificate issued by the to Root CA the connection is successful.
Version-Release number of selected component (if applicable):
4.10.24
How reproducible:
Always
Steps to Reproduce:
1. Configure CA and intermediate CA with CRL
2. Sign client certificate with the intermediate CA
3. Configure mtls in openshift-ingress
Actual results:
When attempting to connect using a user certificate issued by the Intermediate CA it fails with an error of unknown CA.
When attempting to connect using a user certificate issued by the to Root CA the connection is successful.
Expected results:
Be able to connect with client certificated signed by the intermediate CA
Additional info:
The cluster-ingress-operator repository vendors controller-runtime v0.15.0, which uses Kubernetes 1.27 packages. OpenShift 4.15 is based on Kubernetes 1.28.
4.15.
Always.
Check https://github.com/openshift/cluster-ingress-operator/blob/release-4.15/go.mod.
The sigs.k8s.io/controller-runtime package is at v0.15.0.
The sigs.k8s.io/controller-runtime package is at v0.16.0 or newer.
https://github.com/openshift/cluster-ingress-operator/pull/990 already bumped the k8s.io/* packages to v0.28.2, but ideally the controller-runtime package should be bumped too. The controller-runtime v0.16 release includes some breaking changes; see the release notes at https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.16.0.
Description of problem:
When creating an ingresscontroller with empty spec (or where spec.domain clashes with an existing IC), the ingresscontroller's status shows Admitted as "False" and reason is "Invalid". However, "route_controller_metrics_routes_per_shard" metric shows the shard in the Observe tab of the web-console. When the invalid ingresscontroller is deleted, the "route_controller_metrics_routes_per_shard" metric does not clear the row corresponding to the deleted invalid IC.
Version-Release number of selected component (if applicable):
4.12.0-ec5
How reproducible:
Always
Steps to Reproduce:
1. Create the invalid IC with the following spec: apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: ic-invalid namespace: openshift-ingress-operator spec: {} 2. Check the status of the IC: $ oc get ingresscontroller -n openshift-ingress-operator ic-invalid -oyaml apiVersion: operator.openshift.io/v1 kind: IngressController metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"operator.openshift.io/v1","kind":"IngressController","metadata":{"annotations":{},"name":"ic-invalid","namespace":"openshift-ingress-operator"},"spec":{}} creationTimestamp: "2022-11-11T12:53:41Z" generation: 1 name: ic-invalid namespace: openshift-ingress-operator resourceVersion: "97453" uid: 96eae28e-bb14-447e-822f-602f3a3bb378 spec: httpEmptyRequestsPolicy: Respond status: availableReplicas: 0 conditions: - lastTransitionTime: "2022-11-11T12:53:41Z" message: 'conflicts with: default' reason: Invalid status: "False" type: Admitted domain: apps.arsen-cluster1.devcluster.openshift.com endpointPublishingStrategy: loadBalancer: dnsManagementPolicy: Managed providerParameters: aws: classicLoadBalancer: connectionIdleTimeout: 0s type: Classic type: AWS scope: External type: LoadBalancerService observedGeneration: 1 selector: "" 3. Check the "route_metrics_controller_routes_per_shard" metric on the web-console 4. Delete the IC 5. Check the "route_metrics_controller_routes_per_shard" metric again on the web-console
Actual results:
As shown in the attached screenshot, "route_metrics_controller_routes_per_shard" metric adds one row for the invalid IC. This is not cleared even when the IC is deleted.
Expected results:
The "route_metrics_controller_routes_per_shard" metric should not add metric for invalid ICs. Additionally, when the invalid IC is deleted the metric should clear the corresponding row.
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-provider-libvirt/pull/262
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Ingress operator is constantly reverting Internal Services when it detects a service change that are default values.
Version-Release number of selected component (if applicable):
4.13, 4.14
How reproducible:
100%
Steps to Reproduce:
1. Create an ingress controller 2. Watch ingress operator logs for excess updates "updated internal service" [I'll provide a more specific reproducer if needed]
Actual results:
Excess: 2023-05-04T02:08:02.331Z INFO operator.ingress_controller ingress/internal_service.go:44 updated internal service ...
Expected results:
No updates
Additional info:
The diff looks like: 2023-05-05T15:12:06.668Z INFO operator.ingress_controller ingress/internal_service.go:44 updated internal service {"namespace": "openshift-ingress", "name": "router-internal-default", "diff": " &v1.Service{ TypeMeta: {}, ObjectMeta: {Name: \"router-internal-default\", Namespace: \"openshift-ingress\", UID: \"815f1499-a4d4-4cb8-9a5b-9905580e0ffd\", ResourceVersion: \"8031\", ...}, Spec: v1.ServiceSpec{ Ports: {{Name: \"http\", Protocol: \"TCP\", Port: 80, TargetPort: {Type: 1, StrVal: \"http\"}, ...}, {Name: \"https\", Protocol: \"TCP\", Port: 443, TargetPort: {Type: 1, StrVal: \"https\"}, ...}, {Name: \"metrics\", Protocol: \"TCP\", Port: 1936, TargetPort: {Type: 1, StrVal: \"metrics\"}, ...}}, Selector: {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"default\"}, ClusterIP: \"172.30.56.107\", - ClusterIPs: []string{\"172.30.56.107\"}, + ClusterIPs: nil, Type: \"ClusterIP\", ExternalIPs: nil, - SessionAffinity: \"None\", + SessionAffinity: \"\", LoadBalancerIP: \"\", LoadBalancerSourceRanges: nil, ... // 3 identical fields PublishNotReadyAddresses: false, SessionAffinityConfig: nil, - IPFamilies: []v1.IPFamily{\"IPv4\"}, + IPFamilies: nil, - IPFamilyPolicy: &\"SingleStack\", + IPFamilyPolicy: nil, AllocateLoadBalancerNodePorts: nil, LoadBalancerClass: nil, - InternalTrafficPolicy: &\"Cluster\", + InternalTrafficPolicy: nil, }, Status: {}, } "}
Messing around with unit testing, it looks like internalServiceChanged triggers true when spec.IPFamilies, spec.IPFamilyPolicy, and spec.InternalTrafficPolicy are set to the default values that you see in the diff above.
Ingress operator then resets back to nil, then the API server sets them to their defaults, and this process repeats.
internalServiceChanged should either ignore, or explicitly set these values.
Description of problem:
After enabling user-defined monitoring on an HyperShift hosted cluster, PrometheusOperatorRejectedResources starts firing.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Start an hypershift-hosted cluster with cluster-bot 2. Enable user-defined monitoring 3.
Actual results:
PrometheusOperatorRejectedResources alert becomes firing
Expected results:
No alert firing
Additional info:
Need to reach out to the HyperShift folks as the fix should probably be in their code base.
This is duplicate of https://issues.redhat.com/browse/ART-8361 one since on ART bugs we are not able to set `target` so creating the issue here.
Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/102
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
While debugging a problem, I noticed some containers lack FallbackToLogsOnError. This is important for debugging via the API. Found via https://github.com/openshift/origin/pull/28547
Description of problem:
Reported in https://github.com/openshift/cluster-ingress-operator/issues/911
When you open a new issue, it still directs you to Bugzilla, and then doesn't work.
It can be changed here: https://github.com/openshift/cluster-ingress-operator/blob/master/.github/ISSUE_TEMPLATE/config.yml
, but to what?
The correct Jira link is
https://issues.redhat.com/secure/CreateIssueDetails!init.jspa?pid=12332330&issuetype=1&components=12367900&priority=10300&customfield_12316142=26752
But can the public use this mechanism? Yes - https://redhat-internal.slack.com/archives/CB90SDCAK/p1682527645965899
Version-Release number of selected component (if applicable):
n/a
How reproducible:
May be in other repos too.
Steps to Reproduce:
1. Open Issue in the repo - click on New Issue 2. Follow directions and click on link to open Bugzilla 3. Get message that this doesn't work anymore
Actual results:
You get instructions that don't work to open a bug from an Issue.
Expected results:
You get instructions to just open an Issue, or get correct instructions on how to open a bug using Jira.
Additional info:
Description of problem:
library-go should use Lease for leader election by default. In 4.10 we switched from configmaps to configmapsleases, now we can switch to leases change library-go to use lease by default, we already have an open pr for that: https://github.com/openshift/library-go/pull/1448 once the pr merges, we should revendor library-go for: - kas operator - oas operator - etcd operator - kcm operator - openshift controller manager operator - scheduler operator - auth operator - cluster policy controller
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/97
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The IngressController and DNSRecord CRDs were moved to dedicated packages following the introduction of a new method for generating CRDs in the OpenShift API repository ([openshift/api#1803|https://github.com/openshift/api/pull/1803]).
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. go mod edit -replace=github.com/openshift/api=github.com/openshift/api@ff84c2c732279b16baccf08c7dfc9ff8719c4807 2. go mod tidy 3. go mod vendor 4. make update
Actual results:
$ make update hack/update-generated-crd.sh --- vendor/github.com/openshift/api/operator/v1/0000_50_ingress-operator_00-ingresscontroller.crd.yaml 1970-01-01 01:00:00.000000000 +0100 +++ manifests/00-custom-resource-definition.yaml 2024-04-17 18:05:05.009605155 +0200 [LONG DIFF] cp: cannot stat 'vendor/github.com/openshift/api/operator/v1/0000_50_ingress-operator_00-ingresscontroller.crd.yaml': No such file or directory make: *** [Makefile:39: crd] Error 1
Expected results:
$ make update hack/update-generated-crd.sh hack/update-profile-manifests.sh
Additional info:
Description of problem:
I am trying to build the operator image locally and fail because the registry `registry.ci.openshift.org/ocp/` requires authorization
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. git clone git@github.com:openshift/cluster-ingress-operator.git 2. export REPO=<path to a repository to upload the image> 3. run `make release-local`
Actual results:
[skip several lines] Step 1/10 : FROM registry.ci.openshift.org/ocp/builder:rhel-8-golang-1.19-openshift-4.12 AS builder unauthorized: authentication required
Expected results:
image is pulled and the build succeeded
Additional info:
There are two images that are not available: - registry.ci.openshift.org/ocp/builder:rhel-8-golang-1.19-openshift-4.12 - registry.ci.openshift.org/ocp/4.12:base I was able to fix this by changing the images to - registry.ci.openshift.org/openshift/release:golang-1.19 - registry.ci.openshift.org/origin/4.12:base see https://github.com/dudinea/cluster-ingress-operator/tree/fix-build-images-not-public I am not sure what I did is OK, but I suppose that this project, being part of OKD should be easily buildable by the public or at least the issue should be documented somewhere. I wanted to post this to the OKD project, but I am unable to select it in jira.
Description of problem:
SNO installation performed with the assisted-installer failed
Version-Release number of selected component (if applicable):
4.10.32
# oc get co authentication -o yaml - lastTransitionTime: '2023-01-30T00:51:11Z' message: 'IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server OAuthServerConfigObservationDegraded: secret "v4-0-config-system-router-certs" not found OAuthServerDeploymentDegraded: 1 of 1 requested instances are unavailable for oauth-openshift.openshift-authentication (container is waiting in pending oauth-openshift-58b978d7f8-s6x4b pod) OAuthServerRouteEndpointAccessibleControllerDegraded: secret "v4-0-config-system-router-certs" # oc logs ingress-operator-xxx-yyy -c ingress-operator 2023-01-30T08:14:13.701799050Z 2023-01-30T08:14:13.701Z ERROR operator.certificate_publisher_controller certificate-publisher/controller.go:80 failed to list ingresscontrollers for secret {"related": "", "error": "Index with name field:defaultCertificateName does not exist"} Restarting the ingress-operator pod helped fix the issue, but a permanent fix is required. The Bug(https://bugzilla.redhat.com/show_bug.cgi?id=2005351) was filed earlier but closed due to inactivity.
CI is flaky because the TestAWSELBConnectionIdleTimeout test fails. Example failures:
I have seen these failures in 4.14 and 4.13 CI jobs.
Presently, search.ci reports the following stats for the past 14 days:
Found in 1.24% of runs (3.52% of failures) across 404 total runs and 34 jobs (35.15% failed)
This includes two jobs:
1. Post a PR and have bad luck.
2. Check https://search.ci.openshift.org/?search=FAIL%3A+TestAll%2Fparallel%2FTestAWSELBConnectionIdleTimeout&maxAge=336h&context=1&type=all&name=cluster-ingress-operator&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job.
The test fails because it times out waiting for DNS to resolve:
=== RUN TestAll/parallel/TestAWSELBConnectionIdleTimeout operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2650: lookup idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-sh28dt25-08f48.origin-ci-int-aws.dev.rhcloud.com on 172.30.0.10:53: no such host operator_test.go:2656: failed to observe expected condition: timed out waiting for the condition panic.go:522: deleted ingresscontroller test-idle-timeout
The above output comes from build-log.txt from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/917/pull-ci-openshift-cluster-ingress-operator-release-4.13-e2e-aws-operator/1658840125502656512.
CI passes, or it fails on a different test.
Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/217
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
OSSM Daily builds were updated to no longer support the spec.techPreview.controlPlaneMode field and OSSM will not create a SMCP as a result. The field needs to be updated to spec.mode. Gateway API enhanced dev preview is currently broken (currently using latest 2.4 daily build because 2.4 is unreleased). This should be resolved before OSSM 2.4 is GA.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
100%
Steps to Reproduce:
1. Follow instructions in http://pastebin.test.redhat.com/1092754
Actual results:
CIO fails to create a SMCP "error": "failed to create ServiceMeshControlPlane openshift-ingress/openshift-gateway: admission webhook \"smcp.validation.maistra.io\" denied the request: the spec.techPreview.controlPlaneMode field is not supported in version 2.4+; use spec.mode"
Expected results:
CIO is able to create a SMCP
Additional info:
Description of problem:
Ingress-canary Daemon Set does not tolerate Infra taint "NoExecute"
Version-Release number of selected component (if applicable):
OCPv4.9
How reproducible:
Always
Steps to Reproduce:
1.Label and Taint Node
$ oc describe node worker-0.cluster49.lab.pnq2.cee.redhat.com | grep infra
Roles: custom,infra,test
node-role.kubernetes.io/infra= <----
Taints: node-role.kubernetes.io/infra=reserved:NoExecute <----
node-role.kubernetes.io/infra=reserved:NoSchedule <----
2.Edit ingress-canary ds and add NoExecute toleration
$ oc get ds -o yaml | grep -i tole -A6
tolerations:
3. The Daemon Set configuration gets overwritten after some time, probably by the managing operator, and the pods are terminated on the infra nodes.
Actual results:
Infra taint toleration NoExecute gets overwritten :
$ oc get ds -o yaml | grep -i tole -A6
tolerations:
Expected results:
Ingress canary Daemon Set should be able to tolerate the NoExecute taint toleration.
Additional info: Same taint as the product documentation are used (node-role.kubernetes.io/infra)
Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/180
The PR has been automatically opened by ART (#aos-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #aos-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-api/pull/191
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
When setting up router sharding with `endpointPublishingStrategy: Private` in a OCP 4.13.11 BareMetal cluster, the restricted-readonly scc is added to the router pods. Causing them to CrashLoopBackOff:
~~~
$ oc get pod -n openshift-ingress router-spinque-xxx -oyaml | grep -i scc
openshift.io/scc: restricted-readonly <<<
$ oc get pod -n openshift-ingress router-spinque-xxxj -oyaml | grep -i scc
openshift.io/scc: restricted-readonly <<<<
$ oc get pod -n openshift-ingress router-spinque-xxx -oyaml | grep -i scc
openshift.io/scc: restricted-readonly <<<<
~~~
~~~
router-spinque-xxx 0/1 CrashLoopBackOff 27 2h
router-spinque-xxx 0/1 CrashLoopBackOff 27 2h
router-spinque-xxx 0/1 CrashLoopBackOff 27 2h
~~~
Please find the must-gather as well as the sos-report from one of the nodes in the case 03624389 in supportshell
—
The following scc config can be used to reproduce this issue on any platform:
allowPrivilegeEscalation: true allowedCapabilities: [] apiVersion: security.openshift.io/v1 defaultAddCapabilities: null fsGroup: type: MustRunAs groups: - system:authenticated kind: SecurityContextConstraints metadata: name: bad-router priority: 0 readOnlyRootFilesystem: true requiredDropCapabilities: - KILL - MKNOD - SETUID - SETGID runAsUser: type: MustRunAsRange seLinuxContext: type: MustRunAs supplementalGroups: type: RunAsAny users: [] volumes: - configMap - downwardAPI - emptyDir - persistentVolumeClaim - projected - secret
Save the above yaml as bad-router-scc.yaml then apply it to your cluster:
$ oc apply -f bad-router-scc.yaml
Force the restart of router pods, such as by deleting one:
$ oc delete pod router-default-6465854689-gvjhs
The newly started pod(s) should be running but not ready, with the bad-router scc:
$ oc get pods NAME READY STATUS RESTARTS AGE router-default-6465854689-7x558 0/1 Running 0 49s $ oc get pod router-default-6465854689-7x558 -o yaml|grep scc openshift.io/scc: bad-router
If you wait long enough, it will restart multiple times, and eventually enter the CrashLoopBackOff state
Please review the following PR: https://github.com/openshift/cluster-api-provider-libvirt/pull/274
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/561
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
ConfigObserver controller waits until the all given informers are marked as synced including the build informer. However, when build capability is disabled, that causes ConfigObserver's blockage and never runs. This is likely only happening on 4.15 because capability watching mechanism was bound to ConfigObserver in 4.15.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Launch cluster-bot cluster via "launch 4.15.0-0.nightly-2023-11-05-192858,openshift/cluster-openshift-controller-manager-operator#315 no-capabilities"
Steps to Reproduce:
1. 2. 3.
Actual results:
ConfigObserver controller stuck in failure
Expected results:
ConfigObserver controller runs and successfully clear all deployer service accounts when deploymentconfig capability is disabled.
Additional info:
Description of problem:
CAPI manifests have the TechPreviewNoUpgrade annotation but are missing the CustomNoUpgrade annotation
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
E2E Test Failing: (20% impact)
allows modifying registry poll interval: Interacting with CatalogSource page allows modifying registry poll interval
How reproducible:
Always
Kubernetes 1.27 changes validation of CSR for non-RSA kubelet client/serving CSRs, see https://github.com/kubernetes/kubernetes/issues/109077 and the PR changing https://github.com/kubernetes/kubernetes/pull/111660.
For that reason our machine-config-approver needs to relax the validation in https://github.com/openshift/cluster-machine-approver/blob/d74f42bb37c4130ae1e91819d90ad40a51ec472b/pkg/controller/csr_check.go#L84-L86 such that it appropriately expects the necessary key usage.
Description of problem:
cluster-ingress-operator E2E has an error message: [controller-runtime] log.SetLogger(...) was never called, logs will not be displayed: Looks like newClient is called from two places, TestMain and TestIngressStatus
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Run E2E tests that call newClient, such as TestIngressStatus 2. Examine logs
Actual results:
[controller-runtime] log.SetLogger(...) was never called, logs will not be displayed: goroutine 9120 [running]: runtime/debug.Stack() /usr/lib/golang/src/runtime/debug/stack.go:24 +0x65 sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot() /go/src/github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:59 +0xbd sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).WithName(0xc000113000, {0x1dd106b, 0x14}) /go/src/github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:147 +0x4c github.com/go-logr/logr.Logger.WithName({{0x21435e0, 0xc000113000}, 0x0}, {0x1dd106b?, 0xe?}) /go/src/github.com/openshift/cluster-ingress-operator/vendor/github.com/go-logr/logr/logr.go:336 +0x46 sigs.k8s.io/controller-runtime/pkg/client.newClient(0xc00086afc0, {0x0, 0xc0001a0fc0, {0x2144930, 0xc00033ac00}, 0x0, {0x0, 0x0}, 0x0}) /go/src/github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:115 +0xb4 sigs.k8s.io/controller-runtime/pkg/client.New(0xc00086afc0?, {0x0, 0xc0001a0fc0, {0x2144930, 0xc00033ac00}, 0x0, {0x0, 0x0}, 0x0}) /go/src/github.com/openshift/cluster-ingress-operator/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:101 +0x85 github.com/openshift/cluster-ingress-operator/pkg/operator/client.NewClient(0x0?) /go/src/github.com/openshift/cluster-ingress-operator/pkg/operator/client/client.go:83 +0x145 github.com/openshift/cluster-ingress-operator/test/e2e.TestIngressStatus(0xc000503520) /go/src/github.com/openshift/cluster-ingress-operator/test/e2e/dns_ingressdegrade_test.go:33 +0x95 testing.tRunner(0xc000503520, 0x1f015a0) /usr/lib/golang/src/testing/testing.go:1576 +0x10b created by testing.(*T).Run /usr/lib/golang/src/testing/testing.go:1629 +0x3ea
Expected results:
No error message
Additional info:
This is due to 1.27 rebase
Description of problem:
Recently during an audit on a user's cluster, it was discovered that
OLM's certificate generation functionality has a few minor shortcomings.
How reproducible: Always
Joe Lanford could you please double check what I've put below? QE is asking for a bug ticket for this fix (makes sense as it helps them verify everything is correct and gives us traceability)
Steps to Reproduce:
oc get secret -n openshift-operator-lifecycle-manager packageserver-service-cert -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -text
Actual results:
Expected results:
Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/101
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When the TestMTLSWithCRLs e2e test fails on a curl, it checks the stdout but the stdout could be empty, so it panics: --- FAIL: TestAll/parallel/TestMTLSWithCRLs (97.09s) --- FAIL: TestAll/parallel/TestMTLSWithCRLs/certificate-distributes-its-own-crl (97.09s) panic: runtime error: slice bounds out of range [-3:] [recovered] panic: runtime error: slice bounds out of range [-3:]
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Experience a failure on the MTLS testing, such as seen in https://redhat-internal.slack.com/archives/CBWMXQJKD/p1688596054069399?thread_ts=1688596036.042119&cid=CBWMXQJKD Search.ci shows two failures in the past two weeks: https://search.ci.openshift.org/?search=FAIL%3A+TestAll%2Fparallel%2FTestMTLSWithCRLs&maxAge=336h&context=1&type=bug%2Bissue%2Bjunit&name=cluster-ingress-operator&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Steps to Reproduce:
1. N/A 2. 3.
Actual results:
Test panics when trying to report an error.
Expected results:
Test reports whatever error it can without panics.
Additional info:
stdout was empty, but https://github.com/openshift/cluster-ingress-operator/blob/4c92a6d1ee80b6b120dd750855a40145a530153c/test/e2e/client_tls_test.go#L1587 doesn't check that the value is empty before it tries to index it.